## If you weren't here last time, make sure the necessary installations are made
If you were you can skip down to and run the cell containing the connection string for the database instance

In [1]:
!pip install pymssql



#### For Mac users:
You will need to install the following program in the terminal or the notebook will throw an error when importing pymssql.

       brew install freetds

Non Mac users may not need this install at all, but in the event that they do, an ubuntu version can be found here:
    https://packages.ubuntu.com/search?keywords=FreeTDS

## Run this cell to connect to the database and set up to make queries

In [2]:
import pandas as pd
import pymssql

import sys
sys.path.append('../../')
from src.pySQL_funcs import pretty_query

with open('../../src/pw2') as pw_file:
    server, user, pw, database = pw_file.readline().split(',')

In [3]:
conn = pymssql.connect(host=server,user=user,password=pw,database=database)
cur = conn.cursor()

In [4]:
def pretty_query(cur, query):
    """
    Function to Pandas-prettify for Pythonic SQL Server queries
    *NOTE* It is recommended to alias any aggregate calculation columns
           because pymssql doesn't seem to auto generate one (returns an empty string).
    """
    cur.execute(query)
    data = cur.fetchall()
    headers = [col[0] for col in cur.description]
    out = pd.DataFrame(data=data, columns=headers)
    return out

## Answering questions:

Once everyone is ready we'll dive into the following questions. The table schemas can be found at the bottom of this notebook, however you may find it easier to pull up the github readme for the project, which also contains these schema tables, in another window, link below:

https://github.com/dougtheeconomist/flag-on-the-play/blob/master/README.md

We're picking up where we left off last time, where we created a table to identify players who had played for Seattle and another team in the NFL. This table displayed the player's name, the average penalties they accrued while on a specific team and the name of that team. To start everyone off with this session at the same spot I have provided the code for that query below.

## Task A
Take the query provided from the end of last session below, and use it to create a temporary table. This will provide a convenient way for us to make further queries from the information on this table, which will be the focus of tonights exploration. From this temporary table write two querries: one that returns the average of the penalties column when the team is 'Seattle' and one when the team is not 'Seattle'.

Remember that the syntax for creating and using a temporary table is:

>WITH &nbsp;&nbsp; temp_table_name(column_label_1, column_label_2, etc.) <br>
&nbsp;&nbsp;&nbsp;  AS <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    ( QUERY )
    <br>SELECT what-have-you FROM temp_table_name;

## NOTE:
creating a temporary table in one query does not keep this table globally for access by the next query; temporary means temporary! So copying and pasting the tables you create in previous queries into the new cells will be the most convenient way of re-using these.

In [5]:
query = """
WITH penalties AS (
SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, 
      CASE WHEN 
                team ='Seattle' THEN 'Seattle'
                ELSE 'not_Seattle' END AS team

FROM guest.players
WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)
GROUP BY team, player_name)
SELECT AVG(penalties) as avg_pen, team FROM penalties GROUP BY team
;
"""
pretty_query(cur, query)

Unnamed: 0,avg_pen,team
0,5.734127,not_Seattle
1,5.487738,Seattle


<details><summary>
Possible answer:
</summary>
WITH temp(name,penalties,team)
    
    AS
    
    (SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, team

    FROM guest.players

    WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM 
    
    guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)

    GROUP BY team, player_name)
    
SELECT AVG(penalties)
    
    FROM temp1
    
    WHERE team = 'Seattle'
    ;

<details><summary>
Possible answer:
</summary>
WITH temp(name,penalties,team)
    
    AS
    
    (SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, team

    FROM guest.players

    WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM 
    
    guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)

    GROUP BY team, player_name)
    
SELECT AVG(penalties)
    
    FROM temp1
    
    WHERE team <> 'Seattle'
    ;

How different are the averages from these queries? Is the average when the team is Seattle higher?

## Task B
So we now have a definitive answer to the question that we started out with, but being dilligent data scientists, we can't just leave it at that. We want to perform a hypothesis test, treating the rows where team = 'Seattle' and the rows where it does not as different samples to make sure that this result has statistical significance.

In order to do this we need 6 pieces of information. The first two are the averages that you just queried. Next write 4 more queries to obtain the other pieces: the standard deviation when the team is Seattle and when team is not Seattle, and the number of rows where the team is Seattle and where the the team is not Seattle. 

With this information we can conduct a 2 sample hypothesis test

In [6]:
query = """
WITH penalties AS (
SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, 
      CASE WHEN 
                team ='Seattle' THEN team
                ELSE 'not_Seattle' END AS team

FROM guest.players
WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)
GROUP BY team, player_name)


SELECT team, AVG(penalties) as avg_pen, STDEV(penalties) AS std, COUNT(penalties) AS num_samples
  FROM penalties GROUP BY team
  
;
"""
pretty_query(cur, query)

Unnamed: 0,team,avg_pen,std,num_samples
0,not_Seattle,5.734127,3.031268,63
1,Seattle,5.487738,1.969397,40


<details><summary>
Possible answer:
</summary>
WITH temp(name,penalties,team)
    
    AS
    
    (SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, team

    FROM guest.players

    WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM 
    
    guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)

    GROUP BY team, player_name)


SELECT STDEV(penalties) 
    
    FROM temp
    
    WHERE team = 'Seattle'
;

## Task C: Putting it all together
Now write a query that utilizes the queries you have already written to calculate our test statistic

As a refresher the equation for this will be the mean of group1 - the mean of group2 in the numerator, 

and the square root of ( (standard dev group1 squared / n group1) + (standard dev group2 squared / n group2) ) in the denominator

<details><summary>
HINT:
</summary>
SQL's order of operations will want you to do all of your operations in the first select statement,
so your query will be in the format 
    
    SELECT (something - (SELECT next thing FROM temp table etc)) / (more sub queries from previous parts with appropriate operators to perform mathematical operations)
    
    FROM temp table 
    
    etc

In [7]:
query = """

WITH penalties AS (
SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, 
      CASE WHEN 
                team ='Seattle' THEN team
                ELSE 'not_Seattle' END AS team

FROM guest.players
WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)
GROUP BY team, player_name),

stats AS(SELECT team, AVG(penalties) as avg_pen, STDEV(penalties) AS std, COUNT(penalties) AS num_samples
  FROM penalties GROUP BY team)

SELECT (s.avg_pen - n.avg_pen) /
        SQRT(SQUARE(s.std)/s.num_samples + SQUARE(n.std)/n.num_samples) AS Ttest_statistic
 FROM
        (SELECT * FROM stats WHERE team='Seattle') s,
        (SELECT * FROM stats WHERE team='not_Seattle') n
;
"""
pretty_query(cur, query)

Unnamed: 0,Ttest_statistic
0,-0.500017


<details><summary>
Possible answer:
</summary>
WITH temp(name,penalties,team)
    
    AS
    
    (SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, team
    
    FROM guest.players
    
    WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM 
    
    guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)
    
    GROUP BY team, player_name)
    


SELECT (AVG(penalties) - (SELECT AVG(penalties) FROM temp WHERE team = 'Seattle')) 
    
        / (SQRT((POWER((SELECT STDEV(penalties) FROM temp WHERE team = 'Seattle'),2) / (SELECT COUNT(penalties) FROM temp WHERE team = 'Seattle')) +
    
    (POWER((SELECT STDEV(penalties) FROM temp WHERE team <> 'Seattle'),2) / (SELECT COUNT(penalties) FROM temp WHERE team <> 'Seattle'))))
    
    FROM temp
    
    WHERE team <> 'Seattle'
    
;

## Bonus:
If you want to check your work or utilize an easier alternative in a functional setting, you can do the following steps to use the t-test function is scipy.stats

First write two additional queries, each should return the penalties column from our temporary table, one where the team is seattle and one where it is not.

Save the output of these queries to seperate lists, this can be done by simply inserting 

    name_of_list = 
    
in front of 

    cur.fetchall()
   

## Where team is Seattle

In [8]:
query = """
WITH penalties AS (
SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, 
      CASE WHEN 
                team ='Seattle' THEN team
                ELSE 'not_Seattle' END AS team

FROM guest.players
WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)
GROUP BY team, player_name)
  
SELECT penalties, team FROM penalties
;
"""
df = pretty_query(cur, query)
df

Unnamed: 0,penalties,team
0,5.000000,not_Seattle
1,6.000000,Seattle
2,3.500000,not_Seattle
3,3.000000,Seattle
4,15.000000,not_Seattle
...,...,...
98,4.000000,not_Seattle
99,4.666667,not_Seattle
100,4.000000,Seattle
101,6.000000,not_Seattle


<details><summary>
Possible answer:
</summary>
WITH temp(name,penalties,team)

    AS
    
    (SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, team
     
    FROM guest.players
     
    WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)

     GROUP BY team, player_name)

SELECT penalties

    FROM temp
    
    WHERE team <> 'Seattle'
;

## Where team is not Seattle

In [9]:
s_pens = df['penalties'][df['team'] == 'Seattle']
n_pens = df['penalties'][df['team'] != 'Seattle']

<details><summary>
Possible answer:
</summary>
WITH temp(name,penalties,team)

    AS
    
    (SELECT player_name AS name, AVG(CAST(pen_count AS float)) AS penalties, team
     
    FROM guest.players
     
    WHERE player_id IN (SELECT player_id FROM guest.players WHERE player_id IN (SELECT DISTINCT player_id FROM guest.players WHERE team = 'Seattle') GROUP BY player_id HAVING COUNT(DISTINCT team) > 1)

     GROUP BY team, player_name)

SELECT penalties

    FROM temp
    
    WHERE team <> 'Seattle'
;

Then use 

    stats.ttest_ind()
    
with the two lists you just saved to generate our test statistic and corresponding p-value; remember to set the equal_var option to False or you will get a different answer

<details><summary>
Possible answer:
</summary>
stats.ttest_ind(otherteams_list, seattle_list, equal_var=False)

In [10]:
from scipy import stats
stats.ttest_ind(s_pens, n_pens, equal_var=False)

Ttest_indResult(statistic=-0.5000167314225741, pvalue=0.6181517648155634)

## Fun Fact:
Oracle sql syntax actually has a built in function to do this type of hypothesis test: STATS_ONE_WAY_ANOVA()

If you ever find yourself performing sql queries with oracle, this may be a helpful function to take a look at.
Unfortunately this command doesn't work in sql server