# Drill: Which Test to Use -- Kristofer Schobert

In this notebook, we complete the Thinkful challenge involving solving six problems regarding data from a 2012 and 2014 survey of people in Europe. 

More information on this data can be found here: https://thinkful-ed.github.io/data-201-resources/ESS_practice_data/ESS_codebook.html#sclact

We will answer the questions in order. Let's begin.

In [96]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline

In [7]:
df_main = pd.read_csv('https://raw.githubusercontent.com/Thinkful-Ed/' \
                      'data-201-resources/master/ESS_practice_data/ESSdata_Thinkful.csv') 


In [10]:
df_main.head()

Unnamed: 0,cntry,idno,year,tvtot,ppltrst,pplfair,pplhlp,happy,sclmeet,sclact,gndr,agea,partner
0,CH,5.0,6,3.0,3.0,10.0,5.0,8.0,5.0,4.0,2.0,60.0,1.0
1,CH,25.0,6,6.0,5.0,7.0,5.0,9.0,3.0,2.0,2.0,59.0,1.0
2,CH,26.0,6,1.0,8.0,8.0,8.0,7.0,6.0,3.0,1.0,24.0,2.0
3,CH,28.0,6,4.0,6.0,6.0,7.0,10.0,6.0,2.0,2.0,64.0,1.0
4,CH,29.0,6,5.0,6.0,7.0,5.0,8.0,7.0,2.0,2.0,55.0,1.0


## Question 1:

Did people become less trusting from 2012 to 2014? Compute results for each country in the sample.

We will use a depentent t-test to see if these populations are as trusting in both years. 

In [13]:
for country in df_main['cntry'].unique():
    

CH
CZ
DE
ES
NO
SE


In [24]:
list(df_main[(df_main['cntry'] == 'CH') & (df_main['year'] == 6)]['idno']) == list(df_main[(df_main['cntry'] == 'CH') & (df_main['year'] == 7)]['idno'])


True

so at least for county 'CH' we have the same individuals responding in 2012 and 2014 we know this because the idno's are the same for both years. They are even listed in the same order. 

In [28]:
for country in df_main['cntry'].unique():
    print(list(df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)]['idno']) == list(df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)]['idno']))
    

True
True
True
True
True
True


We have the same individuals reporting in both years. After a closer inspection of the documentation of the data, it seems that it was targeted at undertanding behavior on an individual level. That is why we have True statments when comparting the list of id numbers for each country during both years. 

Through other data exploration, we have determined that ID numbers ('idno') are only unique within one country. And that we have missing values in our data. 

In [183]:
# looping through each country to ultimately calculate the depentent t-test statistics for each country

# merging the dataframe by idno.
# again, idno's are not unique. they are only unique within one country.

# we then drop rows this any NaN values. 

# finally we compute the dependent t-statistic

for country in df_main['cntry'].unique():

    print(country, stats.ttest_rel(

    df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)].merge(
        df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)], 
                                       how='outer', on='idno',
                                        suffixes=('_2012', '_2014'))[[
        'idno', 'ppltrst_2012', 'ppltrst_2014']].dropna()['ppltrst_2012'],

    df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)].merge(
        df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)], 
                                       how='outer', on='idno',
                                        suffixes=('_2012', '_2014'))[[
        'idno', 'ppltrst_2012', 'ppltrst_2014']].dropna()['ppltrst_2014']))

CH Ttest_relResult(statistic=-0.6586851756725737, pvalue=0.5102943511301135)
CZ Ttest_relResult(statistic=-0.5001638336887216, pvalue=0.6171292682404208)
DE Ttest_relResult(statistic=-0.18399501804849683, pvalue=0.8568563797095805)
ES Ttest_relResult(statistic=2.4561906976601646, pvalue=0.014181580725321856)
NO Ttest_relResult(statistic=0.5073077081124404, pvalue=0.6120925701518767)
SE Ttest_relResult(statistic=-2.0671082026033982, pvalue=0.0390078167095776)


We can see that "ES" - Spain, and "SE" - Sweden, have a statistically significant change in their mean value for trusting people (their p-values are less than 0.05). The sign of the t-statistic informs us if the change has increased or decreased. I do not know how scipy chooses to calcuate this so I'm not certain which one increased and which country decreased. Let's inspect the means of the differences for these two countries. 

In [197]:
# using the columns of entires for 'ppltrst' in Spain in 2014 and subracting the 2012 values
# thus using final minus initial to find the change. and the change is positive.
 
country = 'ES' # Spain
print('for Spain the change in the mean is: {}'.format(
    np.mean(np.array(df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)].merge(
    df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)], 
                                   how='outer', on='idno',
                                    suffixes=('_2012', '_2014'))[[
    'idno', 'ppltrst_2012', 'ppltrst_2014']].dropna()['ppltrst_2014']) - np.array(
    df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)].merge(
    df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)], 
                                   how='outer', on='idno',
                                    suffixes=('_2012', '_2014'))[[
    'idno', 'ppltrst_2012', 'ppltrst_2014']].dropna()['ppltrst_2012']))))

country = 'SE' # Sweden
print('for Sweden the change in the mean is: {}'.format(
np.mean(np.array(df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)].merge(
    df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)], 
                                   how='outer', on='idno',
                                    suffixes=('_2012', '_2014'))[[
    'idno', 'ppltrst_2012', 'ppltrst_2014']].dropna()['ppltrst_2014']) - np.array(
    df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)].merge(
    df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)], 
                                   how='outer', on='idno',
                                    suffixes=('_2012', '_2014'))[[
    'idno', 'ppltrst_2012', 'ppltrst_2014']].dropna()['ppltrst_2012']))))

for Spain the change in the mean is: -0.2171758876961189
for Sweden the change in the mean is: 0.1986754966887417


## P.1 Conclusion
In Spain people trusted people less and in Sweden people trusted people more on average as time went on from 2012 to 2014. 

Thus the t-statistic scipy returns involves the first input minus the second. Thus initial minus final for us. 

## Problem 2

Did people become happier from 2012 to 2014? Compute results for each country in the sample.

By changing 'ppltrst' to 'happy' in problem 1's block of code, we will have the answer.

In [198]:
for country in df_main['cntry'].unique():

    print(country, stats.ttest_rel(

    df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)].merge(
        df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)], 
                                       how='outer', on='idno',
                                        suffixes=('_2012', '_2014'))[[
        'idno', 'happy_2012', 'happy_2014']].dropna()['happy_2012'],

    df_main[(df_main['cntry'] == country) & (df_main['year'] == 6)].merge(
        df_main[(df_main['cntry'] == country) & (df_main['year'] == 7)], 
                                       how='outer', on='idno',
                                        suffixes=('_2012', '_2014'))[[
        'idno', 'happy_2012', 'happy_2014']].dropna()['happy_2014']))

CH Ttest_relResult(statistic=-0.319412957862232, pvalue=0.7495001355428359)
CZ Ttest_relResult(statistic=-1.4561384833039597, pvalue=0.14584548433894534)
DE Ttest_relResult(statistic=-0.8062257748298549, pvalue=0.4346138707734991)
ES Ttest_relResult(statistic=1.613832417735418, pvalue=0.1068245155648069)
NO Ttest_relResult(statistic=4.2856826576235925, pvalue=2.0674530134057013e-05)
SE Ttest_relResult(statistic=-0.5581637086030012, pvalue=0.5768709591234161)


## Conclusion

Only Norway (NO) has had a significant change in their happiness. The change is very significant. We have a p-value of 0.00002. The Norwegian people surveyed were far less happy in 2014 then they were in 2012. 

## Problem 3

Who reported watching more TV in 2012, men or women?

In [225]:
stats.ttest_ind(
df_main[(df_main['year'] == 6) & (df_main['gndr'] == 1)]['tvtot'].dropna(),
                      df_main[(df_main['year'] == 6) & (df_main['gndr'] == 2)]['tvtot'].dropna(),
    axis=0, equal_var=False)

Ttest_indResult(statistic=-0.6899351288552367, pvalue=0.49027236479485914)

Neither. Both men and women in 2012 watch a similar amount of TV. The sample distributions are too similar to show a statistically significant difference. 

## Problem 4

Who was more likely to believe people were fair in 2012, people living with a partner or people living alone?

Note that partner == 1.0 means the person HAS a partner, 2.0 == no partner. 

In [230]:
df_main[(df_main['year'] == 6)].groupby('partner').mean()

Unnamed: 0_level_0,idno,year,tvtot,ppltrst,pplfair,pplhlp,happy,sclmeet,sclact,gndr,agea
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1.0,40822.379207,6,3.958427,5.660172,6.080736,5.471139,7.966979,5.054389,2.720965,1.479806,50.745075
2.0,37591.420792,6,3.855638,5.396648,5.856965,5.249533,7.319202,5.513932,2.772756,1.529084,42.215873


It looks like, in 2012, people with a partner felt that people are more fair. Let's see if this difference in the mean is statistically significant.

In [233]:
stats.ttest_ind(
df_main[(df_main['year'] == 6) & (df_main['partner'] == 1)]['pplfair'].dropna(),
                      df_main[(df_main['year'] == 6) & (df_main['partner'] == 2)]['pplfair'].dropna(),
    axis=0, equal_var=False)

Ttest_indResult(statistic=3.3027889746563357, pvalue=0.0009673897763136988)

With this low p-value we can conclude that the means are significantly different. People who live with a partner in 2012 believe people were fair moreso than people who lived alone. 

## Problem 5

Pick three or four of the countries in the sample and compare how often people met socially in 2014. Are there differences, and if so, which countries stand out?

We will use the 1-way ANOVA to see if the populations all have the same mean. We will use Spain, Norway, and Sweden as our countries to test. 

In [237]:
stats.f_oneway(
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'ES')]['sclmeet'].dropna(),
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'NO')]['sclmeet'].dropna(),
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'SE')]['sclmeet'].dropna())
    

F_onewayResult(statistic=3.726110434690217, pvalue=0.02420388186106648)

With a p-value of 0.024 we have a statistically significant result. There is about a one in forty chance that our means are actually the same. 

Let's go compare groups and find out who is different, or if they are all different. 

In [240]:
print('Comparing Spain and Norway:')
print(stats.ttest_ind(
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'ES')]['sclmeet'].dropna(),
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'NO')]['sclmeet'].dropna(),
    axis=0, equal_var=False))

print('Comparing Norway and Sweden:')
print(stats.ttest_ind(
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'NO')]['sclmeet'].dropna(),
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'SE')]['sclmeet'].dropna(),
    axis=0, equal_var=False))

print('Comparing Sweden and Spain:')
print(stats.ttest_ind(
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'SE')]['sclmeet'].dropna(),
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'ES')]['sclmeet'].dropna(),
    axis=0, equal_var=False))

Comparing Spain and Norway:
Ttest_indResult(statistic=-0.6463769188235257, pvalue=0.5181256302202837)
Comparing Norway and Sweden:
Ttest_indResult(statistic=-1.8342603800194623, pvalue=0.06680394483729)
Comparing Sweden and Spain:
Ttest_indResult(statistic=2.674925718628197, pvalue=0.007534965623985669)


## Conclusion

The p-value when comparing Spain and Norway is very high, meaning these two countries have very similar means for meeting socially. However, Norway and Sweden differ by a good amount - we have a near 0.05 p-value, and when comparing Sweden and Spain we have a very low p-value. Thus, Sweden is the most different from the other two groups. We know from the sign of the statistics that Sweden has the higher mean value for meeting socially. We subract the second input from the first when computing t-tests. Thus the negative t-statistic when comparing Norway (less social) to Sweden (more social) in that order. 

In [242]:
# further justification
# here is the means of each country's social meeting metric in 2014
df_main[(df_main['year'] == 7)].groupby('cntry').mean()['sclmeet']

cntry
CH    5.160622
CZ    4.445802
DE    4.428571
ES    5.260116
NO    5.302326
SE    5.426211
Name: sclmeet, dtype: float64

We see that Sweden (SE) has the highest 'sclmeet' mean value comparied to Norway (NO) and Spain (ES) (and all the other contries though we have not run proper significants testing for that...)

## Problem 6

Pick three or four of the countries in the sample and compare how often people took part in social activities, relative to others their age, in 2014. Are there differences, and if so, which countries stand out?

In [243]:
print('Comparing Spain and Norway:')
print(stats.ttest_ind(
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'ES')]['sclact'].dropna(),
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'NO')]['sclact'].dropna(),
    axis=0, equal_var=False))

print('Comparing Norway and Sweden:')
print(stats.ttest_ind(
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'NO')]['sclact'].dropna(),
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'SE')]['sclact'].dropna(),
    axis=0, equal_var=False))

print('Comparing Sweden and Spain:')
print(stats.ttest_ind(
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'SE')]['sclact'].dropna(),
    df_main[(df_main['year'] == 7) & (df_main['cntry'] == 'ES')]['sclact'].dropna(),
    axis=0, equal_var=False))

Comparing Spain and Norway:
Ttest_indResult(statistic=-5.8598993324432165, pvalue=5.546987698432352e-09)
Comparing Norway and Sweden:
Ttest_indResult(statistic=-0.4842348224892412, pvalue=0.6282861634242631)
Comparing Sweden and Spain:
Ttest_indResult(statistic=6.58938516532171, pvalue=5.612166916523823e-11)


## Conclusion

Here, Spain stands out. Norway and Sweden have similar means for individuals socializing compared to others of the same age. When either of those two contries are compared to Spain we see an extremely small p-value - those means are not the same. 

Individuals in Spain feel they socialize with people less often than people their age compared to individuals in Norway or Sweden. It seems individuals in Spain feel they are more solitary than individuals in Norway or Sweden.

In [244]:
# further justification
# here is the means of each country's social meeting 
# compared to other's their age metric in 2014
df_main[(df_main['year'] == 7)].groupby('cntry').mean()['sclact']

cntry
CH    2.781699
CZ    2.703077
DE    2.714286
ES    2.616878
NO    2.859097
SE    2.879425
Name: sclact, dtype: float64

Again we see this here. Spain (SE) has the smallest value for this parameter. We have run the significants tests and yes, we can say with very strong certainty (one in a billion), that these means are truly different when comparing Spain to Norway and Spain to Sweden. 