# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct t-tests and an ANOVA on a real-world dataset and interpret the results

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:
#Your code here
df = pd.read_csv('multipleChoiceResponses_cleaned.csv', encoding='latin1')
df.head()

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,1.0,250000.0
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,


## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

You may import the functions stored in the `flatiron_stats.py` file to help perform your hypothesis tests. It contains the stats functions that you previously coded: `welch_t(a,b)`, `welch_df(a, b)`, and `p_value(a, b, two_sided=False)`. 

Note that `scipy.stats.ttest_ind(a, b, equal_var=False)` performs a two-sided Welch's t-test and that p-values derived from two-sided tests are two times the p-values derived from one-sided tests. See the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.    

In [None]:
"""
Null hypothesis : people with a master's degree earn about the same as those with Bachelors' degree

Alternative Hypothesis : people with a master's degree do not earn the same as those with a Bachelor's degree
"""

In [3]:
#Your code here
import numpy as np
import scipy.stats as stats

def welch_t(a, b):
    
    """ Calculate Welch's t statistic for two samples. """

    numerator = a.mean() - b.mean()
    
    # “ddof = Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, 
    #  where N represents the number of elements. By default ddof is zero.
    
    denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)
    
    return np.abs(numerator/denominator)

def welch_df(a, b):
    
    """ Calculate the effective degrees of freedom for two samples. This function returns the degrees of freedom """
    
    s1 = a.var(ddof=1) 
    s2 = b.var(ddof=1)
    n1 = a.size
    n2 = b.size
    
    numerator = (s1/n1 + s2/n2)**2
    denominator = (s1/ n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)
    
    return numerator/denominator


def p_value_welch_ttest(a, b, two_sided=False):
    """Calculates the p-value for Welch's t-test given two samples.
    By default, the returned p-value is for a one-sided t-test. 
    Set the two-sided parameter to True if you wish to perform a two-sided t-test instead.
    """
    t = welch_t(a, b)
    df = welch_df(a, b)
    
    p = 1-stats.t.cdf(np.abs(t), df)
    
    if two_sided:
        return 2*p
    else:
        return p

In [4]:
df1 = df.loc[ :, ['FormalEducation', 'AdjustedCompensation']]
df1

Unnamed: 0,FormalEducation,AdjustedCompensation
0,Bachelor's degree,
1,Master's degree,
2,Master's degree,
3,Master's degree,250000.0
4,Doctoral degree,
...,...,...
26389,Master's degree,
26390,Bachelor's degree,
26391,,
26392,I prefer not to answer,


In [5]:
df1 = df1.dropna()

In [6]:
df1.isna().sum()

FormalEducation         0
AdjustedCompensation    0
dtype: int64

In [7]:
df1

Unnamed: 0,FormalEducation,AdjustedCompensation
3,Master's degree,250000.000
8,Bachelor's degree,64184.800
9,Bachelor's degree,20882.400
11,Bachelor's degree,1483.900
14,Master's degree,36634.400
...,...,...
26185,Bachelor's degree,50000.000
26195,Bachelor's degree,100449.384
26203,Doctoral degree,200000.000
26255,Master's degree,89686.950


In [8]:
df1.loc[(df1["FormalEducation"].isin(["Bachelor's degree", "Master's degree"]))]

Unnamed: 0,FormalEducation,AdjustedCompensation
3,Master's degree,250000.000
8,Bachelor's degree,64184.800
9,Bachelor's degree,20882.400
11,Bachelor's degree,1483.900
14,Master's degree,36634.400
...,...,...
26180,Master's degree,65770.430
26185,Bachelor's degree,50000.000
26195,Bachelor's degree,100449.384
26255,Master's degree,89686.950


In [9]:
group_1 = df1.loc[(df1["FormalEducation"].isin(["Bachelor's degree"]))]
group_1            

Unnamed: 0,FormalEducation,AdjustedCompensation
8,Bachelor's degree,64184.800
9,Bachelor's degree,20882.400
11,Bachelor's degree,1483.900
21,Bachelor's degree,20000.000
25,Bachelor's degree,10858.848
...,...,...
26031,Bachelor's degree,39050.000
26072,Bachelor's degree,31878.000
26101,Bachelor's degree,3336.000
26185,Bachelor's degree,50000.000


In [10]:
group_2 = df1.loc[(df1["FormalEducation"].isin(["Master's degree"]))]
group_2

Unnamed: 0,FormalEducation,AdjustedCompensation
3,Master's degree,250000.000
14,Master's degree,36634.400
27,Master's degree,53352.000
31,Master's degree,35419.104
37,Master's degree,80000.000
...,...,...
26148,Master's degree,54670.000
26159,Master's degree,1.000
26180,Master's degree,65770.430
26255,Master's degree,89686.950


In [11]:
a = group_1['AdjustedCompensation']

In [12]:
b = group_2['AdjustedCompensation']

In [13]:
t = welch_t(a, b)
t

0.43786693335411514

In [14]:
dof = welch_df(a, b)
dof

1350.0828973008781

In [15]:
p = p_value_welch_ttest(a, b, two_sided=False)
p

0.33077639451272445

Based on our pvalue we reject the null hypothesis at an alpha level of 0.05. This is because the pvalue is greater than 0.05, hence the salaries earned are statistically different

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [None]:
"""
Null hypothesis : people with a Doctoral degree earn about the same as those with Bachelors' degree

Alternative Hypothesis : people with a Doctoral degree do not earn the same as those with a Bachelor's degree
"""

In [16]:
#Your code here
df2 = df.loc[ :, ['FormalEducation', 'AdjustedCompensation']]
df2


Unnamed: 0,FormalEducation,AdjustedCompensation
0,Bachelor's degree,
1,Master's degree,
2,Master's degree,
3,Master's degree,250000.0
4,Doctoral degree,
...,...,...
26389,Master's degree,
26390,Bachelor's degree,
26391,,
26392,I prefer not to answer,


In [17]:
df2 = df2.dropna()

In [18]:
df2.isna().sum()

FormalEducation         0
AdjustedCompensation    0
dtype: int64

In [19]:
df2.loc[(df2["FormalEducation"].isin(["Bachelor's degree", "Doctoral degree"]))]

Unnamed: 0,FormalEducation,AdjustedCompensation
8,Bachelor's degree,64184.800
9,Bachelor's degree,20882.400
11,Bachelor's degree,1483.900
21,Bachelor's degree,20000.000
22,Doctoral degree,100000.000
...,...,...
26072,Bachelor's degree,31878.000
26101,Bachelor's degree,3336.000
26185,Bachelor's degree,50000.000
26195,Bachelor's degree,100449.384


In [20]:
group_3 = df2.loc[(df2["FormalEducation"].isin(["Bachelor's degree"]))]
group_3

Unnamed: 0,FormalEducation,AdjustedCompensation
8,Bachelor's degree,64184.800
9,Bachelor's degree,20882.400
11,Bachelor's degree,1483.900
21,Bachelor's degree,20000.000
25,Bachelor's degree,10858.848
...,...,...
26031,Bachelor's degree,39050.000
26072,Bachelor's degree,31878.000
26101,Bachelor's degree,3336.000
26185,Bachelor's degree,50000.000


In [21]:
group_4 = df2.loc[(df2["FormalEducation"].isin(["Doctoral degree"]))]
group_4

Unnamed: 0,FormalEducation,AdjustedCompensation
22,Doctoral degree,100000.000
32,Doctoral degree,172144.440
34,Doctoral degree,133000.000
61,Doctoral degree,15000.000
72,Doctoral degree,43049.736
...,...,...
25875,Doctoral degree,71749.560
25966,Doctoral degree,12000.000
26012,Doctoral degree,123553.200
26038,Doctoral degree,170000.000


In [22]:
a1 = group_3['AdjustedCompensation']

In [23]:
b1 = group_4['AdjustedCompensation']

In [24]:
t2 = welch_t(a1, b1)

In [25]:
dof2 = welch_df(a1, b1)

In [26]:
p2 = p_value_welch_ttest(a1, b1, two_sided=False)
p2

0.15682381994720251

Based on our pvalue we reject the null hypothesis at an alpha level of 0.05. This is because the pvalue is greater than 0.05, hence the salaries earned are statistically different.

## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [29]:
#Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols

formula = "AdjustedCompensation ~ C(FormalEducation)"
lm = ols(formula, df).fit()
sm.stats.anova_lm(lm, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(FormalEducation),6.540294e+17,6.0,0.590714,0.738044
Residual,7.999414e+20,4335.0,,


## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!