# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct statistical tests on a real-world dataset

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [19]:
#Your code here
import pandas as pd

df = pd.read_csv('multipleChoiceResponses_cleaned.csv', encoding='latin1')
df.head()

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,1.0,250000.0
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,


## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

In [73]:
#Your code here

#Create dataframe with variables of interest.  Exclude NaN

df_hypothesis = df[['FormalEducation', 'AdjustedCompensation']]
df_hypothesis.dropna(axis=0, inplace=True)
df_hypothesis.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,FormalEducation,AdjustedCompensation
3,Master's degree,250000.0
8,Bachelor's degree,64184.8
9,Bachelor's degree,20882.4
11,Bachelor's degree,1483.9
14,Master's degree,36634.4
21,Bachelor's degree,20000.0
22,Doctoral degree,100000.0
23,Some college/university study without earning ...,916.4
25,Bachelor's degree,10858.848
27,Master's degree,53352.0


In [68]:
# H0 = There is no difference in AdjustedCompensation between those with bachelors, and those with Master's
# H1 = This is a difference in AdjustedCompensation

alpha = 0.05

sample_bachelors = df_hypothesis.loc[(df_hypothesis['FormalEducation']=="Bachelor's degree"), ['AdjustedCompensation']]
sample_masters = df_hypothesis.loc[(df_hypothesis['FormalEducation']=="Master's degree"), ['AdjustedCompensation']]

import flatiron_stats as fs

p = fs.p_value_welch_ttest(sample_bachelors, sample_masters, two_sided=False)
print(f"Median Values: \tbachelors:{round(sample_bachelors['AdjustedCompensation'].median(),2)} \tmasters: {round(sample_masters['AdjustedCompensation'].median(),2)}")
print(f"Mean Values: \tbachelors: {round(sample_bachelors['AdjustedCompensation'].mean(),2)} \tmasters:{round(sample_masters['AdjustedCompensation'].mean(),2)}")
print(f"Sample Sizes: \tbachelors: {len(sample_bachelors)} \tmasters: {len(sample_masters)}")
print(f"Welch's ttest pvalue {p}")

# We cannot reject the null hypothesis

Median Values: 	bachelors:38399.4 	masters: 53812.17
Mean Values: 	bachelors: 64887.1 	masters:69139.9
Sample Sizes: 	bachelors: 1107 	masters: 1990
Welch's ttest pvalue [0.33077639]


In [69]:
# Find outliers

import numpy as np

for q in np.linspace(0.8,1, num=21):
    bachelorsq = round(sample_bachelors['AdjustedCompensation'].quantile(q=q),2)
    mastersq = round(sample_masters['AdjustedCompensation'].quantile(q=q),2)
    print(f"{round(q,2)}th percentile: \tbachelors: {bachelorsq}, \tmasters: {mastersq}")
                

0.8th percentile: 	bachelors: 93233.13, 	masters: 103000.0
0.81th percentile: 	bachelors: 95572.83, 	masters: 107009.0
0.82th percentile: 	bachelors: 99276.38, 	masters: 110000.0
0.83th percentile: 	bachelors: 100000.0, 	masters: 111503.83
0.84th percentile: 	bachelors: 103040.0, 	masters: 115240.4
0.85th percentile: 	bachelors: 105935.04, 	masters: 119582.6
0.86th percentile: 	bachelors: 110000.0, 	masters: 120000.0
0.87th percentile: 	bachelors: 112000.0, 	masters: 124719.88
0.88th percentile: 	bachelors: 115000.0, 	masters: 129421.46
0.89th percentile: 	bachelors: 120000.0, 	masters: 130000.0
0.9th percentile: 	bachelors: 120346.5, 	masters: 135000.0
0.91th percentile: 	bachelors: 126460.0, 	masters: 140000.0
0.92th percentile: 	bachelors: 132615.4, 	masters: 149640.0
0.93th percentile: 	bachelors: 140000.0, 	masters: 150000.0
0.94th percentile: 	bachelors: 143408.8, 	masters: 160000.0
0.95th percentile: 	bachelors: 150000.0, 	masters: 166778.6
0.96th percentile: 	bachelors: 179849.

In [70]:
# Drop observations where salary greater than 500K

sample_bachelors = sample_bachelors.loc[sample_bachelors['AdjustedCompensation']<=500000]
sample_masters = sample_masters.loc[sample_masters['AdjustedCompensation']<=500000]

In [71]:
for q in np.linspace(0.8,1, num=21):
    bachelorsq = round(sample_bachelors['AdjustedCompensation'].quantile(q=q),2)
    mastersq = round(sample_masters['AdjustedCompensation'].quantile(q=q),2)
    print(f"{round(q,2)}th percentile: \tbachelors: {bachelorsq}, \tmasters: {mastersq}")

0.8th percentile: 	bachelors: 91632.0, 	masters: 102134.74
0.81th percentile: 	bachelors: 95000.0, 	masters: 106095.46
0.82th percentile: 	bachelors: 97971.2, 	masters: 110000.0
0.83th percentile: 	bachelors: 100000.0, 	masters: 110000.0
0.84th percentile: 	bachelors: 102504.11, 	masters: 115000.0
0.85th percentile: 	bachelors: 105000.0, 	masters: 119582.6
0.86th percentile: 	bachelors: 110000.0, 	masters: 120000.0
0.87th percentile: 	bachelors: 110000.0, 	masters: 120346.5
0.88th percentile: 	bachelors: 115000.0, 	masters: 126780.12
0.89th percentile: 	bachelors: 117732.86, 	masters: 130000.0
0.9th percentile: 	bachelors: 120000.0, 	masters: 132251.28
0.91th percentile: 	bachelors: 125000.0, 	masters: 140000.0
0.92th percentile: 	bachelors: 130000.0, 	masters: 145000.0
0.93th percentile: 	bachelors: 137930.0, 	masters: 150000.0
0.94th percentile: 	bachelors: 140000.0, 	masters: 155457.38
0.95th percentile: 	bachelors: 150000.0, 	masters: 165000.0
0.96th percentile: 	bachelors: 174200.

In [72]:


p = fs.p_value_welch_ttest(sample_bachelors, sample_masters, two_sided=False)
print(f"Median Values: \tbachelors:{round(sample_bachelors['AdjustedCompensation'].median(),2)} \tmasters: {round(sample_masters['AdjustedCompensation'].median(),2)}")
print(f"Mean Values: \tbachelors: {round(sample_bachelors['AdjustedCompensation'].mean(),2)} \tmasters:{round(sample_masters['AdjustedCompensation'].mean(),2)}")
print(f"Sample Sizes: \tbachelors: {len(sample_bachelors)} \tmasters: {len(sample_masters)}")
print(f"Welch's ttest pvalue {p}")

Median Values: 	bachelors:38292.15 	masters: 53539.72
Mean Values: 	bachelors: 53744.35 	masters:63976.63
Sample Sizes: 	bachelors: 1103 	masters: 1985
Welch's ttest pvalue [4.48745833e-07]


In [None]:
# The null hypothesis can be rejected

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [75]:
#Your code here

sample_bachelors = df_hypothesis.loc[(df_hypothesis['FormalEducation']=="Bachelor's degree"), ['AdjustedCompensation']]
sample_doctorate = df_hypothesis.loc[(df_hypothesis['FormalEducation']=="Doctoral degree"), ['AdjustedCompensation']]


In [76]:
for q in np.linspace(0.8,1, num=21):
    bachelorsq = round(sample_bachelors['AdjustedCompensation'].quantile(q=q),2)
    doctorsq = round(sample_doctorate['AdjustedCompensation'].quantile(q=q),2)
    print(f"{round(q,2)}th percentile: \tbachelors: {bachelorsq}, \tdoctorate: {doctorsq}")

0.8th percentile: 	bachelors: 93233.13, 	doctorate: 135000.0
0.81th percentile: 	bachelors: 95572.83, 	doctorate: 140000.0
0.82th percentile: 	bachelors: 99276.38, 	doctorate: 140000.0
0.83th percentile: 	bachelors: 100000.0, 	doctorate: 146796.17
0.84th percentile: 	bachelors: 103040.0, 	doctorate: 150000.0
0.85th percentile: 	bachelors: 105935.04, 	doctorate: 150000.0
0.86th percentile: 	bachelors: 110000.0, 	doctorate: 155000.0
0.87th percentile: 	bachelors: 112000.0, 	doctorate: 160000.0
0.88th percentile: 	bachelors: 115000.0, 	doctorate: 160000.0
0.89th percentile: 	bachelors: 120000.0, 	doctorate: 166480.0
0.9th percentile: 	bachelors: 120346.5, 	doctorate: 172057.78
0.91th percentile: 	bachelors: 126460.0, 	doctorate: 175000.0
0.92th percentile: 	bachelors: 132615.4, 	doctorate: 181555.2
0.93th percentile: 	bachelors: 140000.0, 	doctorate: 191900.0
0.94th percentile: 	bachelors: 143408.8, 	doctorate: 200000.0
0.95th percentile: 	bachelors: 150000.0, 	doctorate: 200000.0
0.96th 

In [77]:
# Drop observations where outliers are > 500K

sample_bachelors = sample_bachelors.loc[sample_bachelors['AdjustedCompensation'] <= 500000]
sample_doctorate = sample_doctorate.loc[sample_doctorate['AdjustedCompensation'] <= 500000]

for q in np.linspace(0.8,1, num=21):
    bachelorsq = round(sample_bachelors['AdjustedCompensation'].quantile(q=q),2)
    doctorsq = round(sample_doctorate['AdjustedCompensation'].quantile(q=q),2)
    print(f"{round(q,2)}th percentile: \tbachelors: {bachelorsq}, \tdoctorate: {doctorsq}")

0.8th percentile: 	bachelors: 91632.0, 	doctorate: 135000.0
0.81th percentile: 	bachelors: 95000.0, 	doctorate: 137081.5
0.82th percentile: 	bachelors: 97971.2, 	doctorate: 140000.0
0.83th percentile: 	bachelors: 100000.0, 	doctorate: 145311.31
0.84th percentile: 	bachelors: 102504.11, 	doctorate: 150000.0
0.85th percentile: 	bachelors: 105000.0, 	doctorate: 150000.0
0.86th percentile: 	bachelors: 110000.0, 	doctorate: 153360.0
0.87th percentile: 	bachelors: 110000.0, 	doctorate: 159810.0
0.88th percentile: 	bachelors: 115000.0, 	doctorate: 160000.0
0.89th percentile: 	bachelors: 117732.86, 	doctorate: 165000.0
0.9th percentile: 	bachelors: 120000.0, 	doctorate: 170052.5
0.91th percentile: 	bachelors: 125000.0, 	doctorate: 175000.0
0.92th percentile: 	bachelors: 130000.0, 	doctorate: 180000.0
0.93th percentile: 	bachelors: 137930.0, 	doctorate: 188108.39
0.94th percentile: 	bachelors: 140000.0, 	doctorate: 198628.2
0.95th percentile: 	bachelors: 150000.0, 	doctorate: 200000.0
0.96th pe

In [87]:
print(f"Median: \tbachelors {round(np.median(sample_bachelors['AdjustedCompensation']),2)} \tdoctorate {round(np.median(sample_doctorate['AdjustedCompensation']),2)} ")
print(f"Mean: \tbachelors {round(np.mean(sample_bachelors['AdjustedCompensation']),2)} \tdoctorate {round(np.mean(sample_doctorate['AdjustedCompensation']),2)}")
print(f"Number of Samples: \tbachelors {len(sample_bachelors)} \tdoctorate {len(sample_doctorate)}")
print(f"Welch's T Test P Value: {fs.p_value_welch_ttest(sample_bachelors, sample_doctorate, two_sided=False)}")

Median: 	bachelors 38292.15 	doctorate 73152.77 
Mean: 	bachelors 53744.35 	doctorate 86194.98
Number of Samples: 	bachelors 1103 	doctorate 964
Welch's T Test P Value: [0.]


## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [89]:
#Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols

df_ANOVA = df_hypothesis.loc[df_hypothesis['AdjustedCompensation']<=500000]

formula = 'AdjustedCompensation ~ C(FormalEducation)'
lm = ols(formula, df_ANOVA).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

                          sum_sq      df          F        PR(>F)
C(FormalEducation)  5.841881e+11     6.0  29.224224  1.727132e-34
Residual            1.439270e+13  4320.0        NaN           NaN


## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!