# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct statistical tests on a real-world dataset

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [2]:
#Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [27]:
data = pd.read_csv('multipleChoiceResponses_cleaned.csv', encoding='latin1')

In [5]:
data.columns

Index(['GenderSelect', 'Country', 'Age', 'EmploymentStatus', 'StudentStatus',
       'LearningDataScience', 'CodeWriter', 'CareerSwitcher',
       'CurrentJobTitleSelect', 'TitleFit',
       ...
       'JobFactorTitle', 'JobFactorCompanyFunding', 'JobFactorImpact',
       'JobFactorRemote', 'JobFactorIndustry', 'JobFactorLeaderReputation',
       'JobFactorDiversity', 'JobFactorPublishingOpportunity', 'exchangeRate',
       'AdjustedCompensation'],
      dtype='object', length=230)

In [18]:
data.FormalEducation[:5], data.AdjustedCompensation[:5]

(0    Bachelor's degree
 1      Master's degree
 2      Master's degree
 3      Master's degree
 4      Doctoral degree
 Name: FormalEducation, dtype: object, 0         NaN
 1         NaN
 2         NaN
 3    250000.0
 4         NaN
 Name: AdjustedCompensation, dtype: float64)

## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

In [6]:
#Your code here
# H0: salary_Bachelor >= salary_master 
# H0: salary_Bachelor < salary_master 
bachelor = data[data.FormalEducation == "Bachelor's degree"]['AdjustedCompensation']
master = data[data.FormalEducation == "Master's degree"]['AdjustedCompensation']
mean_b, mean_m = bachelor.mean(), master.mean()


In [7]:
std_b, std_m = bachelor.std(), master.std()
n_b, n_m = len(bachelor), len(master)
alpha = 0.5

In [8]:
std_b, std_m

(306935.8723879783, 135527.2085045828)

In [9]:
from statsmodels.stats import weightstats

In [17]:
bachelor.dropna(inplace=True)
master.dropna(inplace=True)

In [18]:
weightstats.ttest_ind(bachelor, master, alternative='smaller')  # fail to reject 

(-0.5319163473190825, 0.29741105600554973, 3095.0)

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [24]:
bachelor

In [10]:
#Your code here
bachelor = data[data.FormalEducation == "Bachelor's degree"]['AdjustedCompensation']
phd = data[data.FormalEducation == "Doctoral degree"]['AdjustedCompensation']
bachelor.dropna(inplace=True)
phd.dropna(inplace=True)
mean_b, mean_p = bachelor.mean(), phd.mean()
median_b, median_p = bachelor.median(), phd.median()
sd_b, sd_p = bachelor.std(), phd.std()
print(f'mean_b: {mean_b} and mean_p: {mean_p}')
print(f'median_b: {median_b} and median_p: {median_p}')
print(f'sd_b: {sd_b} and sd_p: {sd_p}')

mean_b: 64887.097994618794 and mean_p: 29566175.762453098
median_b: 38399.4 and median_p: 74131.91999999997
sd_b: 306935.8723879783 and sd_p: 909998082.3346785


In [11]:
weightstats.ttest_ind(bachelor,phd, alternative='smaller')

(-1.0786721488559703, 0.14042972979741944, 2072.0)

In [22]:
print('\n\nRepeated Test with Ouliers Removed:')
outlier_threshold = 500000
bachelor_s = bachelor>=outlier_threshold
phd_s = phd >= outlier_threshold
print('Sample sizes: \ns1: {} \ns2: {}'.format(len(bachelor_s), len(phd_s)))
print("Welch's t-test p-value with outliers removed:")
weightstats.ttest_ind(bachelor_s,phd_s, alternative='smaller')



Repeated Test with Ouliers Removed:
Sample sizes: 
s1: 1107 
s2: 967
Welch's t-test p-value with outliers removed:


(0.20006913338318413, 0.5792769266360764, 2072.0)

## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [28]:
#Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols

formula = ' AdjustedCompensation ~ C(FormalEducation)'
lm = ols(formula, data).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
# fail to reject null hypothesis 

                          sum_sq      df         F    PR(>F)
C(FormalEducation)  6.540294e+17     6.0  0.590714  0.738044
Residual            7.999414e+20  4335.0       NaN       NaN


In [36]:
temp = data[data['AdjustedCompensation']>=5*10**5]
formula = ' AdjustedCompensation ~ C(FormalEducation)'
lm = ols(formula, temp).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

                          sum_sq    df        F   PR(>F)
C(FormalEducation)  2.199996e+20   5.0  0.83035  0.55637
Residual            5.298960e+20  10.0      NaN      NaN


## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!