# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct t-tests and an ANOVA on a real-world dataset and interpret the results

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [1]:
# python libraries
import pandas as pd

# load dataset
df = pd.read_csv('multipleChoiceResponses_cleaned.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26394 entries, 0 to 26393
Columns: 230 entries, GenderSelect to AdjustedCompensation
dtypes: float64(15), object(215)
memory usage: 46.3+ MB


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [2]:
df.describe()

Unnamed: 0,LearningCategorySelftTaught,LearningCategoryOnlineCourses,LearningCategoryWork,LearningCategoryUniversity,LearningCategoryKaggle,LearningCategoryOther,TimeGatheringData,TimeModelBuilding,TimeProduction,TimeVisualizing,TimeFindingInsights,TimeOtherSelect,CompensationAmount,exchangeRate,AdjustedCompensation
count,16236.0,16253.0,16238.0,16249.0,16253.0,16221.0,10657.0,10655.0,10644.0,10656.0,10650.0,10640.0,5178.0,4499.0,4343.0
mean,33.596945,25.81468,13.760184,21.13327,4.467212,1.449728,35.680304,27.455279,10.007657,13.639968,9.249953,2.254041,41294940.0,0.703416,6636071.0
std,23.78135,24.558786,17.845975,23.784604,10.186693,8.437395,19.36495,17.450835,10.45843,9.947624,12.429025,10.302431,1965335000.0,0.486681,429399600.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-99.0,3e-05,-73.51631
25%,20.0,10.0,0.0,0.0,0.0,0.0,25.0,15.0,5.0,10.0,0.0,0.0,50000.0,0.058444,20369.42
50%,30.0,20.0,10.0,15.0,0.0,0.0,30.0,30.0,10.0,10.0,5.0,0.0,90000.0,1.0,53812.17
75%,50.0,35.0,20.0,40.0,5.0,0.0,50.0,40.0,10.0,15.0,15.0,0.0,190000.0,1.0,95666.08
max,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,303.0,100.0,100000000000.0,2.652053,28297400000.0


In [3]:
df.head(4)

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,1.0,250000.0


## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

You may import the functions stored in the `flatiron_stats.py` file to help perform your hypothesis tests. It contains the stats functions that you previously coded: `welch_t(a,b)`, `welch_df(a, b)`, and `p_value(a, b, two_sided=False)`. 

Note that `scipy.stats.ttest_ind(a, b, equal_var=False)` performs a two-sided Welch's t-test and that p-values derived from two-sided tests are two times the p-values derived from one-sided tests. See the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.    

In [4]:
# inspect the FormalEducaton column
df['FormalEducation'].value_counts()

Master's degree                                                      8204
Bachelor's degree                                                    4811
Doctoral degree                                                      3543
Some college/university study without earning a bachelor's degree     786
Professional degree                                                   451
I did not complete any formal education past high school              257
I prefer not to answer                                                 90
Name: FormalEducation, dtype: int64

In [5]:
# inspect the AdjustedCompensation column
df['AdjustedCompensation'].describe()

count    4.343000e+03
mean     6.636071e+06
std      4.293996e+08
min     -7.351631e+01
25%      2.036942e+04
50%      5.381217e+04
75%      9.566608e+04
max      2.829740e+10
Name: AdjustedCompensation, dtype: float64

In [6]:
df['AdjustedCompensation'].isnull().count()

26394

In [7]:
'''
Null hypothesis:
Data scientists with Master degrees do not have a higher mean annual salary than those with only Bachelor degree.

Alternative hypothesis:
Data scientists with a Masters degree have a higher mean salary than those with only a Bachelors degree.
'''

'''
Pseudocode
1. assign variables bachelor and master to their respective 
'''

# Import libraries
import scipy.stats as stats
#%load 'flatiron_stats.py' # load UDF for welch

# Subset dataframes to collect only masters and bachelor degree earners
subset = df[(~df['AdjustedCompensation'].isnull())] # remove null values
#doctorate = df.loc[df['FormalEducation'] == "Doctoral degree", 'AdjustedCompensation']
masters = subset.loc[subset['FormalEducation'] == "Master's degree", 'AdjustedCompensation']
bachelor = subset.loc[subset['FormalEducation'] == "Bachelor's degree", 'AdjustedCompensation']

# Create table showing sample size, mean, and standard deviation
print("Bacehlor sample size: %.0f" %len(bachelor))
print("Bachelor range:", bachelor.min(), ":", bachelor.max())
print("Bacehlor mean: %.0f" %bachelor.mean())
print("Bacehlor stdev: %.0f" %bachelor.std())
print("Master's sample size: %.0f" %len(masters))
print("Master's range:", masters.min(), ":", masters.max())
print("Master's mean: %.0f" %masters.mean())
print("Master's stdev: %.0f" %masters.std())

# perform 1-sided Welch's t-test for unequal sample sizes and variances
#results = stats.ttest_ind(masters, bachelor, equal_var=False)#, alternative='greater')
#1 - results.pvalue

Bacehlor sample size: 1107
Bachelor range: 0.0 : 9999999.0
Bacehlor mean: 64887
Bacehlor stdev: 306936
Master's sample size: 1990
Master's range: 0.0 : 4498900.0
Master's mean: 69140
Master's stdev: 135527


#### Results and Conclusions
The p-value is greater than the critical alpha-value of 0.05 thus we must accept the null hypothesis that master's degree holders do not have a higher salary than bachelor degree holders.

This result may be due to outliers in the data as there may be many people reporting that they do not make a salary and some reporting a very high salary. I will set a threshold of removing 0 values and those above 500,000.

In [8]:
df[df['AdjustedCompensation'].isnull()]

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,
5,Male,Brazil,46.0,Employed full-time,,,Yes,,Data Scientist,Fine,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26389,Female,Other,24.0,"Not employed, but looking for work",,,,,,,...,,,,,,,,,,
26390,Male,Indonesia,25.0,Employed full-time,,,Yes,,Programmer,Fine,...,,,,,,,,,0.000076,
26391,Female,Taiwan,25.0,Employed part-time,,,No,Yes,,,...,,,,,,,,,,
26392,Female,Singapore,16.0,I prefer not to say,Yes,"Yes, but data science is a small part of what ...",,,,,...,,,,,,,,,,


In [9]:
# Keep values between 1 and 500,000 from dataframes
outlier_threshold = 500000
#df = df[~df['AdjustedCompensation'].isnull()] # remove null values
subset = df[(df['AdjustedCompensation'] > 0) & (df[df['AdjustedCompensation'] < outlier_threshold])]
doctorate = subset.loc[df['FormalEducation'] == "Doctoral degree", 'AdjustedCompensation']
masters = subset.loc[subset['FormalEducation'] == "Master's degree", 'AdjustedCompensation']

# Create table showing sample size, mean, and standard deviation
print("Bacehlor sample size: %.0f" %len(bachelor))
print("Bachelor range:", bachelor.min(), ":", bachelor.max())
print("Bacehlor mean: %.0f" %bachelor.mean())
print("Bacehlor stdev: %.0f" %bachelor.std())
print("Doctorate's sample size: %.0f" %len(doctorate))
print("Bachelor range:", doctorate.min(), ":", doctorate.max())
print("Doctorate's mean: %.0f" %doctorate.mean())
print("Doctorate's stdev: %.0f" %doctorate.std())

# perform 1-sided Welch's t-test for unequal sample sizes and variances
#results = stats.ttest_ind(doctorate, bachelor, equal_var=False)#, alternative='greater')
#1 - results.pvalue

TypeError: unsupported operand type(s) for &: 'bool' and 'float'

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [None]:
# Remove those reporting $0 salary

bachelor.describe()

## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [None]:
#Your code here

## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!