## Capstone Project 1 Statistical Inference Write-up

For this project, I aim to classify individuals based on census attributes to predict whether or not a person earns above or below $50,000 a year. Since a dependent variable is categorical along with most of the independent variables, it is difficult to find a meaningful correlation to explore. Instead, I will seek to quantify the significance of the most important categorical variables through chi-squared tests.  

### Is the distribution in education level a significant factor in yearly income?

In other words, if education level was not significant, we would expect an even breakdown between all education levels. First, we will explore the education levels of adults making over $50,000 a year.

In [1]:
from scipy import stats
import numpy as np
import pandas as pd

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#reading in csv with column names
colnames = ['age', 'workclass','fnlwgt','education','education-num','marital_status','occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss','hours_per_week','native_country','over_50k']
adult = pd.read_csv('adult.csv', header = None, index_col = None, names = colnames)

#dropping blank values and two columns
adult.replace(' ?',np.nan,inplace=True)
adult = adult.dropna(axis = 0, how = 'any')
adult = adult.drop(['fnlwgt','education-num'],axis = 1)
adult = adult.reset_index(drop = True)
for col in ['workclass', 'education', 'marital_status','occupation','relationship','race','sex', 'native_country', 'over_50k']:
    adult[col] = adult[col].str.strip()


# replacing workclass groups
adult.loc[np.logical_or(adult.loc[:,'workclass'] == 'Self-emp-not-inc', adult.loc[:,'workclass'] == 'Self-emp-inc'),'workclass'] = 'Self-Employed'

#replacing education groups
adult.loc[np.logical_or(adult.loc[:,'education'] == 'Assoc-acdm', adult.loc[:,'education'] == 'Assoc-voc'),'education'] = 'Associates'
adult.loc[np.logical_or.reduce((adult.loc[:,'education'] == '11th', adult.loc[:,'education'] == '9th',\
                               adult.loc[:,'education'] == '7th-8th',adult.loc[:,'education'] == '5th-6th',
                               adult.loc[:,'education'] == '10th', adult.loc[:,'education'] == '12th',\
                               adult.loc[:,'education'] == '1st-4th', adult.loc[:,'education'] == 'Preschool')),'education'] = 'Below-HS'
adult.loc[adult.loc[:,'education'] == 'Some-college','education'] = 'HS-grad'

#replacing marital status groups
adult.loc[np.logical_or.reduce((adult.loc[:,'marital_status']== "Divorced",adult.loc[:,'marital_status']== 'Separated', \
                                adult.loc[:,'marital_status']== "Married-spouse-absent")),'marital_status'] = 'Not-married'
adult.loc[np.logical_or.reduce((adult.loc[:,'marital_status']== "Married-civ-spouse",adult.loc[:,'marital_status']== 'Married-AF-spouse'\
                                )),'marital_status'] = 'Married'
adult.head(10)

Unnamed: 0,age,workclass,education,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,over_50k
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-Employed,Bachelors,Married,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,Not-married,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,Below-HS,Married,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,Married,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,Masters,Married,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,Below-HS,Not-married,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-Employed,HS-grad,Married,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,Bachelors,Married,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [15]:
totalCount = len(adult)
under50 = len(adult.loc[adult['over_50k'] == '<=50K'])
over50 = totalCount - under50
fracsUnderEdu = list(adult[adult['over_50k'] == '<=50K']['education'].value_counts())
fracsOverEdu = list(adult[adult['over_50k'] == '>50K']['education'].value_counts())
over50prop = [x / over50 for x in fracsOverEdu]

In [16]:
stats.chisquare(fracsOverEdu)

Power_divergenceResult(statistic=6231.5697922216295, pvalue=0.0)

As shown by the results of the chi square test, the p-value for the null hypothesis (education being evenly distributed) is extremely low (<1e-40). Therefore, we reject the null hypothesis. The under $50,000 group is even more extreme so we can also reject the null hypothesis for that group as well.


### Marital Status, Relationship, Race, and Sex

In [27]:
fracsUnderM = list(adult[adult['over_50k'] == '<=50K']['marital_status'].value_counts())
fracsOverM = list(adult[adult['over_50k'] == '>50K']['marital_status'].value_counts())
stats.chisquare(fracsOverM), stats.chisquare(fracsUnderM)

(Power_divergenceResult(statistic=14657.147575919018, pvalue=0.0),
 Power_divergenceResult(statistic=7346.6268208704869, pvalue=0.0))

In [23]:
fracsUnderRe = list(adult[adult['over_50k'] == '<=50K']['relationship'].value_counts())
fracsOverRe = list(adult[adult['over_50k'] == '>50K']['relationship'].value_counts())
stats.chisquare(fracsUnderRe), stats.chisquare(fracsOverRe)

(Power_divergenceResult(statistic=9997.6985962743893, pvalue=0.0),
 Power_divergenceResult(statistic=19232.034096963238, pvalue=0.0))

In [25]:
fracsUnderRa = list(adult[adult['over_50k'] == '<=50K']['race'].value_counts())
fracsOverRa = list(adult[adult['over_50k'] == '>50K']['race'].value_counts())
stats.chisquare(fracsUnderRa), stats.chisquare(fracsOverRa)

(Power_divergenceResult(statistic=59255.254436302639, pvalue=0.0),
 Power_divergenceResult(statistic=23771.287426744802, pvalue=0.0))

In [26]:
fracsUnderS = list(adult[adult['over_50k'] == '<=50K']['sex'].value_counts())
fracsOverS = list(adult[adult['over_50k'] == '>50K']['sex'].value_counts())
stats.chisquare(fracsUnderS), stats.chisquare(fracsOverS)

(Power_divergenceResult(statistic=1246.5169947912068, pvalue=4.7426058306866789e-273),
 Power_divergenceResult(statistic=3718.7874267448055, pvalue=0.0))

Once again, we reject the null hypothesis that marital status, relationship, race, and sex are evenly distributed.

### Mean Age and Hours Worked

In our initial exploration, I found that the over $50,000 group was generally older while hours worked remained equal for both groups. I will not perform a statistical test to see if there is a significant difference in mean age and hours worked per week between the two groups.

In [35]:
avgAgeOver = adult[adult['over_50k'] == '>50K'].age.mean()
avgAgeSDOver =adult[adult['over_50k'] == '>50K'].age.std()
avgAgeUnder = adult[adult['over_50k'] == '<=50K'].age.mean()
avgAgeSDUnder = adult[adult['over_50k'] == '<=50K'].age.std()
populationAgeSD = adult.age.std()
avgAgeOver, avgAgeUnder, avgAgeSDOver, avgAgeSDUnder, populationAgeSD

(43.95911028236548,
 36.60806038668668,
 10.269632835673848,
 13.464631257161859,
 13.134664776856338)

In [37]:
diffSD = np.sqrt(avgAgeSDOver**2/over50 + avgAgeSDUnder**2/under50)
diffSD

0.14849209541492336

In [41]:
ageZ = (avgAgeOver-avgAgeUnder)/diffSD
ageZ

49.504654608975386

In [42]:
stats.norm.cdf(ageZ)

1.0

We reject the null hypothesis that there is no difference in mean age. The p-value is less than 1e-50.

In [45]:
avgHoursOver = adult[adult['over_50k'] == '>50K'].hours_per_week.mean()
avgHoursUnder = adult[adult['over_50k'] == '<=50K'].hours_per_week.mean()
avgHoursOverSD = adult[adult['over_50k'] == '>50K'].hours_per_week.std()
avgHoursUnderSD = adult[adult['over_50k'] == '<=50K'].hours_per_week.std()
avgHoursOver,avgHoursUnder,avgHoursOverSD,avgHoursUnderSD

(45.70657964837507, 39.34859186015715, 10.73698663369768, 11.95077413893298)

In [46]:
diffHourSD = np.sqrt(avgHoursOverSD**2/over50 + avgHoursUnderSD**2/under50)
diffHourSD

0.14717038516202069

In [48]:
hoursZ = (avgHoursOver - avgHoursUnder) / diffHourSD
hoursZ

43.201543443800702

In [49]:
stats.norm.cdf(hoursZ)

1.0

Surprisingly we reject the null hypothesis as well and find that the individuals earning over $50,000 work longer hours per week than individuals earning less than $50,000.