<a href="https://colab.research.google.com/github/gsilver321/project_gss/blob/main/project1_paper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 GSS Paper
### DS 3001: Foundations of Machine Learning
### Gabe Silverstein, Rohan Chowla, Evan Stewart, and Rithwik Raman

## Summary

A one paragraph description of the question, methods, and results (about 350 words)

## Data

One to two pages discussing the data and key variables in the analysis, and any challenges in reading, cleaning, and preparing them for analysis

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

! git clone https://github.com/DS3001/project_gss

Cloning into 'project_gss'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 23 (delta 6), reused 1 (delta 1), pack-reused 15[K
Receiving objects: 100% (23/23), 23.94 MiB | 23.12 MiB/s, done.
Resolving deltas: 100% (6/6), done.


In [3]:
var_list = ['age', 'conrinc', 'educ', 'indus10', 'hrs1', 'prestg10'] # variables to save
output_filename = 'work_data.csv' # name of file output

modes = ['w','a'] # can write and append
phase = 0 # starts in write, switches to append

for k in range(3): # for each data chunk
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # get chunk url
    df = pd.read_parquet(url) # download data
    df.loc[:,var_list].to_csv(output_filename, # output file for chunk saving
                              mode=modes[phase], # write vs append
                              header=var_list, # variable names
                              index=False) # no row index saved
    phase = 1 # switch to append

In [4]:
df = df.loc[:, var_list] # only work with variables used in analysis
print('----------- Pre-cleaning null counts')
for v in var_list:
  print(v, sum(df[v].isnull()))

print('\n', df.head(), '\n')

for v in var_list: # loop through each var
  df[v] = df[v].fillna(np.nanmedian(df[v])) # fill nan with median for that column. Using median omits outliers

print('----------- Post-cleaning null counts')
for v in var_list:
  print(v, sum(df[v].isnull()))

print('\n', df.head())

----------- Pre-cleaning null counts
age 591
conrinc 10025
educ 118
indus10 1185
hrs1 10418
prestg10 1129

     age  conrinc  educ  indus10  hrs1  prestg10
0  48.0  59542.0  12.0   9370.0  40.0      49.0
1  25.0   7939.0  16.0   8470.0   NaN      41.0
2  42.0  59542.0  16.0   9490.0  35.0      44.0
3  24.0  33079.0  14.0   9370.0  50.0      35.0
4  24.0  33079.0  16.0   8090.0  40.0      60.0 

----------- Post-cleaning null counts
age 0
conrinc 0
educ 0
indus10 0
hrs1 0
prestg10 0

     age  conrinc  educ  indus10  hrs1  prestg10
0  48.0  59542.0  12.0   9370.0  40.0      49.0
1  25.0   7939.0  16.0   8470.0  40.0      41.0
2  42.0  59542.0  16.0   9490.0  35.0      44.0
3  24.0  33079.0  14.0   9370.0  50.0      35.0
4  24.0  33079.0  16.0   8090.0  40.0      60.0


In [5]:
# rename columns for later analysis
df = df.rename(columns={'conrinc': 'income', 'indus10': 'industry', 'educ': 'education',
                   'hrs1': 'hours', 'prestg10': 'prestige'})
df

Unnamed: 0,age,income,education,industry,hours,prestige
0,48.0,59542.00,12.0,9370.0,40.0,49.0
1,25.0,7939.00,16.0,8470.0,40.0,41.0
2,42.0,59542.00,16.0,9490.0,35.0,44.0
3,24.0,33079.00,14.0,9370.0,50.0,35.0
4,24.0,33079.00,16.0,8090.0,40.0,60.0
...,...,...,...,...,...,...
24125,22.0,26991.25,12.0,8660.0,48.0,30.0
24126,29.0,45360.00,19.0,8190.0,50.0,61.0
24127,32.0,55440.00,15.0,8190.0,38.0,62.0
24128,49.0,45360.00,17.0,7860.0,40.0,61.0


In [6]:
for v in df.columns: # create column labels for analysis that group values similar to gss (e.g. 10-19 years old for age)
  df[v + '_lbl'] = ''

# labeling functions
def determine_range(x):
  range = int(x // 10)
  return f'{range}0-{range}9'

def determine_conrinc(x):
  conrinc = x / 10000
  if conrinc < 1: return '< $10k'
  elif conrinc < 2.5: return '$10-25k'
  elif conrinc < 5: return '$25-50k'
  elif conrinc < 7.5: return '$50-75k'
  elif conrinc < 10: return '$75-100k'
  elif conrinc < 25: return '$100-250k'
  return '$250k+'

def determine_educ(x):
  if x < 12: return 'Less than high school'
  elif x == 12: return 'High School Diploma'
  elif x < 15: return 'Some college'
  elif x == 16: return "Bachelor's Degree"
  return 'Postgrad'

def determine_indus(x):
  # job codes found at U.S. Bureau of the Census occupation and industry codes
  # https://www.bls.gov/tus/iocodes/census07icodes.pdf
  if x < 300: return 'Agriculture, forestry, fishing, and hunting'
  elif x < 500: return 'Mining, quarrying, and oil and gas extraction'
  elif x < 700: return 'Utilities'
  elif x < 800: return 'Construction'
  elif x < 4000: return 'Manufacturing'
  elif x < 6000: return 'Wholesale and retail trade'
  elif x < 6400: return 'Transportation and warehousing'
  elif x < 6800: return 'Information'
  elif x < 7000: return 'Finance and insurance'
  elif x < 7200: return 'Real estate and rental and leasing'
  elif x < 7500: return 'Professional and technical services'
  elif x < 7800: return 'Management, administrative, and waste services'
  elif x < 7900: return 'Educational services'
  elif x < 8500: return 'Health care and social assistance'
  elif x < 8700: return 'Leisure and hospitality'
  elif x < 9300: return 'Other services'
  return 'Public administration'


# determine labels for each variable
for c in ['age', 'prestige', 'hours']:
  for i, v in enumerate(df[c]):
    df.loc[i, c + '_lbl'] = determine_range(v)

for i, v in enumerate(df['income']):
  df.loc[i, 'income_lbl'] = determine_conrinc(v)

for i, v in enumerate(df['education']):
  df.loc[i, 'education_lbl'] = determine_educ(v)

for i, v in enumerate(df['industry']):
  df.loc[i, 'industry_lbl'] = determine_indus(v)

df

Unnamed: 0,age,income,education,industry,hours,prestige,age_lbl,income_lbl,education_lbl,industry_lbl,hours_lbl,prestige_lbl
0,48.0,59542.00,12.0,9370.0,40.0,49.0,40-49,$50-75k,High School Diploma,Public administration,40-49,40-49
1,25.0,7939.00,16.0,8470.0,40.0,41.0,20-29,< $10k,Bachelor's Degree,Health care and social assistance,40-49,40-49
2,42.0,59542.00,16.0,9490.0,35.0,44.0,40-49,$50-75k,Bachelor's Degree,Public administration,30-39,40-49
3,24.0,33079.00,14.0,9370.0,50.0,35.0,20-29,$25-50k,Some college,Public administration,50-59,30-39
4,24.0,33079.00,16.0,8090.0,40.0,60.0,20-29,$25-50k,Bachelor's Degree,Health care and social assistance,40-49,60-69
...,...,...,...,...,...,...,...,...,...,...,...,...
24125,22.0,26991.25,12.0,8660.0,48.0,30.0,20-29,$25-50k,High School Diploma,Leisure and hospitality,40-49,30-39
24126,29.0,45360.00,19.0,8190.0,50.0,61.0,20-29,$25-50k,Postgrad,Health care and social assistance,50-59,60-69
24127,32.0,55440.00,15.0,8190.0,38.0,62.0,30-39,$50-75k,Postgrad,Health care and social assistance,30-39,60-69
24128,49.0,45360.00,17.0,7860.0,40.0,61.0,40-49,$25-50k,Postgrad,Educational services,40-49,60-69


In [7]:
df.drop([c for c in df.columns if 'lbl' in c], axis=1).corr() # determine which variables to perform anaylsis on

Unnamed: 0,age,income,education,industry,hours,prestige
age,1.0,0.047118,-0.011602,-0.057421,-0.051805,0.102491
income,0.047118,1.0,0.255987,-0.025643,0.220088,0.273366
education,-0.011602,0.255987,1.0,0.199906,0.03727,0.476881
industry,-0.057421,-0.025643,0.199906,1.0,-0.085111,0.189914
hours,-0.051805,0.220088,0.03727,-0.085111,1.0,0.092865
prestige,0.102491,0.273366,0.476881,0.189914,0.092865,1.0


In [8]:
income_order = ['< $10k', '$10-25k', '$25-50k', '$50-75k', '$75-100k', '$100-250k', '$250k+']
educ_order = ['Less than high school', 'High School Diploma', 'Some college', "Bachelor's Degree", 'Postgrad']
incomeXeduc_order = income_order + educ_order
indus_order = ['Agriculture, forestry, fishing, and hunting', 'Mining, quarrying, and oil and gas extraction',
'Utilities', 'Construction', 'Manufacturing', 'Wholesale and retail trade','Transportation and warehousing',
'Information', 'Finance and insurance', 'Real estate and rental and leasing','Professional and technical services',
'Management, administrative, and waste services', 'Educational services', 'Health care and social assistance',
'Leisure and hospitality', 'Other services', 'Public administration']

def df_sort(s):
  return s.apply(lambda x: (incomeXeduc_order.index(x)))

df = df.sort_values(by=['income_lbl', 'education_lbl'], key=df_sort, ignore_index=True) # sort df by income and education to make later analysis easier
df

Unnamed: 0,age,income,education,industry,hours,prestige,age_lbl,income_lbl,education_lbl,industry_lbl,hours_lbl,prestige_lbl
0,27.0,3087.00000,11.0,5090.0,40.0,28.0,20-29,< $10k,Less than high school,Wholesale and retail trade,40-49,20-29
1,18.0,1764.00000,11.0,6470.0,40.0,21.0,10-19,< $10k,Less than high school,Information,40-49,20-29
2,55.0,7939.00000,10.0,5090.0,47.0,28.0,50-59,< $10k,Less than high school,Wholesale and retail trade,40-49,20-29
3,27.0,9924.00000,11.0,5670.0,37.0,47.0,20-29,< $10k,Less than high school,Wholesale and retail trade,30-39,40-49
4,75.0,3969.00000,10.0,4870.0,40.0,24.0,70-79,< $10k,Less than high school,Wholesale and retail trade,40-49,20-29
...,...,...,...,...,...,...,...,...,...,...,...,...
24125,47.0,324512.29214,20.0,7970.0,50.0,80.0,40-49,$250k+,Postgrad,Health care and social assistance,50-59,80-89
24126,48.0,324512.29214,20.0,8190.0,40.0,80.0,40-49,$250k+,Postgrad,Health care and social assistance,40-49,80-89
24127,50.0,324512.29214,18.0,9790.0,60.0,73.0,50-59,$250k+,Postgrad,Public administration,60-69,70-79
24128,56.0,324512.29214,18.0,3390.0,55.0,73.0,50-59,$250k+,Postgrad,Manufacturing,50-59,70-79


## Results

Two to five pages providing visualizations, statistics, and a discussion of your findings. If you have a lot of plots or tables, that’s OK, but try to focus on a few key pieces of evidence rather than doing every single pairwise comparison of some set of variables

## Conclusion

One to two pages summarizing the project, defending it from criticism, and suggesting additional work that was outside the scope of the project