## Title
##### How does age, capital gain, capital loss and work hours per week affect the income of adults aged 17-90 in the US?

## Introduction

We are trying to determine whether age, capital gain, capital loss and hours per week can predict if an individual makes above or below 50,000 USD per year.

We are using the Adult data set from the UCI Machine Learning Repository.
This dataset was retrieved from the 1994 US Census database.
The dataset has 14 attributes and 48,842 entries.

Our variables of interest are:
- Age: the age of an individual [17,90]
- Capital Gain: a profit from selling an asset in USD
- Capital Loss: a loss from selling an asset in USD
- Hours per Week: how many hours the individual has reported to work in a week

In [52]:
# Preliminary Exploratory Data Analysis
library(tidyverse)


In [53]:
## Read in data
adult <- read_delim("data/adult.data.txt", delim=",",col_names=c("age", "workclass", "fnl_wgt","education",
    "education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss",
    "hrs_per_week","native_country","label"))

Parsed with column specification:
cols(
  age = [32mcol_double()[39m,
  workclass = [31mcol_character()[39m,
  fnl_wgt = [31mcol_character()[39m,
  education = [31mcol_character()[39m,
  education_num = [31mcol_character()[39m,
  marital_status = [31mcol_character()[39m,
  occupation = [31mcol_character()[39m,
  relationship = [31mcol_character()[39m,
  race = [31mcol_character()[39m,
  sex = [31mcol_character()[39m,
  capital_gain = [31mcol_character()[39m,
  capital_loss = [31mcol_character()[39m,
  hrs_per_week = [31mcol_character()[39m,
  native_country = [31mcol_character()[39m,
  label = [31mcol_character()[39m
)



In [54]:
## Cleaning and Wrangling
adult_tidy <- adult %>%
    mutate(label=as_factor(label), capital_gain = as.numeric(capital_gain), 
           capital_loss = as.numeric(capital_loss), hrs_per_week = as.numeric(hrs_per_week) ) %>%
    filter_all(all_vars(. != " ?")) %>% #annoying, mention somewhere
    select(age,capital_gain,capital_loss,hrs_per_week,label) 
   
head(adult_tidy)

age,capital_gain,capital_loss,hrs_per_week,label
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
39,2174,0,40,<=50K
50,0,0,13,<=50K
38,0,0,40,<=50K
53,0,0,40,<=50K
28,0,0,40,<=50K
37,0,0,40,<=50K


In [59]:
## Summarize 
adult_tidy %>%
    group_by(label)%>%
    summarize(num_labels=n(), mean_age = mean(age,na.rm = TRUE), mean_cg = mean(capital_gain, na.rm = TRUE), 
              mean_cl = mean(capital_loss, na.rm = TRUE))

`summarise()` ungrouping output (override with `.groups` argument)



label,num_labels,mean_age,mean_cg,mean_cl
<fct>,<int>,<dbl>,<dbl>,<dbl>
<=50K,22654,36.60806,148.8938,53.448
>50K,7508,43.95911,3937.6798,193.7507


In [56]:

# some sentences explaining that they determined the ratio for training/test set