# DSCI 310 Group Project: Income Prediction and Inference Analysis

Prepared by Group 10:
- Benjamin Gerochi
- Izzy Zhou
- Michael Tham
- Yui Mikuriya

## (1) Summary


## (2) Introduction
## Dataset Overview

The dataset selected for this project is the **Adult Dataset** (Kohavi & Becker, 1996), available through the **UCI Machine Learning Repository**. It contains demographic and income data collected by the **U.S. Census Bureau** and is widely used for predicting whether an individual’s income exceeds **$50,000 per year** based on various demographic factors.

### Dataset Details:
- **Dataset Name**: UC Irvine Adult Dataset  
- **Source**: 1994 U.S. Census database, compiled by Ronny Kohavi and Barry Becker  
  - [Access the dataset here](https://archive.ics.uci.edu/ml/datasets/adult)  
- **Total Observations**: 32,561  
- **Total Variables**: 15  


### Variables and Their Types


| Variable Index | Variable Name       | Type      | Description |
|----------------|---------------------|-----------|-------------|
| 0              | age                 | continuous       | Age of the individual |
| 1              | workclass           | categorical    | Employment sector (e.g., Private, Self-emp-not-inc, State-gov) |
| 2              | fnlwgt              | continuous       | Final weight, representing the number of people the observation represents in the population |
| 3              | education           | categorical    | Highest level of education attained |
| 4              | education-num       | continuous       | Numerical representation of education level |
| 5              | marital-status      | categorical    | Marital status (e.g., Never-married, Married-civ-spouse) |
| 6              | occupation          | categorical    | Type of occupation (e.g., Adm-clerical, Exec-managerial) |
| 7              | relationship        | categorical    | Relationship of the individual to the household (e.g., Husband, Not-in-family) |
| 8              | race                | categorical    | Race of the individual (e.g., White, Black) |
| 9              | sex                 | categorical    | Gender (Male/Female) |
| 10             | capital-gain        | continuous       | Capital gains earned |
| 11             | capital-loss        | continuous       | Capital losses incurred |
| 12             | hours-per-week      | continuous       | Average hours worked per week |
| 13             | native-country      | categorical    | Country of origin |
| 14             | income              | categorical    | Income level (<=50K, >50K) |

**Table 1.1**: Description of variables in the U.S. Census Adult dataset

<!-- ### Descriptive Statistics

- **Age**: Ranges from 17 to 90, with an average age of 38.6 years.
- **Education-num**: Has values from 1 to 16, representing various education levels.
- **Capital-gain**: Ranges from 0 to 99,999, with most values concentrated around zero, indicating that high capital gains are rare.
- **Capital-loss**: Similar to capital gain, most values are zero.
- **Hours-per-week**: Has a mean of 40.4 hours, aligning with typical full-time work expectations.
- **Income**: Target variable, classified into two categories: <=50K and >50K. -->

This dataset includes both **categorical** and **numerical** variables, making it suitable for analyzing relationships between **demographic attributes** and **income levels**. Further **exploration and preprocessing** may involve handling **missing values** and **encoding categorical features**.  

### Research Question  
**How do demographic factors influence the likelihood of an individual's annual income exceeding $50,000?**  

This study explores relationships between demographic variables and income levels without pre-assuming key predictors. Our team initially analyzed different aspects of the dataset before deciding to focus on demographic influences on income.  

### Literature Context  
Prior research supports the importance of demographic factors in income prediction. Jo (2023) analyzed the **Adult dataset** and identified **capital gain, education, relationship status, and occupation** as key predictors. Similarly, Azzollini et al. (2023) found that demographic differences explained **40% of income inequality** across OECD countries, reinforcing the relevance of our analysis.  

### **Objective**

This research question encompasses both **prediction** and **inference**:

- **Prediction:** To develop a model capable of predicting the likelihood of an individual earning more than $50,000 annually based on key demographic factors.

- **Inference:** To understand the nature and strength of the relationships between various demographic factors and income.

### Loading Required Libraries
We will start by importing the necessary Python libraries for data analysis and preprocessing.

## (3) Methods & Results

In [3]:
install.packages(c("broom", "repr", "infer", "gridExtra", "faraway",  
                   "mltools", "leaps", "glmnet", "cowplot", "modelr",  
                   "tidyverse", "ggplot2", "dplyr", "GGally", "patchwork", "knitr"))

library(broom)
library(repr)
library(infer)
library(gridExtra)
library(faraway)
library(mltools)
library(leaps)
library(glmnet)
library(cowplot)
library(modelr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(GGally)
library(patchwork)
library(knitr)


ダウンロードされたパッケージは、以下にあります
	/var/folders/l5/r_qqxpln463g2rmqs4kwjq840000gn/T//RtmpGmWjQR/downloaded_packages


要求されたパッケージ Matrix をロード中です

Loaded glmnet 4.1-8


次のパッケージを付け加えます: ‘modelr’


以下のオブジェクトは ‘package:mltools’ からマスクされています:

    mse, rmse


以下のオブジェクトは ‘package:broom’ からマスクされています:

    bootstrap


── [1mAttaching core tidyverse packages[22m ────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ──────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mmodelr[39m::[32mbootstrap()[39m masks [34mbroom[39m::bootstrap()
[31m✖[39m [34mdplyr[39m::[32mcombine()[39m    masks [34mgridExtra[39m::combine()
[31m✖[39m [34mtidyr[39m::[32mexpand()[39m     masks [34mMatrix[39m::expand()


In [4]:
file_path <- "data/census+income/adult.data"

income <- read_csv(file_path, 
                   col_names = FALSE,
                   na = "?")

colnames(income) <- c(
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race",
    "sex", "capital_gain", "capital_loss", "hours_per_week",
    "native_country", "income")

head(income)

[1mRows: [22m[34m32561[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X2, X4, X6, X7, X8, X9, X10, X14, X15
[32mdbl[39m (6): X1, X3, X5, X11, X12, X13

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


## (4) Discussion


## (5) References
