# Income Analysis

This project aims to use a dataset of incomes of adults from the 1994 Census to create models that can be used to predict whether an individual makes an annual income of greater than 50,000 or not based on characteristics about the individual. Various supervised learning models will be explored to determine how these models perform on this classification task.

The goal of the project is to determine which sociodemographic characteristics about an individual have an influence on their annual income and by how much these characteristics influence the annual income. Doing so could provide insights into social inequalities regarding an individual's income.

## Dataset

The dataset used for this project is titled "Adult" and is from the University of California Irvine's Machine Learning Repository.

Becker, B., & Kohavi, R. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20

In [1]:
import pandas as pd

data = pd.read_csv('./data/raw_data.csv')
print(data.shape)

(48842, 15)


The dataset contains information on adults from the 1994 Census. The dataset includes information on 48,842 individuals with 15 columns.

#### Target Variable

The target variable in this analysis is the "income" variable. This categorical variable indicates whether an individual's income is greater than 50,000 or not.

#### Feature Variables

age (numerical) - Age

workclass (categorical) - Type of work

fnlwgt (numerical) - Final weight of observation

education (categorical) - Highest level of education

education-num (numerical) - Number assigned based on "education" feature

marital-status (categorical) - Marital status

occupation (categorical) - Industry/role

relationship (categorical) - Relationship relative to marital status

race (categorical) - Race

sex (categorical) - Sex

capital-gain (numerical) - Capital gain

capital-loss (numerical) - Capital loss

hours-per-week (numerical) - Average number of hours worked per week

native-country (categorical) - Country of birth

## Data Cleaning

The UCI Machine Learning Repository website's page for this dataset mentions that there are missing values in the columns "workclass", "occupation", and "native-country".

In [2]:
missing_values_columns = ['workclass', 'occupation', 'native-country']
for column in missing_values_columns:
    print(f'{column} has missing values: {data[column].isnull().any()}')

workclass has missing values: True
occupation has missing values: True
native-country has missing values: True


These columns do indeed have missing values. For this project, it is best to drop any rows that contain missing values in any of the columns.

In [3]:
print(data.shape)
data = data.dropna()
print(data.shape)

(48842, 15)
(47621, 15)


1,221 rows were dropped since they had missing values.

In [4]:
categorical_vars = data.select_dtypes(exclude='number').columns
for var in categorical_vars:
    print(f'{var}:', data[var].unique())

workclass: ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked']
education: ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']
marital-status: ['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated' 'Married-AF-spouse' 'Widowed']
occupation: ['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' '?'
 'Protective-serv' 'Armed-Forces' 'Priv-house-serv']
relationship: ['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']
race: ['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
sex: ['Male' 'Female']
native-country: ['United-States' 'Cuba' 'Jamaica' 'India' '?' 'Mexico' 'South'
 'Puerto-Rico' 'Honduras' 'England' '

The "workclass", "occupation", and "native-country" columns still contain values that are the string "?". Rows containing this value in these columns should also be dropped.

In [5]:
data = data[~(data == '?').any(axis=1)]
print(data.shape)

(45222, 15)


An additional 2,399 rows were dropped.

The "education" categorical variable may be redundant because of the "education-num" numerical variable.

In [6]:
print(len(data['education'].unique()))
print(data['education-num'].min(), data['education-num'].max())

16
1 16


There are 16 unique values in the "education" column and the "education-num" column contains integers from 1 to 16. This suggests that the "education" variable can be dropped since it is an ordinal variable and its information is already captured as integers in "education-num".

In [7]:
data = data.drop(columns='education')

The "fnlwgt" column can also be dropped since it is not helpful when building models in this project.

In [8]:
data = data.drop(columns='fnlwgt')

Data munging may also be necessary for some of the categorical variables.

The "income" target variable should be fixed and converted into a binary variable.

In [9]:
income_mapping = { '>50K': 1, '>50K.': 1, '<=50K': 0, '<=50K.': 0 }
data['income'] = data['income'].map(income_mapping)
print(data['income'].unique())

[0 1]


There are a lot of different possible values for the "native-country" variable. It may be more useful to consider whether someone's native country is the United States or not since the census was conducted in the US. Due to these reasons, it is reasonable to transform the "native-country" variable to a binary variable called "us-native" that indicates whether an individual's native country is the US.

In [10]:
data['native-country'] = data['native-country'].apply(lambda x: 1 if x == 'United-States' else 0)
data = data.rename(columns={'native-country': 'us-native'})
print(data['us-native'].unique())

[1 0]


In [11]:
print(data.shape)

(45222, 13)


In summary, as a result of the data cleaning, 3,620 rows and 2 columns were dropped. There is still additional data munging to be done to some of the remaining categorical variables. However, this will be done after the exploratory data analysis in order to make the EDA easier.