## Predicting income category from socioeconomic characteristics

by Luke Ni, Michael Oyatsi, Nishanth Kumarasamy & Shruti Sasi 

In [None]:
import numpy as np
import pandas as pd
import altair as alt
import altair_ally as aly

# Simplify working with large dataset in altair_ally
aly.alt.data_transformers.enable('vegafusion')

## Summary

For our summary we investigate the socieconomic indicators that contibute to wealth distribution in society.

## Introduction

How is an individual's income affected by other socioeconomic factors? This is the question our team set out to investigate. Socioeconomic status here is defined as a way of describing people based on their education, income and type of job (National Cancer Institute, n.d.). With the diversity of backgrounds that can exist in society, we set out to understand what factors contribute most to an individuals income. 

In this analysis, we use machine learning to predict whether an individuals income is above or below $50,000. As the government sets out massive investment in Canadian societies to improve the lives of citizens(Housing, Infrastructure and Communities Canada, 2025), we envision our analysis as a means of providing insights to the government as to what investments can drive the best chances of improving an individuals life. The persistent income and wealth inequeality increase presents a strong case for prudent investing to improve lives across all Canadians. (Yassin, Petit, & Abraham, 2024)


### Methods

 #### Data

For our dataset, we use the Adult dataset sourced from the UC Irvine Machine Learning Repository (Becker & Kohavi, 1996). The dataset contains 14 features obtained from census data to describe an individuals attributes. The target is a categorical column comprised of a binary outcome of whether an individual earns more than USD 50,000(>50K) or USD 50,000 or less (<=50K). 
The data and the descriptions fo the corresponding attributes can be explored using this [link](https://archive.ics.uci.edu/dataset/2/adult)

### Exploratory Data Analysis

Prior to model fitting and feature selection, we first perform EDA to understand the distribution of our features as it relates to our target. 

The code chunk below imports our dataset. 

In [None]:
# Import the data from the UCI Repostitory. 

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as a pandas dataframe) 
adult_df = pd.concat([adult.data.features, adult.data.targets], axis=1)

# Rename target values
adult_df.income = adult_df.income.replace(to_replace=['<=50K.', '>50K.'], value=['<=50K','>50K'])

# Combine all married groups in marital status to one group. 
adult_df['marital-status'] = adult_df['marital-status'].replace(to_replace=r'^Married\b.*', value='Married', regex=True)

# Display First observations of the dataset
adult_df.head(5)

#### Train-Test-Split: Obey the Golden Rule

Before proceeding with further EDA and visualization of the data, we split and stash a test set from our data in order to evaluate our model performance on unseen data in accordance with the pricinples of the Golden rule of machine learning. 

In [None]:
adult_df.income.value_counts()

In [None]:
# Create a Test train split of the data
from sklearn.model_selection import train_test_split

adult_train, adult_test = train_test_split(adult_df, test_size=0.3, random_state=522)


#### Discern Features & Strategize Missing Data

With the split, complete we review on the `adult_train` data to understand the statistics of the numerical features and to investigate on the presence of null values.

In [None]:
# Investigate quality of the data
adult_train.info()

In [None]:
# Summarize null values 
adult_train.isna().sum()

Three columns in the dataset contain `NaN` values. Further review of the dataset also shows that the same columns have values that are missing, but populated by a question mark to indicate lack of information. Given the size of the observations in the data and the individual distribution of the missing data skewing heavily toward one single category for the `workclass` & `country` features a Simple Imputation strategy using the most frequent values is used to address the missing values.

The distribution of the `occupation` feature is more even across the categorical variables. However, given the wide net of occupations that could lie withing the most popular occupation feature, `prof-specialty`, our approach was to also perform Simple Imputation with the most frequent values as replacement. 

In [None]:
# Visualize distribution of features containing incomplete information values
obj_df = adult_train.select_dtypes(include='object')
for col in obj_df.columns:
    if obj_df[col].str.contains('?', regex=False).any():
        print (col)

In [None]:
# Visualize distributions of categorical variables in the features with missing data. 
for col in obj_df.columns:
    if obj_df[col].str.contains('?', regex=False).any():
        print (obj_df[col].value_counts())

In [None]:
from sklearn.impute import SimpleImputer

# # Reinforce conversion of columns to str object. 
# for col in adult_train.select_dtypes(include='object'):
#     adult_train[col] = adult_train[col].astype(str)

# Replace missing data with NA
#adult_train = adult_train.apply(lambda col: col.str.strip() if col.dtype=='object' else col)
adult_train = adult_train.replace('?', np.nan)

# Perform Simple Imputation
simple_imp = SimpleImputer(missing_values = np.nan, strategy='most_frequent')
adult_train_imp = pd.DataFrame(simple_imp.fit_transform(adult_train), 
                               index=adult_train.index, 
                               columns=adult_train.columns)


# Recast numerical featuers to int data types after Impute
adult_train_imp = adult_train_imp.astype({'age':'int64',
                       'fnlwgt': 'int64',
                       'capital-gain': 'int64',
                       'capital-loss': 'int64',
                       'hours-per-week': 'int64'})

# Confirm all missing values have been imputed
adult_train_imp.info()
                        

#### Univariate Distribution of The Quantitative Variables 

*Note - Visualization of the distributions below using the `altair-ally` package is performed with code adapted from UBC's DSCI-573: Feature and Model Selection Course. Reference on the Altair Ally package can be found in using [this external link](https://vega.github.io/altair_ally/intro.html)*

We first investigate the distribution of the dataset's quantitative variables age against their income bracket. From the plots below, we pay special focus on the age distribution of the respondents. Both distributions are right-skewed. Income earners at or below USD 50,000 skew younger than fellow respondents earning above USD 50,000. 

Of note also is the age distribution of hours worked per week, with most respondents in both income brackets reporting about 40 hours per week. The `fnlweight` feature is a final numerical value representing the final weight of the record. This value can be viewed as the number of people represented by the row. Without further breakdown on the methods or derivation of this value, we chose to ignore it in our analysis. 

Similarly, no in depth data is provided on the `capital-loss` and `capital-gain` features and we discard these features in our model generation portion. 

In [None]:
aly.dist(adult_train_imp, color='income')

#### Univariate Distribution of the Categorical Variables

We also review the distribution of select categorical variables below. 

From the first histogram of `income` distribution, we can see that the dataset contains more records of low income earners compared to high income earners, a ration of about 3:1. 

Reviewing the marital status of distribution, we can see that the distribution of high income earners is concentrated primarily on married respondents with scant representation the other marital status groups. Note that the orignal dataset contains 3 distinc married groups: `Married-AF-spouse` for respondents whose partners are in the Army, `Married-civ-spouse` for individuals married to civilian spouses & `Married-spouse-absent`. For simiplification, all values have been combined into one variable `Married`. 

When analysing the `workclass` feature, a feature to classify the respondents' employer, we notice that high income earners are represented primarily in the private sector and less so in other employer categories. 

While the distribution of the `occupation` feature is less conclusive, we see that high income earners in `exec-managerial` and `prof-specialty`, executive management and professional specialties respectively. Likewise when reviewing the `education` feature, we can see that high income earners tend to have at least some college education and are barely present in respondents who did not finish high school (For clarity, 12th grade is the last year of high school) 

In order to prevent propagating inherent societal biases in our model, the following categories are not considered for feature or model selection: `sex` & `race`. 

Moreover, the `relationship` feature, which represents the relationship the observation has relative to others is not considered as the useful information is encoded within the respondent's marital status. 

We also exclude the `native-country` from our visualization and feature. The overwhelming majority of the respondents are American-born and with little information on other information regarding the foreign-born respondents (e.g. how long they have been in the USA), we exclude this feature from our model. 

In [None]:
# Look at the univariate distrbutions (counts) for the categorical variables

# Changing churn to an object dtype just for the data passed to the chart
aly.dist(adult_train_imp.select_dtypes(include='object').drop(columns=['relationship', 'sex',
                                                                       'education-num', 'race', 
                                                                       'native-country']), 
         dtype='object', color='income')

### Features & Model Selection

### Pre-processing pipeline

The adult dataset has various types of features: numeric, categorical, binary.
| Feature | Type | Transformation |
| :------- | :------: | -------: |
| age	| Integer | no transformation |
| workclass | Categorical | imputation, one-hot encoding |
| fnlwgt |	Integer	| Scaling with StandardScaler |
| education | Categorical | |
| education-num	| Integer | no transformation (already ordinal)|
| marital-status | Categorical	| one-hot encoding |
| occupation | Categorical |	imputation, one-hot encoding |
| relationship	| Categorical |	one-hot encoding |
| race |	Categorical |	one-hot encoding |
| sex |	Binary	| one-hot encoding with drop=if_binary |
| capital-gain	| Integer	| Scaling with StandardScaler |
| capital-loss	| Integer	| Scaling with StandardScaler |
| hours-per-week |	Integer	 | Scaling with StandardScaler |
| native-country | Categorical	| imputation, one-hot encoding |

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer

# Replace missing data with NA
adult_train = adult_train.replace('?', np.nan)

numeric_features = ["fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]
categorical_features = ["workclass", "marital-status", "occupation", "relationship", "native-country", "race"]
#ordinal_features = ["education"]
binary_features = ["sex"]
drop_features = ["education"]
target = "income"

numeric_transformer = StandardScaler()
binary_transformer = OneHotEncoder(drop="if_binary", dtype=int)
categorical_transformer = make_pipeline(
            SimpleImputer(missing_values = np.nan, strategy='most_frequent'), 
            OneHotEncoder(handle_unknown="ignore", sparse_output=False)
        )

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
    (binary_transformer, binary_features),
    ("drop", drop_features)
)
preprocessor

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Fit and transform on the training data
X_train = adult_train.drop(columns=target)
X_test = adult_test.drop(columns=target)
y_train = adult_train[target]
y_test = adult_train[target]

# Create a dataframe with the transformed features and column names
adult_train_enc = pd.DataFrame(preprocessor.transform(adult_train), index=adult_train.index, columns=preprocessor.get_feature_names_out())  

# Show the transformed data
adult_train_enc


### Fit a model

In [None]:
#logistic regression
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("clf", LogisticRegression(max_iter=200))
])

model.fit(X_train, y_train)

### Evaluate model

Accuracy

In [None]:
from sklearn.metrics import accuracy_score

print(X_test.shape)
print(y_test.shape)
print(y_pred.shape)

# y_pred = model.predict(X_test)
# accuracy_score(y_test, y_pred)

Confusion matrix

In [None]:
# from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# cm = confusion_matrix(y_test, y_pred)
# disp = ConfusionMatrixDisplay(cm)
# disp.plot()

Classification report

In [None]:
# from sklearn.metrics import classification_report

# print(classification_report(y_test, y_pred))

### References

National Cancer Institute. (n.d.). Socioeconomic status. In NCI Dictionary of Cancer Terms. Retrieved November 20, 2025, from https://www.cancer.gov/publications/dictionaries/cancer-terms/def/socioeconomic-status

Housing, Infrastructure and Communities Canada. (2025, September 12). Investing in Canada Plan – Building a Better Canada. Retrieved November 20, 2025, from https://housing-infrastructure.canada.ca/plan/about-invest-apropos-eng.html

Becker, B., & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20

Yassin, S., Petit, G., & Abraham, Y. (2024, July 18). The troubling rise of income and wealth inequality in Canada. Policy Options. Institute for Research on Public Policy. https://policyoptions.irpp.org/2024/07/income-wealth-inequality/

Chen, J. (n.d.). Feature Significance Analysis of the US Adult Income Dataset (TR 1869) [Technical report, University of Wisconsin–Madison]. UW–Madison Institutional Repository. https://minds.wisconsin.edu/bitstream/handle/1793/82299/TR1869%20Junda%20Chen%203.pdf?sequence=1&isAllowed=y