Data Retrieved from: https://archive.ics.uci.edu/dataset/2/adult

As was the second Jupyter cell ( Import in Python button on that page)

# Goal and Introduction

The goal is to use this dataset to try to predict if someone makes over 50,000 dollar a year using a variety of variables such as occupation, age, education. 

Since this is a binary classification problem ( someone either makes more or less than/ equal to 50,000 dollars), there are a handful of possible alogrithms such as k-Nearest Neighbors, Logistic Regression, and Decision Trees. 

The page on UCI's Machine Learning Repository states that there are missing values, so the data will need to be examined, and cleaned. Then a prelimary data exploration will be undertaken, and finally an algorithm will be trained using 85% of the data and then tested on 15% of the dataset. 

In [31]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sb

In [32]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables) 

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Mon Aug 07 2023', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': 'Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the following conditions: ((AAG

In [66]:
X = X.replace('?', np.nan)
y = y.replace('<=50K.', '<=50K')
y = y.replace('>50K.', '>50K')

Now that the '?' have been replaced by the numpy module's NaN, the data is cleaner than it was before, and this makes it much easier to search for the missing values (more relevant when missing data presents itself in a mulititude of ways) and allows for a bit of compatibility with numpy in general. 

There is an argument to be made that this is unncessary which has some validity to it, but the compatability with numpy (since Nan is a numpy type) can be useful if you want to use an methods from that module.

Also, for whatever reason, there was some nonuniformity in the targets, so that needed to be fixed as well.

In [67]:
yInt = y.replace('<=50K', 0)
yInt = yInt.replace('>50K', 1)

In [68]:
JointInt = X.join(yInt)
JointInt.corr(numeric_only=True)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income
age,1.0,-0.076628,0.03094,0.077229,0.056944,0.071558,0.230369
fnlwgt,-0.076628,1.0,-0.038761,-0.003706,-0.004366,-0.013519,-0.006339
education-num,0.03094,-0.038761,1.0,0.125146,0.080972,0.143689,0.332613
capital-gain,0.077229,-0.003706,0.125146,1.0,-0.031441,0.082157,0.223013
capital-loss,0.056944,-0.004366,0.080972,-0.031441,1.0,0.054467,0.147554
hours-per-week,0.071558,-0.013519,0.143689,0.082157,0.054467,1.0,0.227687
income,0.230369,-0.006339,0.332613,0.223013,0.147554,0.227687,1.0
