# Adult Classification

## Description

In this project, we have data that represents a population with different demographic characteristics, following by their economic and social status.  
We are going to investigate the data and try to use supervised machine learning algorithms to find a relationship between the different features and the income of the person.

### Data

The data is taken from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).  
The data is split into two files: `adult.data` and `adult.test`, within the data folder. (currently not in the repo because of the size of the files)  
The data is in CSV format, with 14 features plus the target feature.
Here's the link to the [train](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data) and [test](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test) data.

### Goal

The goal is to predict whether a person makes over 50K a year, based on the other features.

### Analysis

The process of the analysis is as follows:

1. Data exploration
2. Data cleaning
3. Feature engineering
4. Model selection
5. Model evaluation

## Load Data

In [14]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [26]:
# load datasets
adults = pd.read_csv('../data/adult.data', delimiter=', ', names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'])

test = pd.read_csv('../data/adult.test', delimiter=', ', names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'], skiprows=1)

In [27]:
# inspect data
print(adults.shape)
adults.head()

(32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Our dataset is quite large, we need to look at the data type of each column and the number of missing values.


In [33]:
# inspect data types
adults.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

It seems that the types are correctly assigned.

In [29]:
# inspect missing values
adults.isna().sum().sort_values()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

Looks like there are no missing values, not at least those represented as Nan, we might have some missing values represented differently.

## Explore Data

### Numerical Features

Let's start by exploring the distribution of numerical features, and how they relate to the target feature.