# Data Preprocessing

Suppose you are assigned to develop a machine learning model to predict whether an individual earns more than USD 50,000 or less in a year using the 1994 US Census Data. The datasets are sourced from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Census+Income.

The repository provides 5 datasets. However, each dataset is raw and does not come in the form of ABT (Analytic Base Table). The datasets are apparently not ready for predictive modeling.

The objective of this notebook is to guide you through the data preprocessing steps on the raw datasets in a sequence of exercises. The expected outcome is "clean" data that can be directly fed into any machine learning algorithm within the Scikit-Learn Python module. The clean data should look like the dataset used in this case study on our website.

# Exercises

**Exercise 0**

Read the training and test datasets directly from the data URL's. Also, since the datasets do not contain the feature names, explicitly specify them while loading in the datasets. Once you read in adultData and adultTest datasets, concatenate them into a single dataset called df.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#column names details are given in website 
col_names = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','income']

In [3]:
train_df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",sep = ',', names=col_names)
train_df.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [4]:
test_df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                      sep = ',', names= col_names, skiprows=1)
test_df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.


In [5]:
df = pd.concat([train_df,test_df])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


 


**--------------------------------------------------------------------------------**

**`Another way to do this:`**

url = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",    
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
)  # pay attention that this defines a tuple that cannot be modified


`adultData` = pd.read_csv(url[0], sep = ',', names = col_names, header = None)

`adultTest` = pd.read_csv(url[1] , sep = ',', names = col_names, skiprows = 1)

`df` = pd.concat([adultData,adultTest])

df.head()

**--------------------------------------------------------------------------------**

**Exercise 1**

Make sure the feature types match the descriptions outlined in the Data Description section. For example, confirm `age` is a numeric feature.

In [6]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

**Exercise 2**

Calculate the number of missing values for each feature. Does the result surprise you?

In [7]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

**Exercise 3**

In Exercise 2, you should see zero missing value for each value. This indicates some features are coded with different labels such as "?" and "99999", instead of NaN. To provide a better overview, generate summary statistics of df. Hint: Use the describe() method with `include=np.number` and `include=np.object`. Make sure you have Python 3.6+ for this to work!

In [8]:
df.describe(include=np.number).round(2)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.64,189664.13,10.08,1079.07,87.5,40.42
std,13.71,105604.03,2.57,7452.02,403.0,12.39
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [10]:
df.describe(include=np.object)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  df.describe(include=np.object)


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income
count,48842,48842,48842,48842,48842,48842,48842,48842,48842
unique,9,16,7,15,6,5,2,42,4
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,33906,15784,22379,6172,19716,41762,32650,43832,24720


**Exercise 4**

In Exercise 3, you can see the target feature `income` has four unique values. This contradicts the definition of `income` as it should have two only labels: "<=50K" and ">50K". In this exercise, return the unique values of `income`.

In [15]:
df['income'].unique()

array([' <=50K', ' >50K', ' <=50K.', ' >50K.'], dtype=object)