# Data Preprocessing and Feature Exploration using Machine Learning

### Contents

1. Goal
2. Approach
3. Data Loading
4. Data Cleaning
    - Dealing with data types
    - Handling missing data
5. Data Exploration
    - Outlier detection
    - Ploting distribution
6. Feature Engineering
    - Interaction between features
    - Dimensionality reduction using PCA
7. Feature Selection and Model Building
    - Show model performance on datasets which have been preprocessed and those which are not preprocessed

### 1. Goal

* Build a binary classficiation model to predict whether income of Adult Data Set - https://archive-beta.ics.uci.edu/ml/datasets/adult exceeds `$50K/yr` based on census data. (Cite: Adult. (1996). UCI Machine Learning Repository)

* Explore effective pre-modeling steps

* Compare the model performance with and without 
    * data preprocessing
    * data cleaning
    * feature exploration
    * feature engineering
* The adult.data file has been edited to add the header

### 2. Approach

- **Terminology**
    - Input: Independant variables / features / Predictors
    - Output: Dependant variables / targer variable / Prediction
    - Model: It explains the effect that features have on the target variable 

- **Model Building**
    - Split the data randomly into train/test sets
    - Build model on the train set and assess the performance on test set
    - Check performance metrics 
        - AUC or ROC
            - True Positive Rate (TPR)
            - False Positive Rate (FPR)

- **Classifiction Model Types**
    - Logistic Regression
    - Decision Trees
        - Random Forest
        - Gradient Boosted Trees
    - Support Vector Machines
    - Tandem models (combination of multiple models)
    - and so on...

I am going to use Logistic Regression for this experiment.

### 3. Data Loading

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('./data/adult.data')

In [2]:
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**Observation**
- There are total 16 columns having combination of numerical and categorical variables / features.

In [3]:
# Let's take a look at the data types of these features
# df.info()

In [4]:
# Have a look at the target variable `income`
df['income'].value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

**Observation**
- The target variable `income` contains two values.
- Let's simplify the values and convert them to 1 and 0

In [5]:
# for x in df['income']:
#     print(x)

In [6]:
# Assign outcome as 0 if income <=50K and as 1 if income >50K
df['income'] = [0 if x.strip() == '<=50K' else 1 for x in df['income']]

# Assign X as a DataFrame of features and y as a Series of the outcome variable
X = df.drop('income', 1)
y = df.income

In [7]:
X.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [8]:
print(y.value_counts())

0    24720
1     7841
Name: income, dtype: int64


### 4. Data Cleaning

**A. Dealing with data types**
- There are three main data types:
    - Numeric: e.g. income, age
    - Categorical: e.g. sex, education
    - Ordinal: e.g. low/medium/high

- Since machine learning can handle only numerica features we must convert categorical and ordinal fetures into numeric features
    - Create dummry features
    - Transform a categorical feature into a set of dummy features, each representing a unique category
    - In the set of dummy features, 1 indicates that the observation belongs to that category

In [9]:
# Education is a categorical feature
X['education'].head(5)

0     Bachelors
1     Bachelors
2       HS-grad
3          11th
4     Bachelors
Name: education, dtype: object

In [10]:
# Use get_dummies in pandas
# Another option is OneHotEncoder in sci-kit learn
pd.get_dummies(X['education']).head(5)

Unnamed: 0,10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,Bachelors,Doctorate,HS-grad,Masters,Preschool,Prof-school,Some-college
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


It isn't always beneficial to convert all categorical variables into numerical variables because it may increase the sparsity of the data if categorical variables are in large number. So, it's a good idea to select few categorical variables.

In [11]:
# Decide which categorical variables you want to use in model

for col_name in X.columns:
    if X[col_name].dtypes == 'object':
        unique_cat = len(X[col_name].unique())
        print(f"Feature '{col_name}' has {unique_cat} unique categories")


Feature 'workclass' has 9 unique categories
Feature 'education' has 16 unique categories
Feature 'marital_status' has 7 unique categories
Feature 'occupation' has 15 unique categories
Feature 'relationship' has 6 unique categories
Feature 'race' has 5 unique categories
Feature 'sex' has 2 unique categories
Feature 'native_country' has 42 unique categories


In [12]:
# Although, 'native_country' has a lot of unique categories, most categories only have a few observations
print(X['native_country'].value_counts().sort_values(ascending=False).head(10))

 United-States    29170
 Mexico             643
 ?                  583
 Philippines        198
 Germany            137
 Canada             121
 Puerto-Rico        114
 El-Salvador        106
 India              100
 Cuba                95
Name: native_country, dtype: int64


In [13]:
# In this case, bucket the low frequency categories as "Other"

X['native_country'] = ['United-States' if x.strip() == 'United-States' else 'Other' for x in X['native_country']]
X['native_country'].value_counts().sort_values(ascending=False)





United-States    29170
Other             3391
Name: native_country, dtype: int64

In [14]:
# Create a list of features to dummy
todummy_list = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']


In [15]:
# Function to dummy all the categorical variables used for modeling
def dummy_df(df, todummy_list):
    for x in todummy_list:
        dummies = pd.get_dummies(df[x], prefix=x, dummy_na=False)
        df = df.drop(x, 1)
        df = pd.concat([df, dummies], axis=1)
    return df


In [16]:
X = dummy_df(X, todummy_list)

In [17]:
X.head(5)

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,relationship_ Wife,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Female,sex_ Male,native_country_Other,native_country_United-States
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,1,0,0,0,1,0,1
4,28,338409,13,0,0,40,0,0,0,0,...,1,0,0,1,0,0,1,0,1,0


**B. Handling Missing Data**