# Data Preprocessing Example

In this example, we show how to preprocess the data and convert ordinal or norminal data to numerical data. We use the following data set:

https://archive.ics.uci.edu/ml/datasets/Adult

The data was extracted by Barry Becker from the 1994 Census database. Prediction task is to determine whether a person makes over 50K a year. 

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

## Loading Data

Note that you can use **na_values** parameter to convert a specific string (such as ?, #NA# etc) to NA value. You can also use regular expression delimiter to "eat" extra spaces in the data file. 

In [None]:
df = pd.read_csv('https://archive.ics.uci.edu'
                 '/ml/machine-learning-databases/adult'
                 '/adult.data', header=None, 
                 delimiter=r",\s*" ,na_values='?', engine='python')

df.info()

In [None]:
df.isna().sum()

From the file *adult.names*, we can get the information about each columns:

>50K, <=50K.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 
7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

## Add column names. 

This step is not required. But it is helpful for understanding what each column is. 

In [None]:
df.columns=['age', 'workclass', 'fnlwgt', 'education', 'education_num','marital-status',
           'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
           'hours_per_week', 'native_country', 'target']

In [None]:
df.head()

## Check how much data is missing

This step is to check how much data is missing and how to handle missing data. Generally, missing data can be handled by three different strategies:

1. Remove samples (rows) with missing data.
2. Remove features (columns) that miss too much data.
3. Impute data. 

The strategies to be used depend on how much data is missing, how important the feature is or whether imputing data is possible. 

In [None]:
print(df.isna().sum())
print(df.shape)

Now we can see that we miss about 5% data data for workclass, occupation and native_country. So method 1 (removing rows) probably a better choice. We can do this by the following code:

In [None]:
df1 = df.dropna(axis=0)
df1.shape

If you prefer the second method, we can use the following code to do so:

In [None]:
df2 = df.dropna(axis=1, thresh=df.shape[0]-1800) #drop features that miss more than 1800 samples.
df2 = df2.dropna(axis=0)  #drop samples that still has missing data
df2.shape

Now we analyze each column:

1. Age: Numerical.
* Workclass: Nominal
* fnlwgt (Final Weight): Numerical.  
* Education: Ordinal
* Education-num: Numerical
* Marital-status: Nominal
* Occupation: Nominal
* Relationship: Nominal
* Race: Nominal
* Sex: Binary ordinal
* Captial-gain: Numerical
* Capital-loss: Numerical.
* Hours-per-week: Numerical
* Native-country: Nominal

18. Nominal Target

All these features will be used.

## Process ordinal data:

In [None]:
df1.head()

In [None]:
# Check all possible values for education.
np.unique(df1['education'])

In [None]:
#disable copyonwrite warning. Optional.
pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
#map the values:
edu_mapping = {'Preschool':0, '1st-4th':1, '5th-6th':2,'7th-8th':3, '9th':4, 
              '10th':5, '11th':6, '12th':7, 'HS-grad':8, 'Some-college':9,
              'Assoc-voc':10,'Assoc-acdm':11,  'Bachelors':12, 'Masters':13,
               'Doctorate':14, 'Prof-school':15}
df1['education']=df1['education'].map(edu_mapping)

In [None]:
df1.head()

In [None]:
#Convert binary categorical feature sex
sex_mapping = {label: idx for idx, label in enumerate(np.unique(df['sex']))}
df1['sex'] = df1['sex'].map(sex_mapping)
sex_mapping

In [None]:
#Convert binary categorical target
target_mapping = {label: idx for idx, label in enumerate(np.unique(df['target']))}
df1['target'] = df1['target'].map(target_mapping)
target_mapping

In [None]:
df1.head()

## Encode nominal columns

In [None]:
# Label encoding with Pandas get_dummies method
df3 = pd.get_dummies(df1.iloc[:, :-1], drop_first=True)
df3.head()

In [None]:
df3.shape

## Get X and y as Numpy arrays.

In [None]:
X = df3.values
y = df1.iloc[:, -1].values

In [None]:
print(X[:500:100, :10])

## Split data into training and test data sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, 
                     test_size=0.25,
                     stratify=y,
                     random_state=1)

In [None]:
print(X_train.shape)
print(X_test.shape)

### Check how many samples are in each class

In [None]:
print(np.unique(y)) 

In [None]:
print(np.sum(y==0))
print(np.sum(y==1))

If there are many classes, you can also use the following code to get all counts.

In [None]:
{i:np.sum(y==i) for i in np.unique(y)}

## Scikit Learn

We use Scikit Learn toolbox to create and train the model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

### Normalize the data

We usually need to normalize data before train the model. The nomalization parameters (mean and standard deviation) can only be calculated from training data. It will also be applied to the test data while we test models.

In [None]:
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
#scaler.fit(X_train)
#X_train_std=scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

In [None]:
print(X_train[:5,:10])
print(X_train_std[:5,:10])

Now we create a logistic regression model. You can see what parameters can be used by checking the help page.

In [None]:
#?LogisticRegression

In [None]:
logistic_model = LogisticRegression(solver='lbfgs', 
                                    multi_class='auto', tol=1e-4, max_iter = 4000, C=1)

The most important parameter here is C, which is the inverse of the regulation. Smaller values specify stronger regularization. Therefore causes less overfitting but potentially more underfitting.

In [None]:
logistic_model.fit(X_train_std, y_train)
print("Training set score: %f" % logistic_model.score(X_train_std, y_train))
print("Testing set score: %f" % logistic_model.score(X_test_std, y_test))

In [None]:
y_test_pred = logistic_model.predict(X_test_std)

In [None]:
y_test_diff = y_test[y_test_pred!=y_test]
print(y_test_diff)