#  Preprocessing for numerical features

> Notes on second part of the predictive modeling pipeline module from sklearn MOOC.

- toc:true
- branch: master
- badges: true
- author: Pratik Kumar
- use_plotly: true
- categories: [Python, sklearn, Data Visualization]

## Introduction 

In this post we will be continuing <b>The Predictive Modeling Pipeline</b> module of the [Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/index.html). The post discusses on the preprocessing of numerical data. We will be using scikit-learn pipeline to preprocess the data so that it can be used for training the model.

The topics covered in the subsection are as follows,

(C) Data Preparation <br>
(D) Model Fitting and Preprocessing



## (C) Data Preparation 
---

We will be using  the full adult census dataset. Specifically, we will use the numerical columns of dataframe which are the columns with values as numerical data types.

In [135]:
#collapse
import pandas as pd
from sklearn import set_config
set_config(display='diagram')
data = pd.read_csv("data/adult-census.csv")
display(data.iloc[:,:4].head(),data.iloc[:,4:8].head(),data.iloc[:,8:].head())

Unnamed: 0,age,workclass,fnlwgt,education
0,25,Private,226802,11th
1,38,Private,89814,HS-grad
2,28,Local-gov,336951,Assoc-acdm
3,44,Private,160323,Some-college
4,18,?,103497,Some-college


Unnamed: 0,education-num,marital-status,occupation,relationship
0,7,Never-married,Machine-op-inspct,Own-child
1,9,Married-civ-spouse,Farming-fishing,Husband
2,12,Married-civ-spouse,Protective-serv,Husband
3,10,Married-civ-spouse,Machine-op-inspct,Husband
4,10,Never-married,?,Own-child


Unnamed: 0,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,Black,Male,0,0,40,United-States,<=50K
1,White,Male,0,0,50,United-States,<=50K
2,White,Male,0,0,40,United-States,>50K
3,Black,Male,7688,0,40,United-States,>50K
4,White,Female,0,0,30,United-States,<=50K


In [136]:
numerical_columns = [var for var in data.columns if data[var].dtype!='O']
print(" Numerical columns : ", len(numerical_columns),"\n Columns: ",numerical_columns)

 Numerical columns :  6 
 Columns:  ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']


The adult-census **data** here is divided into Training data and Target data.

In [137]:
train_data = data.drop(columns='class')
target = data["class"]

Lets have a look in distribution of the features within the **data** dataframe. The following cell tells us about the different ranges these are distributed.

In [138]:
#collapse
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


As we observe the distribution of column-values above, we should *Scale* the feartures(columns). 

- Few of the reasons to scale features are :

    1. Models that rely on the distance between a pair of samples, for instance k-nearest neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.

    2. Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled.
    
- Linear models such as logistic regression generally benefit from scaling the features while other models such as decision trees do not need such preprocessing (but will not suffer from it).

Hence, we will be using scikit-learn's StandardScaler. This would scale our data with zero mean and unit standard deviation, i.e. making the numerical features scale to ***Standard Normal Distribution***. 

### C.1. Data Transformation using scaler.fit()

In [139]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_data[numerical_columns])

The difference between the .fit() method in the model and .fit() method in the standard scaler(tranformer) is that the first one takes both **data** and the **target**, whereas the later takes only **data**. The standard scaler(tranformer) .fit() operation's mechanism can be understood as follows,


![](https://inria.github.io/scikit-learn-mooc/_images/api_diagram-transformer.fit.svg)


In this case, the algorithm needs to compute the mean and standard deviation for each feature and store them into some NumPy arrays. Here, these statistics are the model states. The fact that the model states of this scaler are arrays of means and standard deviations is specific to the StandardScaler. Other scikit-learn transformers will compute different statistics and store them as model states, in the same fashion.


- Standard Scaling resources : 
    - [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
    - [MLMastery : StandardScaler, MinMaxScaler](https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/)

In [140]:
#collapse
print(" Mean : ",scaler.mean_,"\n\n Scaling : ",scaler.scale_)

 Mean :  [3.86435854e+01 1.89664135e+05 1.00780885e+01 1.07906763e+03
 8.75023136e+01 4.04223824e+01] 

 Scaling :  [1.37103696e+01 1.05602944e+05 2.57094644e+00 7.45194277e+03
 4.03000427e+02 1.23913172e+01]


### C.2. Data Transformation using .transform() method

In [141]:
transformed_data = scaler.transform(train_data[numerical_columns])
transformed_data

array([[-0.99512893,  0.35167453, -1.19725891, -0.14480353, -0.2171271 ,
        -0.03408696],
       [-0.04694151, -0.94552415, -0.41933527, -0.14480353, -0.2171271 ,
         0.77292975],
       [-0.77631645,  1.3947231 ,  0.74755018, -0.14480353, -0.2171271 ,
        -0.03408696],
       ...,
       [ 1.41180837, -0.35751025, -0.41933527, -0.14480353, -0.2171271 ,
        -0.03408696],
       [-1.21394141,  0.11198424, -0.41933527, -0.14480353, -0.2171271 ,
        -1.64812038],
       [ 0.97418341,  0.93049361, -0.41933527,  1.87131501, -0.2171271 ,
        -0.03408696]])

Data Transformation is similar mechanism to that of .predict() but it gives an output as **transformed data**.

![](https://inria.github.io/scikit-learn-mooc/_images/api_diagram-transformer.transform.svg)

### C.3. The fit + transform method for scaling data

We can also use directly the .fit_transform() method, i.e., combination of .fit() + .transform() methods.

![](https://inria.github.io/scikit-learn-mooc/_images/api_diagram-transformer.fit_transform.svg)

In [142]:
data_train_scaled = scaler.fit_transform(train_data[numerical_columns])
data_train_scaled

array([[-0.99512893,  0.35167453, -1.19725891, -0.14480353, -0.2171271 ,
        -0.03408696],
       [-0.04694151, -0.94552415, -0.41933527, -0.14480353, -0.2171271 ,
         0.77292975],
       [-0.77631645,  1.3947231 ,  0.74755018, -0.14480353, -0.2171271 ,
        -0.03408696],
       ...,
       [ 1.41180837, -0.35751025, -0.41933527, -0.14480353, -0.2171271 ,
        -0.03408696],
       [-1.21394141,  0.11198424, -0.41933527, -0.14480353, -0.2171271 ,
        -1.64812038],
       [ 0.97418341,  0.93049361, -0.41933527,  1.87131501, -0.2171271 ,
        -0.03408696]])

### C.4. Effect of scaling on data distribution

StandardScaler does not change the structure of the data itself but the axes get shifted and scaled. Following distribution is for original training data(before scaling).

In [143]:
#collapse
import plotly.express as px
num_points_to_plot = 300
fig = px.scatter(train_data[:num_points_to_plot], x="age", y="hours-per-week", marginal_x="histogram", marginal_y="histogram")
fig.show()

**Scaled training data** : 

In [144]:
#collapse
data_train_scaled = pd.DataFrame(data_train_scaled, columns=train_data[numerical_columns].columns)
data_train_scaled.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,1.584958e-16,-4.742349e-17,1.594573e-17,2.294458e-16,7.617582e-17,9.071110000000001e-17
std,1.00001,1.00001,1.00001,1.00001,1.00001,1.00001
min,-1.578629,-1.67968,-3.53103,-0.1448035,-0.2171271,-3.181452
25%,-0.7763164,-0.6828752,-0.4193353,-0.1448035,-0.2171271,-0.03408696
50%,-0.119879,-0.1090844,-0.03037346,-0.1448035,-0.2171271,-0.03408696
75%,0.6824334,0.4543232,0.7475502,-0.1448035,-0.2171271,0.3694214
max,3.745808,12.31723,2.303397,13.27438,10.59179,4.727312


Distribution of the scaled data. Notice the range in the axes change and not the distribution of the data. 

In [145]:
#collapse
fig = px.scatter(data_train_scaled[:num_points_to_plot], x="age", y="hours-per-week", marginal_x="histogram", marginal_y="histogram")
fig.show()

## (D) Model fitting and preprocessing
---
 
We will use sklearn's linear model : Logisitic Regression. Further we will also have a look on the pipeline module's mechanism when calling the .fit() method. But first we need to split our dataset in train and test data.

In [146]:
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(train_data[numerical_columns], target, random_state=42)

In [147]:
#collapse
import time
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

### D.1. Standard Scaling + Logistic Regression

In [148]:
model_sc = make_pipeline(StandardScaler(), LogisticRegression())

start1 = time.time()
model_sc.fit(data_train, target_train)
elapsed_time_with_scaling = time.time() - start1

model_sc

### D.2. Only Logistic Regression

In [149]:
model_lr = LogisticRegression()

start2 = time.time()
model_lr.fit(data_train, target_train)
elapsed_time_without_scaling = time.time() - start2

model_lr

In [150]:
#collapse
print(f"{model_sc.__class__.__name__} iterations : {model_sc[-1].n_iter_[0]}\n"
      f"{model_lr.__class__.__name__} iterations : {model_lr.n_iter_[0]}")

Pipeline iterations : 14
LogisticRegression iterations : 55


In [151]:
#collapse
print(f"The ellapsed time for {model_sc.__class__.__name__} : {elapsed_time_with_scaling} \n"
      f"The ellapsed time for {model_lr.__class__.__name__} : {elapsed_time_without_scaling}")

The ellapsed time for Pipeline : 0.23711895942687988 
The ellapsed time for LogisticRegression : 0.6473968029022217


In [152]:
#collapse
score1 = model_sc.score(data_test, target_test)
score2 = model_lr.score(data_test, target_test)

print(f"{model_sc.__class__.__name__} model accuracy : {score1} \n"
      f"{model_lr.__class__.__name__} model accuracy : {score2}")

Pipeline model accuracy : 0.8187699615101138 
LogisticRegression model accuracy : 0.8037834739169601


Clearly, we observe that model with scaled data does a better job. Hence, scaling is a good practice for a linear model like Logisitic Regression to perform better (not necessarily beneficial for all other models).  


- References : 

    - [Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/index.html)

# Thank you!