## Data Preprocessing for building good machine learning models


We explore a technique for feature selection that helps reduce the dimensionality of the dataset.


Our data consists of 178 samples of wine that are the result of a chemical analysis of 3 types of wine.

There are 13 features variables included in the dataset that describe their different chemical properties found in the wine samples:


#### Data features are (in order):

 - Alcohol
 
 - Malic acid

 - Ash

 - Alcalinity of ash

 - Magnesium

 - Total phenols

 - Flavanoids

 - Nonflavanoid phenols

 - Proanthocyanins

 - Color intensity
  
 - Hue
   
 - OD280/OD315 of diluted wines
   
 - Proline



#### There are 3 types of wine:

- Class label variable




*Data Source is available at:*

https://archive.ics.uci.edu/ml/datasets/Wine




In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use("fivethirtyeight")

import warnings
warnings.filterwarnings('ignore')

In [11]:
#Load data:

df = pd.read_csv('C:/Users/uknow/Desktop/wine.data', header=None)
df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash','Alcalinity of ash', 'Magnesium','Total phenols', 'Flavanoids', 'Nonflavanoid phenols','Proanthocyanins',
'Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']

df.head()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [15]:
df.shape


(178, 14)

In [16]:
df.dtypes


Class label                       int64
Alcohol                         float64
Malic acid                      float64
Ash                             float64
Alcalinity of ash               float64
Magnesium                         int64
Total phenols                   float64
Flavanoids                      float64
Nonflavanoid phenols            float64
Proanthocyanins                 float64
Color intensity                 float64
Hue                             float64
OD280/OD315 of diluted wines    float64
Proline                           int64
dtype: object

In [22]:
#the number of missing values per column

df.isnull().sum()



Class label                     0
Alcohol                         0
Malic acid                      0
Ash                             0
Alcalinity of ash               0
Magnesium                       0
Total phenols                   0
Flavanoids                      0
Nonflavanoid phenols            0
Proanthocyanins                 0
Color intensity                 0
Hue                             0
OD280/OD315 of diluted wines    0
Proline                         0
dtype: int64

In [24]:
# There are 3 class labels

df["Class label"].nunique()

3

There are 178 samples samples that belong to one of three different classes, 1, 2, and 3 which refer to the three different types of wine.

A convenient way to randomly partition this dataset into testing and training subsets is using train_test_split.



In [27]:
from sklearn.model_selection import train_test_split

In [31]:
X, y = df.iloc[:,1:].values, df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)

In [41]:
# We used 70%/30 % train/test split
# but for large datasets 90/10 is common and more appropriate

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(124, 13) (124,)
(54, 13) (54,)


Feature scaling is an important part of data preprocessing 

A commonly used technique is standardization, a transformation that:

 - centers the feature columns at mean 0 with variance 1 so that the feature columns take the form of a normal distribution, which makes it easier to learn the weights

- maintains useful information about outliers and makes a machine algorithm less sensitive to them 

In [35]:
from sklearn.preprocessing import StandardScaler

In [42]:
# fit the StandardScaler() on the X_train 
# and use those parameters to transform the test set 

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

Let us check whether training data is helpful at all. 

In practice it can happen to have a high-dimensional dataset with many features that are irrelevant.


Selecting meaningful features using L1 regularization.

L1 regularization is commonly used technique because it yields sparse feature vectors; most feature weights will be zero. 

It is especially useful when there are more irrelevant dimensions than samples. 

In [44]:
from sklearn.linear_model import LogisticRegression

LogisticRegression(penalty="l1")

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Apply the L1 regularized logistic regression to the standardized training data to produce the following sparse solution:

In [47]:
logreg = LogisticRegression(C=0.1, penalty="l1")
logreg.fit(X_train_std,y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Both training and test accuracies are 98%. 

Therefore, no indicator for overfitting of our model:

  - Oerfitting is when the model fits the parameters closely to the particular observations in the training dataset but does not generalize well to test data.

In [48]:
print(logreg.score(X_train_std, y_train), logreg.score(X_test_std, y_test))

0.9838709677419355 0.9814814814814815


#### Multi-class classification with LogisticRegression that uses the OvR approach by default


Interpretation of the intercepts and coefficients:



In [51]:
logreg.intercept_

array([-0.38379071, -0.15807535, -0.7004143 ])

The intercept is an array with three values:
 - the intercept that belongs to the model that fits class 1 (vs class 2 & 3)
 - the intercept of the model that fits class 2 (vs class 1 & 3)
 - the intercept of the model that fits class 3 (vs class 1 & 2)

In [53]:
logreg.coef_

array([[ 0.28009078,  0.        ,  0.        , -0.02796229,  0.        ,
         0.        ,  0.70997012,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.23643373],
       [-0.64401394, -0.06873457, -0.05721483,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        , -0.92680614,
         0.0602196 ,  0.        , -0.37108924],
       [ 0.        ,  0.06158012,  0.        ,  0.        ,  0.        ,
         0.        , -0.63618518,  0.        ,  0.        ,  0.49819186,
        -0.35806977, -0.57111414,  0.        ]])

The array of coefficients has 3 rows - one 13-dimensional weight vector for each class label 1,2,3.

 - In calculating the net input in computing the probabilities, each row consisting of 13 features is multiplied by the 13D weight vector corresponding to its class label.

 - The coefficient matrix has only a few non-zero entries. A sparse feature vectors means many feature weights are zero.
 


The logistic equation: 

 - logit(p) = log(p/1-p) = z 

expresses the linear relationship of the log-odds with the feature values x:

In [61]:
z = X_train.dot(logreg.coef_.T) + logreg.intercept_

In [60]:
import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


sigmoid(z)

array([[1.00000000e+000, 4.78009742e-173, 3.89078776e-002],
       [1.00000000e+000, 1.20220842e-055, 6.87289163e-002],
       [1.00000000e+000, 2.37413246e-143, 9.57038472e-001],
       [1.00000000e+000, 5.39277437e-140, 4.96994712e-001],
       [1.00000000e+000, 3.48396791e-072, 4.58815384e-002],
       [1.00000000e+000, 1.23029714e-098, 5.94495367e-001],
       [1.00000000e+000, 8.10677242e-086, 9.14764121e-001],
       [1.00000000e+000, 2.82178712e-213, 1.22700553e-001],
       [1.00000000e+000, 1.58489763e-089, 3.70132933e-001],
       [1.00000000e+000, 2.57116268e-111, 6.43633231e-001],
       [1.00000000e+000, 1.95077697e-074, 5.92689655e-002],
       [1.00000000e+000, 8.04663750e-066, 2.58844213e-001],
       [1.00000000e+000, 1.79208212e-115, 6.11642895e-001],
       [1.00000000e+000, 9.85186313e-108, 1.52587659e-001],
       [1.00000000e+000, 7.54279876e-198, 1.22836509e-001],
       [1.00000000e+000, 7.04753509e-089, 1.90142979e-001],
       [1.00000000e+000, 2.08243813e-156

The output of sigmoid function:

- sigmoid(z) = p

where 

- p = Prob (class label: 1,2,3 | x;w) 

represents the conditional probability of particular sample belonging to a class label given its features x parameterized by the weights w



In [79]:
logreg.predict_proba(X_test_std)

array([[0.7036647 , 0.15793261, 0.13840269],
       [0.10432374, 0.09117526, 0.80450099],
       [0.22868198, 0.68435332, 0.08696469],
       [0.69734262, 0.14113823, 0.16151914],
       [0.18521545, 0.65297368, 0.16181087],
       [0.43233612, 0.5411641 , 0.02649977],
       [0.77385598, 0.12633572, 0.09980831],
       [0.0692622 , 0.26316014, 0.66757766],
       [0.1529807 , 0.70503119, 0.14198811],
       [0.08484739, 0.68926641, 0.22588621],
       [0.18618244, 0.24587795, 0.56793961],
       [0.05360587, 0.26089837, 0.68549576],
       [0.80977081, 0.05513756, 0.13509163],
       [0.43028075, 0.50338817, 0.06633108],
       [0.21673921, 0.07984007, 0.70342072],
       [0.07746824, 0.85591237, 0.06661939],
       [0.74577894, 0.15127576, 0.1029453 ],
       [0.85835309, 0.02233509, 0.11931183],
       [0.08379689, 0.43855941, 0.4776437 ],
       [0.78851422, 0.13668751, 0.07479827],
       [0.38351476, 0.52194816, 0.09453708],
       [0.5626508 , 0.29694679, 0.14040241],
       [0.

We can predict the class-membership probability of each sample via
the predict_proba():

-  Interpreting the probability values in the array:

In [80]:
logreg.predict_proba(X_test_std)[0]

array([0.7036647 , 0.15793261, 0.13840269])



 - 70.3% chance the first wine sample belongs to class 1
 
 - 15.7% chance the first wine sample belongs to class label 2 
 
 - 13.8% chance the first wine sample belongs to class label 3


In [83]:
y_pred = logreg.predict(X_test)


### Conclusion





We trained a Logistic Regression model that is robust to potentially irrelevant features in this dataset. 

 - L1 regularization served as a method for the feature selection. 
 - By preferring a simpler model can reduce the variance in the presence of sufficient training data to fit the model


The problem of overfitting in the context of bias and variance. We applied another commonly used technique for avoiding overfitting: train/test split validation:

- We prevent overfitting by not using all of the data
- We have some remaining data we can use to evaluate our model.

Train/test split validation may not sufficiently randomize the data. 
 - In practice, we perform k-fold cross-validation to avoid the limitations in the train/test split
 
 - By varying the number k of folds we can get a sense of how this impacts the score, the variance


Mathematical explaination of why L1 regularization can lead to sparse solutions



https://medium.com/mlreview/l1-norm-regularization-and-sparsity-explained-for-dummies-5b0e4be3938a