# Week 4: Regression Mining

### What's on this week
1. [Resuming from week 3](#resume)
2. [Building your first logistic regression model](#build)
3. [Understanding your logistic regression model](#viz)
4. [Finding optimal hyperparameters with GridSearchCV](#gridsearch)
5. [Feature selection](#fselect)

---

The practical note for this week introduces you to regression mining in Python, particularly using logistic regression. Regressions are a class of linear models that learn coefficients associated with each variable/field and uses them to make predictions.

**This tutorial notes is in experimental version. Please give us feedbacks and suggestions on how to make it better. Ask your tutor for any question and clarification.**

## 1. Resuming from week 3 <a name="resume"></a>
Last week, we learned how to perform data mining with decision trees in Python. For this week, we will reuse the code for data preprocessing:

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from dm_tools import data_prep

# preprocessing step
df = data_prep()

# train test split
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.5, random_state=42, stratify=y)

## 2. Building your logistic regression <a name="build"></a>

### 2.1. Scaling your input

Regression models are sensitive to extreme or outlying values in the input space. Inputs with highly skewed or kurtotic distributions are often selected over inputs with better overall predictions. To avoid this problem, we should scale our inputs first before building our logistic regression model. In `sklearn`, this can easily be done using `StandardScaler`.

In [2]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train, y_train)
X_test = scaler.transform(X_test)

### 2.2. Building logistic regression
Once we scaled our inputs, we are ready to build the model. There are a number of types of regression, namely linear and logistic. The type of regression to use is determined by the target's measurement level. In this case study, the target is of categorical type, thus we need to use logistic regression.

Import and train your logistic regression using code below.

In [3]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

print("Train accuracy:", model.score(X_train, y_train))
print("Test accuracy:", model.score(X_test, y_test))

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Train accuracy: 0.601486681809
Test accuracy: 0.560396448482
             precision    recall  f1-score   support

          0       0.56      0.57      0.56      2422
          1       0.56      0.56      0.56      2421

avg / total       0.56      0.56      0.56      4843



The accuracy score of this model is still below our tuned decision tree model from last week. We will tune this logistic regression model later using GridSearchCV.

## 3. Understanding your logistic regression model

Let's take a deeper look on the model we just built. Firstly, I want to gloss over what is logistic regression is doing. From the lecture, you know that a regression function looks like this:

![regression](http://dataminingtuts.s3.amazonaws.com/reg%20func.png)

As a model, the training process learns weight associated with each feature. The model will try to minimize the cost function, which basically says how far off our current predictions to the ground truth. It looks something like this.

![cost function](http://www.holehouse.org/mlclass/06_Logistic_Regression_files/Image%20[16].png)

In your logistic regression model, all of these weights are stored in `.coef_` array of the model.

In [4]:
print(model.coef_)

[[  1.33602632e-01  -3.49058242e-02   8.73566169e-02  -1.12651445e-02
   -6.47839581e-02   1.13940872e-01  -4.12320225e-02  -1.23297619e-01
   -2.08348046e-01   2.12149503e-01  -1.13710895e-01  -3.89598331e-02
    2.30151716e-01  -8.31855771e-03   1.07410594e-01  -3.17912435e-01
    4.52177956e-02   6.17079947e-02   8.77752670e-03   1.19286390e-01
    3.75487794e-02   5.66883439e-03  -1.20892794e-02   5.39555786e-02
   -1.70564480e-04   2.01003551e-02   2.42925172e-02  -2.05111399e-02
    1.89149029e-02   5.05416081e-03  -4.65260730e-02  -3.38191123e-02
   -1.41424065e-02  -1.67238754e-02   3.86303626e-02  -2.31226410e-02
    1.11225030e-03   1.70336719e-02   2.46588640e-02   2.35854684e-02
   -9.61327697e-03  -1.73118474e-02  -3.34528464e-02   3.19422931e-03
    3.89319574e-02   2.18522571e-02  -1.98121857e-02   3.68600889e-03
   -1.87540293e-02   5.99883488e-02   3.15149624e-02   2.21313845e-02
   -7.55714246e-02   4.65481970e-02  -2.58371600e-02  -5.99350026e-03
    1.52829726e-02  

Now, because a regression is a mathematical function, how can it predict classification problems? The answer lies in the word logistic. Logistic is a function that looks like this:

![logistic](http://dataminingtuts.s3.amazonaws.com/logistic_function.png)

Logistic function produces output from 0 to 1. The output curve looks like this

![logistic curve](http://dataminingtuts.s3.amazonaws.com/Logistic-curve.svg.png)

In a logistic regression model, all inputs smaller than 0.5 will be classified as 0 and the rest as 1. We can then use this function for making classification predictions.

## 4. Finding optimal hyperparameters with GridSearchCV

Alright, let's see whether we can tune our logistic regression model to be better. In this example, I will tune it using only one parameter, `C`, which is the inverse of regularization strength. Smaller values specify stronger regularization. Typical values for C range from 10^-6 to 10^4, increasing in order or 10, which is what we will use here.

Tips: sometimes `GridSearchCV` can be very slow if we are searching over a large set of possible values. To aid with this problem, `GridSearchCV` is implemented with parallel running capability and you can specify how many parallel processes running in the same time with `n_jobs` (-1 means GridSearchCV will use as many cores as possible).

In [5]:
# grid search CV
params = {'C': [pow(10, x) for x in range(-6, 4)]}

cv = GridSearchCV(param_grid=params, estimator=LogisticRegression(), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)

# test the best model
print("Train accuracy:", cv.score(X_train, y_train))
print("Test accuracy:", cv.score(X_test, y_test))

y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

Train accuracy: 0.581664257692
Test accuracy: 0.569275242618
             precision    recall  f1-score   support

          0       0.57      0.60      0.58      2422
          1       0.57      0.54      0.56      2421

avg / total       0.57      0.57      0.57      4843

{'C': 0.0001}


Our GridSearchCV shows improvement with `C` = 0.0001 compared to the original `C`. This is the best result so far compared to decision trees and we will keep it. Experiment with other set of values and parameters, and see if you can get a better result.

## 5. Dimensionality reduction

Another method to improve prediction quality from a model is to perform dimensionality reduction on the input set. Dimensionality reduction is divided into two processes:
* Feature selection: Process of selecting a subset of relevant features/variables to be used in constructing models.
* Feature extraction: Process of transforming high-dimensional feature space into lower dimension. Typically performed by finding principle components of the feature space.

Let's explore some dimensionality reduction techniques. 

### 5.1. Recursive feature elimination

The first method that we will try is called recursive feature elimination (RFE). RFE works by first training the model on the whole set of features. Each feature then will be assigned an weight and RFE tries to eliminate and make a smaller feature set.

In this tutorial, we will use RFE with cross validation. The cross validation allows the RFE to be generalized better. Firstly, let's import `RFECV` from `sklearn.feature_selection`. Initiate the RFE with a logistic regression estimator and 10-fold validation.

In [6]:
from sklearn.feature_selection import RFECV

rfe = RFECV(estimator = LogisticRegression(), cv=10)
rfe.fit(X_train, y_train)

print("Original feature set", X_train.shape)
print("Number of features after elimination", rfe.n_features_)

Original feature set (4843, 85)
Number of features after elimination 19


Once the RFE is fitted, we can transform the original feature set using `.transform`.

In [7]:
X_train_sel = rfe.transform(X_train)
X_test_sel = rfe.transform(X_test)

We managed to reduce the feature set from 85 to 19. Let's re-tune the logistic regression model with this new feature set.

In [8]:
# grid search CV
params = {'C': [pow(10, x) for x in range(-6, 4)]}

cv = GridSearchCV(param_grid=params, estimator=LogisticRegression(), cv=10, n_jobs=-1)
cv.fit(X_train_sel, y_train)

# test the best model
print("Train accuracy:", cv.score(X_train_sel, y_train))
print("Test accuracy:", cv.score(X_test_sel, y_test))

y_pred = cv.predict(X_test_sel)
print(classification_report(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

Train accuracy: 0.588271732397
Test accuracy: 0.570514144126
             precision    recall  f1-score   support

          0       0.57      0.59      0.58      2422
          1       0.57      0.55      0.56      2421

avg / total       0.57      0.57      0.57      4843

{'C': 0.01}


The RFE managed to improve the performance slightly. In addition to that, with much smaller feature set, the training process is speed up significantly.

### 5.2. Principle Component Analysis

Principal Components Analysis (PCA) is a technique that finds underlying variables (known as principal components) that best differentiate your data points. The idea of PCA is to reduce the number of features while still retaining the variance/pattern in the feature set.

[Intuitive explanation of PCA](https://www.quora.com/What-is-an-intuitive-explanation-for-PCA)

Let's start by importing `PCA` from sklearn.

In [9]:
from sklearn.decomposition import PCA

With PCA, we need to specify the number of components that we want to retain. The problem is, how do we know how many? A good rule of thumb is to ensure at least 95% of the variance ratio is retained. Firstly, start by fitting the PCA using X_train. Then, iterate through the `explained_variance_ratio` from the PCA model, and start summing them until it reached at least 95%.

In [10]:
pca = PCA()
pca.fit(X_train)

sum_var = 0
for idx, val in enumerate(pca.explained_variance_ratio_):
    sum_var += val
    if (sum_var >= 0.95):
        print("N components with > 95% variance =", idx+1)
        break

N components with > 95% variance = 66


Now, we know we need to retain 66 components to retain at least 95%. Let's refit the PCA with 66 components and retune the logistic regression model to see if the PCA improves the performance.

In [11]:
pca = PCA(n_components=66)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# grid search CV
params = {'C': [pow(10, x) for x in range(-6, 4)]}

cv = GridSearchCV(param_grid=params, estimator=LogisticRegression(), cv=10, n_jobs=-1)
cv.fit(X_train_pca, y_train)

print("Train accuracy:", cv.score(X_train_pca, y_train))
print("Test accuracy:", cv.score(X_test_pca, y_test))

# test the best model
y_pred = cv.predict(X_test_pca)
print(classification_report(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

Train accuracy: 0.580012389015
Test accuracy: 0.570514144126
             precision    recall  f1-score   support

          0       0.57      0.60      0.58      2422
          1       0.58      0.54      0.56      2421

avg / total       0.57      0.57      0.57      4843

{'C': 0.0001}


The result shows an improved performance over the original feature set. We also managed to reduce the feature set size to only 66, which shorten the training process.

One disadvantage of PCA is it is a statistical procedure that transforms the feature set into a completely different set. Thus, all of the original fields/columns are gone and you cannot interpret it as in original.

### 5.3. Feature selection using model

The last method that we will try on this dataset is the select from model. In this technique, we utilise machine learning models with ability to find feature importance and select the feature set using that computed importance. Typically, decision tree or support vector machine models are used in this method. We will use decision tree here.

Firstly, let's tune another decision tree using the original training data.

In [12]:
from sklearn.tree import DecisionTreeClassifier

params = {'criterion': ['gini', 'entropy'],
          'max_depth': range(3, 10),
          'min_samples_leaf': range(20, 200, 20)}

cv = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(), cv=10)
cv.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_depth': range(3, 10), 'min_samples_leaf': range(20, 200, 20), 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

Then, we could analyse the feature importance from this trained model. Remember the `analyse_feature_importance` method we wrote last week? We will use it here.

In [13]:
from dm_tools import analyse_feature_importance

analyse_feature_importance(cv.best_estimator_, X.columns)

GiftAvgLast : 0.424147866229
DemMedHomeValue : 0.165964132384
GiftTimeLast : 0.148495742668
GiftAvgCard36 : 0.0996893965541
DemAge : 0.0654544865005
PromCntCard36 : 0.0494236862349
GiftCntAll : 0.0468246894295
DemGender_U : 0.0
DemCluster_11 : 0.0
StatusCat96NK_N : 0.0
StatusCat96NK_S : 0.0
DemCluster_0 : 0.0
DemCluster_1 : 0.0
DemCluster_10 : 0.0
DemCluster_13 : 0.0
DemCluster_12 : 0.0
StatusCat96NK_F : 0.0
DemCluster_14 : 0.0
DemCluster_15 : 0.0
DemCluster_16 : 0.0


The feature importance analysis result shows there are only 7 features with importance value more than 0. This means according to our decision tree, there are only 7 important features in our feature set. Let's use this decision tree to select our features. We will use `SelectFromModel` module from `sklearn.feature_selection`.

In [14]:
from sklearn.feature_selection import SelectFromModel

selectmodel = SelectFromModel(cv.best_estimator_, prefit=True)
X_train_sel_model = selectmodel.transform(X_train)
X_test_sel_model = selectmodel.transform(X_test)

print(X_train_sel_model.shape)

(4843, 7)


The shape of X_train shows there are only 7 feature left, exactly what the decision tree suggests. Let's train and tune another logistic regression model from this new data set.

In [15]:
params = {'C': [pow(10, x) for x in range(-6, 4)]}

cv = GridSearchCV(param_grid=params, estimator=LogisticRegression(), cv=10, n_jobs=-1)
cv.fit(X_train_sel_model, y_train)

print("Train accuracy:", cv.score(X_train_sel_model, y_train))
print("Test accuracy:", cv.score(X_test_sel_model, y_test))

# test the best model
y_pred = cv.predict(X_test_sel_model)
print(classification_report(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

Train accuracy: 0.570927111295
Test accuracy: 0.572578979971
             precision    recall  f1-score   support

          0       0.57      0.58      0.58      2422
          1       0.57      0.56      0.57      2421

avg / total       0.57      0.57      0.57      4843

{'C': 0.001}


The test accuracy result shows improvement over the original feature set as well. This method yields the smallest feature set yet (only 7 rather than 85 features), yet the performance is the best compared to the others. This demonstrates the effectiveness of dimensionality reduction.

## End notes and next week

This week, we learned how to build, tune and explore the structure of logistic regression models. We also explored dimensionality reduction techniques to reduce the size of the feature set and improve performance of our logistic regression model.

Next week, we will learn how to perform predictive modelling with neural networks and comparing the end-to-end performance of all the models we have built so far.