# Week 4: Regression Mining

### What's on this week
1. [Resuming from week 3](#resume)
2. [Building your first logistic regression model](#build)
3. [Understanding your logistic regression model](#viz)
4. [Finding optimal hyperparameters with GridSearchCV](#gridsearch)
5. [Feature selection](#fselect)

---

The practical note for this week introduces you to regression mining in Python, particularly using logistic regression. Regressions are a class of linear models that learn coefficients associated with each variable/field and uses them to make predictions.

**This tutorial notes is in experimental version. Please give us feedbacks and suggestions on how to make it better. Ask your tutor for any question and clarification.**

## 1. Resuming from week 3 <a name="resume"></a>
Last week, we learned how to perform data mining with decision trees in Python. For this week, we will reuse the code for data preprocessing:

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from dm_tools import data_prep

# preprocessing step
df = data_prep()

# train test split
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.5, random_state=42, stratify=y)

## 2. Building your logistic regression <a name="build"></a>

### 2.1. Scaling your input

Regression models are sensitive to extreme or outlying values in the input space. Inputs with highly skewed or kurtotic distributions are often selected over inputs with better overall predictions. To avoid this problem, we can scale our inputs first before building our logistic regression model. In `sklearn`, this can easily be done using `StandardScaler`.

In [2]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train, y_train)
X_test = scaler.transform(X_test)

### 2.2. Building logistic regression
Once we scaled our inputs, we are ready to build the model. There are a number of types of regression, namely linear and logistic. The type of regression to use is determined by the target's measurement level. In this case study, the target is of categorical type, thus we need to use logistic regression.

Import and train your logistic regression using code below.

In [11]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
print(classification_report(y_test, y_pred))


0.601486681809
0.560396448482
             precision    recall  f1-score   support

          0       0.56      0.57      0.56      2422
          1       0.56      0.56      0.56      2421

avg / total       0.56      0.56      0.56      4843



The accuracy score of this model shows an improvement over our tuned decision tree model from last week. We will tune this logistic regression model later using GridSearchCV.

## 4. Finding optimal hyperparameters with GridSearchCV

Alright, let's see whether we can tune our logistic regression model to be better. In this example, I will tune it using only one parameter, `C`, which is the inverse of regularization strength. Smaller values specify stronger regularization. Typical values for C range from 10^-6 to 10^4, increasing in order or 10, which is what we will use here.

Tips: sometimes `GridSearchCV` can be very slow if we are searching over a large set of possible values. To aid with this problem, `GridSearchCV` is implemented with parallel running capability and you can specify how many parallel processes running in the same time with `n_jobs` (-1 means GridSearchCV will use as many cores as possible).

In [12]:
# grid search CV
params = {'C': [pow(10, x) for x in range(-6, 4)]}

cv = GridSearchCV(param_grid=params, estimator=LogisticRegression(), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)

# test the best model
y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

print(cv.score(X_train, y_train))

             precision    recall  f1-score   support

          0       0.57      0.60      0.58      2422
          1       0.57      0.54      0.56      2421

avg / total       0.57      0.57      0.57      4843

0.569275242618
{'C': 0.0001}
0.581664257692


Our GridSearchCV shows a slight improvement with `C` = 0.0001 compared to the original `C`. This is the best result so far compared to decision trees and we will keep it. Experiment with other set of values and parameters, and see if you can get a better result.

## 5. Feature transformation and selection

* RFECV
* PCA
* RFECV + PCA

In [13]:
from sklearn.feature_selection import RFECV

rfe = RFECV(estimator = LogisticRegression(C=0.0001), cv=10)
rfe.fit(X_train, y_train)

print(X_train.shape)
print(rfe.n_features_)

(4843, 85)
66


In [14]:
X_train_sel = rfe.transform(X_train)
X_test_sel = rfe.transform(X_test)

model = LogisticRegression(C=0.0001)
model.fit(X_train_sel, y_train)
y_pred = model.predict(X_test_sel)

print(model.score(X_train_sel, y_train))
print(model.score(X_test_sel, y_test))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

0.581664257692
0.569068759034
             precision    recall  f1-score   support

          0       0.56      0.60      0.58      2422
          1       0.57      0.54      0.55      2421

avg / total       0.57      0.57      0.57      4843

0.569068759034


In [5]:
'''
import pandas as pd
import numpy as np


from sklearn.feature_selection import RFE
sel = RFE(LogisticRegression())
sel.fit(X_train, y_train)
y_pred = sel.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import normalize
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = normalize(X.as_matrix())
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.5, random_state=42)
X_train
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, chi2
for i in range(2, len(names)+1):
    select = SelectKBest(score_func=f_classif, k=i)
    X_transf = select.fit_transform(X_mat, y)
    X_train, X_test, y_train, y_test = train_test_split(X_transf, y, test_size=0.5, random_state=42)
    model = LogisticRegression().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(i, accuracy_score(y_test, y_pred))
    
names = df.columns
for i in range(2, len(names)+1):
    select = SelectKBest(score_func=f_classif, k=i)
    X_transf = select.fit_transform(X_mat, y)
    X_train, X_test, y_train, y_test = train_test_split(X_transf, y, test_size=0.5, random_state=42)
    model = LogisticRegression().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(i, accuracy_score(y_test, y_pred))
    
for i in range(2, len(names)):
    select = SelectKBest(score_func=chi2, k=i)
    X_transf = select.fit_transform(X_mat, y)
    X_train, X_test, y_train, y_test = train_test_split(X_transf, y, test_size=0.5, random_state=42)
    model = LogisticRegression().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(i, accuracy_score(y_test, y_pred))
    
'''

"\nimport pandas as pd\nimport numpy as np\n\n\nfrom sklearn.feature_selection import RFE\nsel = RFE(LogisticRegression())\nsel.fit(X_train, y_train)\ny_pred = sel.predict(X_test)\nprint(classification_report(y_test, y_pred))\nprint(accuracy_score(y_test, y_pred))\nfrom sklearn.preprocessing import Normalizer\nfrom sklearn.preprocessing import normalize\ny = df['TargetB']\nX = df.drop(['TargetB'], axis=1)\nX_mat = normalize(X.as_matrix())\nX_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.5, random_state=42)\nX_train\nmodel = LogisticRegression()\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test, y_pred))\nprint(accuracy_score(y_test, y_pred))\nfrom sklearn.feature_selection import SelectKBest\nfrom sklearn.feature_selection import f_classif, chi2\nfor i in range(2, len(names)+1):\n    select = SelectKBest(score_func=f_classif, k=i)\n    X_transf = select.fit_transform(X_mat, y)\n    X_train, X_test, y_train, y_test