# Week 4: Regression Mining

### What's on this week
1. [Resuming from week 3](#resume)
2. [Building your first logistic regression model](#build)
3. [Understanding your logistic regression model](#viz)
4. [Finding optimal hyperparameters with GridSearchCV](#gridsearch)
5. [Feature transformation and selection](#fselect)

---

The practical note for this week introduces you to regression mining in Python, particularly using logistic regression. Regressions are a class of linear models that learn coefficients associated with each variable/field and uses them to make predictions.

**This tutorial notes is in experimental version. Please give us feedbacks and suggestions on how to make it better. Ask your tutor for any question and clarification.**

## 1. Resuming from week 3 <a name="resume"></a>
Last week, we learned how to perform data mining with decision trees in Python. For this week, we will reuse the code for data preprocessing:

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from dm_tools import data_prep

# preprocessing step
df = data_prep()

# train test split
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.5, random_state=11, stratify=y)

## 2. Building your logistic regression <a name="build"></a>
There are a number of types of regression, namely linear and logistic. The type of regression to use is determined by the target's measurement level. In this case study, the target is of categorical type, thus we need to use logistic regression.

Import and train your logistic regression using code below.

In [2]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

print(model)

             precision    recall  f1-score   support

          0       0.58      0.58      0.58      2421
          1       0.58      0.58      0.58      2422

avg / total       0.58      0.58      0.58      4843

0.581044806938
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


Seems pretty good, can do better.

In [9]:
# grid search CV
params = {'C': [pow(10, x) for x in range(-4, 4)],
         'tol': [pow(10, x) for x in range(-15, -8)]}

cv = GridSearchCV(param_grid=params, estimator=LogisticRegression(), cv=10)
cv.fit(X_train, y_train)

# test the best model
y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

             precision    recall  f1-score   support

          0       0.58      0.59      0.59      2421
          1       0.59      0.58      0.58      2422

avg / total       0.59      0.59      0.59      4843

0.586206896552
{'tol': 1e-15, 'C': 0.001}


In [4]:
# grid search CV
params = {'C': [x*0.00001 for x in range(1, 25)]}

cv = GridSearchCV(param_grid=params, estimator=LogisticRegression(), cv=5)
cv.fit(X_train, y_train)

# test the best model
y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

             precision    recall  f1-score   support

          0       0.58      0.58      0.58      2421
          1       0.58      0.58      0.58      2422

avg / total       0.58      0.58      0.58      4843

0.579186454677
{'C': 0.00021}


In [5]:
'''
# coding: utf-8
import pandas as pd
import numpy as np

def data_prep():
    # read the pva97nk dataset
    df = pd.read_csv('pva97nk.csv')

    # drop ID and the unused target variable
    df.drop(['ID', 'TargetD'], axis=1, inplace=True)

    # impute missing values in DemAge with its mean
    df['DemAge'].fillna(df['DemAge'].mean(), inplace=True)

    # change DemCluster from interval/integer to nominal/str
    df['DemCluster'] = df['DemCluster'].astype(str)

    # change DemHomeOwner into binary 0/1 variable
    dem_home_owner_map = {'U': 0, 'H': 1}
    df['DemHomeOwner'] = df['DemHomeOwner'].map(dem_home_owner_map)

    # denote miss values in DemMidIncome
    mask = df['DemMedIncome'] < 1
    df.loc[mask, 'DemMedIncome'] = np.nan

    # df['DemMedIncome'].replace(0, np.nan, inplace=True)

    # impute med income using average strategy
    df['DemMedIncome'].fillna(df['DemMedIncome'].mean(), inplace=True)

    # impute gift avg card 36 using average strategy
    df['GiftAvgCard36'].fillna(df['GiftAvgCard36'].mean(), inplace=True)

    # one hot encoding
    df = pd.get_dummies(df)

    return df
data_prep()
df = data_prep()
df['DemMedIncome']
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.5, random_state=42)
from sklearn.model_selection import train_test_split
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.5, random_state=42)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import classification_report, accuracy_score
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
from sklearn.feature_selection import RFE
sel = RFE(LogisticRegression())
sel.fit(X_train, y_train)
y_pred = sel.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import normalize
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = normalize(X.as_matrix())
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.5, random_state=42)
X_train
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, chi2
for i in range(2, len(names)+1):
    select = SelectKBest(score_func=f_classif, k=i)
    X_transf = select.fit_transform(X_mat, y)
    X_train, X_test, y_train, y_test = train_test_split(X_transf, y, test_size=0.5, random_state=42)
    model = LogisticRegression().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(i, accuracy_score(y_test, y_pred))
    
names = df.columns
for i in range(2, len(names)+1):
    select = SelectKBest(score_func=f_classif, k=i)
    X_transf = select.fit_transform(X_mat, y)
    X_train, X_test, y_train, y_test = train_test_split(X_transf, y, test_size=0.5, random_state=42)
    model = LogisticRegression().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(i, accuracy_score(y_test, y_pred))
    
for i in range(2, len(names)):
    select = SelectKBest(score_func=chi2, k=i)
    X_transf = select.fit_transform(X_mat, y)
    X_train, X_test, y_train, y_test = train_test_split(X_transf, y, test_size=0.5, random_state=42)
    model = LogisticRegression().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(i, accuracy_score(y_test, y_pred))
    
'''

"\n# coding: utf-8\nimport pandas as pd\nimport numpy as np\n\ndef data_prep():\n    # read the pva97nk dataset\n    df = pd.read_csv('pva97nk.csv')\n\n    # drop ID and the unused target variable\n    df.drop(['ID', 'TargetD'], axis=1, inplace=True)\n\n    # impute missing values in DemAge with its mean\n    df['DemAge'].fillna(df['DemAge'].mean(), inplace=True)\n\n    # change DemCluster from interval/integer to nominal/str\n    df['DemCluster'] = df['DemCluster'].astype(str)\n\n    # change DemHomeOwner into binary 0/1 variable\n    dem_home_owner_map = {'U': 0, 'H': 1}\n    df['DemHomeOwner'] = df['DemHomeOwner'].map(dem_home_owner_map)\n\n    # denote miss values in DemMidIncome\n    mask = df['DemMedIncome'] < 1\n    df.loc[mask, 'DemMedIncome'] = np.nan\n\n    # df['DemMedIncome'].replace(0, np.nan, inplace=True)\n\n    # impute med income using average strategy\n    df['DemMedIncome'].fillna(df['DemMedIncome'].mean(), inplace=True)\n\n    # impute gift avg card 36 using average