# Training classification models

This notebook demonstrates how you can follow a similar process for training a model using:
- Logistic Regression Classifier
- Naive Bayes Classifier
- KNearest Neighbour Classifier
- Decision Tree Classifier
- Support Vector Machine Classifier
- Random Forest Classifier

The process of training different models follow similar patterns:
1. prepare data into a format that `sklearn` can understand (i.e. target data in a 2-dimensional array, and target data in a 1-dimensional array)
2. split data into training set and test set
3. choose a model (e.g. LinearRegression, LogisticRegression, RandomForestClassifier, etc) and train the model using the `.fit()` method
4. Evaluate model
5. Tune / improve the model

In [1]:
# load libraries
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [2]:
# defining some helper methods
def print_header(title):
    print("\n" + title + ":\n")
    
def print_model_header(title):
    newline = "\n=====================================================\n"
    print(newline + title + newline)

def generate_evaluation_tables(model, data, target):
    X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=0)
    print_model_header(type(model).__name__)
    # Evaluate our model
    training_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print_header("COMPARING SCORES OF TRAINING AND TEST SET")
    print("training set score: %f" % training_score)
    print("test set score: %f" % test_score)

    # Make predictions
    expected = y
    predicted = model.predict(X)
    print_header("CLASSIFICATION REPORT")
    classification_report = metrics.classification_report(expected, predicted)
    print(classification_report)

    print_header("CONFUSION MATRIX")
    confusion_matrix = metrics.confusion_matrix(expected, predicted)
    print(confusion_matrix)

## 1. Load and prepare data in X and y format

In [3]:
df = pd.read_csv('./data/bank-marketing-data/bank-additional-one-hot-encoded.csv')

y = df['y'].as_matrix()
del df['y']

X = df.as_matrix()

## 2. Split data into train and test set

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## 3. Choose a model and train the model using the .fit() method

### Logistic Regression

In [5]:
# Choose our model and train our model 
from sklearn.linear_model import Ridge        ## IMPORTANT: These 2 lines are the only lines that 
ridge_regression_model = Ridge()               ## you need to change to build a different model
ridge_regression_model.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [6]:
# Choose our model and train our model 
from sklearn.linear_model import LogisticRegression     ## IMPORTANT: These 2 lines are the only lines that 
logistic_regression_model = LogisticRegression()        ## you need to change to build a different model
logistic_regression_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [7]:
# Evaluate our model
training_score = logistic_regression_model.score(X_train, y_train)
test_score = logistic_regression_model.score(X_test, y_test)
print_header("COMPARING SCORES OF TRAINING AND TEST SET")
print("training set score: %f" % training_score)
print("test set score: %f" % test_score)

# Make predictions
expected = y
predicted = logistic_regression_model.predict(X)

print_header("CLASSIFICATION REPORT")
classification_report = metrics.classification_report(expected, predicted)
print(classification_report)

print_header("CONFUSION MATRIX")
confusion_matrix = metrics.confusion_matrix(expected, predicted)
print(confusion_matrix)


COMPARING SCORES OF TRAINING AND TEST SET:

training set score: 0.908906
test set score: 0.912013

CLASSIFICATION REPORT:

             precision    recall  f1-score   support

          0       0.93      0.97      0.95     36548
          1       0.66      0.40      0.50      4640

avg / total       0.90      0.91      0.90     41188


CONFUSION MATRIX:

[[35613   935]
 [ 2785  1855]]


### Naive Bayes

In [8]:
# 3. Choose our model and train our model 
from sklearn.naive_bayes import GaussianNB
naive_bayes_model = GaussianNB()
naive_bayes_model.fit(X_train, y_train)

GaussianNB(priors=None)

### k-Nearest Neighbour

In [9]:
# 3. Choose our model and train our model 
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

### Decision Trees

In [10]:
# 3. Choose our model and train our model 
from sklearn.tree import DecisionTreeClassifier
decision_trees_model = DecisionTreeClassifier()
decision_trees_model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### Support Vector Machines

In [11]:
# 3. Choose our model and train our model 
from sklearn.svm import SVC
svm_model = SVC()
svm_model.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

### Random Forest Classifier

In [12]:
from sklearn.ensemble import RandomForestClassifier

rfc_model = RandomForestClassifier(max_depth=5, 
                                 min_samples_leaf=10, 
                                 max_features=10, 
                                 bootstrap=False)
rfc_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=5, max_features=10, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## 4. Evaluate model

In [13]:
models = [logistic_regression_model, naive_bayes_model, knn_model, decision_trees_model, svm_model, rfc_model]
for model in models:
    generate_evaluation_tables(model, X, y)


LogisticRegression


COMPARING SCORES OF TRAINING AND TEST SET:

training set score: 0.908906
test set score: 0.912013

CLASSIFICATION REPORT:

             precision    recall  f1-score   support

          0       0.93      0.97      0.95     36548
          1       0.66      0.40      0.50      4640

avg / total       0.90      0.91      0.90     41188


CONFUSION MATRIX:

[[35613   935]
 [ 2785  1855]]

GaussianNB


COMPARING SCORES OF TRAINING AND TEST SET:

training set score: 0.861578
test set score: 0.868700

CLASSIFICATION REPORT:

             precision    recall  f1-score   support

          0       0.94      0.91      0.92     36548
          1       0.41      0.51      0.46      4640

avg / total       0.88      0.86      0.87     41188


CONFUSION MATRIX:

[[33178  3370]
 [ 2258  2382]]

KNeighborsClassifier


COMPARING SCORES OF TRAINING AND TEST SET:

training set score: 0.930789
test set score: 0.908420

CLASSIFICATION REPORT:

             precision    recall  f1-sc

## 5. Tuning the Random Forest Classifier with GridSearchCV 

In [14]:
from sklearn.model_selection import GridSearchCV

In [15]:
# NOTE: This will take about 1-2 minutes to run!
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

rfc_model = RandomForestClassifier()
rfc_grid = GridSearchCV(estimator=rfc_model, param_grid=param_grid, cv=5)

rfc_grid.fit(X_train, y_train)

print("Best estimator:", rfc_grid.best_estimator_)
print("Best score:", rfc_grid.best_score_)
print(pd.DataFrame(rfc_grid.cv_results_))

('Best estimator:', RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=10, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))
('Best score:', 0.91236929850118154)
     mean_fit_time  mean_score_time  mean_test_score  mean_train_score  \
0         0.066529         0.005173         0.887281          0.887281   
1         0.056992         0.004137         0.887281          0.887281   
2         0.063744         0.004260         0.887281          0.887281   
3         0.059304         0.004185         0.887281          0.887281   
4         0.064074         0.004615         0.887281          0.887281   
5         0.063669         0.004324         0.887281          0.887289   
6   

In [16]:
generate_evaluation_tables(rfc_grid, X, y)


GridSearchCV


COMPARING SCORES OF TRAINING AND TEST SET:

training set score: 0.941698
test set score: 0.915509

CLASSIFICATION REPORT:

             precision    recall  f1-score   support

          0       0.95      0.98      0.96     36548
          1       0.80      0.57      0.66      4640

avg / total       0.93      0.94      0.93     41188


CONFUSION MATRIX:

[[35882   666]
 [ 2005  2635]]


# Conclusion

Key learning points

1. The process of training different classification models is very similar.
2. Without any tuning, DecisionTreeClassifier provides the most accurate predictions. However, this is due to overfitting (as suggested by the training score of 1.0). This is one of the weakness of DecisionTrees). 
3. The next most accurate models are:
    - SupportVectorMachineClassifier (precision=0.94, recall=0.94)
    - RandomForestClassifier (GridSearched) (precision=0.93, recall=0.94)
    - KNeighboursClassifier(precision=0.92, recall=0.93)
4. We can use GridSearchCV to search for the most optimized parameters for a given model