# Assignment 4 - part 2
Lauri Pessi | bft860

## Dataset: KidCreative.xlsx

In [1]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from collections import namedtuple
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report 

In [2]:
# Get data and take a peek
df = pd.read_excel('http://myy.haaga-helia.fi/~menetelmat/Data-analytiikka/Teaching/KidCreative.xlsx')
df.head()

Unnamed: 0,Obs No.,Buy,Income,Is Female,Is Married,Has College,Is Professional,Is Retired,Unemployed,Residence Length,Dual Income,Minors,Own,House,White,English,Prev Child Mag,Prev Parent Mag
0,1,0,24000,1,0,1,1,0,0,26,0,0,0,1,0,0,0,0
1,2,1,75000,1,1,1,1,0,0,15,1,0,1,1,1,1,1,0
2,3,0,46000,1,1,0,0,0,0,36,1,1,1,1,1,1,0,0
3,4,1,70000,0,1,0,1,0,0,55,0,0,1,1,1,1,1,0
4,5,0,43000,1,0,0,0,0,0,27,0,0,0,0,1,1,0,1


## Preparing the data
Let's split the data first into features X and labels y
And then split those into separate datasets for training and testing
- Without splitting the model would be fitter over the very same data it's going to be tested against
    - This would make the forecasting quite boring, as you've already shown all of the correct answers

In [3]:
# Assign label vector y and features matrix X
y = df['Buy']
X = df.drop(['Obs No.', 'Buy'], axis = 1)

# Split the data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 5)

## Tooling
To embrace good conventions coming out of laziness, let's define some functions to avoid writing the same repeatedly.

In [4]:
# Utility function for extracting precision/recall from confusion-matrix
def precisionRecall(cm):
    tn, fp = cm[0]
    fn, tp = cm[1]
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = precision * recall / (precision + recall) * 2

    pr = namedtuple('pr', ['Precision', 'Recall', 'F1'])
    return pr(precision, recall, f1)


# A generic function for fitting and scoring the models
def scoreModel(model):

    # Fit the given model using the training set and then apply to testing set
    model.fit(X_train, y_train)
    y_model = model.predict(X_test)

    # Assign scores for return values
    modelName = type(model).__name__
    acc = accuracy_score(y_test, y_model)
    cm = confusion_matrix(y_test, y_model)
    pr = precisionRecall(cm)
    
    scores = namedtuple('scores', ['Model', 'Accuracy', 'Precision', 'Recall', 'F1'])

    return scores(modelName, acc, pr.Precision, pr.Recall, pr.F1)


# And another function to loop the given models through test-function and collecting the results
def tryModels(models):
    rs = []
    for i in models:
        classRef = globals()[i]
        model = classRef()
        rs.append(scoreModel(model))

    rs = pd.DataFrame(rs).set_index('Model')

    # Sort the data based on accuracy prior return
    rs = rs.sort_values('Accuracy', ascending=False)
    return rs

## Model comparison
With the function(s) defined earlier, we can now easily run a bunch of classification models through it and see how the fare against each other.

In [5]:
# Import the models to be tried
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [6]:
# Run the models trough "test suite"
models = ['LogisticRegression', 'GaussianNB', 'DecisionTreeClassifier', 'RandomForestClassifier', 'GradientBoostingClassifier']
tryModels(models)

Unnamed: 0_level_0,Accuracy,Precision,Recall,F1
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GradientBoostingClassifier,0.928994,0.78125,0.833333,0.806452
RandomForestClassifier,0.923077,0.757576,0.833333,0.793651
GaussianNB,0.91716,0.710526,0.9,0.794118
DecisionTreeClassifier,0.91716,0.75,0.8,0.774194
LogisticRegression,0.721893,0.292683,0.4,0.338028


## Conclusions
Based on accuracy scoring the ensemble methods worked the best.
- Multiple runs with different assignment of training and test data makes random forest and gradient boosting switch positions
Classic logistic regression didn't fare well at all.
- While accuracy was decent 72%, precision of 0.29 tells it labeled a lot of cases falsely as buy even though they weren't
- Also recall (or sensitivity) tells that in addition to labeling many false positives, it also failed to identify many of the actual buyers

Interestingly enough, GaussianNB got the highest recall of all the models while ending up at 3rd by other metrics.
- This metric by itself is not enough, as model can get perfect recall-score by simply classifying everything as true
- Emphasis between Precision and Recall/Sensitivity is a balancing act, you cannot get both (unless the model is flawless)
    - In case of identifying potential buyers, the emphasis could be driven by e.g. the cost of converting the prospects to buyers
        - If the conversion cost is low, it's better to "shoot the barns door" with a model having higher recall
        - If the cost is high and you want to mimimize false positives, then Precision is the score to optimize
        - F1-score is a combination of these two, and simplifies comparisons by offering a single metric


## One more try
Let's see if logistic regression could perform better, if we normalize the continuous variables closer to the 0/1 values used in booleans.

In [7]:
# Normalize the two continuous variables
X['Income'] = (X['Income'] - X['Income'].mean()) / X['Income'].std()
X['Residence Length'] = (X['Residence Length'] - X['Residence Length'].mean()) / X['Residence Length'].std()

# Split the data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 5)

# Run same test pattern against the normalized datasets
tryModels(models)

Unnamed: 0_level_0,Accuracy,Precision,Recall,F1
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LogisticRegression,0.934911,0.806452,0.833333,0.819672
GradientBoostingClassifier,0.928994,0.78125,0.833333,0.806452
RandomForestClassifier,0.923077,0.757576,0.833333,0.793651
DecisionTreeClassifier,0.911243,0.727273,0.8,0.761905
GaussianNB,0.887574,0.622222,0.933333,0.746667


__Wow!!__

## Final thoughts
- Performance of logistic regression seems to be really sensitive to whether the variables are within similar range of values
    - With non-normalized data the results of logistic regression were quite unsatisfactory
    - After normalizing the two offset variables, logistic regression managed to beat also the fancy ensemble models.

- What's was left missing out this "test suite" is averaging the results of multiple runs with randomly assigned datasets
    - Likely this wouldn't make much difference in the big picture, but maybe the ensemble methods could have been sorted out
        - Now with normalized dataset, I'd give a shared 2nd place for gradient boosting and random forest.

