# Assignment 4 - part 2
Lauri Pessi | bft860

## Dataset: KidCreative.xlsx

In [1]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from collections import namedtuple
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report 

In [2]:
# Get data and take a peek
df = pd.read_excel('http://myy.haaga-helia.fi/~menetelmat/Data-analytiikka/Teaching/KidCreative.xlsx')
df.head()

Unnamed: 0,Obs No.,Buy,Income,Is Female,Is Married,Has College,Is Professional,Is Retired,Unemployed,Residence Length,Dual Income,Minors,Own,House,White,English,Prev Child Mag,Prev Parent Mag
0,1,0,24000,1,0,1,1,0,0,26,0,0,0,1,0,0,0,0
1,2,1,75000,1,1,1,1,0,0,15,1,0,1,1,1,1,1,0
2,3,0,46000,1,1,0,0,0,0,36,1,1,1,1,1,1,0,0
3,4,1,70000,0,1,0,1,0,0,55,0,0,1,1,1,1,1,0
4,5,0,43000,1,0,0,0,0,0,27,0,0,0,0,1,1,0,1


In [3]:
# Assign label vector y and features matrix X
y = df['Buy']
X = df.drop(['Obs No.', 'Buy'], axis = 1)

# Split the data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 5)

In [7]:
# Utility function for precision/recall
def precisionRecall(cm):
    tn, fp = cm[0]
    fn, tp = cm[1]
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)

    pr = namedtuple('pr', ['Precision', 'Recall'])
    return pr(precision, recall)

# Define a generic function for testing the models
def scoreModel(model):

    # Fit the given model using the training set and then apply to testing set
    model.fit(X_train, y_train)
    y_model = model.predict(X_test)

    # Assign scores for return values
    modelName = type(model).__name__
    acc = accuracy_score(y_test, y_model)
    cm = confusion_matrix(y_test, y_model)
    pr = precisionRecall(cm)
    f1 = pr.Precision * pr.Recall / (pr.Precision + pr.Recall) * 2
    

    scores = namedtuple('scores', ['Model', 'Accuracy', 'Precision', 'Recall', 'F1', 'Fit'])

    return scores(modelName, acc, pr.Precision, pr.Recall, f1, model)



# And another function for looping models through test-function and collecting the results

def tryModels(models):
    rs = []
    for i in models:
        classRef = globals()[i]
        model = classRef()
        rs.append(scoreModel(model))

    rs = pd.DataFrame(rs).set_index('Model')
    return rs

In [5]:
# Import the models to be tried
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [22]:
# Run the models trough "test suite"
models = ['LogisticRegression', 'GaussianNB', 'DecisionTreeClassifier', 'RandomForestClassifier', 'GradientBoostingClassifier']
rs = tryModels(models)

# Check scores for each model sorted best first based on accuracy score
rs[['Accuracy', 'Precision', 'Recall', 'F1']].sort_values('Accuracy', ascending=False)

Unnamed: 0_level_0,Accuracy,Precision,Recall,F1
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GradientBoostingClassifier,0.928994,0.78125,0.833333,0.806452
RandomForestClassifier,0.923077,0.774194,0.8,0.786885
GaussianNB,0.91716,0.710526,0.9,0.794118
DecisionTreeClassifier,0.911243,0.727273,0.8,0.761905
LogisticRegression,0.721893,0.292683,0.4,0.338028


## Conclusions
Based on accuracy scoring the ensemble methods lead by GradientBoosting worked the best.
Classic logistic regression didn't fare well at all.
- While accuracy was decent 72%, precision of 0.29 tells it labeled a lot of cases falsely as buy, even though they weren't
- Also recall (or sensitivity) tells that in addition to labeling many false positives, it also failed to identify many of the actual buyers

Interestingly enough, GaussianNB got the highest recall of all the models while ending up at 3rd by other metrics.
- This metric by itself is not enough, as model can get perfect recall-score by simply classifying everything as true
- Emphasis between Precision and Recall/Sensitivity is a balancing act, you cannot get both (unless the model is flawless)
    - In case of identifying potential buyers, the emphasis could be driven by the cost of contanting the candidates
        - If the conversion cost is low, it's better to "shoot the barns door" with a model having higher recall
        - If the cost is high and you want to mimimize false positives, then Precision is the score to optimize
        - F1-score is a combination of these two, and simplifies comparisons by offering a single metric


## One more try
Let's see if logistic regression could perform better, if we normalize the continuous variables closer to the 0/1 values used in booleans.


rs