In [1]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# About the Data

The dataset for this challenge has been obtained from the European Social Survey. Our objective for this challenge is to determine what variables we can use to predict if a person has a partner or not, and how significant each variable is for predicting the outcome.

# Exercise

From our initial decision tree, we were able to predict whether someone has a partner or not with an error rate of 6.258% for false positives, and 18.528% for false negatives. The challenge here is to reduce those error rates through modifying the features and the tree.

In [2]:
df = pd.read_csv((
    "https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
    "master/ESS_practice_data/ESSdata_Thinkful.csv")).dropna()

In [3]:
def booster (X, y, iterations, loss, depth):
    # Create training and test sets.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2)

    # We'll make 500 iterations, use 2-deep trees, and set our loss function.
    params = {'n_estimators': iterations,
              'max_depth': depth,
              'loss': loss}

    # Initialize and fit the model.
    clf = ensemble.GradientBoostingClassifier(**params)
    clf.fit(X_train, y_train)

    predict_train = clf.predict(X_train)
    predict_test = clf.predict(X_test)

    # Accuracy tables.
    table_train = pd.crosstab(y_train, predict_train, margins=True)
    table_test = pd.crosstab(y_test, predict_test, margins=True)

    train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
    train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']

    test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
    test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']

    print((
        'Training set accuracy:\n'
        'Percent Type I errors: {}\n'
        'Percent Type II errors: {}\n\n'
        'Test set accuracy:\n'
        'Percent Type I errors: {}\n'
        'Percent Type II errors: {}'
    ).format(train_tI_errors, train_tII_errors, test_tI_errors, test_tII_errors))

In [4]:
# Definine outcome and predictors.
# Set our outcome to 0 and 1.
y = df['partner'] - 1
X1 = df.loc[:, ~df.columns.isin(['partner', 'cntry', 'idno'])]

# Make the categorical variable 'country' into dummies.
X1 = pd.concat([X1, pd.get_dummies(df['cntry'])], axis=1)

booster(X1, y, 500, 'deviance', 2)

Training set accuracy:
Percent Type I errors: 0.047107564830443455
Percent Type II errors: 0.17247199631732393

Test set accuracy:
Percent Type I errors: 0.06441717791411043
Percent Type II errors: 0.18527607361963191


In [5]:
# Changed loss type - less accurate
booster(X1, y, 500, 'exponential', 2)

Training set accuracy:
Percent Type I errors: 0.04741445450360595
Percent Type II errors: 0.1755408930489489

Test set accuracy:
Percent Type I errors: 0.05337423312883435
Percent Type II errors: 0.20552147239263804


In [6]:
# Doubled number of iterations + more accurate
booster(X1, y, 1000, 'deviance', 2)

Training set accuracy:
Percent Type I errors: 0.04864201319625595
Percent Type II errors: 0.1704772134417677

Test set accuracy:
Percent Type I errors: 0.06012269938650307
Percent Type II errors: 0.17975460122699385


In [7]:
# Significantly higher number of iterations + less accurate
booster(X1, y, 10000, 'deviance', 2)

Training set accuracy:
Percent Type I errors: 0.03299063986496854
Percent Type II errors: 0.13180911462329292

Test set accuracy:
Percent Type I errors: 0.08159509202453988
Percent Type II errors: 0.18834355828220858


In [16]:
# Doubled tree depth + less accurate
booster(X1, y, 500, 'deviance', 4)

Training set accuracy:
Percent Type I errors: 0.01764615620684364
Percent Type II errors: 0.11048028233849931

Test set accuracy:
Percent Type I errors: 0.06932515337423313
Percent Type II errors: 0.18650306748466258


In [9]:
X2 = X1 ** 2
X3 = np.sqrt(X1)

In [10]:
booster(X1 + X2, y, 500, 'deviance', 2)

Training set accuracy:
Percent Type I errors: 0.04618689581095596
Percent Type II errors: 0.17170477213441768

Test set accuracy:
Percent Type I errors: 0.0705521472392638
Percent Type II errors: 0.19877300613496932


In [11]:
booster(X3, y, 500, 'deviance', 2)

Training set accuracy:
Percent Type I errors: 0.047261009667024706
Percent Type II errors: 0.17569433788553016

Test set accuracy:
Percent Type I errors: 0.05889570552147239
Percent Type II errors: 0.18466257668711655


In [12]:
booster(X1, y, 500, 'deviance', 2)

Training set accuracy:
Percent Type I errors: 0.04894890286941844
Percent Type II errors: 0.1758477827221114

Test set accuracy:
Percent Type I errors: 0.050920245398773004
Percent Type II errors: 0.18282208588957055


In [13]:
booster(X1, y, 500, 'exponential', 2)

Training set accuracy:
Percent Type I errors: 0.04940923737916219
Percent Type II errors: 0.17692189657818014

Test set accuracy:
Percent Type I errors: 0.06503067484662577
Percent Type II errors: 0.18404907975460122


# Analysis

Through modifying our parameters for modelling, there seem to be a few ways to improve our model yet. Modifying the number of iterations in our test seems to indicate that while doubling the number of iterations for our model will improve the accuracy, there can be too much of a good thing as performing the model 20 times will lead to overfitting. Manipulating our features by taking the square root of them also seems to be a better predictor of our data. 

On the other hand, modifying the loss type from deviance to exponential seems to have a mixed result as our rate of false positives will decrease, but our rate of false negatives increases by a similar amount. 