<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#EDA" data-toc-modified-id="EDA-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>EDA</a></span><ul class="toc-item"><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Data Cleaning</a></span></li></ul></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#Main-Label" data-toc-modified-id="Main-Label-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Main Label</a></span></li><li><span><a href="#Location-Label" data-toc-modified-id="Location-Label-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Location Label</a></span></li></ul></li></ul></div>

# Introduction

Create a multi-layer perceptron neural network model to predict on a labeled dataset of your choosing. Compare this model to either a boosted tree or a random forest model and describe the relative tradeoffs between complexity and accuracy. Be sure to vary the hyperparameters of your MLP!

# Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score
from sklearn import ensemble
from sklearn.neural_network import MLPClassifier
import time

import operator
import os


%matplotlib inline

# Options for pandas
pd.options.display.max_columns = 100
pd.options.display.max_rows = 150

In [4]:
user2 = pd.read_csv('data/user2.features_labels.csv')

# EDA

## Data Cleaning

In [5]:
#Use linear interpolate with grouped label?

In [6]:
# general function used to clean dataframes. 

def clean_df(df, main_labels, loc_labels, secondary_labels, interpolate='linear'):
    
    label_col = [col for col in list(df.columns) if col[:6] == 'label:']
    
    for col in label_col:
        df[col].fillna(0, axis=0, inplace=True)

    #drop observations with no labels
    no_label_index = list(df[df[label_col].eq(0).all(1)].index)

    df.drop(df.index[no_label_index], axis=0, inplace=True)

    #drop all columns where all observations are nans
    drop_col = df.loc[:, df.isnull().sum()/df.isnull().count()*100 == 100].columns
    df.drop(drop_col, axis=1, inplace=True)
    
    #interpolate
    nan_col = [col for col in list(df.columns) if df[col].isnull().sum()/df[col].isnull().count()*100 > 0]

    for col in nan_col:
        df[col].interpolate(method='linear', limit_direction='both', inplace=True)
    
    #dictionary that separates labels by categories
    label_dict = {'main_label': main_labels, 'loc_label': loc_labels, 'secondary_label': secondary_labels}
    
    #finds labels that fall under each label category above and adds it to new column relating to label category.
    for label, lst in label_dict.items():
        #init dict
        df_label_dict = {i: '' for i in range(len(df))}
        # add labels to dict if labels is present in respective index
        for i in range(len(df)):
            for col in lst:
                if df[col].iloc[i] == 1:
                    #creates multiclass label
                    df_label_dict[i] += col + ' '
        df[label] = pd.Series(df_label_dict).apply(lambda x: x if x != '' else np.NaN)
    
    return df

In [7]:
label_col = label_col = [col for col in list(user1.columns) if col[:6] == 'label:']

main_labels = ['label:LYING_DOWN', 'label:SITTING', 'label:FIX_running', 'label:OR_standing','label:SLEEPING', 
               'label:FIX_walking']

loc_labels = ['label:LAB_WORK', 'label:IN_CLASS', 'label:IN_A_MEETING', 'label:LOC_main_workplace','label:OR_indoors',
 'label:OR_outside', 'label:IN_A_CAR', 'label:ON_A_BUS', 'label:LOC_home', 'label:FIX_restaurant','label:SHOPPING',
'label:AT_A_PARTY', 'label:AT_A_BAR', 'label:LOC_beach', 'label:AT_THE_GYM', 'label:ELEVATOR', 'label:AT_SCHOOL']

not_secondary_labels = main_labels + loc_labels
secondary_labels = [col for col in label_col if col not in not_secondary_labels]

user2 = clean_df(user2, main_labels, loc_labels, secondary_labels)

# Modeling

In [29]:
not_features_list = label_col + ['timestamp', 'label_source', 'main_label', 'loc_label', 'secondary_label', 'timestamp' ]
all_feature_list = [col for col in list(user2.columns) if col not in not_features_list]

def predict_label(user, model_dict, feature_list, label):
    data = user[user[label] != '']
    data = data.dropna(subset=[label])

    train_x, test_x, train_y, test_y = train_test_split(data[feature_list], data[label], test_size=.3)
    print(train_x.shape)
    print(len(test_y.unique()))
    for name,model in model_dict.items():
        start_time = time.time()
        model.fit(train_x, train_y)
        end_time = time.time()
        score = model.score(test_x, test_y)
        print()
        print('{} accuracy: {}'.format(name, score))
        print('Runtime: ', end_time - start_time)



## Main Label

In [57]:
rfc = ensemble.RandomForestClassifier()
hidden_sizes = [150]
mlp = MLPClassifier(hidden_layer_sizes=hidden_sizes, alpha=.0001, learning_rate_init=.001, solver='adam',
                    activation='logistic')

model_dict = {'rfc': rfc, 'mlp': mlp}

main_model_dict = predict_label(user2, model_dict, all_feature_list, 'main_label')

(3259, 220)
5





rfc accuracy: 0.9163090128755365
Runtime:  0.2814369201660156

mlp accuracy: 0.8197424892703863
Runtime:  7.413454055786133


## Location Label

In [62]:
rfc = ensemble.RandomForestClassifier()
hidden_sizes = [150]
mlp = MLPClassifier(hidden_layer_sizes=hidden_sizes, alpha=.0001, learning_rate_init=.001, solver='adam',
                    activation='logistic')

model_dict = {'rfc': rfc, 'mlp': mlp}

main_model_dict = predict_label(user2, model_dict, all_feature_list, 'loc_label')

(4448, 220)
11





rfc accuracy: 0.9124278972207656
Runtime:  0.42943406105041504

mlp accuracy: 0.7477713686418458
Runtime:  12.490822076797485




With the small amount of training data and moderate amount of features, one hidden layer was appropriate. Using half of the input size as the hidden layer size had the best accuracy results. Since this is a classification problem, logistic sigmoid activate function was used to transform the output to a number between 0-1. The output represents the predicted probability for each target class/label.

Compared to the Random Forrest Classifier (RFC) with default parameters, the Multi-Layer Perceptron (MLP) under performed even after hyper-parameters tuning. In addition, the runtime for MLP was much larger comapred to the RFC. 