<h2> Import Libraries </h2>

In [1]:
!pip install lime
!pip install shap
!pip install eli5

Collecting lime
[?25l  Downloading https://files.pythonhosted.org/packages/07/20/a4a59ed562610e19fea333da48bb5fab978a72acbe8e831930f444cd69c9/lime-0.1.1.34.tar.gz (272kB)
[K     |████████████████████████████████| 276kB 3.4MB/s 
Building wheels for collected packages: lime
  Building wheel for lime (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/2f/8e/c1/c1cddd9cf8fbae812904fa5c84ef571e782891288d309d04c8
Successfully built lime
Installing collected packages: lime
Successfully installed lime-0.1.1.34
Collecting shap
[?25l  Downloading https://files.pythonhosted.org/packages/89/9b/aa05f0c7aaac33137d3bbeff9b87e539aedcf18797490485cc5e93f57c48/shap-0.29.2.tar.gz (230kB)
[K     |████████████████████████████████| 235kB 3.4MB/s 
Collecting tqdm>4.25.0 (from shap)
[?25l  Downloading https://files.pythonhosted.org/packages/9f/3d/7a6b68b631d2ab54975f3a4863f3c4e9b26445353264ef01f465dc9b0208/tqdm-4.32.2-py2.py3-none-any.whl (50kB)
[K     |█████████████████

In [21]:
#Data handling
import pandas as pd
import numpy as np
import scipy as sp
import gc
import pickle
#preprocessing and feature selection
import sklearn.preprocessing
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE, VarianceThreshold
from sklearn.feature_selection import SelectFromModel, SelectKBest, chi2
from sklearn.model_selection import train_test_split, RandomizedSearchCV
#models
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
#evaluation and interpretability
import lime
import shap
import eli5
import sklearn.metrics

<h2> Pre-process Data for Model Training </h2>

<b> STEP 1: Encode Datasets <b>

In [43]:
alldata = pd.read_pickle('sp500finaldata.pkl')
#The only features that need to be label encoded are 'Sector' and 'DistanceFromLast'
alldata.dtypes[alldata.dtypes=='object']

firm                object
Sector              object
DistanceFromLast    object
dtype: object

In [44]:
labenc = LabelEncoder()
labenc.fit(alldata['Sector'])
with open('labencsector.pkl', 'wb') as q:
    pickle.dump(labenc, q)
print(labenc.transform(alldata['Sector']))
alldata['Sector'] = labenc.transform(alldata['Sector'])
labenc1 = LabelEncoder()
labenc1.fit(alldata['DistanceFromLast'])
with open('labencdist.pkl', 'wb') as q:
    pickle.dump(labenc1, q)
print(labenc1.transform(alldata['DistanceFromLast']))
alldata['DistanceFromLast'] = labenc1.transform(alldata['DistanceFromLast'])


[4 4 4 ... 4 4 4]
[6 0 0 ... 0 0 0]


In [45]:
onehot = OneHotEncoder(sparse=False)
#Since we would be dropping the 'firm' feature, only 'Sector' and 'DistanceFromLast' from above need to be hot-encoded
onehot.fit(alldata[['Sector', 'DistanceFromLast']])
with open('onehotenc.pkl', 'wb') as q:
    pickle.dump(onehot, q)
print(onehot.transform(alldata[['Sector', 'DistanceFromLast']]))
namessector = ['Sector_'+str(i) for i in labenc.classes_]
namesdist = ['DistLast_'+str(i) for i in labenc1.classes_]
names = np.append(namessector, namesdist)
enc = onehot.transform(alldata[['Sector', 'DistanceFromLast']])
enc = pd.DataFrame(enc, columns=names)
alldata.drop(['firm', 'Sector', 'DistanceFromLast'], axis=1, inplace=True)
alldata = pd.concat([alldata, enc], axis=1)

[[0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


<b> STEP 2 - Separate Two Models </b>

We divide our project into two distinct models - one where we attempt to predict 'Div_Paid?' (hence a classifier model predicting whether or not dividends will be paid), and another where we attempt to predict how much dividend, if paid (therefore a regressor model). Let us designate the different datasets for both.

In [55]:
classifdata = alldata.copy()
classifdata.drop(['Dividend'], axis=1, inplace=True)
regrdata = alldata[alldata['Div_Paid?']==1].copy()
regrdata.drop(['Div_Paid?'], axis=1, inplace=True)
classifdata.to_pickle('classifdatafull.pkl')
regrdata.to_pickle('regrdatafull.pkl')

<h2> Classifier Model - Dividend Outlook </h2>

 First we work towards the classifier model, and begin with feature selection. Subsequently, we try three different fits - RandomForestClassifier, GaussianProcessClassifier and MLPClassifier.

<b> STEP 1 - Feature Elimination </b>

We perform basic feature elimination here, and leave actual selection to fit based on model. The only selection performed here is chi2 testing (to check relationship to the target variable) and VarianceThreshold (differences within the feature).

In [46]:
t = VarianceThreshold()
t.fit(alldata[''])

VarianceThreshold(threshold=0.0)

In [52]:
alldata.columns.values[t.get_support(indices=False)==False]

array(['Accumulated Depreciation'], dtype=object)