## Naive bayes Experiments
Given the performance we have been getting with Naive Bayes, I will focus on improving it. I will deal only with the top50 features and see if the following modifications improve our performance:
* Create more features: quintiles, log, id, exp, interaction terms.
* Try variations of Naive Bayes
* Tune its hyper parameters

## Results:
* Adding quintiles, exponentials, and square terms do not improve the performance
* Interaction terms (i.e) second order terms has not been tried yet
* Tunning hyperparameters of Naive Bayes ... not tried yet
* Try other versions of Naive Bayes ... not tried

In [24]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.naive_bayes import BernoulliNB

from pipeline_utilities import create_base_pipeline
from pipeline_utilities import  create_positive_base_pipeline
from load_data import load_data
from sklearn.metrics import roc_auc_score

In [23]:
X, X_ensemble, X_dropout, y, y_ensemble, y_dropout, train_csv, test_csv = load_data()

## Get 'most relevant' features

In [6]:
feature_importance = pd.read_csv('feature_importance.csv', index_col='feature').sort_values(by='average_rank')
top10_features = list(feature_importance.index[:10])
top20_features = list(feature_importance.index[:20])
top50_features = list(feature_importance.index[:50])
top100_features = list(feature_importance.index[:100])
top150_features = list(feature_importance.index[:150])
all_features = list(feature_importance.index)
feature_importance.head()

Unnamed: 0_level_0,tree_importance,linear_importance,linear_rank,tree_rank,average_rank
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
var_81,0.015771,0.221647,1.0,1.0,1.0
var_139,0.011978,0.202066,2.0,3.0,2.5
var_12,0.012252,0.187078,4.0,2.0,3.0
var_53,0.010661,0.177735,5.0,4.0,4.5
var_110,0.010287,0.172094,8.0,5.0,6.5


In [7]:
X.head()

Unnamed: 0_level_0,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
ID_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
train_20237,11.4978,-3.0777,11.9565,8.8948,11.2996,-10.4822,3.827,11.438,-3.4475,7.4598,...,-6.0965,3.9164,-0.4457,-0.9721,21.6714,1.5506,6.3634,8.135,21.6712,-8.5367
train_198821,6.7885,0.7269,5.8945,5.4038,9.9259,-2.9944,5.4108,16.3829,2.058,8.1804,...,6.1964,11.602,3.5603,0.6766,20.0554,-1.5801,5.3351,8.6182,9.6734,-13.4286
train_188009,8.6802,3.531,7.9625,7.6073,11.0424,5.7768,4.3228,20.5002,4.8297,7.4733,...,7.9154,7.6972,1.6893,4.8253,17.645,0.972,-1.3828,7.3101,13.2607,-9.4329
train_12214,13.1322,-0.5355,8.9295,9.2934,11.0048,-1.6971,5.3749,14.7257,3.6912,9.225,...,8.5705,8.4554,1.4635,1.0137,14.6109,1.8852,8.9966,8.0651,19.3677,-14.6202
train_133939,10.8787,-0.6815,5.0,5.9295,9.3202,-3.5659,3.5086,18.0152,-0.2265,7.8618,...,6.2816,7.193,1.7058,-2.6972,21.1006,0.9372,-4.3809,8.9797,20.3274,-19.3787


In [None]:
from pipeline_utilities import PandasWrapper
from pipeline_utilities import NumericFeaturesTransformer
from pipeline_utilities import BucketDistributor
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


##cv = 2 maybe not even necessary?
def create_pipeline_with_artificial_features(columns, transformations, num_buckets, Classifier= GaussianNB()):
    ScalerTransformerScaler = Pipeline([
        ('StandardScaler1', PandasWrapper(StandardScaler())), 
        ("NumericFeaturesTransformer", NumericFeaturesTransformer(transformations = transformations, columns=columns)),
        ('StandardScaler2', StandardScaler())
    ])


    UnionOfTransformedFeatures = FeatureUnion([
        ("ScalerTransformerScaler", ScalerTransformerScaler),
        ("BucketDistributor", BucketDistributor(num_buckets = num_buckets, columns = columns))
    ])


    pipelinewithartificialfeatures = Pipeline([
        ("UnionOfTransformedFeatures", UnionOfTransformedFeatures), 
        ("Classifier", Classifier)

    ])
    return pipelinewithartificialfeatures ## no crossvalidation yet!


## Our benchmark will be the `base_pipeline` with top50 features
AUC = 0.83

In [8]:
descriptor = (GaussianNB(), {}, top50_features, 'linear_classifier_top100.pkl')
base_pipeline = create_base_pipeline(descriptor, cv=2)
base_pipeline.fit(X, y);
print 'score to beat = ', base_pipeline.score(X_dropout, y_dropout)

score to beat =  0.839293035567118


In [9]:
y_predicted = base_pipeline.predict_proba(X_dropout)[:,1]
roc_auc_score(y_dropout, y_predicted)

0.839293035567118

## If we add transformations of the features, do we improve our performance?
No

In [None]:
def identity(x): return x
def exp(x): return np.exp(x)
def square(x): return x**2

In [None]:
columns = top50_features
transformations = [exp, square]
num_buckets=5

pipeline = create_pipeline_with_artificial_features(columns, transformations, num_buckets)
pipeline.fit(X, y)
y_predicted = pipeline.predict_proba(X_dropout)[:,1]
roc_auc_score(y_dropout, y_predicted)

## If we tune the hyperparameters of Naive Bayes, do we get better performance?
Just a tiny microscopic little bit

In [None]:
parameters = {'var_smoothing': 10**np.linspace(-20, -1,40)}
descriptor = (GaussianNB(), parameters, top50_features, 'naive_top50.pkl')
base_pipeline = create_base_pipeline(descriptor, cv=2)
base_pipeline.fit(X, y);
'Best score = ', base_pipeline.score(X_dropout, y_dropout)

In [None]:
base_pipeline.best_estimator_

## If we use Multinomial Naive Bayes, do we get better performance?
No

In [16]:
parameters = {'alpha': 10**np.linspace(-20, 20,10)}
descriptor = (MultinomialNB(), parameters, top50_features, 'multinomialnv_top50.pkl')
base_pipeline = create_positive_base_pipeline(descriptor, cv=2)
base_pipeline.fit(X, y);
'Best score = ', base_pipeline.score(X_dropout, y_dropout)

('Best score = ', 0.8044952362239447)

In [18]:
from sklearn.naive_bayes import ComplementNB

## If we use Complement Naive Bayes, do we get better performance?
(This algo is supposed to be well suited for umbalanced datasets)

In [21]:
parameters = {'var_smoothing': 10**np.linspace(-20, -1,40)}
descriptor = (ComplementNB(), parameters, top50_features, 'naive_top50.pkl')
base_pipeline = create_positive_base_pipeline(descriptor, cv=2)
base_pipeline.fit(X, y);
'Best score = ', base_pipeline.score(X_dropout, y_dropout)

('Best score = ', 0.8044952362239447)