*Findings*

 - With the scope of data until 30.03.2018, the best algorithms could find hot subgroups with a 55% uplift of the success rate
 - The explicit algorithms should reach an uplift of 40%

*Decision*
 - The DE Online Marketing Team wants first to filter the leads that are auto-closed and fire the pixel only in the case of non auto closed lead.
 - 31.07.2018: It had not yet been implemented (Jeongmin Lee is responsible) 


### Motivation

*NOTE: optimize in this section for **context setting**, as specifically as you can. For instance, this post is generally a set of standards for work in the repo. The specific motivation is to have least friction to current workflow while being able to painlessly aggregate it later.*

The knowledge repo was created to consolidate research work that is currently scattered in emails, blogposts, and presentations, so that people didn't redo their work.

### The script used for this analysis
Nota Bene: The script won't work if you don't have access to the talend datawarehouse database. If you have access, you should execute:
keyring.set_password("talend", "sdunoyer", "my-password") with sdunoyer replaced by your username and my-password by your password. You only have to execute that once and the set of username + password will be saved safely on your computer.
When you execute: keyring.get_password('talend','sdunoyer') it will fetch the password safely stored on your computer. Replace in the script every sdunoyer by your username.


# 0. Import packages

In [None]:
import pandas as pd
import re
import os
import sys
import psycopg2
import numpy as np
import keyring #store passwords locally and securely using keyring 
#keyring.set_password("mail", "your-mail-address", "your-mail-pw")
#keyring.set_password("talend", "sdunoyer", "my-password")

import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

from sklearn import linear_model
from sklearn.preprocessing import StandardScaler, Imputer
pd.set_option('display.max_rows', 450)
pd.set_option('display.max_columns', 400)
pd.set_option('display.width', 1000)

import plotly
plotly.offline.init_notebook_mode() # run at the start of every ipython notebook

# 1. Import Data: Import Audibene German Web Leads 
Filters applied: 
    - lead.country = DE
    - lead.email_only = false
    - lead.source = 'Web2Lead'
    - campaign.controlling_channel != 'CRM'
    - lead.created_date >= '2017-01-01'
    - lead.status = 'qualified' or 'closed'
    - opportunity.stage_name = 'Closed and Won' or 'Closed and Lost'
    - only leads that are [qualified & closed and won] OR [qualified & closed and lost] OR closed
    - no time_to_first_call null (means inbound calls)

In [None]:
con = psycopg2.connect(dbname='talend', port = '6432', user='sdunoyer', host='bi-proxy', password=keyring.get_password('talend','sdunoyer'))
cur = con.cursor()

# ---- pull Audibene German Weads ---- #
cur.execute('''select l.id as lead_id, 
l.status as lead_status, 
l.created_date as lead_created_date, 
l.reached_date as lead_reached_date, 
l.reason_for_closing, 
l.number_of_unsuccessful_attempts,
q.precise_age,
l.created_during_office_hours, l.time_to_first_call, l.user_device, l.t_parameter, l.sub_publisher,
a.usage, a.marketing_partner, a.controlling_channel as act_controlling_channel, a.offer_type, a.dw_created_at, a.dw_modified_at,
o.stage_name,
q.cardiac_pacemaker, q.cosi_basic_subtype_1, q.age_of_current_hearing_aid, q.current_hearing_test,
q.degree_of_suffering, q.discreet_design, q.income_group, q.insurance_type, q.postal_code, q.prescription,
q.purchase_timeframe, q.searching_for, q.tinnitus, q.type_of_treatment,
q.willing_to_invest, q.why_not_sooner, q.salutation, q.email_filled, q.alternative_phone_filled, q.manufacturer_of_current_hearing_aid,
q.satisfaction_current_device, q.professional_status, q.willing_to_invest_time, q.browser, q.operating_system,
q.currently_looking_for_hearing_aids, q.marketing_offer
from datamart.dim_lead l
       left join datamart.dim_opportunity o on l.id = o.lead_id
       left join datamart.dim_act a on l.act = a.act and l.country_code_iso3 = a.country_code_iso3
       left join datamart.dim_questionnaire_dmk q on q.lead_id = l.id
where l.country_code_iso3 = 'DEU'  and l.source = 'Web2Lead'
       and a.controlling_channel != 'CRM' and l.created_date >= '2017-01-01' and l.status in ('qualified','closed')
''')
rows = cur.fetchall()
leads = pd.DataFrame(rows, columns =  [elt[0] for elt in cur.description])
print (leads.shape)
cur.close()
con.close() 

In [None]:
leads.columns

In [None]:
leads_backup = leads

# 2. Cleanup and Transform the Data
## 2.1. Data Cleaning
### 2.1.1. Remove rows by filtering leads

In [None]:
#Filter 1: Keep only stage_name in (closed and lost, closed and won or NA)
leads = leads[(leads['stage_name']=='closed and lost')|(leads['stage_name']=='closed and won')|leads['stage_name'].isnull()]
#Filter 2: Keep only closed or qualified & lost or qualified & won
leads = leads[(leads['lead_status']=='closed') | ((leads['lead_status']=='qualified')&(leads['stage_name'].notnull()==True))].reset_index()
print (leads.shape)
#Filter 3: keep only leads with positive time_to_first_call
leads = leads[leads['time_to_first_call'].isnull()==False]
print (leads.shape)

### 2.1.2. Improve Data Quality: replace german fields by english 

In [None]:
leads.cardiac_pacemaker = np.where(leads.cardiac_pacemaker == 'nein', 'no', leads.cardiac_pacemaker)
leads.current_hearing_test = np.where(leads.current_hearing_test == 'nein', 'no', leads.current_hearing_test)
leads.current_hearing_test = np.where(leads.current_hearing_test == 'ja', 'yes', leads.current_hearing_test)
leads.discreet_design = np.where(leads.discreet_design == 'sehr wichtig', 'very important', leads.discreet_design)
leads.insurance_type = np.where(leads.insurance_type == 'gesetzlich', 'statutory', leads.insurance_type)
leads.salutation = np.where(leads.salutation == 'herr', 'mr.', leads.salutation)
leads.salutation = np.where(leads.salutation == 'ms.', 'mrs.', leads.salutation)
leads.operating_system = np.where((leads.operating_system == 'macos') | 
                                  (leads.operating_system == 'os x') |
                                  (leads.operating_system == 'iphone os') |
                                  (leads.operating_system == 'mac os x') |
                                  (leads.operating_system == 'mac')
                                  , 'macos/ios', leads.operating_system)
leads.operating_system = np.where((leads.operating_system == 'android') | 
                                  (leads.operating_system == 'linux') |
                                  (leads.operating_system == 'fireos') |
                                  (leads.operating_system == 'fedora') |
                                  (leads.operating_system == 'tizen') |
                                  (leads.operating_system == 'android tv') |
                                  (leads.operating_system == 'webos') |
                                  (leads.operating_system == 'chrome os')
                                  , 'ubuntu', leads.operating_system)
leads.operating_system = np.where(leads.operating_system == 'windows phone'
                                  , 'windows', leads.operating_system)


## 2.2. Data Derivation : create new variables :
    - outcome variable: purchase
    - day_of_week
    - weekday_or_weekend
    - completed_time
    - completed_time_segment
    - hour of day
    - questionnaire_fill_up_rate

In [None]:
from datetime import time, tzinfo, timedelta
#create outcome variable: purchase
leads['purchase'] = np.where(leads['lead_status'] == 'closed', 
                             0, 
                             np.where(leads['stage_name']== 'closed and won', 1, 0))
#time and date derivation
leads['day_of_week'] = leads.lead_created_date.dt.weekday + 1
leads['weekday_or_weekend'] = np.where(leads.day_of_week >=6, 'weekend', 'weekday')
leads['completed_time'] = leads.lead_created_date.dt.time
leads['completed_time_segment'] = np.where(leads.completed_time < time(6, 0, 0),
                                           "0-6",
                                           np.where(leads.completed_time < time(8, 0, 0),
                                                   "6-8",
                                                   np.where(leads.completed_time < time(10, 0, 0),
                                                           "8-10",
                                                           np.where(leads.completed_time < time(12, 0, 0),
                                                                   "10-12",
                                                                   np.where(leads.completed_time < time(14, 0, 0),
                                                                           "12-14",
                                                                           np.where(leads.completed_time < time(16, 0, 0),
                                                                                   "14-16",
                                                                                   np.where(leads.completed_time < time(18, 0, 0),
                                                                                           "16-18",
                                                                                           np.where(leads.completed_time < time(20, 0, 0),
                                                                                                   "18-20",
                                                                                                   "20-24"))))))))
leads['hour_of_day'] = leads.lead_created_date.dt.hour

#age bucket
leads['age_bucket'] = np.where(leads.precise_age<62, 
                                        '<62', 
                                        np.where(leads.precise_age <69, 
                                                 '62-68', np.where(leads.precise_age < 75, '69-75', 
                                                                  np.where(leads.precise_age < 81, '76-80', '>80'))))

In [None]:
questions_first_care = ['precise_age', # compulsory
                        'cosi_basic_subtype_1', 
                        'current_hearing_test',
                        'degree_of_suffering',
                        'discreet_design',
                        'insurance_type',
                        'postal_code', # compulsory since CW18 2017 (4’th may)
                        'prescription',
                        'purchase_timeframe',
                        'searching_for',
                        'tinnitus',
                        'type_of_treatment',
                        'why_not_sooner',
                        'salutation',# compulsory
                        'email_filled',# compulsory
                        'professional_status'] #16

questions_follow_up_care = ['precise_age',# compulsory
                            'cosi_basic_subtype_1',
                            'degree_of_suffering',
                            'discreet_design',
                            'insurance_type',
                            'postal_code',# compulsory  since CW18 2017 (4’th may)
                            'purchase_timeframe',
                            'searching_for',
                            'tinnitus',
                            'type_of_treatment',
                            'salutation',# compulsory
                            'email_filled',# compulsory
                            'professional_status'] #13

leads['questionnaire_fill_up_rate'] = np.where(leads.type_of_treatment == 'first care', 
                                               leads[questions_first_care].apply(lambda x: x.count(), axis=1)/16,
                                               leads[questions_follow_up_care].apply(lambda x: x.count(), axis=1)/13)

In [None]:
print('The dataset contains ', leads.shape[0], ' closed leads and ', leads.shape[1], 'features.')
print('{0:.2f}%'.format((leads.purchase.value_counts()[1]/leads.shape[0] * 100)), ' of the closed leads have purchased a Hearing Aid through Audibene.')

## 2.3. Data Transformation:
    - transform categorical variables to 1/0 variables
    - exclude some columns

In [None]:
#columns included for the analysis
categorical_columns = [
    'user_device', 
    'cosi_basic_subtype_1',
    'age_of_current_hearing_aid', 
    'current_hearing_test',
    'degree_of_suffering', 
    'discreet_design', 
    'insurance_type', 
    'prescription', 
    'purchase_timeframe',
    'searching_for', 
    'tinnitus', 
    'type_of_treatment',
    'why_not_sooner', 
    'salutation',
    'manufacturer_of_current_hearing_aid',
    'satisfaction_current_device',
    'professional_status', 
    'browser',
    'operating_system',
    'weekday_or_weekend',
    'completed_time_segment', 
    'age_bucket']

#create dummies: transform categorical variables into 0/1 variables
dummies = pd.get_dummies((leads[categorical_columns]))
#join dummies 
leads = pd.concat([leads, dummies], axis=1)

## 2.4. Columns filtering 
        - remove IDs
        - remove columns related to outcome
        - remove columns related to marketing partner or offer
        - remove dates
        - remove columns that have under 5% of fill-up rate or too granular

In [None]:
#remove some columns 
columns_to_drop = [
    'lead_id', #id
    'lead_status', #related to outcome variable
    'lead_created_date', #date
    'lead_reached_date', #date
    'completed_time', #time
    'reason_for_closing', #after lead is generated
    'time_to_first_call', #after lead is generated
    'number_of_unsuccessful_attempts', #after lead is generated
    't_parameter', #A/B testing parameters, not relevant
    'sub_publisher', #too granular, try to segment in categories
    'usage', #too granular, try to segment in categories
    'dw_created_at', #date
    'dw_modified_at', #date
    'stage_name',#related to outcome variable
    'postal_code', #too granular, segmented in region and east/west,
    'index',
    'act_controlling_channel', 
    'marketing_partner',
    'offer_type',
    'marketing_offer'
]

#remove low fill-up rates
fill_up = pd.DataFrame((leads.shape[0] - leads.isnull().sum())/leads.shape[0])
low_fill_up_features = fill_up[fill_up[0]<0.05].index

for i in range(len(low_fill_up_features)): 
    columns_to_drop.append(low_fill_up_features[i]) 

#drop categorical columns that were transformed to dummies
columns_to_drop.append(['user_device',
 'cosi_basic_subtype_1',
 'age_of_current_hearing_aid',
 'current_hearing_test',
 'degree_of_suffering',
 'discreet_design',
 'insurance_type',
 'prescription',
 'purchase_timeframe',
 'searching_for',
 'tinnitus',
 'type_of_treatment',
 'why_not_sooner',
 'salutation',
 'manufacturer_of_current_hearing_aid',
 'satisfaction_current_device',
 'professional_status',
 'browser',
 'operating_system',
 'weekday_or_weekend',
 'completed_time_segment',
 'age_bucket'])
  
leads = leads.drop(columns_to_drop, axis = 1)

## 2.7. Success Rate accross all dimensions (1D Analysis)

In [None]:
#statistical test : Student T test. Test if success_rate is same accross groups.
# A large t-score tells you that the groups are different.
# A small t-score tells you that the groups are similar.
# Null hypothesis: SR accross groups are same
# Alternative hypothesis: SR accross groups are different
from scipy import stats
import scipy.stats
from scipy.stats import mannwhitneyu
for col in dummies:
    result_ttest = stats.ttest_ind(leads[leads['purchase']==1][col], leads[leads['purchase']==0][col])
    if(result_ttest[1]<0.05):
        print(col)
        print(result_ttest)

# 3. Construct Datasets (X, y)

In [None]:
X = leads
X = X.drop('purchase', axis = 1)
y = leads.purchase
features=X.columns

In [None]:
#test train & validation split
from sklearn.model_selection import train_test_split,GridSearchCV
#train = 70%
#test = 30% * 70* = 21%
#validation = 30% * 30$ = 9%
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=3, 
                                                    test_size=0.30)

X_test, X_val, y_test, y_val = train_test_split(X_test, 
                                                y_test, 
                                                test_size=0.3, 
                                                random_state=3)
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)
print('X_val:', X_val.shape)
print('y_val:', y_val.shape)

print('Train SR: ', y_train.sum()/y_train.count())
print('Test SR: ', y_test.sum()/y_test.count())
print('Validation SR: ', y_val.sum()/y_val.count())

In [None]:
#imputation median
imp=Imputer(missing_values="NaN", strategy="median", axis=0) #specify axis
imp.fit(X_train)
X_train_imputed_df = pd.DataFrame(imp.transform(X_train), columns = X.columns)
X_test_imputed_df = pd.DataFrame(imp.transform(X_test), columns = X.columns)

In [None]:
#standardisation
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train_imputed_df)
X_train_std = pd.DataFrame(scaler.transform(X_train_imputed_df), columns = X.columns)      
X_test_std = pd.DataFrame(scaler.transform(X_test_imputed_df), columns = X.columns)      

# 4. Feature selection
Feature selection is a process where features that contribute most to the prediction are automatically selected.

Having too many irrelevant features in data can decrease the accuracy of the models. Three benefits of performing feature selection before training models are:

- Reduced Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improved Accuracy: Less misleading data means modeling accuracy improves.
- Reduced Training Time: Less data means that algorithms train faster.

Two different feature selection methods provided by the scikit-learn Python library are Recursive Feature Elimination and feature importance ranking.
The Univariate feature selection - chi2 - best 35 was tested but got better results with Recursive feature elimination - best 35. Beforehand, dummy features with less than 1% occurence were removed.

In [None]:
#Removing features with less than 1% of occurence
cat_occurence = pd.DataFrame({'occurence':dummies.sum()/dummies.shape[0]})
dummies_to_drop = cat_occurence[cat_occurence['occurence'] < 0.01].index
X_train = X_train.drop(dummies_to_drop, axis=1)
X_train_std = X_train_std.drop(dummies_to_drop, axis=1)
X_train_imputed_df = X_train_imputed_df.drop(dummies_to_drop, axis=1)
X_test = X_test.drop(dummies_to_drop, axis=1)
X_test_std = X_test_std.drop(dummies_to_drop, axis=1)
X_test_imputed_df = X_test_imputed_df.drop(dummies_to_drop, axis=1)

In [None]:
#Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model 
#and choose either the best or worst performing feature, 
#setting the feature aside and then repeating the process with the rest of the features. 
#This process is applied until all features in the dataset are exhausted. 
#The goal of RFE is to select features by recursively considering smaller and smaller sets of features.
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
rfe = RFE(logreg, 35)
rfe = rfe.fit(X_train_std, y_train)
rfe_support = pd.DataFrame({'rfe_support':rfe.support_, 'features':X_train_std.columns})

In [None]:
# features_names
X_train_imputed_df = X_train_imputed_df[rfe_support[rfe_support['rfe_support'] == True].features.values]
X_train_std = X_train_std[rfe_support[rfe_support['rfe_support'] == True].features.values]
X_test_imputed_df = X_test_imputed_df[rfe_support[rfe_support['rfe_support'] == True].features.values]
X_test_std = X_test_std[rfe_support[rfe_support['rfe_support'] == True].features.values]
features_names = rfe_support[rfe_support['rfe_support'] == True].features.values
features_names

In [None]:
#Because of the Imbalanced dataset (way more 0 than 1), SMOTE (Synthetic Minority Over-sampling Technique) is performed.
RANDOM_STATE = 0 
from imblearn import over_sampling as os
sm = os.SMOTE(random_state=RANDOM_STATE)
X_train_res, y_train_res = sm.fit_sample(X_train_std, y_train)
X_train_imputed_res, y_train_imputed_res = sm.fit_sample(X_train_imputed_df, y_train)
X_train_imputed_res = pd.DataFrame(X_train_imputed_res, 
                                   columns = X_train_imputed_df.columns)

# 5. Decision Tree Algorithm
Advantages:
    - simple to understand and interpret. 
    - Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
    - Help determine worst, best and expected values for different scenarios.
    - Can be combined with other decision techniques.
Disadvantages :
    - unstable. A small change in the data can lead to a large change in the struture of the optimal decision tree.
    - often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree.

For data including categorical variables with different number of levels, information gain in decision trees is biased in favor of those attributes with more levels.
Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.

In [None]:
from sklearn import datasets
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from imblearn import over_sampling as os
from imblearn import pipeline as pl
from imblearn.metrics import (geometric_mean_score,
                              make_index_balanced_accuracy)
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn import tree
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif, chi2, SelectPercentile
from sklearn.metrics import accuracy_score, mean_squared_error,confusion_matrix
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier

RANDOM_STATE = 0

In [None]:
clf_tree = tree.DecisionTreeClassifier(max_depth=4,
                                  min_samples_leaf=5000)
clf_tree.fit(X_train_imputed_df, y_train)
tree.export_graphviz(clf_tree, 
                     out_file='tree_all.dot',
                     feature_names=X_train_imputed_df.columns, 
                     proportion = True)

# 6. Create Pipeline with SMOTE + Classifier
    - Perceptron
    - Linear Regression
    - Logistic Regression
    - Lasso
    - Decision Tree
    - Linear Regression (LinR)- ok
    - Logistic Regression (LogR)- ok
    - Lasso Regression (LassoR)- ok 
    - Decision Tree (DT)- ok
    - Random Forest (RF) - ok
    - K-Nearest Neighbor (KNN) - ok
    - Support Vector Machine (SVM - not ok. very costly. feature selection necessary)
    - XG Boost
    - Naïve Bayesian Classifier (BC)
    - Bayesian Network (BN)
    - Artificial Neural Network (ANN)

In [None]:
# import packages
from sklearn.linear_model import Perceptron
from sklearn import linear_model
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.linear_model import ElasticNet
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import calibration_curve

In [None]:
import logging
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt
from imblearn import over_sampling as os
from imblearn import pipeline as pl
from imblearn.metrics import (geometric_mean_score,
                              make_index_balanced_accuracy)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier, Perceptron, PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics
from sklearn.metrics import recall_score

from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp

# #############################################################################
# Measure Prediction Score

def score_prediction(y_pred, y_test):
    print("accuracy score:   %0.3f" %  metrics.accuracy_score(y_test, y_pred))
    print("classification report:", metrics.classification_report(y_test, y_pred, digits=3))
    print("confusion matrix:")
    cm = metrics.confusion_matrix(y_test, y_pred)
    print(cm)
    print('Misclassified samples: %d' %(y_test != y_pred).sum())
    print('Recall score:', recall_score(y_test, y_pred))
    print('Coverage:', (cm[0,1]+cm[1,1])/cm.sum())
    # ---- ROC Curve -----  
    #probs = pipeline.predict_proba(X_test_std)
    preds = y_pred 
    fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
    roc_auc = metrics.auc(fpr, tpr)
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()


# Benchmark classifiers
def benchmark(clf):
    print('_' * 80)
    print("Training: ")
    print(clf)
    # ---- time ----- 
    t0 = time()
    pipeline = pl.make_pipeline(os.SMOTE(random_state=RANDOM_STATE),
                                clf)
    pipeline.fit(X_train_std, y_train)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)
    t0 = time()
    y_pred = pipeline.predict(X_test_std)
    #y_pred = np.where(y_pred>0.1, 1, 0)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)
    
    score_prediction(y_pred, y_test)
    # ---- Mode description -----  
    clf_descr = str(clf).split('(')[0]
    return clf_descr, train_time, test_time



In [None]:
results = []
for clf, name in (
    #(DummyClassifier(strategy='most_frequent',random_state=0), "Dummy Classifier"),
    (Perceptron(n_iter=50, random_state=0), "Perceptron"),
    (linear_model.SGDClassifier(random_state=0), "Linear SGD"),
    (linear_model.LogisticRegression(penalty='l1',random_state=0), "Logistic + L1"),
    (linear_model.LogisticRegression(penalty='l2',random_state=0), "Logistic + L2"),
    (RidgeClassifier(tol=1e-2, solver="lsqr", random_state=0), "Ridge Classifier"),
    (PassiveAggressiveClassifier(n_iter=50, random_state=0), "Passive-Aggressive"),
    (KNeighborsClassifier(n_neighbors=10), "kNN"), #too long on all features : train time: 205.479s, test time:  26099.829s
    (tree.DecisionTreeClassifier(max_depth=4, min_samples_leaf=5000, random_state=0), "Decision Tree"),
    (RandomForestClassifier(n_estimators=100, random_state=0), "Random forest"),
    (RandomForestClassifier(n_estimators=100, random_state=0, max_depth=4, min_samples_leaf=5000), "Random forest 2"),
    (RandomForestClassifier(criterion = 'gini',n_estimators=400,random_state=RANDOM_STATE,
                            n_jobs=2, min_samples_leaf=8000, max_features = 10), "Random forest 3"),
    (LinearSVC(C=1.0), "Linear SVC"),
    (SGDClassifier(alpha=.0001, n_iter=50, penalty="elasticnet"), 'Elastic-Net penalty'),
    (NearestCentroid(), 'NearestCentroid (aka Rocchio classifier)'),
    (GaussianNB(), 'Gaussian Naive Bayes'),
    (BernoulliNB(alpha=.01), 'Bernouilli Naive Bayes')
    
#                                       
):
    print('=' * 80)
    print(name)
    benchmark(clf)

# 7. Fine-tuning Random Forest via grid-search


In [None]:
from time import time
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=20, class_weight={0: 1, 1: 3000})

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

scoring = {'precision': 'precision', 'Recall': 'recall', 'accuracy': 'accuracy', 'AP':'average_precision' }
# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search, scoring = scoring,
                                   refit='precision' 
                                  )

start = time()
random_search.fit(X_train_imputed_df, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
#report(random_search.cv_results_)

In [None]:
argmax = (random_search.cv_results_['mean_test_precision']).argmax()
recall = random_search.cv_results_['mean_test_Recall'][argmax]
average_precision  = random_search.cv_results_['mean_test_AP'][argmax]
precision  = random_search.cv_results_['mean_test_precision'][argmax]

print(argmax, recall,average_precision,  precision)
print(random_search.cv_results_['params'][argmax])

# 8. Combination of classifiers:
1. Logistic L1 (linear_model.LogisticRegression(penalty='l1',random_state=0), "Logistic + L1")
2. RF 5000 (RandomForestClassifier(criterion = 'entropy',n_estimators=400,random_state=RANDOM_STATE, n_jobs=2, min_samples_leaf=5000, max_features = 10), "Random forest 2")
3. Nearest Centroid NearestCentroid()
4. Bernouilli BernoulliNB(alpha=.01)
5. Decision Tree (tree.DecisionTreeClassifier(max_depth=4, min_samples_leaf=5000, random_state=0), "Decision Tree")

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

clf1 = linear_model.LogisticRegression(random_state=1, penalty='l1')
clf2 = RandomForestClassifier(criterion = 'entropy',n_estimators=400,random_state=RANDOM_STATE, n_jobs=2, min_samples_leaf=5000, max_features = 10)
clf3 = NearestCentroid()
clf4 = BernoulliNB(alpha=.01)
clf5 = tree.DecisionTreeClassifier(max_depth=4, min_samples_leaf=5000, random_state=0)

eclf1 = VotingClassifier(estimators=[
        ('lr', clf1), ('rf', clf2), ('nc', clf3), ('nb', clf4), ('dt', clf5)], voting='hard')

pipe1 = pl.make_pipeline(os.SMOTE(random_state=RANDOM_STATE), eclf1)

eclf1 = pipe1.fit(X_train_std, y_train)
print(eclf1.predict(X_train_std))

eclf2 = VotingClassifier(estimators=[
        ('lr', clf1), ('rf', clf2), ('dt', clf5)],
        voting='soft')

pipe2 = pl.make_pipeline(os.SMOTE(random_state=RANDOM_STATE), eclf2)

eclf2 = pipe2.fit(X_train_std, y_train)
print(eclf2.predict(X_test_std))

eclf3 = VotingClassifier(estimators=[
       ('lr', clf1), ('rf', clf2), ('nc', clf3), ('nb', clf4), ('dt', clf5)],
       voting='hard', weights=[2,3,1,1,1],
       flatten_transform=True)

pipe3 = pl.make_pipeline(os.SMOTE(random_state=RANDOM_STATE), eclf3)

eclf3 = pipe3.fit(X_train_std, y_train)
eclf4 = VotingClassifier(estimators=[
        ('lr', clf1), ('rf', clf2), ('dt', clf5)],
        voting='hard')

pipe4 = pl.make_pipeline(os.SMOTE(random_state=RANDOM_STATE), eclf4)

eclf4 = pipe4.fit(X_train_std, y_train)

In [None]:
print('Ensemble 1')
y_pred_1 = eclf1.predict(X_test_std)
print(score_prediction(y_pred_1, y_test))
print('Ensemble 2')
y_pred_2 = eclf2.predict(X_test_std)
print(score_prediction(y_pred_2, y_test))
print('Ensemble 3')
y_pred_3 = eclf3.predict(X_test_std)
print(score_prediction(y_pred_3, y_test))
print('Ensemble 4')
y_pred_4 = eclf4.predict(X_test_std)
print(score_prediction(y_pred_4, y_test))

### Appendix

Put all the stuff here that is not necessary for supporting the points above. Good place for documentation without distraction.

# 9. TO DO      
    - k fold CV
    - Gradient Boosting
    - Support Vector Machine (SVM - not ok. very costly. feature selection necessary)
    - XG Boost
    - Naïve Bayesian Classifier (BC) 
    - gnb = GaussianNB()
    - Bayesian Network (BN)
https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/