# An Experimental Journey With Data to Inspire Your Work

## Introduction 

The Experiential Journey with Data to Inspire Your Work session will make you think differently about data and how it can solve problems! You will hear surprising use case that will make you think, sometimes laugh and hopefully inspire your own work. The use case and introductory material includes a hands-on experiential journey described below. The most valuable part of this session is that it is designed to help you gain experience and relate it to your work – so that when you leave you have a plan of action on how you can make data more useful in your organization to solve a key challenge.

A real-business application of analytics in “Improving Customer Experiences with Real-Time Insights” will be used as an example during the workshop. This experiential session will include a step by step journey on “How data science is helping IBM to predict the customer experience journey and proactively address the issues, leading to the improvement of Net Promoter Score”. The session will also highlight the importance of using CRISP-DM (Cross Industry Standard Process for Data Mining) and Agile in Data Science projects.

The methodology involves consuming historical Net Promoter Score (NPS) data; using machine learning and artificial intelligence to identify the most important features and created an algorithm to predict the customer experience.

## Background

NPS has become the industry standard customer loyalty measurement. Businesses see customer experience as an imperative and would like to run analytics on and predict customer experience. Since competition is rife, keeping customers happy so they do not move their investments elsewhere is key to maintaining profitability.

Improving the customer experience is valuable because of its effect on our bottom line. Creating an ultimate experience that appeals to both the heart and the head is our goal. Customers give their money, fans give their hearts of consumers. 44% of consumers say that majority of customer experiences are bland and 69% of consumers say that emotions count for half their experiences.


## Approach

In this notebook, we'll use scikit-learn to predict the customer experience. scikit-learn provides implementations of many classification algorithms. Here, we will apply multiple classification algorithms, evaluate the performance, and select the best peroforming algorithm based on performance metrics.

have chosen the gradient boosting classification algorithm to walk through all the different steps.

To help visualize what we are doing, we'll use 2D and 3D charts to show how the classes look with matplotlib and scikitplot python libraries.

<a id="top"></a>
## Table of Contents


1. [Load libraries](#load_libraries)


2. [Data exploration](#explore_data)


3. [Prepare data for building classification model](#prepare_data)


4. [Preprocessing (Feature Extraction)](#feature_extraction)


5. [Preprocessing (Feature Scaling)](#feature_scaling)


6. [Preprocessing (Dimensionality Reduction)](#dimensionality_reduction)


7. [Split data into train and test sets](#split_data)


8. [Model Selection](#model_selection)


9. [Performance Metric](#performance_metric)


10. [Evaluation](#evaluate_model)


11. [Deployment](#deployment)


13. [Refernces](#reference)

### Quick set of instructions to work through the notebook

If you are new to Notebooks, here's a quick overview of how to work in this environment.

1. The notebook has 2 types of cells - markdown (text) such as this and code such as the one below. 


2. Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell to provide a hands-on experiential journey.


3. To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.


4. Work through this notebook by reading the instructions and executing code cell by cell. Some cells will require modifications before you run them. 

<a id="load_libraries"></a>
## 1. Load packages, libraries and verify the version
[Top](#top)


Install python modules and load packages

In [1]:
#Load packages and libraries

import sys #Provides information about constants, functions and methods of the Python interpreter (https://docs.python.org/3/library/sys.html)

import numpy as np #Scientific Computing (https://numpy.org/)

import pandas as pd #Data manipulation and Analysis (https://pandas.pydata.org/pandas-docs/stable/)

from datetime import datetime #Manipulate dates and times (https://docs.python.org/3/library/datetime.html#module-datetime)
import datetime as dt

import sklearn#Classification, Regression, Clustering, Dimensionality Reduction,Model Selection and Preprocesing (https://scikit-learn.org/)
from sklearn import preprocessing 
from sklearn.preprocessing import MinMaxScaler

import csv #Import and export spreadsheets and databases (https://docs.python.org/3/library/csv.html)

!pip install pandas-profiling #Generates profile reports from a pandas DataFrame and helps with quick data analysis(https://github.com/pandas-profiling/pandas-profiling)
from pandas_profiling import ProfileReport

!pip install ipywidgets==7.5.1 #prerequisite for pandas-profiling (https://ipywidgets.readthedocs.io/en/latest/)
from ipywidgets import widgets

Collecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/1d/08/1f614fb2d31b59cd69896b900044c8d7119389b9151983a872d047ea021f/pandas-profiling-2.4.0.tar.gz (150kB)
[K     |████████████████████████████████| 153kB 8.5MB/s eta 0:00:01
Collecting confuse>=1.0.0 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/4c/6f/90e860cba937c174d8b3775729ccc6377eb91f52ad4eeb008e7252a3646d/confuse-1.0.0.tar.gz
Collecting htmlmin>=0.1.12 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting missingno>=0.4.2 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/2b/de/6e4dd6d720c49939544352155dc06a08c9f7e4271aa631a559dfbeaaf9d4/missingno-0.4.2-py3-none-any.whl
Collecting phik>=0.9.8 (from pandas-profiling)
[?25l  Downloading https://files.pythonhosted.org/packages/45/ad/24a16fa4ba612fb96a3c4bb115a5b9741

In [None]:
#Check version
print("Python %d.%d.%d%s%s"%sys.version_info)
print("Pandas %s"%pd.__version__)
print("Numpy %s"%np.__version__)
print("Scikit-learn %s"%sklearn.__version__)
print("CSV %s"%csv.__version__)

<a id="explore_data"></a>
## 2. Data exploration
[Top](#top)

This step envolves adding a project token to access the data sets, loading as well as reading the files, and understanding the data profile.

### <a id="project_token"></a> 2.1 Add the project token
[Top](#top)

Click in an empty line in the cell below. Use the menu item with the three vertical dots, and choose **Insert project token**. 

For more information about project tokens, see <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/token.html" target="_blank" rel="noopener noreferrer">Manually add the project token</a>.

### <a id="load_file"></a> 2.2 Load and read the file (Method 1: Object Storage)
[Top](#top)

1. Click the **1001** data icon at the upper part of the page to open the Files subpanel.


2. In the right part of the page, select the NPS data set. Click insert to code, and select Insert pandas DataFrame. This adds code to the data cell for reading the data set into a pandas DataFrame.


3. Change the generated variable name df_data_1 for the data frame to df, which is used in the rest of the notebook. When displayed in the notebook, the data frame appears as the following:

### <a id="load_file"></a> 2.3 Load and read the file (Method 2: GitHub)
[Top](#top)

In [12]:
#assign the urls of the files
case_information_url = "https://raw.githubusercontent.com/neemadan/An-Experiential-Journey-With-Data-to-Inspire-Your-Work/master/case_information.csv"
case_sentiments_emotions_url = "https://raw.githubusercontent.com/neemadan/An-Experiential-Journey-With-Data-to-Inspire-Your-Work/master/case_sentiments_emotions.csv"
geography_url = "https://raw.githubusercontent.com/neemadan/An-Experiential-Journey-With-Data-to-Inspire-Your-Work/master/geography.csv"    
survey_result_url = "https://raw.githubusercontent.com/neemadan/An-Experiential-Journey-With-Data-to-Inspire-Your-Work/master/survey_result.csv"

In [13]:
#read the files from the links and store in a dataframe
case_information = pd.read_csv(case_information_url)
case_sentiments_emotions = pd.read_csv(case_sentiments_emotions_url)
geography = pd.read_csv(geography_url)
survey_result = pd.read_csv(survey_result_url)

In [None]:
#View data profile report
profile = ProfileReport(survey_result, title='Survey Results Table', minimal=True)
profile.to_notebook_iframe()

In [None]:
#View data profile report
profile = ProfileReport(case_information, title='Case Information Table', minimal=True)
profile.to_notebook_iframe()

In [None]:
#View data profile report
profile = ProfileReport(case_sentiments_emotions, title='Case Sentiment and Emotions Table', minimal=True)
profile.to_notebook_iframe()

In [None]:
#View data profile report
profile = ProfileReport(geography, title='Geography Table', minimal=True)
profile.to_notebook_iframe()

<a id="prepare_data"></a>
## 3. Prepare data for building classification model
[Top](#top)

Data preparation is a very important step in machine learning model building. This is because the model can perform well only when the data it is trained on is good and well prepared. Hence, this step consumes the bulk of a data scientist's time while building models.

During this process, we identify categorical columns in the dataset. Categories need to be indexed, which means the string labels are converted to label indices. These label indices are encoded using One-hot encoding to a binary vector with at most a single value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features to use categorical features.

In [14]:
#join survey results, case information, case_sentiment_emotions, and geography data tables)
sr_ci = pd.merge(survey_result, case_information, on=['case_number'],how='left')
sr_ci_cse = pd.merge(sr_ci, case_sentiments_emotions, on=['case_number'],how='left')
sr_ci_cse_geo = pd.merge(sr_ci_cse, geography, on=['country'],how='left')

In [15]:
sr_ci_cse_geo.head()

Unnamed: 0,case_number,likelihood_to_recommend,service_type,opened_date_time,support_plan,account_type,severity,technology_level_1,technology_level_2,technology_level_3,...,sadness_sentiment_last3,sentiment_conversation_last,anger_conversation_last,disgust_conversation_last,fear_conversation_last,joy_conversation_last,sadness_conversation_last,country_code,geography,region
0,cs0003230,1,non technical,11/4/2019,basic,payg,4,level_1_7,level_2_20,level_3_23,...,,,,,,,,us,am,us
1,cs0001378,1,technical,10/29/2019,basic,payg,4,level_1_9,level_2_26,level_3_137,...,,,,,,,,jp,jn,japan
2,cs0001480,1,technical,10/16/2019,basic,payg,4,level_1_1,level_2_22,level_3_92,...,0.268286,-0.580763,0.067375,0.041635,0.07128,0.096182,0.757099,us,am,us
3,cs000389,0,non technical,10/25/2019,basic,payg,4,level_1_7,level_2_20,level_3_23,...,,,,,,,,us,am,us
4,cs000161,0,non technical,10/11/2019,basic,payg,4,level_1_7,level_2_20,level_3_23,...,0.619187,0.83929,0.10201,0.023498,0.013776,0.029261,0.17058,mx,la,latin america


In [16]:
#View data profile report
profile = ProfileReport(sr_ci_cse_geo, title='Consolidate Data Table', minimal=True)
profile.to_notebook_iframe()

In [17]:
#replace all nan values with blanks
sr_ci_cse_geo_select = sr_ci_cse_geo.replace(np.nan, '', regex=True)
sr_ci_cse_geo_select = sr_ci_cse_geo_select.replace('nan', '', regex=True)

In [18]:
sr_ci_cse_geo_select.head()

Unnamed: 0,case_number,likelihood_to_recommend,service_type,opened_date_time,support_plan,account_type,severity,technology_level_1,technology_level_2,technology_level_3,...,sadness_sentiment_last3,sentiment_conversation_last,anger_conversation_last,disgust_conversation_last,fear_conversation_last,joy_conversation_last,sadness_conversation_last,country_code,geography,region
0,cs0003230,1,non technical,11/4/2019,basic,payg,4,level_1_7,level_2_20,level_3_23,...,,,,,,,,us,am,us
1,cs0001378,1,technical,10/29/2019,basic,payg,4,level_1_9,level_2_26,level_3_137,...,,,,,,,,jp,jn,japan
2,cs0001480,1,technical,10/16/2019,basic,payg,4,level_1_1,level_2_22,level_3_92,...,0.268286,-0.580763,0.067375,0.041635,0.07128,0.096182,0.757099,us,am,us
3,cs000389,0,non technical,10/25/2019,basic,payg,4,level_1_7,level_2_20,level_3_23,...,,,,,,,,us,am,us
4,cs000161,0,non technical,10/11/2019,basic,payg,4,level_1_7,level_2_20,level_3_23,...,0.619187,0.83929,0.10201,0.023498,0.013776,0.029261,0.17058,mx,la,latin america


<a id="feature_extraction"></a>
## 4. Preprocessing (Feature Extraction)
[Top](#top)

In [19]:
numcols = ['likelihood_to_recommend', 'assignment_count', 'meaningful_comm_count', 'first_meaningful_comm_duration_mins', 
           'all_avg_meaningful_comm_duration_mins', 'age_of_account_days','life_time_invoice_usd', 'recurring_invoice_usd', 
           'case_duration_days', 'sentiment_overall', 'anger_overall', 'disgust_overall', 'fear_overall', 'joy_overall', 
           'sadness_overall', 'sentiment_conversation_last3', 'anger_sentiment_last3','disgust_sentiment_last3', 
           'fear_sentiment_last3', 'joy_sentiment_last3', 'sadness_sentiment_last3', 'sentiment_conversation_last', 
           'anger_conversation_last', 'disgust_conversation_last', 'fear_conversation_last', 'joy_conversation_last', 
           'sadness_conversation_last']

catcols_dummy = ['service_type', 'support_plan', 'account_type', 'severity', 'case_origination_user_type', 
                 'case_origination_day', 'case_origination_time', 'severity_change']

catcols_hash = ['case_origination_source', 'technology_level_1', 'technology_level_2', 'technology_level_3', 'assignment_group', 
                'country', 'country_code', 'geography', 'region']

In [20]:
#assign to string type to ensure that dummy variable gets created for severity. Need to check further.
sr_ci_cse_geo_select['severity'] = sr_ci_cse_geo_select['severity'].astype(str)

In [21]:
sr_ci_cse_geo_select = pd.concat([sr_ci_cse_geo_select[numcols], pd.get_dummies(sr_ci_cse_geo_select[catcols_dummy]),sr_ci_cse_geo_select[catcols_hash]],axis=1)

sr_ci_cse_geo_select['case_origination_source'] = sr_ci_cse_geo_select['case_origination_source'].apply(hash)
sr_ci_cse_geo_select['technology_level_1'] = sr_ci_cse_geo_select['technology_level_1'].apply(hash)
sr_ci_cse_geo_select['technology_level_2'] = sr_ci_cse_geo_select['technology_level_2'].apply(hash)
sr_ci_cse_geo_select['technology_level_3'] = sr_ci_cse_geo_select['technology_level_3'].apply(hash)
sr_ci_cse_geo_select['assignment_group'] = sr_ci_cse_geo_select['assignment_group'].apply(hash)
sr_ci_cse_geo_select['country'] = sr_ci_cse_geo_select['country'].apply(hash)
sr_ci_cse_geo_select['country_code'] = sr_ci_cse_geo_select['country_code'].apply(hash)
sr_ci_cse_geo_select['geography'] = sr_ci_cse_geo_select['geography'].apply(hash)
sr_ci_cse_geo_select['region'] = sr_ci_cse_geo_select['region'].apply(hash)

features = sr_ci_cse_geo_select.columns

In [22]:
sr_ci_cse_geo_select.head()

Unnamed: 0,likelihood_to_recommend,assignment_count,meaningful_comm_count,first_meaningful_comm_duration_mins,all_avg_meaningful_comm_duration_mins,age_of_account_days,life_time_invoice_usd,recurring_invoice_usd,case_duration_days,sentiment_overall,...,severity_change_yes,case_origination_source,technology_level_1,technology_level_2,technology_level_3,assignment_group,country,country_code,geography,region
0,1,0,2,0.033,0.7248,3330.08,34665.8,265.0,2.012377,,...,0,4948275787136104827,-5562142050183539652,8297146999659937850,6865731277712779023,-2784979101041221216,-3561662590124868379,-558677630159279283,-9061349528410791718,-558677630159279283
1,1,0,5,7.4328,28.99656,2289.08,2855.24,0.0,8.037736,,...,0,8367445963618599186,-5022295867498511678,-3267883836195431285,-8519694195436197881,1413779715238436958,6053929623693123893,-5722274166384307053,-3564307182645756841,6053929623693123893
2,1,0,3,669.6828,2878.3276,,78.0,,20.349611,0.544811,...,0,8367445963618599186,8498688021852511973,7518247233398965045,4319341182702921688,8769026428354709771,-3561662590124868379,-558677630159279283,-9061349528410791718,-558677630159279283
3,0,0,1,7206.4998,7206.4998,,0.0,,11.262203,,...,0,8367445963618599186,-5562142050183539652,8297146999659937850,6865731277712779023,-2784979101041221216,-3561662590124868379,-558677630159279283,-9061349528410791718,-558677630159279283
4,0,5,10,32.1498,3276.74166,,0.0,,25.269773,0.512768,...,0,8367445963618599186,-5562142050183539652,8297146999659937850,6865731277712779023,-2784979101041221216,-8603583009532549369,-6656377169217232176,-4115008005223849437,-8086548428285562708


<a id="feature_scaling"></a>
## 5. Preprocessing (Feature Scaling)
[Top](#top)

In [23]:
y = sr_ci_cse_geo_select['likelihood_to_recommend']
X = sr_ci_cse_geo_select.copy()
del X['likelihood_to_recommend']

In [24]:
X.head()

Unnamed: 0,assignment_count,meaningful_comm_count,first_meaningful_comm_duration_mins,all_avg_meaningful_comm_duration_mins,age_of_account_days,life_time_invoice_usd,recurring_invoice_usd,case_duration_days,sentiment_overall,anger_overall,...,severity_change_yes,case_origination_source,technology_level_1,technology_level_2,technology_level_3,assignment_group,country,country_code,geography,region
0,0,2,0.033,0.7248,3330.08,34665.8,265.0,2.012377,,,...,0,4948275787136104827,-5562142050183539652,8297146999659937850,6865731277712779023,-2784979101041221216,-3561662590124868379,-558677630159279283,-9061349528410791718,-558677630159279283
1,0,5,7.4328,28.99656,2289.08,2855.24,0.0,8.037736,,,...,0,8367445963618599186,-5022295867498511678,-3267883836195431285,-8519694195436197881,1413779715238436958,6053929623693123893,-5722274166384307053,-3564307182645756841,6053929623693123893
2,0,3,669.6828,2878.3276,,78.0,,20.349611,0.544811,0.02216,...,0,8367445963618599186,8498688021852511973,7518247233398965045,4319341182702921688,8769026428354709771,-3561662590124868379,-558677630159279283,-9061349528410791718,-558677630159279283
3,0,1,7206.4998,7206.4998,,0.0,,11.262203,,,...,0,8367445963618599186,-5562142050183539652,8297146999659937850,6865731277712779023,-2784979101041221216,-3561662590124868379,-558677630159279283,-9061349528410791718,-558677630159279283
4,5,10,32.1498,3276.74166,,0.0,,25.269773,0.512768,0.44397,...,0,8367445963618599186,-5562142050183539652,8297146999659937850,6865731277712779023,-2784979101041221216,-8603583009532549369,-6656377169217232176,-4115008005223849437,-8086548428285562708


In [25]:
X = X.apply(pd.to_numeric, errors='coerce')

In [26]:
X = X.fillna(X.mean())

scaler = MinMaxScaler()
scaler = MinMaxScaler(feature_range=(0, 1), copy=True)
scaler.fit(X)

X = pd.DataFrame(scaler.transform(X), index=X.index, columns=X.columns)
X.head()

  return self.partial_fit(X, y)


Unnamed: 0,assignment_count,meaningful_comm_count,first_meaningful_comm_duration_mins,all_avg_meaningful_comm_duration_mins,age_of_account_days,life_time_invoice_usd,recurring_invoice_usd,case_duration_days,sentiment_overall,anger_overall,...,severity_change_yes,case_origination_source,technology_level_1,technology_level_2,technology_level_3,assignment_group,country,country_code,geography,region
0,0.0,0.011494,9.050339e-08,3e-06,0.508541,0.008696,0.000159,0.002746,0.585226,0.201064,...,0.0,0.782363,0.202807,0.97896,0.874756,0.331441,0.305611,0.46921,0.0,0.480527
1,0.0,0.028736,2.038465e-05,0.000101,0.334347,0.008385,0.0,0.01097,0.585226,0.201064,...,0.0,0.984572,0.233414,0.323416,0.030464,0.574398,0.828293,0.18534,0.606647,0.871652
2,0.0,0.017241,0.001836623,0.010026,0.272711,0.008358,0.013397,0.027772,0.772381,0.029297,...,0.0,0.984572,1.0,0.93481,0.73502,1.0,0.305611,0.46921,0.0,0.480527
3,0.0,0.005747,0.01976402,0.025102,0.272711,0.008357,0.013397,0.01537,0.585226,0.201064,...,0.0,0.984572,0.202807,0.97896,0.874756,0.331441,0.305611,0.46921,0.0,0.480527
4,0.416667,0.057471,8.817169e-05,0.011414,0.272711,0.008357,0.013397,0.034487,0.756335,0.586955,...,0.0,0.984572,0.202807,0.97896,0.874756,0.331441,0.031544,0.133987,0.545873,0.035265


In [27]:
len(X.columns)

57

In [28]:
y.head()

0    1
1    1
2    1
3    0
4    0
Name: likelihood_to_recommend, dtype: int64

<a id="dimensionality_reduction"></a>
## 6. Preprocessing (Dimensionality Reduction)
[Top](#top)

### Set some Parameters

In [29]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

### 6.1. Pearson correlation
[Top](#top)

This is a filter-based method.  We check the absolute value of the Pearson's correlation between the target and numerical features in our dataset. We keep the top n features based on this criterion.

In [30]:
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### 6.2. Chi-Square Features
[Top](#top)

This is another filter-based method.  In this method, we calculate the chi-square metric between the target and the numerical variable and only select the variable with the maximum chi-squared values.

In [31]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
chi_selector = SelectKBest(chi2, k=num_feats)
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

30 selected features


### 6.3. Recursive Feature Elimination
[Top](#top)

This is a wrapper based method. As I said before, wrapper methods consider the selection of a set of features as a search problem.  From sklearn Documentation:

The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a featureimportances attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

As you would have guessed we could use any estimator with the method. In this case, we use LogisticRegression and the RFE observes the coef_ attribute of the LogisticRegression object

In [32]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=num_feats, step=10, verbose=5)
rfe_selector.fit(X_norm, y)



Fitting estimator with 57 features.
Fitting estimator with 47 features.
Fitting estimator with 37 features.




RFE(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
  n_features_to_select=30, step=10, verbose=5)

In [33]:
rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

30 selected features


### 6.4. Lasso: SelectFromModel
[Top](#top)

This is an Embedded method. As said before, Embedded methods use algorithms that have built-in feature selection methods.  For example, Lasso, and RF have their own feature selection methods. Lasso Regularizer forces a lot of feature weights to be zero.  Here we use Lasso to select variables.

In [34]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1"), max_features=num_feats)
embeded_lr_selector.fit(X_norm, y)



SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
        max_features=30, norm_order=1, prefit=False, threshold=None)

In [35]:
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')

30 selected features


### 6.5. Tree-based: SelectFromModel
[Top](#top)

This is an Embedded method. As said before, Embedded methods use algorithms that have built-in feature selection methods. We can also use RandomForest to select features based on feature importance. We calculate feature importance using node impurities in each decision tree. In Random forest, the final feature importance is the average of all decision tree feature importance.

In [36]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=num_feats)
embeded_rf_selector.fit(X, y)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
        max_features=30, norm_order=1, prefit=False, threshold=None)

In [37]:
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')

17 selected features


In [38]:
pd.set_option('display.max_rows', None)
# put all selection together
#feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embeded_lr_support,
#                                    'Random Forest':embeded_rf_support, 'LightGBM':embeded_lgb_support})
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embeded_lr_support,
                                    'Random Forest':embeded_rf_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,Total
1,region,True,True,True,True,True,5
2,meaningful_comm_count,True,True,True,True,True,5
3,first_meaningful_comm_duration_mins,True,True,True,True,True,5
4,country_code,True,True,True,True,True,5
5,case_duration_days,True,True,True,True,True,5
6,assignment_count,True,True,True,True,True,5
7,all_avg_meaningful_comm_duration_mins,True,True,True,True,True,5
8,support_plan_free,True,True,True,True,False,4
9,severity_change_yes,True,True,True,True,False,4
10,severity_2,True,True,True,True,False,4


In [39]:
#Select the key features
num_feats = 17
feature_selection_df_select = feature_selection_df.head(num_feats)
feature_selection_df_select.Feature.unique()

array(['region', 'meaningful_comm_count',
       'first_meaningful_comm_duration_mins', 'country_code',
       'case_duration_days', 'assignment_count',
       'all_avg_meaningful_comm_duration_mins', 'support_plan_free',
       'severity_change_yes', 'severity_2', 'severity_1',
       'sentiment_overall', 'sentiment_conversation_last3',
       'sentiment_conversation_last', 'age_of_account_days',
       'account_type_subscription', 'technology_level_3'], dtype=object)

In [40]:
X_select = X[['technology_level_2', 'meaningful_comm_count', 'geography',
       'first_meaningful_comm_duration_mins', 'country_code',
       'case_duration_days', 'all_avg_meaningful_comm_duration_mins',
       'severity_2', 'severity_1', 'severity_change_yes',
       'sentiment_overall', 'sentiment_conversation_last3',
       'sentiment_conversation_last', 'joy_sentiment_last3',
       'assignment_group', 'assignment_count', 'age_of_account_days']]

<a id="split_data"></a>
## 7. Split data into train and test sets
[Top](#top)

In [41]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X_select, y, test_size=0.30, random_state=42)

print("train and test data shape=")
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

print("testing data likelihood=")
print("1_Promoter=",np.sum(y_test))
print("0_Non Promoter=",len(y_test)-np.sum(y_test))

train and test data shape=
(2588, 17)
(1110, 17)
(2588,)
(1110,)
testing data likelihood=
1_Promoter= 441
0_Non Promoter= 669


<a id="model_selection"></a>
## 8. Model Selection
[Top](#top)

In [42]:
from sklearn import preprocessing

from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler

def calculate_metrics(y_true,y_pred):
    print(precision_recall_fscore_support(y_true, y_pred,average='macro'))
    print(accuracy_score(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred,labels=[0,1]))

In [43]:
from sklearn.linear_model import LogisticRegression
clf1 = LogisticRegression(random_state=0, solver='lbfgs',class_weight="auto").fit(x_train, y_train)
y_pred= clf1.predict(x_test)
calculate_metrics(y_test,y_pred)

(0.6224477535425061, 0.5518135505323206, 0.5126266071002286, None)
0.6288288288288288
[[620  49]
 [363  78]]


In [44]:
from sklearn.linear_model import SGDClassifier
clf2 = SGDClassifier(max_iter=1000, tol=1e-3,class_weight="balanced").fit(x_train, y_train)
y_pred= clf2.predict(x_test)
calculate_metrics(y_test,y_pred)

(0.6011648745519713, 0.5574011368373957, 0.4554798669048153, None)
0.48558558558558557
[[139 530]
 [ 41 400]]


In [45]:
from sklearn import svm
clf3 = svm.SVC(gamma='scale',class_weight="balanced").fit(x_train, y_train)
y_pred= clf3.predict(x_test)
calculate_metrics(y_test,y_pred)

(0.5993867583920549, 0.5947042494127696, 0.55946649316851, None)
0.5603603603603604
[[286 383]
 [105 336]]


In [46]:
from sklearn.neighbors import KNeighborsClassifier
clf4 = KNeighborsClassifier(n_neighbors=3).fit(x_train, y_train)
y_pred= clf4.predict(x_test)
calculate_metrics(y_test,y_pred)

(0.5310666132310422, 0.5294276155903318, 0.5288326239940417, None)
0.5594594594594594
[[452 217]
 [272 169]]


In [47]:
from sklearn.gaussian_process import GaussianProcessClassifier
clf5 = GaussianProcessClassifier(random_state=0).fit(x_train, y_train)
y_pred= clf5.predict(x_test)
calculate_metrics(y_test,y_pred)

(0.6202738165782933, 0.5529727586101706, 0.5159630992248128, None)
0.6288288288288288
[[617  52]
 [360  81]]


In [49]:
from sklearn import tree 
clf8 = tree.DecisionTreeClassifier(class_weight="balanced").fit(x_train, y_train) 
y_pred= clf8.predict(x_test) 
calculate_metrics(y_test,y_pred)

(0.5759722640940609, 0.5751654922058509, 0.5754838720668064, None)
0.5954954954954955
[[451 218]
 [231 210]]


In [50]:
from sklearn.ensemble import RandomForestClassifier 
clf9 = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=2, random_state=0,class_weight="balanced").fit(x_train, y_train) 
y_pred= clf9.predict(x_test) 
calculate_metrics(y_test,y_pred)

(0.5789764544327964, 0.5639394771361459, 0.5590433139964274, None)
0.6099099099099099
[[527 142]
 [291 150]]


In [51]:
from sklearn.ensemble import GradientBoostingClassifier
clf10 = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(x_train, y_train)
y_pred= clf10.predict(x_test)
calculate_metrics(y_test,y_pred)

(0.6012262380445897, 0.5835494137864413, 0.5809701315611903, None)
0.6270270270270271
[[532 137]
 [277 164]]


In [52]:
from sklearn.ensemble import VotingClassifier
clf11 = VotingClassifier(estimators=[('svm', clf9), ('rf', clf10)], voting='hard').fit(x_train, y_train)
y_pred= clf11.predict(x_test)
calculate_metrics(y_test,y_pred)

(0.5909391382887512, 0.5470953702856329, 0.5168145735886558, None)
0.618018018018018
[[597  72]
 [352  89]]


<a id="performance_metric"></a>
## 9. Performance Metrics
[Top](#top)

<a id="evaluate_metric"></a>
## 10. Evaluation
[Top](#top)

<a id="deployment"></a>
## 11. Deployment
[Top](#top)

In [47]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

wml_credentials = {
                   "url": "https://us-south.ml.cloud.ibm.com",
                   "apikey":"insert api key",
                   "instance_id": "insert instance id"
                  }

client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
#This steps is one time to save the model and needs to be rerun for re-traiing purpose only.
stored_model_details = client.repository.store_model(clf10, 'Final_Model')

In [49]:
#This steps is one time to save the model and needs to be rerun for re-traiing purpose only.
import json
published_model_uid = client.repository.get_model_uid(stored_model_details)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))

{
  "metadata": {
    "guid": "4e59ac4d-8471-4067-8758-2c38e776d6bb",
    "url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/4e3826ef-cddb-4851-90f8-6a0c98e62c8a/published_models/4e59ac4d-8471-4067-8758-2c38e776d6bb",
    "created_at": "2020-01-20T09:08:40.273Z",
    "modified_at": "2020-01-20T09:08:40.328Z"
  },
  "entity": {
    "runtime_environment": "python-3.6",
    "learning_configuration_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/4e3826ef-cddb-4851-90f8-6a0c98e62c8a/published_models/4e59ac4d-8471-4067-8758-2c38e776d6bb/learning_configuration",
    "name": "Training_Model_Jan20th2019",
    "learning_iterations_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/4e3826ef-cddb-4851-90f8-6a0c98e62c8a/published_models/4e59ac4d-8471-4067-8758-2c38e776d6bb/learning_iterations",
    "feedback_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/4e3826ef-cddb-4851-90f8-6a0c98e62c8a/published_models/4e59ac4d-8471-4067-8758-2c38e776d6bb/feedback",
    "lat

In [50]:
#This steps is one time to save the model and needs to be rerun for re-traiing purpose only.
created_deployment = client.deployments.create(published_model_uid, name="Training_Model_Jan20th2019")
scoring_endpoint = client.deployments.get_scoring_url(created_deployment)



#######################################################################################

Synchronous deployment creation for uid: '4e59ac4d-8471-4067-8758-2c38e776d6bb' started

#######################################################################################


INITIALIZING
DEPLOY_SUCCESS


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='83101f36-5c62-4645-9739-f3327df97b54'
------------------------------------------------------------------------------------------------




In [51]:
def predict(row):
  scoring_endpoint = 'https://us-south.ml.cloud.ibm.com/v3/wml_instances/4e3826ef-cddb-4851-90f8-6a0c98e62c8a/deployments/83101f36-5c62-4645-9739-f3327df97b54/online'
  scoring_payload = {'fields': ['support_plan_free','all_avg_meaningful_comm_duration_mins','age_of_account_days','sentiment_overall',
                                'sentiment_conversation_last3','technology_level_1','geography','sadness_overall','recurring_invoice_usd',
                                'support_plan_basic','case_duration_days','region','sadness_sentiment_last3','anger_conversation_last',
                                'meaningful_comm_count','sentiment_conversation_last'],
                     'values': [list(row)]}
  predict_flg = False
  num_retries = 5
  while(not predict_flg):
      try:
          predictions = client.deployments.score(scoring_endpoint, scoring_payload)
          predict_flg = True
      except Exception as ex:
          if ('Status code: 504' in str(ex) or 'Status code: 503' in str(ex)) and num_retries > 1:
              predict_flg = False
              num_retries = num_retries - 1
          else:
              raise ex
  return [0,predictions['values'][0][1][0]]

In [52]:
def predict_batch(df):
    df_temp = df.copy()
    df_temp[['target', 'probability']] = df_temp.apply(lambda row: pd.Series(predict(row)), axis=1)
    return df_temp

In [53]:
result = predict_batch(sn_logs_geo_one_hot_hash_select[0:10])
#result = predict_batch(X_select)
result

Unnamed: 0,support_plan_free,all_avg_meaningful_comm_duration_mins,age_of_account_days,sentiment_overall,sentiment_conversation_last3,technology_level_1,geography,sadness_overall,recurring_invoice_usd,support_plan_basic,case_duration_days,region,sadness_sentiment_last3,anger_conversation_last,meaningful_comm_count,sentiment_conversation_last,target,probability
0,0.0,2.524671e-06,0.508541,0.585226,0.606882,0.481217,0.360193,0.32837,0.000159,1.0,0.002746,0.949161,0.288259,0.1153,0.011494,0.606733,0.0,0.466827
1,0.0,0.0001010027,0.334347,0.585226,0.606882,0.480522,0.760082,0.32837,0.0,1.0,0.01097,1.0,0.288259,0.1153,0.028736,0.606733,0.0,0.693498
2,0.0,0.01002598,0.272711,0.772381,0.772381,0.177455,0.360193,0.300531,0.013397,1.0,0.027772,0.949161,0.300531,0.076964,0.017241,0.205389,0.0,0.488599
3,0.0,0.02510216,0.272711,0.585226,0.606882,0.481217,0.360193,0.32837,0.013397,1.0,0.01537,0.949161,0.288259,0.1153,0.005747,0.606733,0.0,0.686577
4,0.0,0.01141376,0.272711,0.756335,0.743332,0.481217,0.098661,0.566497,0.013397,1.0,0.034487,0.385081,0.693606,0.116529,0.057471,0.919218,0.0,0.602158
5,1.0,6.827202e-05,0.272711,0.993261,0.993261,0.039177,0.360193,0.114574,0.013397,0.0,0.013892,0.949161,0.114574,0.144255,0.017241,0.497326,0.0,0.118487
6,0.0,0.02006596,0.126853,0.585226,0.606882,0.625975,0.098661,0.32837,0.005988,1.0,0.026549,0.385081,0.288259,0.1153,0.011494,0.606733,0.0,0.646449
7,0.0,0.0008206718,0.272711,0.779339,0.779339,0.625975,0.098661,0.191532,0.013397,0.0,0.014833,0.385081,0.191532,0.019547,0.022989,0.874041,0.0,0.409489
8,1.0,0.01846395,0.272711,0.830189,0.830189,0.481217,0.360193,0.039115,0.013397,0.0,0.017983,0.949161,0.039115,0.02297,0.011494,0.829217,0.0,0.471795
9,0.0,0.0006142275,0.272711,0.966133,0.966133,0.481217,0.360193,0.007798,0.013397,1.0,0.008762,0.409469,0.007798,0.26722,0.028736,0.965678,0.0,0.339685


In [None]:
result_probability = result[['target','probability']]
result_probability.head()

In [55]:
#join the predictions with the original data set
final_output = result_probability.join(sr_ci_cse_geo)

In [56]:
#save file in object storage
project.save_data("final_output_testing %s" %dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S.csv"), final_output.to_csv(index=False,quoting=csv.QUOTE_ALL,line_terminator="\r\n"),set_project_asset=True, overwrite=True)

{'file_name': 'final_output_testing 2020-01-20-09-25-37.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'test-donotdelete-pr-kqb0wncv5si2pf',
 'asset_id': '0c931432-01e2-4468-bcca-f0e3f663a84f'}

<a id="references"></a>
## 11. References
[Top](#top)
    
    #To be added: In Progress