# An Experimental Journey With Data to Inspire Your Work

## Introduction 

The Experiential Journey with Data to Inspire Your Work session will make you think differently about data and how it can solve problems! You will hear surprising use case that will make you think, sometimes laugh and hopefully inspire your own work. The use case and introductory material includes a hands-on experiential journey described below. The most valuable part of this session is that it is designed to help you gain experience and relate it to your work – so that when you leave you have a plan of action on how you can make data more useful in your organization to solve a key challenge.

A real-business application of analytics in “Improving Customer Experiences with Real-Time Insights” will be used as an example during the workshop. This experiential session will include a step by step journey on “How data science is helping companies to predict the customer experience journey and proactively address the issues, leading to the improvement of Net Promoter Score”. The session will also highlight the importance of using AI, Canvas, CRISP-DM (Cross Industry Standard Process for Data Mining) and Agile in Data Science projects.

The methodology involves consuming historical Net Promoter Score (NPS) data; using machine learning and artificial intelligence to identify the most important features and created an algorithm to predict the customer experience.

## Background

NPS has become the industry standard customer loyalty measurement. Businesses see customer experience as an imperative and would like to run analytics on and predict customer experience. Since competition is rife, keeping customers happy so they do not move their investments elsewhere is key to maintaining profitability.

Improving the customer experience is valuable because of its effect on our bottom line. Creating an ultimate experience that appeals to both the heart and the head is our goal. Customers give their money, fans give their hearts. 44% of consumers say that majority of customer experiences are bland and 69% of consumers say that emotions count for half their experiences.


## Approach

In this notebook, we'll use scikit-learn to predict the customer experience. scikit-learn, which is a machine learning library for the Python programming language, provides implementations of many classification algorithms. 
Here, we will apply multiple classification algorithms, evaluate the performance, and select the best peroforming algorithm based on performance metrics.

To help visualize what we are doing, we'll use 2D and 3D charts to show how the classes look with matplotlib and scikitplot python libraries.

<a id="top"></a>
## Table of Contents

1. [Introduction to Notebook](#getting_started)


2. [Install packages and verify the version](#load_libraries)


3. [Data Exploration](#explore_data)


4. [Feature Extraction](#prepare_data)


5. [Feature Scaling](#feature_extraction)


6. [Feature Selection](#feature_scaling)


7. [Split data into train and test sets](#split_data)


8. [Measure Model Performance](#model_selection)


9. [Evaluate and Select Model](#performance_metric)


10. [Save Model](#evaluate_model)


11. [Deployment](#deployment)


12. [Make Predictions](#interpretation)

<a id="getting_started"></a>
## 1. Introduction to Notebook
[Top](#top)

Quick set of instructions to work through the notebook (If you are new to Notebooks, here's a quick overview of how to work in this environment).

**a.** Notebook is a document representing all input and output of operations. This includes code, text input and numerical, text and rich media output. These files have ipynb extensions.

**b.** The notebook has 3 types of cells [**code cells, markdown cells**, and raw cells - markdown (text)]. 

   - Code cell allows you to edit and write new code, with full syntax highlighting and tab completion.

   - Markdown cell allows to  document the computational process in a literate way, alternating descriptive text with code, using rich text.

   - Raw cells provide a place in which you can write output directly.

**c.** Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell to provide a hands-on experiential journey.


**d.** To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.


**e.** Work through this notebook by reading the instructions and executing code cell by cell. Some cells will require modifications before you run them. 

<a id="load_libraries"></a>
## 2. Load packages and verify the version
[Top](#top)

Library/ Package is collection of various packages. There is no difference between package and python library conceptually.

Module is a set of functions, globals and classes that you can import. Package or library is a set of modules.

In [None]:
%%capture
%matplotlib inline
#Built-in magic commands https://ipython.readthedocs.io/en/stable/interactive/magics.html

#Load packages and libraries

#Provides information about constants, functions and methods of the Python interpreter 
#(https://docs.python.org/3/library/sys.html)
import sys 

#Scientific Computing (https://numpy.org/)
import numpy as np 

#Data manipulation and Analysis (https://pandas.pydata.org/pandas-docs/stable/)
!pip install --user --upgrade pandas
import pandas as pd

#Import and export spreadsheets and databases (https://docs.python.org/3/library/csv.html)
import csv 

#Manipulate dates and times (https://docs.python.org/3/library/datetime.html#module-datetime)
from datetime import datetime
import time

#Exploratory data analysis reports helps with quick data analysis (https://github.com/sfu-db/dataprep#dataprep)
!pip install dataprep
from dataprep.eda import plot, plot_missing, plot_correlation

#Bokeh is an interactive visualization library for modern web browsers (https://docs.bokeh.org/en/latest/index.html#)
!pip install bokeh 
from bokeh.resources import INLINE
import bokeh.io
bokeh.io.output_notebook(INLINE)

#Profile reports helps with quick data analysis(https://github.com/pandas-profiling/pandas-profiling)
!pip install pandas-profiling[notebook]
from pandas_profiling import ProfileReport

#Prerequisite for pandas-profiling (https://ipywidgets.readthedocs.io/en/latest/)
!pip install ipywidgets==7.5.1 
from ipywidgets import widgets

#Plotting library https://matplotlib.org/
!pip install matplotlib
import matplotlib
import matplotlib.pyplot as plt

#Python client library to quickly get started with the various Watson Developer Cloud services
!pip install ibm_watson 

#Token-based Identity and Access Management (IAM) authentication https://github.com/watson-developer-cloud/python-sdk
#allows error handling in more complex programs #https://ibm-watson-iot.github.io/iot-python/exceptions/
from ibm_watson import ApiException 
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator  

#Analyzes concepts, entities, keywords, categories, sentiment, emotion, relations, and semantic roles 
#github https://ibm.biz/Bdqf2U demo https://ibm.biz/Bdqf25
from ibm_watson import NaturalLanguageUnderstandingV1 
from ibm_watson.natural_language_understanding_v1 import Features, SentimentOptions, EmotionOptions

#Provides complete access to the IBM Cloud Object Storage API
!pip install boto3 
import ibm_boto3
from ibm_botocore.client import Config, ClientError

#Classification, Regression, Clustering, Dimensionality Reduction,Model Selection and Preprocesing (https://scikit-learn.org/)
!pip install sklearn  
import sklearn 
#Preprocessing data #https://scikit-learn.org/stable/modules/preprocessing.html
from sklearn import preprocessing 
#Perform a train-test split
from sklearn.model_selection import train_test_split
#Transforms between zero and one
from sklearn.preprocessing import MinMaxScaler 
#Model evaluation metrics
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

#Statistical graphics https://seaborn.pydata.org/introduction.html
import seaborn as sns

In [None]:
#Check version
print("Python %d.%d.%d%s%s"%sys.version_info)
print("Pandas %s"%pd.__version__)
print("Numpy %s"%np.__version__)
print("Scikit-learn %s"%sklearn.__version__)
print("CSV %s"%csv.__version__)
print("IBM boto3 %s"%ibm_boto3.__version__)
print("Matplotlib %s"%matplotlib.__version__)

<a id="explore_data"></a>
## 3. Data exploration
[Top](#top)

<a id="load_libraries"></a>
### 3.1. Load and read the files from GitHub

[Top](#top)

In [None]:
#assign the urls of the files
nps = "https://raw.githubusercontent.com/neemadan/An-Experiential-Journey-With-Data-to-Inspire-Your-Work/master/nps_dataset.csv"

#read the files from the links and store in a dataframe
nps = pd.read_csv(nps)

<a id="load_libraries"></a>
### 3.2. Explore the data and perform quality audit [DataPrep.eda and Pandas Profiling](https://towardsdatascience.com/exploratory-data-analysis-dataprep-eda-vs-pandas-profiling-7137683fe47f)

[Top](#top)

#### Option 1: DataPrep.eda (2020) is a Python library for doing EDA produced by SFU’s Data Science Research Group.

In [None]:
#Generate Data Summary
bokeh.io.output_notebook(INLINE)
plot(nps)

In [None]:
#Analyze select column for correlation
plot_correlation(nps, 'sentiment_overall')

#### Option 2: Pandas Profiling is an open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code.

#This package takes time (approx 20-30 mins), so please feel free to change this Markdown to Code cell and run this post the workshop.
profile = ProfileReport(nps, title="Pandas Profiling Report")
profile

<a id="feature_extraction"></a>
## 4. Feature Extraction
[Top](#top)

In [None]:
#This step assigns the columns by type for transformation to be applied (numerical, categorical, and categorical with high cardinality)
numcols = ['likelihood_to_recommend','assignment_count','meaningful_comm_count', 'first_meaningful_comm_duration_mins', 'all_avg_meaningful_comm_duration_mins',
           'age_of_account_days', 'life_time_spend_usd', 'monthly_recurring_revenue_usd', 'ticket_duration_days', 'sentiment_overall', 'anger_overall', 'disgust_overall', 
           'fear_overall', 'joy_overall', 'sadness_overall', 'sentiment_last3_conversation', 'anger_sentiment_last3', 'disgust_sentiment_last3', 
           'fear_sentiment_last3', 'joy_sentiment_last3', 'sadness_sentiment_last3', 'sentiment_last_conversation', 'anger_last_conversation', 'disgust_last_conversation', 
           'fear_last_conversation', 'joy_last_conversation', 'sadness_last_conversation', 'sentiment_short_description', 'anger_short_description',
           'disgust_short_description', 'fear_short_description', 'joy_short_description', 'sadness_short_description', 'sentiment_description', 'anger_description', 
           'disgust_description', 'fear_description', 'joy_description', 'sadness_description']
           
catcols_dummy = ['support_plan','account_type', 'sr_severity','technology_level_2', 'technology_level_3','case_origination_source', 'case_origination_user_type', 
                 'dayofweek', 'timewindow', 'severity_change','tribe_level_1', 'tribe_level_2']

catcols_hash = ['technology_level_1','catalog_name', 'country', 'geography', 'region']

In [None]:
#This step performs one hot encoding on categorical variables and hashing on categorical variables with high cardinality
nps_select = pd.concat([nps[numcols], pd.get_dummies(nps[catcols_dummy]),nps[catcols_hash]],axis=1)

for cat in catcols_hash:
    nps_select[cat] = nps_select[cat].apply(hash)

<a id="feature_scaling"></a>
## 5. Feature Scaling
[Top](#top)

In [None]:
#This step assigns the target variable to y and other features to X
y = nps_select['likelihood_to_recommend']
X = nps_select.copy()
del X['likelihood_to_recommend']

In [None]:
#This steps converts all the values in the dataframe to numeric
X = X.apply(pd.to_numeric, errors='coerce')

In [None]:
#This step assigns mean value to blank cells and thereafter uses MinMax to scale the data

X = X.fillna(X.mean())

scaler = MinMaxScaler()
scaler = MinMaxScaler(feature_range=(0, 1), copy=True)
scaler.fit(X)

X = pd.DataFrame(scaler.transform(X), index=X.index, columns=X.columns)
display(X.head())

<a id="feature_scaling"></a>
## 6. Feature Selection
[Top](#top)

This is a filter-based method.  We check the absolute value of the Pearson's correlation between the target and numerical features in our dataset. We keep the top n features based on this criterion.

In [None]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

In [None]:
# Pearson's Correlation
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features:')
print(cor_feature)

In [None]:
#Select the top feature to use as in input to train the model
top_30_select = X[cor_feature[:30]]
top_30_select

<a id="feature_scaling"></a>
## 7. Split data into train and test sets

[Top](#top)

In [None]:
#This step perform a train-test split

X_train, X_test, y_train, y_test = train_test_split(top_30_select, y, test_size=0.30, random_state=123)
print("train and test data shape=")
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

print("testing data likelihood=")
print("1_Promoter=",np.sum(y_test))
print("0_Non Promoter=",len(y_test)-np.sum(y_test))

<a id="feature_scaling"></a>
## 8. Measure Model Performance

[Top](#top)

In [None]:
def calculate_metrics(y_true,y_pred):
    print("precision, recall, and f1 score:", precision_recall_fscore_support(y_true, y_pred,average='macro'))
    print("accuracy score:", accuracy_score(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred,labels=[0,1]))

In [None]:
from sklearn.linear_model import LogisticRegression
clf1 = LogisticRegression(random_state=0, solver='lbfgs',class_weight="auto").fit(X_train, y_train)
y_pred= clf1.predict(X_test)
calculate_metrics(y_test,y_pred);

In [None]:
from sklearn.linear_model import SGDClassifier
clf2 = SGDClassifier(max_iter=1000, tol=1e-3,class_weight="balanced").fit(X_train, y_train)
y_pred= clf2.predict(X_test)
calculate_metrics(y_test,y_pred);

In [None]:
from sklearn import svm
clf3 = svm.SVC(gamma='scale',class_weight="balanced").fit(X_train, y_train)
y_pred= clf3.predict(X_test)
calculate_metrics(y_test,y_pred)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf4 = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
y_pred= clf4.predict(X_test)
calculate_metrics(y_test,y_pred)

In [None]:
from sklearn.gaussian_process import GaussianProcessClassifier
clf5 = GaussianProcessClassifier(max_iter_predict = 300, random_state=0).fit(X_train, y_train)
y_pred= clf5.predict(X_test)
calculate_metrics(y_test,y_pred)

In [None]:
from sklearn.naive_bayes import MultinomialNB 
clf7 = MultinomialNB().fit(X_train, y_train) 
y_pred= clf7.predict(X_test) 
calculate_metrics(y_test,y_pred)

In [None]:
from sklearn import tree 
clf8 = tree.DecisionTreeClassifier(class_weight="balanced").fit(X_train, y_train) 
y_pred= clf8.predict(X_test) 
calculate_metrics(y_test,y_pred)

In [None]:
from sklearn.ensemble import RandomForestClassifier 
clf9 = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=2, random_state=0,class_weight="balanced").fit(X_train, y_train) 
y_pred= clf9.predict(X_test) 
calculate_metrics(y_test,y_pred)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf10 = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)
y_pred= clf10.predict(X_test) 
calculate_metrics(y_test,y_pred)

In [None]:
from sklearn.ensemble import VotingClassifier
clf11 = VotingClassifier(estimators=[('svm', clf9), ('rf', clf10)], voting='hard').fit(X_train, y_train)
y_pred= clf11.predict(X_test)
calculate_metrics(y_test,y_pred)

<a id="feature_scaling"></a>
## 9. Evaluate and Select Model

[Top](#top)

In [None]:
from sklearn.metrics import confusion_matrix #, plot_confusion_matrix

acc_log = pd.DataFrame(columns=["Classifier", "Accuracy"]) #create accuracy log dataframe
classifiers = [clf1, clf2, clf3, clf4, clf5, clf8, clf9, clf10, clf11] #list classifiers

for clf in classifiers:
    name = clf.__class__.__name__ #Get and print classifier name
    print(name)
    y_pred= clf.predict(X_test)
    
    print(precision_recall_fscore_support(y_test, y_pred,average='macro'))
    acc = accuracy_score(y_test, y_pred) #Get and print accuracy
    print("Accuracy: {:.2%}".format(acc)) 
    print(confusion_matrix(y_test, y_pred,labels=[0,1])) 
    #plot_confusion_matrix(clf, x_train, y_train, cmap=plt.cm.Blues)

    log_entry = pd.DataFrame([[name, acc*100]], columns=["Classifier", "Accuracy"])
    acc_log = acc_log.append(log_entry)
    print("")

#Format and print comparison of accuracy
acc_log = acc_log.sort_values(['Accuracy'], ascending = False)
display(acc_log)
sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=acc_log, color="b")

plt.xlabel('Accuracy %')
plt.title('Classifier Accuracy')
plt.show()

<a id="feature_scaling"></a>
## 10. Save the Model

[Top](#top)

### Connection to WML

Authenticate the Watson Machine Learning service on IBM Cloud. You need to provide platform api_key and instance location.

1. Your Cloud API key can be generated by going to the [Users section](https://cloud.ibm.com/iam#/users) of the Cloud console. From that page, click your name, scroll down to the API Keys section, and click Create an IBM Cloud API key. Give your key a name and click Create, then copy the created key and paste it below. 
2. You can check your instance location in your Watson Machine Learning (WML) Service instance details. Pick the name corresponding to the region listed on the service details page:

```
Name            Display name
au-syd          Sydney
in-che          Chennai
jp-osa          Osaka
jp-tok          Tokyo
kr-seo          Seoul
eu-de           Frankfurt
eu-gb           London
ca-tor          Toronto
us-south        Dallas
us-south-test   Dallas Test
us-east         Washington DC
br-sao          Sao Paolo

```



In [None]:
api_key = "INSERT API HERE"
location = "INSERT LOCATION NAME HERE"
wml_credentials = {
    "apikey": api_key,
    "url": 'https://' + location + '.ml.cloud.ibm.com'
}

In [None]:
%%capture

!pip install -U ibm-watson-machine-learning

In [None]:
from ibm_watson_machine_learning import APIClient
client = APIClient(wml_credentials)

### Working with spaces

First, you need to create a space that will be used for your work. If you do not have space already created, you can use [Deployment Spaces Dashboard](https://dataplatform.cloud.ibm.com/ml-runtime/spaces?context=cpdaas) to create one.

    1. Click New Deployment Space
    2. Create an empty space
    3. Select Cloud Object Storage
    4. Select Watson Machine Learning instance and press Create
    5. Copy space_id and paste it below


In [None]:
space_id = 'INSERT SPACE ID HERE'

In [None]:
# you should see your space listed below
client.spaces.list(limit=10)

In [None]:
# To be able to interact with all resources available in Watson Machine Learning, you need to set space which you will be using.
client.set.default_space(space_id)

In [None]:
#This steps is one time to save the model and needs to be rerun for re-traiing purpose only.
sofware_spec_uid = client.software_specifications.get_id_by_name("default_py3.7")
metadata = {
            client.repository.ModelMetaNames.NAME: 'Scikit model',
            client.repository.ModelMetaNames.TYPE: 'scikit-learn_0.23',
            client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: sofware_spec_uid
}


# Publish model in Watson Machine Learning repository on Cloud.
published_model = client.repository.store_model(
    model=clf10,
    meta_props=metadata,
    training_data=X_train,
    training_target=y_train)

In [None]:
# see the details of your published model:
import json

published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))

<a id="feature_scaling"></a>
## 11. Deploy the Model

[Top](#top)

In [None]:
metadata = {
    client.deployments.ConfigurationMetaNames.NAME: "PoC_NPS_Conference",
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

created_deployment = client.deployments.create(published_model_uid, meta_props=metadata)

In [None]:
# you need the deployment_uid to make predictions using the API
deployment_uid = client.deployments.get_uid(created_deployment)

In [None]:
# get model details
client.deployments.get_details(deployment_uid)

<a id="feature_scaling"></a>
## 12. Make Predictions

[Top](#top)

In [None]:
# you can get a scoring enpoint if you want to make predictions against the endpoint directly
scoring_endpoint = client.deployments.get_scoring_href(created_deployment)
print(scoring_endpoint)

In [None]:
# the scoring payload needs to be of the following format. We are using a random sample from the test dataset in this payload
scoring_values = list(X_test.sample().values[0])
scoring_fields = X_test.columns
scoring_payload = {client.deployments.ScoringMetaNames.INPUT_DATA: [{'fields': list(scoring_fields),
                     'values': [scoring_values]}]}


# scoring_payload = {"input_data": [{"values": [score_0, score_1]}]}

In [None]:
# let's see what the scoring payload looks like:
# scoring_payload

In [None]:
# finally, let's make some predictions. We use the deployment_uid and not the scoring URL since we using the API.
result = client.deployments.score(deployment_uid, scoring_payload)
result

In [None]:
predicted_class = result['predictions'][0]['values'][0][0]
print(f'predicted_class: {predicted_class}')
print(f'confidence score: {result["predictions"][0]["values"][0][1][predicted_class]}')

In [None]:
# method to batch predict 
def predict(row):
    
  scoring_values = list(X_test.sample().values[0])
  scoring_fields = X_test.columns
  scoring_payload = {client.deployments.ScoringMetaNames.INPUT_DATA: [{'fields': list(scoring_fields),
                     'values': [list(row)]}]}

  predict_flg = False #if prediction output fails for any reason
  num_retries = 5 #make another 5 attempts
  while(not predict_flg):
      try:
          predictions = client.deployments.score(deployment_uid, scoring_payload)
          predict_flg = True
      except Exception as ex:
          if ('Status code: 504' in str(ex) or 'Status code: 503' in str(ex)) and num_retries > 1:
              predict_flg = False
              num_retries = num_retries - 1
          else:
              raise ex
    
  return [predictions['predictions'][0]['values'][0][0],predictions['predictions'][0]['values'][0][1][0]]

In [None]:
def predict_batch(df):
    df_temp = df.copy()
    df_temp[['target', 'probability']] = df_temp.apply(lambda row: pd.Series(predict(row)), axis=1)
    return df_temp

In [None]:
#run predictions on first few records
result = predict_batch(X_train[0:15])

#run predictions on all records
#result = predict_batch(X_select)
result

## Congratulations! You have reached the end of the notebook

Here are some more notebooks to try - https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-samples-overview.html?context=wdp&audience=wdp