# Call Center Analytics using Waston AI Services

This notebook shows you

## Table of contents

1. [Load the required libraries](#loadlibraries)
2. [Load data from Cloud Object Storage](#loaddata)
3. [Visualize Sentiment and Top Keywords using Watson NLU response](#visualizeNLU)
4. [Visualize Emotion Tone using Watson Tone Analyzer response](#visualizeToneAnalyzer)
5. [Summary](#summary)

<a id="loadlibraries"></a>
## Step 1: Load the required libraries

- <a href="https://github.com/amueller/word_cloud/" target="_blank" rel="noopener no referrer">wordcloud</a> is a Python library for generating Word Clouds 

In [None]:
# Run pip install only the first time, once installed on your Spark machine, no need to re-run unless you want to upgrade
!pip install --upgrade --force-reinstall wordcloud

In [None]:
import ibm_boto3
from botocore.client import Config
import json
import pixiedust
from pixiedust.display import *

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

import matplotlib.pyplot as plt

from pyspark.sql import functions as F
from pyspark.sql.functions import col

<a id="loaddata"> </a>
## Step 2: Load NLU enriched data from your Cloud Object Storage instance

The first step is to load the data. This notebook assumes you have your enriched data stored in cloud object storage. In particular, we load the Watson Natural Language Understanding response for call center logs from cloud object storage.

In [None]:
# The code was removed by DSX for sharing.

In [None]:
# Set the credentials to a generic variable to reference in the rest of the notebook
cos_credentials = credentials_1

In [None]:
# Define Cloud Object Storage client by specifying the credentials for your COS instance
client = ibm_boto3.client(service_name='s3', 
    ibm_api_key_id=cos_credentials['IBM_API_KEY_ID'],
    ibm_auth_endpoint=cos_credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=cos_credentials['ENDPOINT'])

<a id="visualizeNLU"></a>
## Step 3: Visualize Sentiment and Top Keywords using Watson NLU response
Define the function to parse Watson NLU json response and extract sentiment score, sentiment label, and keywords.

In [None]:
# Method to parse NLU response file from Cloud Object Storage
# and return sentiment score, sentiment label, and keywords
def getNLUresponse(COSclient, bucket, filename):
    streaming_body = COSclient.get_object(Bucket=bucket, Key=filename)['Body']
    nlu_response = json.loads(streaming_body.read().decode("utf-8"))
    if nlu_response and nlu_response['sentiment'] \
    and nlu_response['sentiment']['document'] and nlu_response['sentiment']['document']['label']:
        sentiment_score = nlu_response['sentiment']['document']['score']
        sentiment_label = nlu_response['sentiment']['document']['label']
        keywords = list(nlu_response['keywords'])
    else:
        sentiment_score = 0.0
        sentiment_label = None
        keywords = null
        
    return (filename,sentiment_score,sentiment_label,keywords)

In [None]:
# Read enriched files from Cloud Object Storage
# Provide list of files saved in COS that include NLU response
nlu_files=['sample1_nlu.json','sample2_nlu.json', 'sample3_nlu.json', 'sample4_nlu.json']
nlu_header=['filename','sentiment_score','sentiment_label','keywords']
nlu_results = []
bucket = cos_credentials['BUCKET']
for filename in nlu_files:
    print("Processing NLU response from file: ", filename)
    nlu_results.append(getNLUresponse(client,bucket,filename))
    

In [None]:
print(nlu_results)

Map the parsed NLU responses into a Spark dataframe, one record for each file, where each file is the NLU response for one call center record.

In [None]:
callcenterlogs_nluDF = spark.createDataFrame(nlu_results, nlu_header)

In [None]:
# Common validation calls
print(type(callcenterlogs_nluDF))
callcenterlogs_nluDF.printSchema()
callcenterlogs_nluDF.show()

### Sentiment plots using PixieDust
Leverage PixieDust to plot sentiment labels as a pie-chart showing how many positive, negative, and neutral calls are received.

In [None]:
## Ignore any records with null sentiment label
callcenterlogs_nluDF = callcenterlogs_nluDF.where(col('sentiment_label').isNotNull())
perlabel_sentimentDF = callcenterlogs_nluDF.groupBy('sentiment_label')\
                              .agg(F.count('filename')\
                              .alias('num_calls'))

## Take a look
perlabel_sentimentDF.show()

In [None]:
# Call Pixiedust to visualize sentiment data
display(callcenterlogs_nluDF)

### Keywords visualization using Word Cloud
Next, we process the NLU keywords results to understand what are the top keywords referenced in the call center interactions. This would be very helpful in delivering insights what are the main topics being referenced in these call center interactions.

In [None]:
from pyspark.sql.functions import explode

# Explode keywords
callcenterlogs_nluDF = callcenterlogs_nluDF.select(explode('keywords').alias('topkeywords'))
callcenterlogs_nluDF = callcenterlogs_nluDF.select('topkeywords').rdd.map(lambda row: row[0]).toDF()


In [None]:
callcenterlogs_nluDF.head(4)

In [None]:
# UDF to return lower case of word
def toLowerCase(word):
    return word.lower()

In [None]:
# Process extracted keywords to change to lower case
udfLowerCase = udf(toLowerCase, StringType())
callcenterlogsTopKeywordsDF = callcenterlogs_nluDF.withColumn('topkeywords',udfLowerCase('text'))


In [None]:
# Group by topkeywords and compute average relevance per keyword and also number of calls for each keyword
callcenterlogsKwdsNumDF = callcenterlogsTopKeywordsDF.groupBy('topkeywords')\
                              .agg(F.count('topkeywords').alias('kwdsnumcalls'))
callcenterlogsKwdsRelDF = callcenterlogsTopKeywordsDF.groupBy('topkeywords')\
                          .agg(F.avg('relevance').alias('kwdsavgrelevance'))


In [None]:
# join the keywords nunber and keywords relevance dataframes into one
callcenterlogsKeywordsDF = callcenterlogsKwdsNumDF.join(callcenterlogsKwdsRelDF,'topkeywords','outer')

# Define keyword score as product of number of calls expressing that keyword and average relevance of that keyword
callcenterlogsKeywordsDF = callcenterlogsKeywordsDF.withColumn('keyword_score',callcenterlogsKeywordsDF.kwdsnumcalls * callcenterlogsKeywordsDF.kwdsavgrelevance)

# Sort dataframe in descending order of KEYWORD_SCORE
callcenterlogsKeywordsDF = callcenterlogsKeywordsDF.orderBy('keyword_score',ascending=False)

# Remove None keywords
callcenterlogsKeywordsDF = callcenterlogsKeywordsDF.where(col('topkeywords').isNotNull())


In [None]:
print("Top Keywords from call center logs")
callcenterlogsKeywordsDF.show()

In [None]:
display(callcenterlogsKeywordsDF)

In [None]:
# Map to Pandas DataFrame
callcenterlogsKeywordsPandas = callcenterlogsKeywordsDF.toPandas()

In [None]:
from wordcloud import WordCloud

# Process Pandas DataFrame in the right format to leverage wordcloud.py for plotting
# See documentation: https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py 
def prepForWordCloud(pandasDF,n):
    kwdList = pandasDF['topkeywords']
    sizeList = pandasDF['keyword_score']
    kwdSize = {}
    for i in range(n):
        kwd=kwdList[i]
        size=sizeList[i]
        kwdSize[kwd] = size
    return kwdSize

In [None]:
%matplotlib inline
maxWords = len(callcenterlogsKeywordsPandas)
nWords = 4

#Generating wordcloud. Relative scaling value is to adjust the importance of a frequency word.
#See documentation: https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py
callcenterlogsKwdFreq = prepForWordCloud(callcenterlogsKeywordsPandas,nWords)
callcenterlogsWordCloud = WordCloud(max_words=maxWords,relative_scaling=0,normalize_plurals=False).generate_from_frequencies(callcenterlogsKwdFreq)

plt.imshow(callcenterlogsWordCloud)

# If need to support multiple side-by-side word clouds, use commented lines below

#fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (23, 10))

## Set titles for images
#ax[0].set_title('Top Keywords from logs of call 1')
#ax[1].set_title('Top Keywords from logs of call 2')

                
## Plot word clouds
#ax[0].imshow(callcenterlogs1WordCloud)
#ax[1].imshow(callcenterlogs2WordCloud)

# turn off axis and ticks
#plt.axis("off")
#ax[0].tick_params(axis='both',which='both',bottom='off',top='off',left='off',right='off',
#                 labelbottom='off',labeltop='off',labelleft='off',labelright='off') 
#ax[1].tick_params(axis='both',which='both',bottom='off',top='off',left='off',right='off',
#                 labelbottom='off',labeltop='off',labelleft='off',labelright='off') 


#plt.show()

<a id="visualizeToneAnalyzer"></a>
## Step 4: Visualize Emotion Tone using Watson Tone Analyzer response
Define the function to parse Watson Tone Analyzer json response and extract emotion tone labels and scores.

In [None]:
# Method to parse Tone Analyzer response file from Cloud Object Storage
# and return emotion tone labels and scores
toneID_list=['excited','frustrated','impolite','polite','sad','satisfied','sympathetic']
def getTAresponse(COSclient, bucket, filename):
    streaming_body = COSclient.get_object(Bucket=bucket, Key=filename)['Body']
    ta_response = json.loads(streaming_body.read().decode("utf-8"))
    if ta_response and ta_response['utterances_tone']:
        # Assume one set of tones per file; if file is created to include a number of utterances
        # we will need to change this code
        tones = ta_response['utterances_tone'][0]['tones']
    else:
        tones = []
    return (filename, tones)

In [None]:
toneanalyzer_files=['sample1_ta.json','sample2_ta.json']
toneanalyzer_header=['filename','tones']
toneanalyzer_results = []
bucket = cos_credentials['BUCKET']
for filename in toneanalyzer_files:
    print("Processing Tone Analyzer response from file: ", filename)
    response= getTAresponse(client,bucket,filename)
    toneanalyzer_results.append(response)

In [None]:
callcenterlogs_taDF = spark.createDataFrame(toneanalyzer_results, toneanalyzer_header)

In [None]:
callcenterlogs_taDF.head(4)

In [None]:
callcenterlogs_taDF.printSchema()

In [None]:
# If not imported earlier, import explode
from pyspark.sql.functions import explode

# Explode keywords
callcenterlogs_taDF = callcenterlogs_taDF.select(explode('tones').alias('toptones'))
callcenterlogs_taDF = callcenterlogs_taDF.select('toptones').rdd.map(lambda row: row[0]).toDF()

In [None]:
# Print schema and note that score is of type string
callcenterlogs_taDF.printSchema()

In [None]:
# Cast the score column from String to Double
callcenterlogs_taDF = callcenterlogs_taDF.withColumn("score", col("score").cast("double"))

In [None]:
# Print schema to verify score is now of type double
callcenterlogs_taDF.printSchema()

In [None]:
callcenterlogs_taDF.head(5)

In [None]:
display(callcenterlogs_taDF)

### Stop Here
Cells below are not needed when using PixieDust as it simplifies much of the processing executed in the steps below.

However, we're keeping them here commented for reference, in case you want to explore other operations.

In [None]:
# Group by toptones and compute average score per tone and also number of calls for each tone
callcenterlogsTonesNumDF = callcenterlogs_taDF.groupBy('tone_id')\
                           .agg(F.count('tone_id').alias('tonesnumcalls'))
callcenterlogsTonesScoreDF = callcenterlogs_taDF.groupBy('tone_id')\
                          .agg(F.avg('score').alias('tonesavgscore'))



In [None]:
# join the tones nunber and tones scores dataframes into one
callcenterlogsTonesDF = callcenterlogsTonesNumDF.join(callcenterlogsTonesScoreDF,'tone_id','outer')

# Define tones score as product of number of calls expressing that tone and average score of that tone
callcenterlogsTonesDF = callcenterlogsTonesDF.withColumn('tones_score',callcenterlogsTonesDF.tonesnumcalls * callcenterlogsTonesDF.tonesavgscore)

# Sort dataframe in descending order of tones_score
callcenterlogsTonesDF = callcenterlogsTonesDF.orderBy('tones_score',ascending=False)

# Remove None tones
callcenterlogsTonesDF = callcenterlogsTonesDF.where(col('tone_id').isNotNull())

In [None]:
#response= getTAresponse(client,bucket,filename)
#print(json.dumps(resp, indent=4, sort_keys=True))

<a id="summary"></a>
## Summary
Write summary here