# Analyzing Conversations

This data science notebook analyzes the log of all conversations between users and the bot to help understand how it was used, and what topics were of greatest interest to users.

### Prerequisites
We're going to use PixieDust to help visualize our data. You can learn more about PixieDust at https://ibm-cds-labs.github.io/pixiedust/.
In the following cell we ensure we are running the lastest version of PixieDust. Be sure to restart your kernel if instructed to do so.

In [None]:
!pip install --user --upgrade pixiedust

In [None]:
import pixiedust
pixiedust.enableJobMonitor()

In [None]:
from pyspark.sql.functions import explode, lower

### Configure database connectivity
Enter your Cloudant information below.

In [None]:
# Enter your Cloudant host name
host = 'adc809a8-9e28-41c4-bfac-faf3e36a9862-bluemix.cloudant.com'
# Enter your Cloudant user name
username = 'acelymposelaneallondints'
# Enter your Cloudant password
password = '14f7814094c43a22ddef8492d2f9ce95abf6efb3'
# Enter your source database name
database = 'cbf_nyc_chatbot_convos'

### Load documents from the database
Load the documents into an Apache Spark DataFrame.

In [None]:
# no changes are required to this cell
# obtain SparkSession
sparkSession = SparkSession.builder.getOrCreate()
# load data
conversation_df = sparkSession.read.format("com.cloudant.spark").\
    option("cloudant.host", host).\
    option("cloudant.username", username).\
    option("cloudant.password", password).\
    load(database)

### Document Structure

Each document in the database represents a single conversation made with the chatbot. Each conversation includes the user, date, and the steps of the conversation. The steps are stored in an array called dialogs (referring to the dialogs in Watson Conversation that were traversed as part of the conversation). Here is a sample conversation:

<code>
"_id": "19e231d78c7106b61993bd3aacc77520",
"_rev": "3-3dad296b493c1cf2f12fc1a082661f82",
"userId": "iLhfu20E",
"date": 1494347114492,
"dialogs": [
  {
    "name": "sickGetSymptoms",
    "message": "I'm not feeling well",
    "reply": "I'm sorry to hear that. Why don't you tell me what is bothering you?\n",
    "date": 1494347114695
  },
  {
    "name": "sickENTSymptoms",
    "message": "I have a sore throat",
    "reply": "OK. It sounds like an ENT might be able to help you. Give me your address and I'll try and find one nearby.\n",
    "date": 1494347121729
  }
]
</code>

In the following cell we print the schema to confirm the structure of the documents.

In [None]:
conversation_df.printSchema()

### How many conversations where there?
Let's start by showing how many conversations there have been with the Chatbot: 

In [None]:
conversation_df.count()

### Flatten the Cloudant JSON document structure
Each dialog contains a message field which contains the message sent by the user, and a name field which represents the action performed by the system based on the message sent by the user and the current dialog in the conversation as managed by Watson Conversation. For example, the name `findDoctorLocation` maps to the action of searching for a doctor. We want to do some analysis on specific actions and the messages associated with those actions, so in the next cell we convert each row (which has the dialog array) into multiple rows - one for each dialog. This will make it easier for us to filter and aggregate based on the message and name fields in the dialogs.

In [None]:
dialog_df = conversation_df.select(explode(conversation_df.dialogs).alias("dialog"))
dialog_df = dialog_df.select("dialog.date", 
                                         lower(dialog_df.dialog.message).alias("message"), 
                                         "dialog.name")
dialog_df.printSchema()

### Display Dialogs in PixieDust
Below we display each dialog in a PixieDust table. You can see the date, message (the message the user sent to the chatbot), and name (the name of the action performed).

In [None]:
display(dialog_df)

### How many conversations started with people saying they were sick?
The action `sickGetSymptoms` is stored when a user says they are not feeling well. Here we create a DataFrame with only those search actions:

In [None]:
sick_dialog_df = dialog_df.filter(dialog_df.name == 'sickGetSymptoms')
sick_dialog_df.count()

### What were some of the symptoms we were unable to detect

In [None]:
unknown_symptom_dialog_df = dialog_df.filter(dialog_df.name == 'sickUnknownSymptoms')
unknown_symptom_dialog_df.count()

### What were the most common symptoms we were unable to detect?
Next we group by message. Message is the message sent by the user. In this case it essentially represents the search term entered by the user for finding Interactive sessions. Here we aggregate and display the search terms across all users:

In [None]:
unknown_symptom_dialog_by_message_df = unknown_symptom_dialog_df.groupBy('message').count().orderBy('count', ascending=False)
display(unknown_symptom_dialog_by_message_df)

### Display PixieDust Bar Chart

In [None]:
display(unknown_symptom_dialog_by_message_df.limit(20))

### Display PixieDust Pie Chart

In [None]:
display(unknown_symptom_dialog_by_message_df.limit(10))

## Conclusions 