# Measure Conversation Intent Performance

The following notebook details the steps to test the performance of intent classificaiton within an IBM Watson Conversation workspace.

What you'll need:

1. A conversation workspace file (named workspace.json)
2. A set of example utterences with which to test (named test_set.csv)
3. The credentials of your conversation service and the workspace_id (named conv_creds.json)

## Prepare the conversation workspace

In order to carry out these tests you will need an instance of the IBM Watson Conversation service and a workspace that has been trained with two or more intents. For details of the service and how to go about this please see the getting started section of the documentation.

For the purpose of this notebook it is assumed that you already have the service instance and a trained workspace.

## Import the required dependancies

We'll be using a number of dependancies including the watson_developer_cloud package which you can find and install from here. Additionally we'll be using the scikit-learn metrics to help display a report and the matplotlib for output a fancy confusion matrix.

In [None]:
%matplotlib inline

from collections import Counter
import numpy as np
import json, time
from pprint import pprint
import pandas as pd
from watson_developer_cloud import ConversationV1
from IPython.display import clear_output
from sklearn import metrics

import matplotlib.pyplot as plt

## Prepare the data

Using the pandas package import the data we'll be using for our test. The expected format here is a two column csv with "example utterences", "label", if your file is in a different format you'll need to change the code below, the aim is to get three arrays, test_X that contains the list if utterences, test_Y which contains the labels for each item in test_X and labels which contains the unique list of labels for all utterences.

In [None]:
dataset = pd.read_csv("test_set.csv", quotechar='"', skipinitialspace=True)
print(dataset.shape)

# separate the data from the target attributes
test_X = dataset.values[:,0].tolist()
test_Y = dataset.values[:,1].tolist()

# Get a unique list of the labels in this test set
labels = np.unique(test_Y)
pprint(labels)

## Setup the service credentials

You will need a file in the same directory as this notebook that contains the information in the following format:

```
{
    "credentials": {
        "url": "https://gateway.watsonplatform.net/conversation/api",
        "password": "PASSWORD",
        "username": "USERNAME",
        "workspace_id": "WORKSPACE_ID"
    }
}
```

To get the url, username and password for your instance go to the Bluemix dashboard, find the service instance you want and select it then see the Service Credentials tab. Your workspace_id can be obtained by launching the tooling then selecting the menu icon in the top right of the workspace tile and selecting details.

If you change the format of your credentials file you'll need to change the code below to get the details from your file format.

In [None]:
with open("conv_creds.json") as f:
    creds = json.load(f)['credentials']
    username,password = creds['username'], creds['password']
    endpoint = creds['url']
    workspace_id = creds['workspace_id']

print(creds)

## Run the test set

For each item in the test set that is in the array test_X we'll hit the Conversations API passing in the utterences as the input. The intent name of the first intent returned for each utterences in then added to an array (preds) for use later.

A running total of the tests completed is output and when the tests are complete "Completed test run" will be displayed.

In [None]:
# Iterate over each example in the test set
conversation = ConversationV1(
  username=username,
  password=password,
  version='2016-09-20'
)
preds = []

for idx, item in enumerate(test_X):
    response = conversation.message(
      workspace_id=workspace_id,
      message_input={'text': item},
      context=None
    )
    preds.append(response["intents"][0]["intent"])
    clear_output()
    print("Completed {!s} of {!s}\r".format(idx+1, len(test_X)))

print("Completed test run")

## Show the test metrics

We're using the classification report from the scikit-learn.metrics module to help us display the performance of the classifier. We could roll our own metric calcualtions but really why reinvent the wheel?

The report show us the precision, recall, f1-score and support (number of items tested) fro each class and as a total. For more information on exactly what these measures mean see the scikit-learn website.

Note the call to "np.unique(np.append(test_Y, preds))" this is needed as sometimes the class labels predicted are not in the list of labels we are testing. In a perfect world you would have tests for all possible intents but that isn't always the case so this expression ensures that the report runs.

In [None]:
print(metrics.classification_report(test_Y, preds,
                                    target_names=np.unique(np.append(test_Y, preds))))

## Display a confusion matrix

Confusion matricies can be hard to read, essentially you're looking to identify the labels that are incorrectly classified and what those examples are actually classified as. If I have four labels (1,2,3,4) and my confusion matrix displays that examples of label 2 are often classified as label 3 then I would know that the two labels overlap and I need to look deeper.

In [None]:
ncats = len(np.unique(np.append(test_Y, preds)))
import matplotlib 
matplotlib.rcParams['figure.figsize'] = (20.0, 15.0)
plt.matshow( metrics.confusion_matrix(test_Y, preds), cmap='cubehelix', );
plt.colorbar();
plt.xticks(range(ncats), np.unique(np.append(test_Y, preds)), rotation=90);
plt.yticks(range(ncats), np.unique(np.append(test_Y, preds)));
plt.ylabel("True Label");
plt.xlabel("Predicted Label");
