# Classifying emotions in movie dialogue using Watson NLP

This notebook demonstrates how to classify emotions in tweets using Watson NLP python library

### What you'll learn in this notebook
Watson NLP offers so-called blocks for various NLP tasks. This notebook shows:

- **Syntax analysis** with the _Syntax block_ for English (`syntax_izumo_en_stock`). This block performs tokenization, lemmatization, parts of speech tagging, and dependency parsing on raw input documents so that custom models can properly classify documents.
- **Emotion classification** with the _Ensemble emotion workflow_ (`ensemble_classification-wf_en_emotion-stock`) and the _Aggregated emotion workflow_ (`aggregated_classification-wf_en_emotion-stock`). These model workflow classify text into five emotions: "sadness", "joy", "anger", "fear", "disgust".

## Table of Contents

1. [Before you start](#beforeYouStart)
1. [Dataset](#dataset)

    1. [Data import](#dataImport)
    1. [Data formatting](#dataFormat)
    1. [Train test split](#trainTestSplit)    
1. [Training models](#training)

    1. [Pre-processing](#preProcessing)
    1. [TF-IDF](#tfidf)
    1. [Embeddings](#embeddings)
    1. [CNN](#cnn)
    
1. [Testing](#testing)

<a id="beforeYouStart"></a>
## 1. Before you start

<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.8 + Watson NLP XS_ environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

<span style="color:blueviolet">Begin by importing and initializing some helper libs that are used throughout the notebook.</span>


In [2]:
import json
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import tensorflow as tf
import watson_nlp
import watson_nlp.data_model as dm

from sklearn.model_selection import train_test_split

from watson_core.toolkit import fileio
from watson_core.toolkit.quality_evaluation import QualityEvaluator, EvalTypes

from watson_nlp.blocks.classification.bert import BERT
from watson_nlp.blocks.classification.cnn import CNN
from watson_core.data_model.streams.resolver import DataStreamResolver
from watson_core.data_model.streams.resolver import DataStream
from watson_nlp.blocks.classification.svm import SVM
from watson_nlp.blocks.vectorization.tfidf import TFIDF

In [4]:
pd.set_option('display.max_colwidth', 0)

<span style="color:maroon">Printing either `block_models` or `workflow_models` will display a list of pretrained models available in the current version of Watson NLP</span>

In [5]:
block_models = watson_nlp.get_models().get_alias_models()
workflow_models = watson_nlp.get_workflows().get_alias_models()

## 2. Dataset

The dataset contains over three thousand quotations from movie dialogues. The labeled emotions in the dataset are "anger", "sadness", "fear", "joy", and "disgust". Because some quotations have multiple sentences, they can be regarded as documents. The dataset is available interanlly at [Github Repo](https://github.ibm.com/hcbt/Watson-NLP/blob/main/Emotion-Classification/movieDialog_train.csv), in accordance to IBM licensing; any other emotion classification dataset may be used in substitution for the workflow outlined in this notebook.

<a id="dataImport"></a>
### 2.1. Data import using Watson Studio Project Library

<span style="color:blue">The previously inserted project access token will be used to import datasets from the Project Data Assets. The most commonly imported file formats are .csv and .json.</span>

In [6]:
file = project.get_file('movieDialog_train.csv')
df = pd.read_csv(file)

<a id="dataFormatting"></a>
### 2.2. Data formatting

<span style="color:blue">Data prepared for Watson NLP models need to be formatted in such a way that there is a `text` feature column and a `labels` label column. The `labels` column needs have type `list`.</span>

In [7]:
def convertToList(x):
    return [x]

In [8]:
df['label'] = df['label'].apply(convertToList)
df = df.rename(columns={'label':'labels'})

<a id="trainTestSplit"></a>
## 2. Train Test Split

<div class="alert alert-block alert-info">
<b>Tip:</b> If you want to carry out emotion analysis on any other dataset, you should first upload the dataset into the project and then update the name of the file in the next cell</div>
<br>

<span style="color:blue">The data will be split into an 80/20 train-test split using sklearn and then exported into JSON format for the Watson NLP models to consume. Additionally, the column headers will be renamed to the expected `text` and `labels` names, with the labels having type list.</span>

In [9]:
df_train, df_test = train_test_split(df, test_size=0.2)

In [10]:
df_train.to_json('movieDialog_train.json', orient='records')
df_test.to_json('movieDialog_test.json', orient='records')

In [11]:
df_train

Unnamed: 0,text,labels,source
1145,Only one guys checked in?,[disgust],MovieDialog
3307,"That's what I mean - mysterious. Mr. Conway, I don't like that man. He's too vague.",[disgust],MovieDialog
357,You don't mind that I'm not coming tonight?,[fear],MovieDialog
1272,What do I think? You mean about the possibility of your becoming a monster in two days or about visits from dead friends?,[anger],MovieDialog
2356,I was referring to myself. I thought we might have a picnic tomorrow - it might be a nice change after the Wild West party tonight. Invite everybody to go to the Everglades -,[happiness],MovieDialog
...,...,...,...
1614,What would you say if I told you the toilet just blew up in my face.,[anger],MovieDialog
2327,"--ohGodplease -- don't kill me -- don't kill me -- you're one of them, I know it --",[fear],MovieDialog
3532,"If we run into Billy first, let me try and talk him in.",[neutral],MovieDialog
1504,If he's gonna fuck me up the ass!,[disgust],MovieDialog


<a id="Training models"></a>
## 4. Training models

<span style="color:blue">There are three custom classification models in the follow sections. TF-IDF and Embeddings SVM models are trained and saved but not used in the testing section. They can be tested as a bonus. The CNN model is set up to be fine tuned and tested on the test split.</span>
<br>
<br>
<span style="color:blueviolet">The `model.save()` function will save the model as a directory populated with a config.yaml and an artifacts folder. The model will save in the current working directory where this notebook was launched, unless otherwise specified in the model name.</span>

<a id="preProcessing"></a>
### Pre-Processing

<span style="color:blue">The pre-processing step converts the training dataset into a a data stream for Watson NLP consumption. 
The training data is then run through a syntax model to perform tokenization and lemmatization.
The following three models will all use this syntax processed data as training data.</span>

In [12]:
training_data_file = "movieDialog_train.json"

# Create datastream from training data
data_stream_resolver = DataStreamResolver(target_stream_type=list, expected_keys={'text': str, 'labels': list})
training_data = data_stream_resolver.as_data_stream(training_data_file)

# Load a Syntax model
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))

# Create Syntax stream
text_stream, labels_stream = training_data[0], training_data[1]
syntax_stream = syntax_model.stream(text_stream)

<a id="tfidf"></a>
### Training with TF-IDF

In [20]:
# Train the TF-IDF vectorizer
tf_idf_model = TFIDF.train(syntax_stream)
tfidf_train_stream = tf_idf_model.stream(syntax_stream)
tfidf_svm_train_stream = watson_nlp.data_model.DataStream.zip(tfidf_train_stream, labels_stream)

# Train SVM using TF-IDF training stream
tfidf_classification_model = SVM.train(tfidf_svm_train_stream)

In [21]:
tfidf_classification_model.save('model_tfidf_emo_classification')

<a id="embeddings"></a>
### Training with Embeddings

In [22]:
use_embedding_model = watson_nlp.download_and_load('embedding_use_en_stock')
use_train_stream = use_embedding_model.stream(syntax_stream, doc_embed_style='raw_text')
# `raw_text`: run the universal sentence encoder over your text as one large chunk
# `ave_sent`: independently run the universal sentence encoder over each of your sentences and average the results to 
#             produce a document embedding
use_svm_train_stream = watson_nlp.data_model.DataStream.zip(use_train_stream, labels_stream)

# Train SVM using Universal Sentence Encoder (USE) training stream
embeddings_classification_model = SVM.train(use_svm_train_stream)

In [23]:
embeddings_classification_model.save('model_embeddings_emo_classification')

<a id="cnn"></a>
### Training with CNN
<span style="color:maroon">These hyperparameters can be used in the out-of-the-box to tune those models too.</span>

In [24]:
training_data_file = "movieDialog_train.json"

# Load glove embeddings
glove_embedding_model = watson_nlp.download_and_load('embedding_glove_en_stock')

# Train CNN
cnn_classification_model = CNN.train(DataStream.zip(syntax_stream, labels_stream), 
                                     embedding=glove_embedding_model.embedding, 
                                     batch_size=128, 
                                     filter_sizes=(1, 2, 2), 
                                     num_filters=256, 
                                     epochs=10, 
                                     random_seed=1001, 
                                     enable_tensorboard=False, 
                                     dropout_prob=0.5, 
                                     l2_reg_lambda=0.01, 
                                     verbose=1, 
                                     multi_label=False, )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [25]:
cnn_classification_model.save('model_cnn_emo_classification')

<span style="color:blueviolet">By using `project.save_data()`, the model will be saved as an object into the Data Assets of the Watson Studio project.</span>

In [26]:
project.save_data('model_cnn_emo_classification', data=cnn_classification_model.as_file_like_object(), overwrite=True)

{'file_name': 'model_cnn_emo_classification',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '0e8343b6-0009-4cca-903b-0aec5ff4e27c'}

<a id="Testing"></a>
## 5. Testing
<span style="color:blue">The first step in testing the model is to load in the previously saved classification model. Then, just like with the training data, the testing data will need to be processed by the syntax model for tokenization and lemmatization. Finally, the `model.evaluate_quality()` function will run predictions on the data.</span>

In [27]:
test_data_file = "movieDialog_test.json"

# Load a Syntax model
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))

# Load classification model
classification_model = watson_nlp.load('model_cnn_emo_classification')

# Setup pre-processing function
preprocess_func = lambda raw_doc: syntax_model.run_batch(raw_doc)

quality_report = classification_model.evaluate_quality(test_data_file, preprocess_func)

<span style="color:blueviolet">While this CNN classification model did out perform the out-of-the-box Ensemble model, it still has a medium-low micro precision, recall, and f1.
<br><br>
*The micro metrics are taken here instead of the macro metrics because of the class imbalance in the data.</span>

In [28]:
print(json.dumps(quality_report, indent=4))

{
    "per_class_confusion_matrix": {
        "anger": {
            "true_positive": 203,
            "true_negative": 0,
            "false_positive": 189,
            "false_negative": 78,
            "precision": 0.5178571428571429,
            "recall": 0.7224199288256228,
            "f1": 0.6032689450222882
        },
        "fear": {
            "true_positive": 0,
            "true_negative": 0,
            "false_positive": 0,
            "false_negative": 38,
            "precision": 0,
            "recall": 0.0,
            "f1": 0
        },
        "neutral": {
            "true_positive": 174,
            "true_negative": 0,
            "false_positive": 148,
            "false_negative": 99,
            "precision": 0.5403726708074534,
            "recall": 0.6373626373626373,
            "f1": 0.5848739495798319
        },
        "happiness": {
            "true_positive": 59,
            "true_negative": 0,
            "false_positive": 40,
            "false_negati

<span style="color:blueviolet">With the same single input, the CNN model predicts `happiness` with higher confidence than the out-of-the-box models.</span>

In [29]:
syntax_prediction = syntax_model.run("Such a sweet boy. But after much thought and careful consideration, I've decided that the ruler for the next ten thousand years is going to have to be... me. ")
classifier_result = classification_model.run(syntax_prediction)
print(classifier_result)

{
  "classes": [
    {
      "class_name": "happiness",
      "confidence": 0.7313699722290039
    },
    {
      "class_name": "neutral",
      "confidence": 0.0853412076830864
    },
    {
      "class_name": "sadness",
      "confidence": 0.08039671927690506
    },
    {
      "class_name": "anger",
      "confidence": 0.04342210665345192
    },
    {
      "class_name": "disgust",
      "confidence": 0.04208728298544884
    },
    {
      "class_name": "fear",
      "confidence": 0.017382703721523285
    }
  ],
  "producer_id": {
    "name": "CNN classifier",
    "version": "0.0.1"
  }
}


In [30]:
# Another option for saving the model
'''
import pickle
with open ('model_cnn_emo_classification.pkl', 'wb') as f:
    pickle.dump(classification_model.as_file_like_object(), f)
'''

"\nimport pickle\nwith open ('model_cnn_emo_classification.pkl', 'wb') as f:\n    pickle.dump(classification_model.as_file_like_object(), f)\n"

## 7. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library and how quickly and easily you can get started with Watson NLP by fine tuning deep learning models for emotion analysis.