# Emotion Classification: Data Cleaning

Anaysis by Frank Flavell

## Business Case

The goal of this project is to develop a Natural Language Understanding (NLU) algorithm for classifying the underlying emotion associated with a chat message so that chatbots and other programs can use this information to deliver a better experience to users. 

People have goals.  Some goals are explicit and others are implicit.  Explicit goals are usually easy to identify because a person can clearly articulate them, like buying groceries, resolving a billing issue, traveling to the beach, updating a software, etc.  Implicit goals, on the other hand, are more difficult to identify.  These are emotional goals that aren't always articulated even though the achievement of these emotional goals are often more valuable than the explicit goals.  Not only does a person want to buy groceries, they also want to feel good about the experience.

It requires emotional intelligence to recognize emotions and act in ways that will address these emotions in a healthy and harmonious way.  As we all know from customer service interactions, not all people have this level of emotional intelligence.  If a program can automate this emotional classification process, then organizations could use this information to dramatically improve a wide variety of service encounters acorss industries.

The specific application of this classification algorithm will be for the NLU pipeline of an emotionally intelligence chatbot.


## Dataset

I will be using the [DailyDialogue](http://yanran.li/dailydialog) dataset compiled for the International Joint Conference on Natural Language Processing (IJCNLP) in Taipei, Taiwan by Yanran Li, Hui Su, and Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu.

This is a very thorough dataset that includes over 13,000 conversations and over 100,000 utterances.  Each conversation has been manually categroized by a topic while each utterance additionally categorized with an emotion and a statement type.
* ***Dialogue: string.  One utterance per row.***
* ***Emotion: int. The emotion associated with the text.***
    * 0: No emotion
    * 1: Anger
    * 2: Disgust
    * 3: Fear
    * 4: Happiness
    * 5: Sadness
    * 6: Surprise
* ***Type: int. The type of utterance.***
    * 1: Inform
    * 2: Question
    * 3: Directive
    * 4: Commissive
* ***Topic: The general topic of the conversation.***
    * 1: Ordinary Life
    * 2: School Life
    * 3: Culture & Education
    * 4: Attitude & Emotion
    * 5: Relationship
    * 6: Tourism
    * 7: Health
    * 8: Work
    * 9: Politics
    * 10: Finance


## Table of Contents<span id="0"></span>

1. [**Import Dialogue**](#1)
2. [**Import Conversation Topics & Merge**](#2)
3. [**Import Emotion Classification**](#3)
4. [**Import Dialogue Act**](#4)
5. [**Compare DF to Emotions and Acts**](#5)
6. [**Explode Conversations and Topics to Utterances**](#6)
7. [**Explode Emotions and Merge**](#7)
8. [**Explode Dialogue Acts and Merge**](#8)
9. [**Update Datatypes**](#9)
10. [**Export Cleaned Data**](#10)




## Package Import

In [2]:
# import external libraries

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import re #regex

# Configure matplotlib for jupyter.
%matplotlib inline

## Data Import & Cleaning

The data comes in 4 different .txt files, one for each feature: dialogue, topic, emotion, and dialogue act.  In all files, each line contains one conversation containing several utterances between two people.  I needed to 'explode' each conversation so each utterance had its own row in the dataframe.

I also discovered that the conversation at index 672 was missing one emotion and dialogue act classification.  I investigated the strings at this index in the emotion and act dataframes and updated the values to include the appropropriate classification.  With this update, I could effectively merge the features together into a master df.

The result is a dataframe containing 102,980 non-null string utterances in the dialogue column and integer objects labeling each utterance in the topic, emotion, and type columns.

## <span id="1"></span>1. Import Dialogue
#### [Return Contents](#0)

In [3]:
df = pd.read_csv("data/dialogues_text.txt", delimiter="\t", header=None)
df.columns = ["dialogue"]

In [4]:
df.head()

Unnamed: 0,dialogue
0,The kitchen stinks . \t I'll throw out the gar...
1,"So Dick , how about getting some coffee for to..."
2,Are things still going badly with your housegu...
3,"Would you mind waiting a while ? \t Well , how..."
4,Are you going to the annual party ? I can give...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13118 entries, 0 to 13117
Data columns (total 1 columns):
dialogue    13118 non-null object
dtypes: object(1)
memory usage: 102.6+ KB


In [6]:
df['dialogue'][0]

"The kitchen stinks . \\t I'll throw out the garbage . \\t"

In [7]:
type(df['dialogue'][0])

str

No null values hiding as an empty string.

In [8]:
df[df['dialogue'] == '']

Unnamed: 0,dialogue


## <span id="2"></span>2. Import Conversation Topics & Merge
#### [Return Contents](#0)

In [9]:
df['topic'] = pd.read_csv('data/dialogues_topic.txt', header=None)

In [10]:
df.head()

Unnamed: 0,dialogue,topic
0,The kitchen stinks . \t I'll throw out the gar...,1
1,"So Dick , how about getting some coffee for to...",1
2,Are things still going badly with your housegu...,1
3,"Would you mind waiting a while ? \t Well , how...",1
4,Are you going to the annual party ? I can give...,1


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13118 entries, 0 to 13117
Data columns (total 2 columns):
dialogue    13118 non-null object
topic       13118 non-null int64
dtypes: int64(1), object(1)
memory usage: 205.1+ KB


No null values hiding as an empty string.

In [12]:
df[df['dialogue'] == '']

Unnamed: 0,dialogue,topic


## <span id="3"></span>3. Import Emotion Classification
#### [Return Contents](#0)

In [14]:
# Import the txt file.
emotions = pd.read_csv('data/dialogues_emotion.txt', header=None)
# Label the emotions df column name
emotions.columns = ["emotion"]

In [15]:
emotions.head()

Unnamed: 0,emotion
0,2 0
1,4 2 0 1 0
2,0 1 0 0
3,0 0 0 4
4,0 4 4


We can see that the number of rows in the emotions dataframe matches the number of rows in the dialogue and topic dataframe: 13,118.

In [16]:
emotions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13118 entries, 0 to 13117
Data columns (total 1 columns):
emotion    13118 non-null object
dtypes: object(1)
memory usage: 102.6+ KB


No null values hiding as an empty string.

In [17]:
emotions[emotions['emotion'] == '']

Unnamed: 0,emotion


Each conversation emotion classification string contains an extra space at the end, which means it will create an extra row when we explode the conversations to utterances.  We will need to deal with this when the time comes.

In [18]:
emotions.emotion[0]

'2 0 '

## <span id="4"></span>4. Import Dialogue Act
#### [Return Contents](#0)

In [19]:
# Import the txt file.
acts = pd.read_csv('data/dialogues_act.txt', header=None)
# Label the emotions df column name
acts.columns = ['type']

In [21]:
acts.head()

Unnamed: 0,type
0,3 4
1,3 4 3 1 1
2,2 1 3 4
3,3 2 1 1
4,3 4 1


We can also see that the number of rows in the dialogue acts dataframe matches the number of rows in the dialogue & topic as well as emotions dataframes: 13,118.

In [22]:
acts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13118 entries, 0 to 13117
Data columns (total 1 columns):
type    13118 non-null object
dtypes: object(1)
memory usage: 102.6+ KB


No null values hiding as an empty string.

In [23]:
acts[acts['type'] == '']

Unnamed: 0,type


## <span id="5"></span>5. Compare DF to Emotions & Acts
#### [Return Contents](#0)

Unfortunately, when I initially exploded the rows of the emotion and acts dataframes, they were both one row short of the dialogue dataframe, which meant there was one label missing!  Conversations also weren't labeled with an ID number that I could use to match with the labels from other .txt files.  If the utterance and labels don't match up, then it would completely undermine my ability to predict the emotional classification of utterances.

Since the original conversation per row dataframes all contain the same number of rows, I made a list containing the number of utterances per row as well as a list containing the number of emotion labels per row.  I compared the two lists and identified that the row at index 672 was the only row that didn't contain the same number of utterances (12) to emotion labels (11).  This was also true for the acts dataframe.  I examined the contents of row 672 and updated the values to contain the correct classifications.

In [24]:
num_utter = df.dialogue.apply(lambda x: len(x.split('\\t'))-1)

In [25]:
num_emo = emotions.emotion.apply(lambda x: len(x.split(' '))-1)

In [26]:
compare = num_utter == num_emo

In [27]:
compare.index[compare == False]

Int64Index([672], dtype='int64')

In [28]:
num_emo.iloc[672]

11

In [29]:
num_utter.iloc[672]

12

In [30]:
df.iloc[672, 0]

"Sam , can we stop at this bicycle shop ? \\t Do you want to buy a new bicycle ? \\t Yes , and they have a sale on now . \\t What happened to your old one ? \\t I left it at my parent's house , but I need one here as well . \\t I've been using Jim's old bike but he needs it back . \\t Let's go then . \\t Look at this mountain bike . It is only £ 330 . Do you like it ? \\t I prefer something like this one - a touring bike , but it is more expensive . \\t How much is it ? \\t The price on the tag says £ 565 but maybe you can get a discount . \\t OK , let's go and ask . \\t"

In [31]:
emotions.iloc[672]

emotion    0 0 0 0 0 0 0 0 0 0 0 
Name: 672, dtype: object

In [32]:
type(emotions.emotion[672])

str

In [33]:
emotions.emotion[672] = '0 0 0 0 0 0 0 0 0 0 0 0'

In [34]:
emotions.emotion[672]

'0 0 0 0 0 0 0 0 0 0 0 0'

In [35]:
acts.type[672] = '2 2 1 2 1 1 3 2 1 2 1 3'

In [36]:
acts.type[672]

'2 2 1 2 1 1 3 2 1 2 1 3'

## <span id="6"></span>6. Explode Conversations & Topics into Utterances
#### [Return Contents](#0)

Once each row of each dataframe contained the same number of values, I 'exploded' the conversations into utterances, making a dataframe with one utterance per row.

In [37]:
def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pd.DataFrame(new_rows)
    return new_df

In [38]:
df = splitDataFrameList(df,'dialogue','\\t')
df = df[df.dialogue != '']
df.reset_index(drop=True, inplace=True)

In [40]:
df.head()

Unnamed: 0,dialogue,topic
0,The kitchen stinks .,1
1,I'll throw out the garbage .,1
2,"So Dick , how about getting some coffee for to...",1
3,Coffee ? I don ’ t honestly like that kind of...,1
4,"Come on , you can at least try a little , bes...",1


## <span id="7"></span>7. Explode Emotions & Merge
#### [Return Contents](#0)

In [41]:
# Split the strings in each row and expand their own rows
emotions = splitDataFrameList(emotions,'emotion',' ')
# Remove any rows that contain an empty quote
emotions = emotions[emotions.emotion != '']
# Reset Index
emotions.reset_index(drop=True, inplace=True)

In [42]:
emotions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102980 entries, 0 to 102979
Data columns (total 1 columns):
emotion    102980 non-null object
dtypes: object(1)
memory usage: 804.7+ KB


Correct number of rows!  We're good to go.

In [43]:
df['emotion'] = emotions['emotion']

In [45]:
df.head()

Unnamed: 0,dialogue,topic,emotion
0,The kitchen stinks .,1,2
1,I'll throw out the garbage .,1,0
2,"So Dick , how about getting some coffee for to...",1,4
3,Coffee ? I don ’ t honestly like that kind of...,1,2
4,"Come on , you can at least try a little , bes...",1,0


## <span id="8"></span>8. Explode Dialogue Acts & Merge
#### [Return Contents](#0)

In [46]:
# Split the strings in each row and expand their own rows
acts = splitDataFrameList(acts,'type',' ')
# Remove any rows that contain an empty quote
acts = acts[acts.type != '']
# Reset Index
acts.reset_index(drop=True, inplace=True)

In [47]:
acts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102980 entries, 0 to 102979
Data columns (total 1 columns):
type    102980 non-null object
dtypes: object(1)
memory usage: 804.7+ KB


Correct number of rows!  We're good to go.

In [48]:
df['type'] = acts['type']

In [49]:
df.head()

Unnamed: 0,dialogue,topic,emotion,type
0,The kitchen stinks .,1,2,3
1,I'll throw out the garbage .,1,0,4
2,"So Dick , how about getting some coffee for to...",1,4,3
3,Coffee ? I don ’ t honestly like that kind of...,1,2,4
4,"Come on , you can at least try a little , bes...",1,0,3


## <span id="9"></span>9. Update Datatypes
#### [Return Contents](#0)

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102980 entries, 0 to 102979
Data columns (total 4 columns):
dialogue    102980 non-null object
topic       102980 non-null int64
emotion     102980 non-null object
type        102980 non-null object
dtypes: int64(1), object(3)
memory usage: 3.1+ MB


In [51]:
df.emotion = df.emotion.astype('int')

In [52]:
df.type = df.type.astype('int')

In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102980 entries, 0 to 102979
Data columns (total 4 columns):
dialogue    102980 non-null object
topic       102980 non-null int64
emotion     102980 non-null int64
type        102980 non-null int64
dtypes: int64(3), object(1)
memory usage: 3.1+ MB


## <span id="10"></span>10. Export Cleaned Data
#### [Return Contents](#0)

Save cleaned data to a pickle for easy access in the future.

In [54]:
df.to_pickle("data/dialogue_master.pickle")