# Emotional AI: Feature Engineering

Anaysis by Frank Flavell

## How do people communicate about their emotions through text?
By identifying the way people communicate about their emotions, we can determine specific features to engineer that could capture each use case.

* ***Specific Emotion Words:*** Ideally the person uses specific words that get right at the emotion they currently feel as well as the causes and consequences of that emotion.  

    * Example: "I feel alienated because my friends didn't invite me to a concert and I'm not sure if they like me."
<br/><br/>
* ***Causes:*** The person can articulate the cause of the emotion they feel without having identified the emotion that has been caused.  This is like knowing the definition of a word without knowing the actual word.  The emotion must be inferred using the context of the message.

    * Example: "My friends didn't invite me to a concert."
<br/><br/>
* ***Consequences:*** Similar to causes, the person can articulate the consequences of the emotion they are feeling or how it is affecting them and/or others.  Causes and consequences are not mutually exclusive and it is sometimes easier to see the consequences of an emotion without understanding the cause.

    * Example: "I'm not sure if my friends like me."
<br/><br/>
* ***Incohate:*** The person knows they are feeling something and it is having an impact on them, but they haven't been able to even articulate the consequences of the emotion.  They use vague emotion words that signal toward a positive or negative feeling, like "good" or "bad".

    * Example: "I'm feeling bad."
<br/><br/>
* ***Buried, but Willing:*** Almost worst case scenario, which is probably to be expected, is that the person isn't even thinking about their emotions.  They are willing to discuss it, but they aren't aware of how they feel or the causes and consequences of those emotions.  They most likely unwittingly mask their true feelings by using vague emotion words like "fine," "okay," "alright," etc.

    * Example: "I'm fine."
<br/><br/>
* ***Buried, but Unwilling:*** Worst case scenario the person is unaware of their feelings and unwilling to talk about it.

    * Example: "Can't talk right now."
<br/><br/>

## What new features can be engineered to match communication?
In order of development priority.

* ***Emotion Score:*** Calculate the emotional intensity of keywords to calculate to prioritize one emotion over the others.
<br/><br/>
* ***Key Punctuation:*** (? ! .) could help us identify emotion and possibly the intensity of the emotion.
<br/><br/>
* ***Parts of Speech:*** Extract the parts of speech tags to gain a more accurate contextual understanding of the person's message to help with emotion words.
<br/><br/>
* ***Capitalization Ratio:*** number of capital letters divided by the number of words in the utterance.
<br/><br/>

Because the main goal is to build a machine learning model sophisticated enough to recognize the linguistic patterns assoicated with emotions, I decided not to engineer the following features.  The features listed below would essentially be a crutch for the model when what I really need is more, and a wide variety of, observations in my data related to each emotion.

* ***Emotion Words:*** Connect the person's words back to a specific emotion using a emotion lexicon with intensity scores.
<br/><br/>
* ***Synonyms:*** Use synonyms of the person's keywords to connect the message to a specific emotion.
<br/><br/>


## Findings

#### Emotion Score Calculator

Before the Emotion Score Calculator can be put into production, it requires a more robust emotional lexicon that meets the following characteristics:
* Less or no duplicate words shared between emotions.
* A larger vocabulary that includes more words from everyday language
* Better intensity scoring to ensure words lead back to the correct classification.

Next steps will be to find or develop a more comprehensive lexicon.

#### Punctuation Calculator

The punctuation calculator is useful in recognizing the intensity of an emotion and in identifying a question, which often fall into the "no emotion" class.  I'm not convinced this feature will truly be effective at assisting with the classification of our 6 emotions.  But I will still put it into production because clear patterns could emerge with enough data.

#### Parts of Speech

Parts of Speech would be useful for entity extraction, which could be combined with the emotion score calculator to more effectively identify emotional keywords.  But I'm not sure how this feature, by itself, will aid a machine learning model in learning the linguistic patterns associated with specific emotions. Until we build a more comprehensive lexicon of emotions, I do not think this feature is of use.

#### Capitalization Ratio

Similar to punctuation calculator, the capitalization ratio signifies emotional intensity more than actual emotion.  All caps can be used to signify excitment and happiness as well as anger and complete demoralization.  However, with enough data, I could eventually identify a pattern associated with a specific emotional class.



## Table of Contents<span id="0"></span>

1. [**Emotion Score Calculator**](#1)
<br/><br/>
2. [**Punctuation Counter**](#2)
<br/><br/>
3. [**Parts of Speech**](#3)
<br/><br/>
4. [**Capitalization Ratio**](#4)
<br/><br/>


# Package Import

In [2]:
# import external libraries
import pandas as pd
import numpy as np
import matplotlib as cm
import matplotlib.pyplot as plt
import seaborn as sns

from collections import Counter
import re #regex
import random
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Configure matplotlib for jupyter.
%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/matthewflavell/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/matthewflavell/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/matthewflavell/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Data Import

In [5]:
#Import cleaned data from pickle
df = pd.read_pickle('data/dailydialog/dialogue_lowered_contractions.pickle')

In [6]:
df.head(2)

Unnamed: 0,dialogue,topic,emotion,type
0,the kitchen stinks.,1,2,3
1,i will throw out the garbage.,1,0,4


In [13]:
df.shape

(102980, 4)

In [11]:
nrc_lex = pd.read_pickle('data/nrc/eai_nrc_lex.pickle')

In [12]:
nrc_lex.head()

Unnamed: 0,word,score,emotion
0,outraged,0.964,1
1,brutality,0.959,1
2,hatred,0.953,1
3,hateful,0.94,1
4,terrorize,0.939,1


In [14]:
nrc_lex.shape

(7493, 3)

# <span id="1"></span>1. Emotion Score Calculator
#### [Return Contents](#0)

Here I will build an emotion score calculator to determine how closely associated a sentence is with a specific emotion.

**Input:** utterance, str

**Output:** str and int, separated by comma if multiple emotions.

**Function Operations:**
1. Take in the string utterance, remove punctuation and tokenize it
    * Could be useful to include the ability to lower case and expand contractions.
2. Compare each token in the utterance to a lexicon of words associated with the six emotion classes.
3. Add up the intensity scores of each word associated with each emotion.
4. Determine which emotion(s) exceed a specific threshold and can be confidently classified with those emotions.
5. Input the score(s) in the in the emotion score column for the utterance.

**Process:**
1. Successfully complete the process for one utterance.
2. Build the function using the sucessful steps.
3. Test the function on two other utterances.
4. Use .apply to calculate the scores for all utterances in the dataset.

### Remove punctuation and tokenize

In [4]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
#   text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text

In [189]:
df[df.emotion == 1]

Unnamed: 0,dialogue,topic,emotion,type
5,what is wrong with that? cigarette is the thin...,1,1,1
8,getting worse. now he is eating me out of hous...,1,1,1
70,"no, the steak was recommended, but it is not v...",1,1,1
72,so what? it is not fresh and i am not happy ab...,1,1,1
74,"no, thank you.",1,1,4
...,...,...,...,...
101913,"i am sorry to say it is not on the way, but du...",10,1,1
102708,"stupid girl, making me spend so much money, no...",10,1,3
102710,"i know where to put my card! stupid machine, t...",10,1,1
102712,"yeah, yeah, i know what i selected. just give ...",10,1,3


In [253]:
test1 = clean_text(df.dialogue[102710])
test1

'i know where to put my card stupid machine  talking to me like i am an idiot'

In [254]:
test2 = clean_text(df.dialogue[102714])
test2

'no  no stupid machine  what are you doing no'

In [255]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')

test1 = tokenizer.tokenize(test1)
test1

['i',
 'know',
 'where',
 'to',
 'put',
 'my',
 'card',
 'stupid',
 'machine',
 'talking',
 'to',
 'me',
 'like',
 'i',
 'am',
 'an',
 'idiot']

In [256]:
def sent_clean_token(text):
    text = text.lower()
    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
    cleaned = BAD_SYMBOLS_RE.sub('', text)
    tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')
    tokened = tokenizer.tokenize(cleaned)
    return tokened

In [257]:
test2 = sent_clean_token(test2)
test2

['no', 'no', 'stupid', 'machine', 'what', 'are', 'you', 'doing', 'no']

### Compare each token in the sentence with emotion lexicon

According to my EDA of the NRC Lexicon, there are many duplicate words in the lexicon, which means we will need a way to capture multiple emotion scores per word.  I decided to iterate over an utterance and build a mini dataframe containing the key emotion words and their scores.

The word 'outrage' belongs to two classes.  I start out by building a dataframe for just this word.

In [258]:
data = pd.DataFrame(columns=['word', 'score', 'emotion'])
word = 'outrage'
row = nrc_lex.loc[nrc_lex['word'] == word]
data = pd.concat([data, row])

In [259]:
data

Unnamed: 0,word,score,emotion
75,outrage,0.848,1
2089,outrage,0.469,2


Next I iterate over a list of multiple words and see my code works at building a dataframe containing all key words and their scores.

In [260]:
sentence = ['gross', 'outrage', 'worse']
data = pd.DataFrame(columns=['word', 'score', 'emotion'])
for word in sentence:
    row = nrc_lex.loc[nrc_lex['word'] == word]
    data = pd.concat([data, row])
print(data)

         word  score emotion
1635    gross  0.719       2
75    outrage  0.848       1
2089  outrage  0.469       2
3500    worse  0.484       3
6369    worse  0.453       5


I built a function to output this mini dataframe of key emotional words, scores, and classes.

In [261]:
def sent_df(sentence):
    data = pd.DataFrame(columns=['word', 'score', 'emotion'])
    for word in sentence:
        row = nrc_lex.loc[nrc_lex['word'] == word]
        data = pd.concat([data, row])
    return data

In [262]:
test1 = sent_df(test1)
test1

Unnamed: 0,word,score,emotion
2051,idiot,0.492,2


In [263]:
test2 = sent_df(test2)
test2

Unnamed: 0,word,score,emotion


### Add up scores for each emotion.

In [203]:
mini_df = sent_df(['gross', 'outrage', 'worse'])

In [205]:
mini_df

Unnamed: 0,word,score,emotion
1635,gross,0.719,2
75,outrage,0.848,1
2089,outrage,0.469,2
3500,worse,0.484,3
6369,worse,0.453,5


I made an even smaller dataframe for anger and then calculated the sentence's overall anger score.

In [218]:
anger = mini_df[mini_df.emotion == 1]

In [219]:
anger.score.sum()

0.848

I then used the same code to build a function that calculates the overall emotional score for each emotion and outputs the scores in a dictionary.

In [226]:
def emo_scores(mini_df):
    score = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0}
    #anger
    anger = mini_df[mini_df.emotion == 1]
    score[1] += anger.score.sum()
    #disgust
    disgust = mini_df[mini_df.emotion == 2]
    score[2] += disgust.score.sum()
    #fear
    fear = mini_df[mini_df.emotion == 3]
    score[3] += fear.score.sum()
    #happy
    happy = mini_df[mini_df.emotion == 4]
    score[4] += happy.score.sum()   
    #sad
    sad = mini_df[mini_df.emotion == 5]
    score[5] += sad.score.sum()
    #surprise
    surprise = mini_df[mini_df.emotion == 6]
    score[6] += surprise.score.sum()
    return score

In [227]:
score = emo_scores(mini_df)
score

{1: 0.848,
 2: 1.1880000000000002,
 3: 0.484,
 4: 0.0,
 5: 0.45299999999999996,
 6: 0.0}

In [264]:
test1 = emo_scores(test1)
test1

{1: 0.0, 2: 0.49200000000000005, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0}

In [265]:
test2 = emo_scores(test2)
test2

{1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0}

### Choose the emotion score to include

Now I need to decide exactly what should be output and inserted into the new feature column.

People often feel more than one emotion at a time.  It is common to feel angry and sad, angry and disgusted, sad and disgusted or happy and surprised at the same time.  Humans have complex emotions and combinations of emotions.  

However, the purpose of this pipeline is to identify the main emotion someone is feeling so the bot can mirror that emotion back to the person. 

However, if there isn't a score for any of the emotion classes, then we need to output a 0.

In [228]:
max_key = max(score, key=score.get)
max_key

2

In [268]:
def highest_score(score):
    max_key = max(score, key=score.get)
    if score[max_key] == 0:
        return 0
    else:
        return max_key

In [269]:
test1 = highest_score(test1)
test1

2

In [270]:
test2 = highest_score(test2)
test2

0

### Build Emotion Score Function & Test Cases

In [282]:
def emotion_score_calc(text=str, df=False, cls=False):
    """
        text: a string
        
        df: bool
            function returns a dataframe containg the emotion keywords, 
            their scores, and associated emotion.
            
        cls: bool
            If False, function returns a dictionary of the emotion scores.
            If True, funtion returns the emotion classification with the highest score.
    """    
    sentence = sent_clean_token(text)
    mini_df = sent_df(sentence)
    if df:
        return mini_df
    score = emo_scores(mini_df)
    if cls:
        return highest_score(score)
    else:
        return score

In [272]:
df[df.emotion == 3]

Unnamed: 0,dialogue,topic,emotion,type
4389,promise that you will not get angry.,1,3,3
5076,i think it is dangerous. what if i cannot turn...,1,3,2
11304,i am worrying about it too. i want to install ...,1,3,1
11902,too bad. i really like it.,1,3,1
12876,"really, it is so beautiful.",1,3,1
...,...,...,...,...
95125,i have not gone to the interview yet. it is to...,8,3,1
95181,"ah, thanks.",8,3,1
96492,yeah?,9,3,2
98089,i do not know anything about the bills or laws...,9,3,1


### Test Case 3

In [288]:
df.dialogue[5076]

'i think it is dangerous. what if i cannot turn it off?'

In [278]:
emotion_score_calc(df.dialogue[5076], df=True)

Unnamed: 0,word,score,emotion
2793,dangerous,0.75,3
2552,turn,0.133,2


In [279]:
emotion_score_calc(df.dialogue[5076], cls=False)

{1: 0.0, 2: 0.133, 3: 0.75, 4: 0.0, 5: 0.0, 6: 0.0}

In [280]:
emotion_score_calc(df.dialogue[5076], cls=True)

3

### Test Case 4

In [287]:
df.dialogue[11304]

'i am worrying about it too. i want to install a security door.'

In [283]:
emotion_score_calc(df.dialogue[11304], df=True)

Unnamed: 0,word,score,emotion
3526,worrying,0.484,3
6127,worrying,0.562,5


In [284]:
emotion_score_calc(df.dialogue[11304], cls=False)

{1: 0.0, 2: 0.0, 3: 0.484, 4: 0.0, 5: 0.562, 6: 0.0}

In [286]:
emotion_score_calc(df.dialogue[11304], cls=True)

5

### Emotion Score Calculator Findings

The calculator is only as good as the lexicon behind it and the emotional intelligence of the person feeding it messages.  We cannot expect the user to improve their vocabulary for the sake of the machine, so that means the lexicon must be robust enough to account for the language used to express everyday emotions.  The scores behind those words must also be accurate.

Although the lexicon seems comprehensive on the surface, I do not think it actually meets the requirements of this specific project.  It failed to recognize the emotion classification for 3 out of 4 of the test cases.  It also seems the large number of duplicate words ultimately undermines the calculator's ability to classify the main emotion correctly.  

I will need to find or build a more robust lexicon before implementing this feature in the model.

However, the lexicon also brings to light some weaknesses of the DailyDialog dataset.  I do not agree with every classification I encounter and I also recognize that the structure of the utterances often do not adequately represent the types of inputs a chatbot will receive.  I will need to address this with more data.

# <span id="2"></span>2. Punctuation Counter
#### [Return Contents](#0)

In [294]:
def punc_calc(text=str):
    punc = {'?': 0, '!': 0}
    #question mark
    question = text.count('?')
    punc['?'] += question
    #exclamation point
    exclaim = text.count('!')
    punc['!'] += exclaim
    return punc

### Test 1

In [295]:
df.dialogue[96492]

'yeah?'

In [296]:
punc_calc(df.dialogue[96492])

{'?': 1, '!': 0}

### Test 2

In [297]:
df.dialogue[102714]

'no, no! stupid machine, what are you doing! no!'

In [298]:
punc_calc(df.dialogue[102714])

{'?': 0, '!': 3}

### Test 3

In [299]:
df.dialogue[102710]

'i know where to put my card! stupid machine, talking to me like i am an idiot...'

In [300]:
punc_calc(df.dialogue[102710])

{'?': 0, '!': 1}

# <span id="3"></span>3. Parts of Speech
#### [Return Contents](#0)

I will return to this feature once I have found or built a more comprehensive emotional lexicon.

# <span id="4"></span>4. Capitalization Ratio
#### [Return Contents](#0)

In [301]:
message = 'ONE TWO three'

In [308]:
len(message.replace(" ", ""))

11

In [303]:
caps = sum(1 for c in message if c.isupper())
caps

6

In [304]:
message2 = 'One, Two, Three'

In [309]:
len(message2.replace(" ", ""))

13

In [305]:
caps = sum(1 for c in message2 if c.isupper())
caps

3

In [341]:
def cap_ratio(text=str):
    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    BAD_SYMBOLS_RE = re.compile('[^0-9a-zA-Z #+_]')
    text = BAD_SYMBOLS_RE.sub('', text)
    caps = sum(1 for c in text if c.isupper())
    leng = len(text.replace(" ", ""))
    return round((caps/leng),2)

In [342]:
Message3 = "HOW MANY TIMES!!!"

In [343]:
cap_ratio(Message3)

1.0

In [344]:
Message4 = "How many times?"

In [345]:
cap_ratio(Message4)

0.08