# Emotion AI: NRC Affect Intensity Lexicon

Anaysis by Frank Flavell

## Overview

As part of the emmotional classification process, I want to calculate an emotion score for each utterance.  This would be a function in the NLU pipeline with the following operations:
   * Take in the utterance and tokenize it
   * Compare each token in the utterance to a lexicon of words associated with the six emotion classes.
   * Add up the intensity scores of each word associated with each emotion.
   * Determine which emotion(s) exceed a specific threshold and can be confidently classified with those emotions.

The score(s) would help to determine which emotion(s) the utterance is associated with and also address the possibility of multiple emotions in one sentence.


## Dataset

The data I will be using is from an incredible emotional-linguistic initiative from the National Research Council (NRC) of Canada.  The [Affect Intensity Lexicon](http://sentiment.nrc.ca/lexicons-for-research/) groups words associated with an overarching emotion and provides an intensity score for each word.  For example, "outraged" has an anger intensity of 0.964 while "grumpy" has in anger intensity of 0.328.  Intensity scores are calculated using Best-Worst Scaling where the level of association is mathmatically calculated.

I will be using the lexicons for the following emotions:
   * Anger
   * Disgust
   * Fear
   * Sadness
   * Joy (for happiness)
   * Surprise


## Table of Contents<span id="0"></span>

1. [**Import Anger**](#1)
<br/><br/>
2. [**Build Import Function**](#2)
<br/><br/>
3. [**Import Remaining Emotions**](#3)
<br/><br/>
4. [**Combine Emotions into nrc_lex**](#4)
<br/><br/>
5. [**Export to Pickle**](#5)

## Package Import

In [2]:
# import external libraries

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import re #regex

# Configure matplotlib for jupyter.
%matplotlib inline

## Data Import & Cleaning

The data comes in 6 different .txt files, one for each emotion. In all files, each line contains one word with its intensity score.  I will import each txt file into a dataframe and combine them into one dataframe that can be subset if necessary.

## <span id="1"></span>1. Import Anger
#### [Return Contents](#0)

In [14]:
anger = pd.read_csv("data/nrc/anger-scores.txt", delimiter='\t', header=None)
anger.columns = ["word", "score"]

In [15]:
anger.head()

Unnamed: 0,word,score
0,outraged,0.964
1,brutality,0.959
2,hatred,0.953
3,hateful,0.94
4,terrorize,0.939


In [16]:
anger.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1483 entries, 0 to 1482
Data columns (total 2 columns):
word     1483 non-null object
score    1483 non-null float64
dtypes: float64(1), object(1)
memory usage: 23.3+ KB


In [17]:
anger['emotion'] = 1

In [18]:
anger.head()

Unnamed: 0,word,score,emotion
0,outraged,0.964,1
1,brutality,0.959,1
2,hatred,0.953,1
3,hateful,0.94,1
4,terrorize,0.939,1


## <span id="2"></span>2. Build Import Function
#### [Return Contents](#0)

In [19]:
def lex_to_df(file_path=str, emo_num=int):
    df = pd.read_csv(file_path, delimiter='\t', header=None)
    df.columns = ["word", "score"]
    df['emotion'] = emo_num
    return df

## <span id="3"></span>3. Import Remaining Emotions
#### [Return Contents](#0)

After importing all the remaining emotions, the total number of words in the lexicon with intensity scores is 7,493.  The question is, how well will this dataset match up with everyday conversation and give me the ability to associate a score with each utterance?

In [20]:
disgust = lex_to_df("data/nrc/disgust-scores.txt", 2)

In [21]:
disgust.head()

Unnamed: 0,word,score,emotion
0,cannibalism,0.953,2
1,mutilation,0.93,2
2,incest,0.914,2
3,molestation,0.914,2
4,gonorrhea,0.906,2


In [22]:
disgust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 3 columns):
word       1094 non-null object
score      1094 non-null float64
emotion    1094 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 25.8+ KB


In [23]:
fear = lex_to_df("data/nrc/fear-scores.txt", 3)

In [24]:
fear.head()

Unnamed: 0,word,score,emotion
0,torture,0.984,3
1,terrorist,0.972,3
2,horrific,0.969,3
3,terrorism,0.969,3
4,terrorists,0.969,3


In [25]:
fear.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1765 entries, 0 to 1764
Data columns (total 3 columns):
word       1765 non-null object
score      1765 non-null float64
emotion    1765 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 41.5+ KB


In [27]:
happy = lex_to_df("data/nrc/joy-scores.txt", 4)

In [28]:
happy.head()

Unnamed: 0,word,score,emotion
0,happiest,0.986,4
1,happiness,0.984,4
2,bliss,0.971,4
3,celebrating,0.97,4
4,jubilant,0.969,4


In [29]:
happy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1268 entries, 0 to 1267
Data columns (total 3 columns):
word       1268 non-null object
score      1268 non-null float64
emotion    1268 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 29.8+ KB


In [30]:
sad = lex_to_df("data/nrc/sadness-scores.txt", 5)

In [31]:
sad.head()

Unnamed: 0,word,score,emotion
0,heartbreaking,0.969,5
1,mourning,0.969,5
2,tragic,0.961,5
3,holocaust,0.953,5
4,suicidal,0.941,5


In [32]:
sad.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1298 entries, 0 to 1297
Data columns (total 3 columns):
word       1298 non-null object
score      1298 non-null float64
emotion    1298 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 30.5+ KB


In [33]:
surprise = lex_to_df("data/nrc/surprise-scores.txt", 6)

In [34]:
surprise.head()

Unnamed: 0,word,score,emotion
0,surprise,0.93,6
1,explode,0.906,6
2,flabbergast,0.906,6
3,explosion,0.898,6
4,eruption,0.883,6


In [35]:
surprise.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 585 entries, 0 to 584
Data columns (total 3 columns):
word       585 non-null object
score      585 non-null float64
emotion    585 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 13.8+ KB


In [37]:
len(anger.word) + len(disgust.word) + len(fear.word) + len(happy.word) + len(sad.word) + len(surprise.word)


7493

## <span id="4"></span>4. Combine Emotions
#### [Return Contents](#0)

After concatenating the dataframes together, the shape should be 7,493 x 3.

In [39]:
nrc_lex = pd.concat([anger, disgust, fear, happy, sad, surprise])

In [40]:
nrc_lex

Unnamed: 0,word,score,emotion
0,outraged,0.964,1
1,brutality,0.959,1
2,hatred,0.953,1
3,hateful,0.940,1
4,terrorize,0.939,1
...,...,...,...
580,peaceful,0.086,6
581,leisure,0.086,6
582,tree,0.078,6
583,picnic,0.078,6


The shape is right on. Now I reset index values for the new dataframe.

In [41]:
nrc_lex.reset_index(drop=True, inplace=True)

In [43]:
nrc_lex

Unnamed: 0,word,score,emotion
0,outraged,0.964,1
1,brutality,0.959,1
2,hatred,0.953,1
3,hateful,0.940,1
4,terrorize,0.939,1
...,...,...,...
7488,peaceful,0.086,6
7489,leisure,0.086,6
7490,tree,0.078,6
7491,picnic,0.078,6


In [44]:
nrc_lex.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 3 columns):
word       7493 non-null object
score      7493 non-null float64
emotion    7493 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 175.7+ KB


## <span id="5"></span>5. Export to Pickle
#### [Return Contents](#0)

Now I export the updated data to a pickle so I can use it when engineering the emotion score feature in my feature engineering notebook.

In [45]:
nrc_lex.to_pickle("data/nrc/eai_nrc_lex.pickle")