# Emotions ML Data Generation 

This notebook is built with the purpose of creating a large corpus to train and test a machine learning emotions classifier for the `limbic` package. 


## General Idea

Given that currently the `limbic` package supports a very deterministic dictionary-based emotions classification, the idea is the following: 
1. Fetch a large corpus of texts. These texts need to have a wide range of emotions depicted inside. It's impossible to manually review for a reasonably distributed set of emotions in this corpus, so as a strong assumption I'll be picking as many books from different genres as I can. The hypothesis is that with a large enough set of large documents, the classifier will be able to pick the patterns on which emotions 
2. Pre-process such texts in order to isolate as many sentences as possible and run the current emotions classifier on each sentence. 
3. Aggregate such results to create a multi-category dataset where each sentence will be associated to many categories, and each category will have a strength as determined by the dictionary-based emotions classifier in `limbic`.
4. Train a model using such dataset and then check the performance.

For this, I picked a few books from https://www.smashwords.com/ that were free and their `txt` version was already available. This is just to get started, but this collection of cheesy fiction books from different genres which seem to be quite heavy on emotions so it might be a good sample (added a few non-fiction factual books to balance out too many emotions for the classifier). Also, I went throught the classic Guthemberg project and picked ~90 books from the top list https://www.gutenberg.org/browse/scores/top . The model will be built in TensorFlow using a bi-directional recurrent neural network for multi-label classification, and tested in a separate notebook. 



## Processing Books


In [1]:
training_metadata = {}  # This variable will keep track of all parameters used in this notebook. 

In [41]:
import os
from typing import List, Iterable
from collections import Counter

from tqdm import tqdm_notebook as tqdm 

from limbic.emotion.models.tf_limbic_model import utils


books_path = '../data/books/'  # books are included in this repository as a compressed file books.tar.gz 
files = [os.path.join(books_path, filename) for filename in os.listdir(books_path) if filename.endswith(".txt")]
training_metadata['corpus_files'] = files
training_metadata['corpus_total_files'] = len(files)
print(f'--\ntotal: {len(files)} files.')

paragraphs = []
for _file in files:
    with open(_file, 'r') as f:
        paragraphs += utils.load_book(f.readlines())
training_metadata['corpus_total_paragraphs'] = len(paragraphs)
print(f'total paragraphs: {len(paragraphs)}')

# split all paragraph lines (using simple period and no question or exclamation marks)
# TODO: An analysis on whether to use question marks or exclamation marks could be interesting ;)
lines = []
for p in paragraphs:
    lines += [x.strip() for x in p.split('.') if x and len(x.split(' ')) > 1]  
training_metadata['corpus_total_lines'] = len(lines)
print(f'total lines: {len(lines)}')

unique_words = Counter()
for l in lines:
    unique_words.update(l.split(' '))
print(f'total unique words ~ {len(unique_words.keys())}')

sorted_words = sorted(unique_words.items(), key=lambda x: x[1], reverse=True)
print(f'max freq: {sorted_words[0]}')
print(f'freq 50k: {sorted_words[50000]}')
print(f'freq 100k: {sorted_words[100000]}')
print(f'freq 200k: {sorted_words[200000]}')


--
total: 96 files.
total paragraphs: 111006
total lines: 432053
total unique words ~ 252216
max freq: ('the', 373777)
freq 50k: ('malicious,', 5)
freq 100k: ('Hollander', 2)
freq 200k: ('across--till', 1)


## Create training and testing dataset

In this section we'll go through processing the corpus and create the training and testing dataset. 

The idea will be to use `limbic` to approximate in a deterministic (but not perfect) way which is the emotion of a sentence, and use these emotions as labels. As we do have many emotions in a sentence, the idea would be to model the problem into a multi-label classification problem. In cases where there's more than one label, I'll keep the max emotion associated to that label. 

In [3]:
from limbic.emotion.models import LexiconLimbicModel
from limbic.emotion.nrc_utils import load_nrc_lexicon

EMOTIONS_DICTIONARY_FILE = '../data/lexicons/NRC-AffectIntensity-Lexicon.txt'
EMOTIONS_TYPE = 'affect_intensity'
training_metadata['lexicon_limbic_model_params'] = {
    'dictionary_file': EMOTIONS_DICTIONARY_FILE,
    'emotions_type': EMOTIONS_TYPE
}

lexicon = load_nrc_lexicon(EMOTIONS_DICTIONARY_FILE, EMOTIONS_TYPE)
lb = LexiconLimbicModel(lexicon)


In [4]:
from collections import defaultdict
import random

CORPUS_LINES_SAMPLE = 100  # Small number for now to speed up experiments
training_metadata['corpus_lines_sample'] = CORPUS_LINES_SAMPLE

# Get the emotions for all sentences so we can use them as target labels 
# Note that this step takes considerable time if the len(lines) is used as CORPUS_LINES_SAMPLE (~4hrs), so be patient. 
sentence_emotions = {l: lb.get_sentence_emotions(l) 
                     for l in tqdm(random.sample(lines, CORPUS_LINES_SAMPLE), 'getting emotions')}



HBox(children=(IntProgress(value=0, description='getting emotions', style=ProgressStyle(description_width='ini…




### Checking how balanced is the data


Running a counter over the different labels in the dataset, you can see that there's a small imbalance (biased towards "joy" emotions and the other emotions are mostly within a similar distribution. I'll keep the data like this unless I see there's some predicted biased towards the "joy" emotion (TODO). 


In [5]:
from limbic.limbic_types import Emotion
from tqdm import tqdm_notebook as tqdm
import json

# Here I'm cheating a little bit. I used 100 in the step above as example, but I'm loading data computed with 100000
# Loading pre-computed version of sentence -> emotions as it takes a long time to compute for large datasets. 

sentence_emotions = {}
with open('sentence_emotions.jsons', 'r') as se_file:
    for line in tqdm(se_file.readlines()):
        _d = json.loads(line.strip())
        _emotions = [Emotion(term=x['term'], value=x['value'], category=x['category']) for x in _d['emotions']]
        sentence_emotions[_d['sentence']] = _emotions
len(sentence_emotions)


HBox(children=(IntProgress(value=0, max=95425), HTML(value='')))




95425

In [6]:
from collections import Counter

count = Counter()
for k, v in sentence_emotions.items():
    count.update([x.category for x in v])
training_metadata['labels_distribution'] = dict(count)
count

Counter({'fear': 34038, 'sadness': 31586, 'joy': 54212, 'anger': 23406})

### Shaping the data for TensorFlow


In [7]:
import pandas as pd

from limbic.limbic_constants import AFFECT_INTENSITY_EMOTIONS as EMOTIONS
training_metadata['labels'] = EMOTIONS

# The idea is to create a Pandas DataFrame with all the features from this dictionary.
sentence_unique_emotions_score = defaultdict(list)
for k, v in tqdm(sentence_emotions.items()):
    sentence_unique_emotions_score['text'].append(k)
    categories = defaultdict(list)
    for x in v:
        categories[x.category].append(x.value)
    emotions_added = []
    for c, v_list in categories.items():
        sentence_unique_emotions_score[c].append(max(v_list)) 
        emotions_added.append(c)
    for c in EMOTIONS:
        if c not in emotions_added:
            sentence_unique_emotions_score[c].append(0.0)

data = pd.DataFrame.from_dict(sentence_unique_emotions_score)
data.head()


HBox(children=(IntProgress(value=0, max=95425), HTML(value='')))




Unnamed: 0,text,sadness,joy,fear,anger
0,copyright law in creating the Project Gutenber...,0.0,0.0,0.0,0.0
1,"If she would let him, he would give her everyt...",0.0,0.0,0.0,0.0
2,“It would cease to be a danger if we could def...,0.719,0.0,0.802,0.0
3,"Any words of wisdom now, mama? She thought to ...",0.0,0.312,0.0,0.0
4,How at the year’s end all three knights with t...,0.0,0.0,0.0,0.0


In [32]:
from sklearn.model_selection import train_test_split

TRAIN_TEST_SPLIT = 0.2
RANDOM_STATE = 42
training_metadata['train_test_split'] = TRAIN_TEST_SPLIT
training_metadata['train_test_split_random_state'] = RANDOM_STATE

train, test = train_test_split(data, test_size=TRAIN_TEST_SPLIT, random_state=RANDOM_STATE)


In [28]:
SENTENCE_EMOTIONS_TRAIN_FILE = '../data/sentence_emotions_train.pickle'
SENTENCE_EMOTIONS_TEST_FILE = '../data/sentence_emotions_test.pickle'
training_metadata['train_split'] = SENTENCE_EMOTIONS_TRAIN_FILE
training_metadata['test_split'] = SENTENCE_EMOTIONS_TEST_FILE

train.to_pickle(SENTENCE_EMOTIONS_TRAIN_FILE)  
test.to_pickle(SENTENCE_EMOTIONS_TEST_FILE)  


In [None]:
from datetime import datetime 

current_date = datetime.now().date().isoformat()
metadata_file_path = f'model_metadata_{current_date}.txt'
with open(metadata_file_path, 'w') as meta:
    meta.write(json.dumps(training_metadata, indent=2))
