# Emotion detection using Spacy 3

This notebook show how to do emotion detection on tweet size texts using a transformer architecture with Spacy 3.

You can run this notebook on Google Colab if you want to customize it to your own needs. Remember to choose GPU hardware.

## Installations and imports

In [None]:
# Installing Spacy library

!pip install spacy==3.1.1
!pip install spacy-transformers

In [None]:
# Downloading the spaCy Transformer model "en_core_web_trf"
!python -m spacy download en_core_web_trf

In [None]:
# Importing libraries

import pandas as pd
from datetime import datetime
import spacy
import spacy_transformers

# Storing docs in binary format
from spacy.tokens import DocBin

## Read in the data

I got the dataset from this github repository:
https://github.com/RoozbehBandpey/ELTEA17

In [None]:
# Read in dataset

jsonpath = "sentence_level_annotation.json"

df = pd.read_json(jsonpath)

df.head()

Unnamed: 0,emotion,text,sarcasm,sent_num
0,joy,That is one #happy #dog who never ceases to ma...,N,1
1,sad,Because everyone knows Arsenal are desperate t...,S,2
2,dis,You say that I'm paranoid but I'm pretty sure ...,N,3
3,joy,One of London's best days and showing the worl...,N,4
4,sad,More children will die because govt not trying...,N,5


As you can see there are a column with emotions and a column with the text. We are interested in those two.

There are 6 different emotions, and I am interested in splitting the data into train and test sets, but keep the ratio across the emotions. 

In [None]:
# Splitting the dataset into train and test
train = df.groupby("emotion").sample(frac = 0.8, random_state = 25)
test = df.drop(train.index)

In [None]:
# Checking the shape

print(train.shape, test.shape)

(1626, 4) (408, 4)


In [None]:
#Creating tuples

train['tuples'] = train.apply(lambda row : (row['text'],row['emotion']), axis=1)

train = train['tuples'].tolist()

test['tuples'] = test.apply(lambda row : (row['text'],row['emotion']), axis=1)

test = test['tuples'].tolist()

train[0]

("@GoDaddy This is your business model? You're part of the problem. #Shame",
 'ang')

In [None]:
df.emotion.value_counts()

joy    459
sad    429
dis    348
sup    305
fea    255
ang    238
Name: emotion, dtype: int64

In [None]:
# User function for converting the train and test dataset into spaCy document

nlp = spacy.load("en_core_web_trf")

def document(data):
#Creating empty list called "text"  

    emotions = ["joy", "sad", "dis", "sup", "fea", "ang"]

    text = []

    for doc, label in nlp.pipe(data, as_tuples = True):

        for emotion in emotions:
            if (label == emotion):
                doc.cats[emotion] = 1
            else:
                doc.cats[emotion] = 0
    
        #Adding the doc into the list 'text'
        text.append(doc)
        
    return(text)

In [None]:
# Calculate the time for converting into binary document for train dataset

start_time = datetime.now()

#passing the train dataset into function 'document'
train_docs = document(train)

#Creating binary document using DocBin function in spaCy
doc_bin = DocBin(docs = train_docs)

#Saving the binary document as train.spacy
doc_bin.to_disk("train.spacy")
end_time = datetime.now()

#Printing the time duration for train dataset
print('Duration: {}'.format(end_time - start_time))

Duration: 0:03:07.909619


In [None]:
# Calculate the time for converting into binary document for test dataset

start_time = datetime.now()

#passing the test dataset into function 'document'
test_docs = document(test)
doc_bin = DocBin(docs = test_docs)
doc_bin.to_disk("test.spacy")
end_time = datetime.now()

#Printing the time duration for test dataset
print('Duration: {}'.format(end_time - start_time))

Duration: 0:00:45.883531


Go here https://spacy.io/usage/training#quickstart

And download the base_config.cfg

Set it to:
- textcat
- gpu
- accuracy

Put it here. And then change the paths to:

train = "train.spacy"

dev = "test.spacy"

In [None]:
#Converting base configuration into full config file

!python -m spacy init fill-config ./base_config.cfg ./config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
#Calculating the time for training the model
start_time = datetime.now()

# To train the model. Enabled GPU and storing the model output in folder called output_updated
!python -m spacy train config.cfg --verbose  --gpu-id 0 --output ./output_updated

end_time = datetime.now()

#Printing the time taken for training the model
print('Duration: {}'.format(end_time - start_time))

[38;5;2m✔ Created output directory: output_updated[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2021-09-30 07:45:38,058] [INFO] Set up nlp object from config
[2021-09-30 07:45:38,071] [DEBUG] Loading corpus from path: test.spacy
[2021-09-30 07:45:38,072] [DEBUG] Loading corpus from path: train.spacy
[2021-09-30 07:45:38,073] [INFO] Pipeline: ['transformer', 'textcat']
[2021-09-30 07:45:38,078] [INFO] Created vocabulary
[2021-09-30 07:45:38,079] [INFO] Finished initializing nlp object
Downloading: 100% 481/481 [00:00<00:00, 562kB/s]
Downloading: 100% 899k/899k [00:00<00:00, 5.24MB/s]
Downloading: 100% 456k/456k [00:00<00:00, 4.06MB/s]
Downloading: 100% 1.36M/1.36M [00:00<00:00, 8.50MB/s]
Downloading: 100% 501M/501M [00:13<00:00, 37.2MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expe

In [None]:
# Testing the model

# Loading the best model from output_updated folder
nlp = spacy.load("output_updated/model-best")

In [None]:
text = "Capitalism produces ecological crisis for the same reason it produces inequality: because the fundamental mechanism of capitalist growth is that capital must extract (from nature and labour) more than it gives in return."

demo = nlp(text)

a_dictionary = demo.cats
cat = max(a_dictionary, key=a_dictionary.get)

print(text)
print(cat.upper())

Capitalism produces ecological crisis for the same reason it produces inequality: because the fundamental mechanism of capitalist growth is that capital must extract (from nature and labour) more than it gives in return.
DIS


In [None]:
a_dictionary

{'ang': 0.0012292256578803062,
 'dis': 0.9250048398971558,
 'fea': 0.005434458144009113,
 'joy': 0.0011282231425866485,
 'sad': 0.06589248031377792,
 'sup': 0.0013107025297358632}

## Store the stuff for faster reuse

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
%cp -r `ls -A | grep -v "gdrive"` /content/gdrive/MyDrive/emotions/