# **Phase 1 - Analysing GoEmotions and VAD Mapping**

## GoEmotions DataSet

*   [GoEmotions @ Google's github](https://github.com/google-research/google-research/tree/master/goemotions/data)




In [1]:
# Set working directory as the project's dir
from google.colab import drive
drive.mount('/content/drive/')

# Change here to the path in Google Drive where this projcet is located
%cd "drive"/"My Drive"/"University"/"Year 3"/"Semester A"/"Natural Language Processing"/"Project"/"emotion-recognition-nlp-project"

Mounted at /content/drive/
/content/drive/My Drive/University/Year 3/Semester A/Natural Language Processing/Project/emotion-recognition-nlp-project


In [2]:
# Packages
!pip3 install datasets transformers -q
!pip3 install torch

[K     |████████████████████████████████| 306 kB 13.3 MB/s 
[K     |████████████████████████████████| 3.4 MB 69.3 MB/s 
[K     |████████████████████████████████| 243 kB 85.6 MB/s 
[K     |████████████████████████████████| 61 kB 387 kB/s 
[K     |████████████████████████████████| 1.1 MB 68.3 MB/s 
[K     |████████████████████████████████| 132 kB 73.5 MB/s 
[K     |████████████████████████████████| 895 kB 60.7 MB/s 
[K     |████████████████████████████████| 596 kB 69.8 MB/s 
[K     |████████████████████████████████| 3.3 MB 68.1 MB/s 
[K     |████████████████████████████████| 271 kB 88.1 MB/s 
[K     |████████████████████████████████| 192 kB 73.8 MB/s 
[K     |████████████████████████████████| 160 kB 71.8 MB/s 


In [3]:
# Imports
import pandas as pd

In [4]:
# Load split GoEmotions to pandas DFs
from datasets import load_dataset
go_emotions = load_dataset("go_emotions", "simplified")
data = go_emotions.data
train, dev, test = data["train"].to_pandas(), data["validation"].to_pandas(), data["test"].to_pandas()

Downloading:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.67k [00:00<?, ?B/s]

Downloading and preparing dataset go_emotions/simplified (download: 4.19 MiB, generated: 5.03 MiB, post-processed: Unknown size, total: 9.22 MiB) to /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d...


Downloading:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/203k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/201k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset go_emotions downloaded and prepared to /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
# Adding verbal emotions column

# TODO: fetch the following list from data/emotions.txt
labels_list = [
  'admiration',
  'amusement',
  'anger',
  'annoyance',
  'approval',
  'caring',
  'confusion',
  'curiosity',
  'desire',
  'disappointment',
  'disapproval',
  'disgust',
  'embarrassment',
  'excitement',
  'fear',
  'gratitude',
  'grief',
  'joya',
  'love',
  'nervousness',
  'optimism',
  'pride',
  'realization',
  'relief',
  'remorse',
  'sadness',
  'surprise',
  'neutral'
]
idx2emotion = {i : e for i, e in enumerate(labels_list)}
emotion2idx = None # TODO reverse the dict

train["emotions"] = train["labels"].apply(lambda labels: [idx2emotion[i] for i in labels])
dev["emotions"] = dev["labels"].apply(lambda labels: [idx2emotion[i] for i in labels])
test["emotions"] = test["labels"].apply(lambda labels: [idx2emotion[i] for i in labels])


## Delete multi-labels from datasets

We shall temporarily ignore multi-labaled exmaple just for simplicity. Later we will choose a way to treat them

There might be another way of doing it, as @Shir said.

In [6]:
# Add label counts column
train['labels_count'] = train['labels'].apply(lambda labels: len(labels))
dev['labels_count'] = dev['labels'].apply(lambda labels: len(labels))
test['labels_count'] = test['labels'].apply(lambda labels: len(labels))

# Print percenrage of multi-labels (candidates to delete)
print("--- What is the part that will be delete? --")
print("train : ", sum(train['labels_count'] > 1) / len(train['labels_count']))
print("dev : ", sum(dev['labels_count'] > 1) / len(dev['labels_count']))
print("test : ", sum(test['labels_count'] > 1) / len(test['labels_count']))

--- What is the part that will be delete? --
train :  0.16360285648468095
dev :  0.1618134906008109
test :  0.15422885572139303


In [7]:
train_one_label = train[train['labels_count'] == 1]
dev_one_label = dev[dev['labels_count'] == 1]
test_one_label = test[test['labels_count'] == 1]

## Map to VAD
We'd like to map each example to VAD, using its label (= emotion).
Each lexicon's df will have the column `word`, `v`, `a`, `d`

1. **First option** - [NRC-VAD : 2018](http://saifmohammad.com/WebPages/nrc-vad.html)
Crowd sourced lexicon, mapping 20K words to VAD.

In [8]:
NRC_VAD_LEXICON_PATH = "data/mapping-emotions-to-vad/nrc-vad/NRC-VAD-Lexicon.txt"
df_nrc = pd.read_csv(NRC_VAD_LEXICON_PATH, sep="\t", 
                     names=['word', 'v', 'a', 'd'])

# desired mapping from NRC
mapping_nrc = df_nrc[df_nrc["word"].apply(lambda word: word in labels_list)]
mapping_nrc.head()

Unnamed: 0,word,v,a,d
251,admiration,0.969,0.583,0.726
624,amusement,0.929,0.837,0.803
670,anger,0.167,0.865,0.657
709,annoyance,0.167,0.718,0.342
865,approval,0.854,0.46,0.889


2. **Second option** - [ANEW : 1999](https://pdodds.w3.uvm.edu/teaching/courses/2009-08UVM-300/docs/others/everything/bradley1999a.pdf)
Psychology students sourced lexicon, mapping 1K words to VAD.

In [9]:
ANEW_LEXICON_PATH = "data/mapping-emotions-to-vad/anew/all.csv"
df_anew_raw = pd.read_csv(ANEW_LEXICON_PATH, sep=",")

# mapping to conventioned dateframe
_mapper = {
    "Description": "word",
    "Valence Mean": "v",
    "Arousal Mean": "a",
    "Dominance Mean": "d",
}
_scaler = lambda x: (x - 1) / 8  # linear scaling:  [1,9] --> [0,1] # @Shir: do you think of something better?

df_anew = df_anew_raw[_mapper.keys()].rename(_mapper, axis=1)
df_anew[["v", "a", "d"]] = df_anew[["v", "a", "d"]].apply(_scaler)

# desired mapping from ANEW
mapping_anew = df_anew[df_anew["word"].apply(lambda word: word in labels_list)]
mapping_anew.head()

Unnamed: 0,word,v,a,d
172,love,0.965,0.68,0.76375
308,optimism,0.74375,0.5425,0.70125
360,pride,0.75,0.60375,0.7575
714,anger,0.1675,0.82875,0.5625
884,desire,0.83625,0.79375,0.68625


3. **Third option** - TODO Shir?

In [10]:
# TODO: change dataframes indexes of mapping_anew, mapping_nrc to GoEmotions indices (like the inddexes in labels_list)

# Primitive Regression Approaches
Here we shall try some naive models as a baseline, before "dropping the hammer" with fine-tuning.

We shall use bert's output's as embeddings.

In [11]:
# get BERT
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
bertModel = AutoModel.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
# BERT tokenizing
def bert_tokenizer(sent):
  """
     tokenize the given sentence into pytorch tensotr
     (just like Bert does in its input layer)
  """
  tokenized_sentence = tokenizer(sent, return_tensors='pt')
  # tokenizer.tokenize(sent)  # subwords tokenization (first phase)
  # tokenizer.decode(tokenized_sentence["input_ids"]) # decode the full tokeniztion
  return tokenized_sentence

bert_tokenizer("just an exmaple we can enjoy from")

{'input_ids': tensor([[ 101, 1198, 1126, 4252, 1918, 7136, 1195, 1169, 5548, 1121,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [13]:
# BERT forward pass
def bert_embedding_get_cls(tokenized_sent):
  """
    given a tokinized-sentence, perform bert's forward pass,
    returns the CLS 
  """
  bert_encoding = bertModel(**tokenized_sentence)

  bert_cls = bert_encoding.last_hidden_state[:, 0]  # original cls
  bert_cls_from_nsp = bert_encoding.pooler_output # cls after layers used in NSP task

  return bert_cls

### 1.1 - Linear regression approach
Although Linear Regression does **not** really model our task well, we'll try it as a baseline,
(We can also try to gradient a logistic regeression with cross entropy by our-self, if it's convex).

In [14]:
from sklearn.linear_model import LinearRegression
def train_linear_regression_naive(X_train, Y_train):
  """
    trains a linear regression model on given data
  """
  regr_model = LinearRegression()
  regr.fit(X_train, Y_train)

  return regr


from sklearn.metrics import mean_squared_error, r2_score

def evaluate_linear_regression_model(regr_model, x_test, y_test):
  y_pred=regr_model.predict(X_test)

  print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
  print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))

In [15]:
# Regress VAD - TODO!

### 1.2 - NN regression approach
Practically we try logistic regression as well, BUT we add here non-linearity in the middle.

In [16]:
from torch import nn

class RegresionNN(nn.Module):
    def __init__(self):
        super(RegresionNN, self).__init__()

        # my NN config
        self.in_dim = 768      # bert-encoding size
        self.hideen_dim = 500 
        self.out_dim = 3       # VAD dimension

        self.nn_stack = nn.Sequential(
            nn.Linear(self.in_dim, self.hideen_dim),
            nn.Sigmoid(),
            nn.Linear(self.hideen_dim, self.out_dim),
            nn.Sigmoid(),
        )

        self.loss = nn.CrossEntropyLoss()

    def forward(self, x):
        x = self.flatten(x)
        logits = self.nn_stack(x)
        return logits

In [17]:
# TODO Use hugging-face trainer, for this regression ([Matan]: I am currently reading about it)

model = RegresionNN()
[p for p, _ in model.named_parameters()]

['nn_stack.0.weight',
 'nn_stack.0.bias',
 'nn_stack.2.weight',
 'nn_stack.2.bias']

### 1.3 - Probibilistic Models
Let's imagine we that VAD is disributed normally, for example (V | X) ~ N(mu, sigma), where V is valance and X is the sentence. 

Can we find a good model this way?