# Context
- [1. Read me](#1)
- [2. Pre-preparation](#2)
- [3. Data Downloading](#3)
- [4. Read file](#4)
- [5. Data Preprocessing](#5)
- [6. Load the pre-trained model](#6)
- [7. Adapt the model](#7)
- [8. Training](#8)
- [9. Evaluation](#9)

<a name='1'></a>
# Read me

This task is to build a classifier to idenity 6 emotions, which are anger, disgust, fear, joy, sadness, and surprise.

<a name='2'></a>
# Pre-preparation

In [1]:
!pip install transformers
!pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

from transformers import BertTokenizer,TFBertPreTrainedModel,TFBertMainLayer, TFAutoModelForSequenceClassification,BertForSequenceClassification,AdamWeightDecay

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


<a name='3'></a>
# Data Downloading

In [3]:
!wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
!wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv
!wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_3.csv

--2022-06-30 19:49:09--  https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.20.128, 74.125.135.128, 74.125.142.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.20.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14174600 (14M) [application/octet-stream]
Saving to: ‘data/full_dataset/goemotions_1.csv’


2022-06-30 19:49:10 (56.6 MB/s) - ‘data/full_dataset/goemotions_1.csv’ saved [14174600/14174600]

--2022-06-30 19:49:10--  https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.20.128, 108.177.98.128, 74.125.197.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.20.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14173154 (14M) [application/octet-stream]
Saving to: ‘data/ful

<a name='4'></a>
# Read file

In [3]:
csv1 = pd.read_csv('data/full_dataset/goemotions_1.csv')
csv2 = pd.read_csv('data/full_dataset/goemotions_2.csv')
csv3 = pd.read_csv('data/full_dataset/goemotions_3.csv')

data = pd.concat([csv1, csv2,csv3])
data = data.reset_index()

# data is so large, only use 10%
data_use = data.sample(frac=0.1, replace=False, random_state=1)
print(data_use.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21122 entries, 9375 to 8811
Data columns (total 38 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   index                 21122 non-null  int64  
 1   text                  21122 non-null  object 
 2   id                    21122 non-null  object 
 3   author                21122 non-null  object 
 4   subreddit             21122 non-null  object 
 5   link_id               21122 non-null  object 
 6   parent_id             21122 non-null  object 
 7   created_utc           21122 non-null  float64
 8   rater_id              21122 non-null  int64  
 9   example_very_unclear  21122 non-null  bool   
 10  admiration            21122 non-null  int64  
 11  amusement             21122 non-null  int64  
 12  anger                 21122 non-null  int64  
 13  annoyance             21122 non-null  int64  
 14  approval              21122 non-null  int64  
 15  caring           

<a name='5'></a>
# Data Preprocessing

In [4]:
# Data Processing
X = data_use[['text','id','rater_id']]
y = data_use.iloc[:,10:]


# Only use six emotions
six_emotion_mapping = {
"anger": ["anger", "annoyance", "disapproval"],
"disgust": ["disgust"],
"fear": ["fear", "nervousness"],
"joy": ["joy", "amusement", "approval", "excitement", "gratitude",  "love", "optimism", "relief", "pride", "admiration", "desire", "caring"],
"sadness": ["sadness", "disappointment", "embarrassment", "grief",  "remorse"],
"surprise": ["surprise", "realization", "confusion", "curiosity"]
}

for i in range(len(list(six_emotion_mapping.values()))):
  y_new = y[list(six_emotion_mapping.values())[i]].sum(axis = 1)
  y = y.drop(list(six_emotion_mapping.values())[i], axis = 1)
  y[list(six_emotion_mapping.keys())[i]] = y_new

# Drop emotion neutral
y = y.drop(columns = ['neutral'], axis = 1)


# Replace (all the values > 1) by 1
for c in y.columns:
  y.loc[y[c] > 1, c] = 1 

# Remove all the values = 0
X = X.loc[(y!=0).any(1)]
y = y.loc[(y!=0).any(1)]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.15)

X_validation = X.drop(X_train.index).drop(X_test.index)  
y_validation = y.drop(y_train.index).drop(y_test.index)
print(len(X_test))






2289


In [5]:
# Tokenization
def tokenization(df,y=None):
  tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)
  input_ids = []
  token_type_ids = []
  attention_mask = []


  for sentence in df.values:

    encoded_dict = tokenizer.encode_plus(
                        sentence,                      # Sentence to encode
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 256,           # Pad & truncate all sentences
                        padding='max_length',
                        truncation=True,
                        return_attention_mask = True,   # Construct attn. masks
                        return_tensors = 'tf',     # Return tf tensors
                   )
    input_ids.append(encoded_dict['input_ids'])
    token_type_ids.append(encoded_dict['token_type_ids'])
    attention_mask.append(encoded_dict['attention_mask'])

  if y is not None:
    return tf.data.Dataset.from_tensor_slices(((input_ids,token_type_ids, attention_mask),tf.convert_to_tensor(y.values.reshape((-1,1,6)))))
  else:
    return tf.data.Dataset.from_tensor_slices((input_ids,token_type_ids, attention_mask))


In [6]:
# Get the train/test data
train_data = tokenization(X_train['text'],y_train)
validation_data = tokenization(X_validation['text'],y_validation)
test_data = tokenization(X_test['text'],y_test)


<a name='6'></a>
# Load the pre-trained model

In [7]:
# Load the pretrained model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
print(model.summary())

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
None


<a name='7'></a>
# Adapt the model

In [8]:
# Adapt to our classification task - multi labels
model.classifier = tf.keras.layers.Dense(units= 6,
                                   name='classifier_new',
                                   activation='sigmoid')
print(model.summary())

# Freeze the main layers
model.bert.encoder.layer[0].trainable = False

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier_new (Dense)      multiple                  0 (unused)
                                                                 
Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0
_________________________________________________________________
None


<a name='8'></a>
# Training

In [9]:
# defining the optimizer
model.compile(AdamWeightDecay(learning_rate=1e-5,weight_decay_rate = 5e-6), loss=tf.keras.losses.BinaryCrossentropy(), metrics=[tf.keras.metrics.CategoricalAccuracy()]) 

# training the model
history = model.fit(train_data, validation_data=validation_data, batch_size=16, epochs=3, shuffle=True, verbose=1)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<a name='9'></a>
# Evaluation

In [10]:
pred=model.predict(tokenization(X_test['text']))

In [11]:
print(accuracy_score(tf.convert_to_tensor(y_test), tf.round(pred['logits'])))

0.5329838357361293


In [12]:
print(classification_report(tf.convert_to_tensor(y_test.astype('float32')), tf.round(pred['logits']), target_names = list(six_emotion_mapping.keys())))

              precision    recall  f1-score   support

       anger       0.63      0.22      0.33       465
     disgust       0.20      0.03      0.05        78
        fear       0.83      0.17      0.28        60
         joy       0.76      0.84      0.80      1227
     sadness       0.74      0.23      0.35       304
    surprise       0.69      0.43      0.53       438

   micro avg       0.74      0.55      0.63      2572
   macro avg       0.64      0.32      0.39      2572
weighted avg       0.71      0.55      0.58      2572
 samples avg       0.60      0.57      0.58      2572



  _warn_prf(average, modifier, msg_start, len(result))


In [79]:
# Examine all the annotators by majority votes
pd_test = pd. concat([X_test, y_test], axis = 1)
ids_group = pd_test.groupby("id").count()
ids_group_index = ids_group.loc[ids_group['text']>1,:].index
pd_test_unique = pd_test.drop_duplicates(['id'])


vote_majority_label = []
for i in range(len(pd_test_unique)):
  if list(pd_test_unique['id'])[i] not in ids_group_index:
    vote_majority_label.append(pd_test_unique.iloc[i, 3:])
    
  else:
    pd_vote_majority = pd_test.loc[pd_test['id'] == list(pd_test_unique['id'])[i]]
    pd_vote_majority = pd.DataFrame(pd_vote_majority[list(six_emotion_mapping.keys())].sum(axis = 0))
    for c in pd_vote_majority.columns:
      pd_vote_majority.loc[pd_vote_majority[c] > 1, c] = 1 
    vote_majority_label.append(pd_vote_majority.T.values.flatten())



In [82]:
pre_majority = model.predict(tokenization(pd_test_unique['text']))

print(accuracy_score(tf.convert_to_tensor(vote_majority_label), tf.round(pre_majority ['logits'])))

print(classification_report(tf.convert_to_tensor(vote_majority_label),  tf.round(pre_majority ['logits']), target_names = list(six_emotion_mapping.keys())))

0.5310834813499112
              precision    recall  f1-score   support

       anger       0.63      0.22      0.33       461
     disgust       0.22      0.03      0.05        78
        fear       0.83      0.17      0.28        60
         joy       0.76      0.84      0.80      1213
     sadness       0.74      0.23      0.35       302
    surprise       0.69      0.42      0.53       431

   micro avg       0.74      0.54      0.63      2545
   macro avg       0.65      0.32      0.39      2545
weighted avg       0.71      0.54      0.58      2545
 samples avg       0.60      0.57      0.58      2545



  _warn_prf(average, modifier, msg_start, len(result))
