This script is a modification of the code founf on HuggingFace.com. You can find the work of the original authors more detailed on the links to their githubs below.

# Initial GOAL

We wanted to use transfer learning to be able to predict instead of just Positive/Negative, more emotions.
After some research we fpund interesting to use a RoBERTa pre-trained model to do this task.
We wanted to use RoBERTa with our initial unlabeled data and extract emotion labels from it. 
Once we had these labels, we wanted to build a LSM from scratch, see its performance and then apply transfer learning to our original LSTM.


In this script you will find our attempt to use RoBERTa with our labels. Finally, we have been forced to change our goal due to some problems that we are going to relate now. To see our final Transfer Learning approach please go to FinalProject.ipynb.

# Usage of TweetEval and Twitter-specific RoBERTa models

In this notebook we show how to perform tasks such as masked language modeling, computing tweet similarity or tweet classificationo using our Twitter-specific RoBERTa models.

- Paper: [_TweetEval_ benchmark (Findings of EMNLP 2020)](https://arxiv.org/pdf/2010.12421.pdf)
- Authors: Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke and Leonardo Neves.
- [Github](https://github.com/cardiffnlp/tweeteval)


In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')
data_path = '/content/drive/My Drive/DeepLearning_2021/FINAL PROJECT/Data/'
results_path = '/content/drive/My Drive/DeepLearning_2021/FINAL PROJECT/Results/'
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Preliminaries

We define a function to normalize a tweet to the format we used for TweetEval. Note that preprocessing is minimal (replacing user names by `@user` and links by `http`).

In [None]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

We only need to install one dependnecy: the `transformers` library.

In [None]:
!pip install transformers



Now we are going to load the pretrained model.

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

task='emotion'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [None]:
# download label mapping
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]
labels

['anger', 'joy', 'optimism', 'sadness']

Let's start with a simple example:

In [None]:
# TF
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)

text = "I really hate you."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
scores = output[0][0].numpy()
scores = softmax(scores)

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-emotion.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [None]:
type(model)

transformers.models.roberta.modeling_tf_roberta.TFRobertaForSequenceClassification

In [None]:
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

1) anger 0.9571
2) sadness 0.0249
3) optimism 0.0099
4) joy 0.0081


Try the model with our dataset.

In [None]:
df = pd.read_csv(results_path+"df.csv")

In [None]:
df.head()

Unnamed: 0,id,text,Polarity,Sentiment
0,1338158543359250433,While the world has been on the wrong side of ...,-0.5,Negative
1,1337855739918835717,"Facts are immutable, Senator, even when you're...",-0.05,Negative
2,1337852648389832708,Does anyone have any useful advice/guidance fo...,0.4,Positive
3,1337851215875608579,it is a bit sad to claim the fame for success ...,-0.1,Negative
4,1337850832256176136,There have not been many bright days in 2020 b...,0.675,Positive


In [None]:
df_transfer = df[["id","text"]].set_index("id")

In [None]:
df_transfer

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
1338158543359250433,While the world has been on the wrong side of ...
1337855739918835717,"Facts are immutable, Senator, even when you're..."
1337852648389832708,Does anyone have any useful advice/guidance fo...
1337851215875608579,it is a bit sad to claim the fame for success ...
1337850832256176136,There have not been many bright days in 2020 b...
...,...
1396852555909505024,slamShaikh_MLA Dear Sir! I am Clinical Researc...
1396842878958284802,@CP24 Canada stop politicizing vaccine Toronto...
1396835959271043074,India’s Panacea Biotec has started producing S...
1396834454878961664,@globeandmail Canada stop politicizing vaccine...


Now we are going to encode all the texts from our dataset.

In [None]:
encoded_input = df_transfer.text.apply(lambda x: tokenizer(x, return_tensors='tf'))

The main problem encountered is that when we apply Tokenizer, the output class is tf.keras...BatchEncoding. If we only pass to the model one register, then there is no problem. It works. The issue comes when we try to pass it more than one input, that is, it is impossible to do, model(encoded_input). The error is something like: ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type BatchEncoding). We have tried many things but none of them has worked.
We didn't want to get stuck at this point, so we have tried the following:

In [None]:
predictions = []
for enccoded_text in encoded_input.to_numpy():
    p = model(enccoded_text)
    predictions.append(p)
df_transfer["predictons"] = predictons

The problem of this is that RoBERTa is a very very complex architecture, which makes the model take even more time than other architectures. Since we have to pass it one register at a time, this takes soo long. Basically we cannot parallellize or train with batches. It has taken 28 minutes to train only 4000 registers before collab has crashed.

Due to this, we have changed our original plan, and we have tried another way of doing Transfer Learning, that you can find on the FinalProject script.

In [None]:
df_transfer.to_csv("df_with_emotions.csv")
df_transfer.head()

ValueError: ignored