<a href="https://colab.research.google.com/github/kamabdi/gem_of_data_science/blob/main/Sample_Text_Analysis_TODO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Usage of TweetEval and Twitter-specific RoBERTa models

In this notebook we show how to perform tasks such as masked language modeling, computing tweet similarity or tweet classificationo using our Twitter-specific RoBERTa models.

- Paper: [_TweetEval_ benchmark (Findings of EMNLP 2020)](https://arxiv.org/pdf/2010.12421.pdf)


## Preliminaries

We define a function to normalize a tweet to the format we used for TweetEval. Note that preprocessing is minimal (replacing user names by `@user` and links by `http`).

In [None]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

We only need to install one dependnecy: the `transformers` library.

In [None]:
!pip install transformers

## Computing Tweet Similarity

To make sense of the words and sentences we need:
0. Preprocess your text (lowcase, remove special characters, etc...)
1. Load/Create the Dictionary of known words
2. Load/Create/Learn Embeddings to represent known words

In [None]:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
import numpy as np
from scipy.spatial.distance import cosine, euclidean
from collections import defaultdict

MODEL = "cardiffnlp/twitter-roberta-base"

# Load the Dictionary of known words 
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# Load embeddings
model = AutoModel.from_pretrained(MODEL)

def get_embedding(text):
  # 0. Preprocess text
  text = preprocess(text)
  # 1. Get what words are known/ in the dictionary
  encoded_input = tokenizer(text, return_tensors='pt')
  print(encoded_input)
  # 2. Get embedding of each word
  features = model(**encoded_input)
  print(features[0].shape)
  features = features[0].detach().cpu().numpy() 

  # 3. Find sentence representation
  features_mean = np.mean(features[0], axis=0) 
  return features_mean

MODEL = "cardiffnlp/twitter-roberta-base"

query = "The course was awesome 😂"

tweets = ["I just ordered fried chicken 🐣", 
          "The movie was great", 
          "What time is the next game? 😂", 
          "Just finished reading 'Embeddings in NLP'"]

d = defaultdict(int)
for tweet in tweets:
  sim = 1-cosine(get_embedding(query),get_embedding(tweet))
  d[tweet] = sim

print('Most similar to: ',query)
print('----------------------------------------')
for idx,x in enumerate(sorted(d.items(), key=lambda x:x[1], reverse=True)):
  print(idx+1,x[0])

## Masked language modeling

Use Twitter-RoBERTA-base to predict words in context using the `fill-mask` pipeline in `transformers`.

In [None]:
from transformers import pipeline, AutoTokenizer
import numpy as np

MODEL = "cardiffnlp/twitter-roberta-base"
fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def print_candidates():
    for i in range(5):
        token = tokenizer.decode(candidates[i]['token'])
        score = np.round(candidates[i]['score'], 4)
        print(f"{i+1}) {token} {score}")

texts = [
 "I am so <mask> 😊",
 "I am so <mask> 😢" 
]
for text in texts:
    # TODO 
    # 0. Preprocess text  
    
    print(f"{'-'*30}\n{t}")
    candidates = fill_mask(t)
    print_candidates()

------------------------------
I am so <mask> 😊
1)  happy 0.402
2)  excited 0.1441
3)  proud 0.143
4)  grateful 0.0669
5)  blessed 0.0334
------------------------------
I am so <mask> 😢
1)  sad 0.2641
2)  sorry 0.1605
3)  tired 0.138
4)  sick 0.0278
5)  hungry 0.0232


## Use TweetEval Classifiers

We currently provide the following fine-tuned models for different tweet classification tasks:

- emoji prediction (`emoji`)
- emotion detection (`emotion`)
- hate speech detection (`hate`)
- irony detection (`irony`)
- offensive language identification (`offensive`)
- sentiment analysis (`sentiment`)
- _(coming soon)_ stance detection (`stance`) with 5 targets (`abortion`, `atheism`, `climate`, `feminist`, `hillary`), for example: `stance-abortion`


In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

# TODO 
# SELECT task from the list 
task= # TODO

MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

# TODO
# LOAD Dictionary 

In [None]:
# download label mapping
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]
labels

In [None]:
# LOQD MODEL FOR CLASSIFICATION
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

# TODO
# GIVE SAMPLE TEXT FOR CLASSIFICATION
text = # TODO

#TODO 
# PREPROCESS TEXT
text = # TODO

# TODO
# GET WORDS FROM DICTIONARY
encoded_input = 

# MAKE PREDICTION
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

In [None]:
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

1) 📷 0.1268
2) 😊 0.1255
3) 😁 0.1114
4) ❤ 0.1106
5) 😎 0.1078
6) 📸 0.0828
7) 😜 0.0514
8) 😍 0.0507
9) 😂 0.0453
10) 😘 0.0388
11) 😉 0.0366
12) 💯 0.0175
13) 💕 0.0168
14) 🔥 0.0162
15) 💙 0.014
16) ✨ 0.0123
17) 🇺🇸 0.0121
18) 🎄 0.0109
19) 💜 0.0078
20) ☀ 0.0048
