# Introduction

## Pick one of the datasets between hate and offensive, and justify your choice. Remember that it is for a commercial application (there is a good and a bad answer).

### We have chosen the **Offensive** dataset as the hate one is under the Creative Commons CC-BY-NC-4.0 license where we can't use it for commercial purpose.

# Evaluating the dataset

## Describe the dataset. Look at the splits, proportion of classes, and see what you can figure out by just looking at the text.

#### The dataset is 14.1k rows, wich is note a lot of data, we will probably need more if we want to perform well on the task, furthermore a big model is likely to overfit.

#### The train test split is about 85/15. Depending of how well we perform we may consider using part of the data of testing in the training, reducing the amount for testing.

#### Regarding the class repartition, obviously there are far more non offensive tweets than offensive one, that is something we'll have to keep in mind during the training (about 10% of offensive tweets).

In [1]:
!pip install datasets bertopic transformers tqdm shap nltk statsmodels > /dev/null 2>&1

In [2]:
from datasets import load_dataset
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP

In [3]:
dataset = load_dataset("tweet_eval", "offensive")



  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
train_data = dataset['train']
documents = train_data['text']
umap_model = UMAP(random_state=42)
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics, probs = topic_model.fit_transform(documents)

In [5]:
topic_model.visualize_barchart()

In [6]:
topics_per_class = topic_model.topics_per_class(documents, train_data['label'])
topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=5)


To visualize the topic repartition you can click on the label, default on the first topic

## What do you think about the results? How do you think it could impact a model trained on these data?

#### We can see in the topics that tweets were chosen during a short period of time as one of the maiin topic is about a lawyer that made the news in 2018, but is clearly not one of the main topic in general on tweeter

#### We can clearly see disparity of label within topics, which could lead to learning wrong features for our model, and therefore having great chance of error

## **Bonus** By default, BERTopic extracts single keywords. Play with the model to extract bigrams or more. See if you can go deeper in your analysis.

In [7]:
vectorizer_model = CountVectorizer(ngram_range=(2, 2))
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(documents)

In [8]:
topic_model.visualize_barchart()

In [9]:
vectorizer_model = CountVectorizer(ngram_range=(3, 3))
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(documents)
topic_model.visualize_barchart()

#### Bigram or trigram doesn't seem to affect the topics too much, which are globaly the same.

# Evaluate a model

## Evaluate their model on the test split of the dataset you picked, using precision, recall, and F1-score.

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-offensive")

model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-offensive")

In [11]:
import torch

def tokenize(data):
  encoding = tokenizer(data['text'], truncation=True, padding='max_length', max_length=512)
  encoding['labels'] = data['label']


  return encoding

tokenized_test = dataset['test'].map(tokenize, batched=True)
tokenized_test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])




In [12]:
from torch.utils.data import DataLoader
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model.to(device)
data_loader = DataLoader(tokenized_test, batch_size=16)


cuda


In [13]:
from tqdm import tqdm
def predict(data_loader):
  predictions = []
  labels = []
  texts = []
  probs = []
  model.eval()

  with torch.no_grad():
      for batch in tqdm(data_loader):
          inputs = {key: val.to(device) for key, val in batch.items()}
          outputs = model(**inputs)

          batch_text = [tokenizer.decode(input_id) for input_id in inputs['input_ids']]

          probs.extend(outputs.logits.tolist())
          predictions.extend(torch.argmax(outputs.logits, dim=-1).tolist())
          labels.extend(inputs['labels'].tolist())
          texts.extend(batch_text)
  return predictions, labels, texts, probs

predictions, labels, texts, probs = predict(data_loader)

100%|██████████| 54/54 [00:27<00:00,  1.97it/s]


In [14]:
from sklearn.metrics import f1_score, precision_score, recall_score

f1 = f1_score(labels, predictions, average='weighted')
precision = precision_score(labels, predictions, average='weighted')
recall = recall_score(labels, predictions, average='weighted')

print(f'F1 score: {f1}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')

F1 score: 0.8552261168481793
Precision: 0.8555572351743067
Recall: 0.8593023255813953


## Evaluate their model on the test split of the dataset you picked, using precision, recall, and F1-score.

F1 score: 0.8552261168481793
Precision: 0.8555572351743067
Recall: 0.8593023255813953

## Look for prediction failures. Extract the top 5 misclassified tweets (highest score in wrong class) for each class and discuss what could be wrong with the model.

In [15]:
import pandas as pd
df = pd.DataFrame({'labels': labels, 'predictions': predictions, 'texts': texts, 'probs': probs})
df.head()

Unnamed: 0,labels,predictions,texts,probs
0,1,1,<s>#ibelieveblaseyford is liar she is fat ugly...,"[-0.8861263394355774, 0.8977356553077698]"
1,0,1,<s>@user @user @user I got in a pretty deep de...,"[-0.3487536609172821, 0.2542918026447296]"
2,0,0,<s>...if you want more shootings and more deat...,"[0.18577970564365387, -0.20211200416088104]"
3,0,0,<s>Angels now have 6 runs. Five of them have c...,"[0.6158384680747986, -0.542585015296936]"
4,0,0,<s>#Travel #Movies and Unix #Fortune combined ...,"[1.2384463548660278, -1.1371604204177856]"


In [16]:
import numpy as np
df['confident'] = df['probs'].apply(np.max)


In [17]:
df.head()

Unnamed: 0,labels,predictions,texts,probs,confident
0,1,1,<s>#ibelieveblaseyford is liar she is fat ugly...,"[-0.8861263394355774, 0.8977356553077698]",0.897736
1,0,1,<s>@user @user @user I got in a pretty deep de...,"[-0.3487536609172821, 0.2542918026447296]",0.254292
2,0,0,<s>...if you want more shootings and more deat...,"[0.18577970564365387, -0.20211200416088104]",0.18578
3,0,0,<s>Angels now have 6 runs. Five of them have c...,"[0.6158384680747986, -0.542585015296936]",0.615838
4,0,0,<s>#Travel #Movies and Unix #Fortune combined ...,"[1.2384463548660278, -1.1371604204177856]",1.238446


In [18]:
wrong = df.loc[df.labels != df.predictions]
wrong.sort_values('confident', ascending=False, inplace=True)
wrong.head()

Unnamed: 0,labels,predictions,texts,probs,confident
96,1,0,<s>#Liberals / #Democrats THIS is what you sta...,"[1.3073264360427856, -1.3396073579788208]",1.307326
177,1,0,<s>#Liberals Are Reaching Peak Desperation To ...,"[1.1855162382125854, -1.2535228729248047]",1.185516
228,1,0,<s>#BREXIT deal HAS been reached - and will be...,"[1.1514830589294434, -1.140032410621643]",1.151483
418,1,0,<s>#NoPasaran: Unity demo to oppose the far-ri...,"[1.090884804725647, -1.237762689590454]",1.090885
455,0,1,<s>@user I guess that’s where swamp ass origin...,"[-1.0679528713226318, 1.0643125772476196]",1.064313


Looking at those tweets it seems to think that political tweet imply offensive language as the top 4 mention democrats, liberals or brexit and is misslabeled as offensive
I don't really know why the fifth went wrong


In [19]:
dataset['train'][8]

{'text': '@user Been a Willie fan since before most of you were born....LOVE that he is holding a rally with Beto.... Exactly WHICH fans are furious?  Could you give some specifics?',
 'label': 0}

In [20]:
import json

tweets = pd.read_json('/content/tweets.json')

In [21]:
tweets.head()

Unnamed: 0,id,id_str,text,lang,created_at
0,1410492618790817793,1410492618790817792,YOU BETTER SUCK HIS DICK KOZY I SEE YOU WITH K...,en,2021-07-01 06:57:00+00:00
1,1410492618769780742,1410492618769780736,I still canr believe it.😭😭😭😭😭,en,2021-07-01 06:57:00+00:00
2,1410492618790686720,1410492618790686720,You should raise the webform....how would they...,en,2021-07-01 06:57:00+00:00
3,1410492618803335174,1410492618803335168,im tired too but this is so entertaining i cant,en,2021-07-01 06:57:00+00:00
4,1410492618778157059,1410492618778157056,Fuckof,en,2021-07-01 06:57:00+00:00


In [22]:
from torch.utils.data import Dataset
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)
        return {key: tensor.squeeze(0) for key, tensor in inputs.items()}

    def __len__(self):
        return len(self.texts)

tweet_dataset = TextDataset(tweets['text'], tokenizer, max_length=512)
data_loader = DataLoader(tweet_dataset, batch_size=128)

## Extract the top 10 tweets your model is most confident about in the target class (offensive or hateful), the top 10 in the neutral class, and the top 10 your model is most uncertain about. Do you believe the model is doing a great job?

In [23]:
predictions = []
logits = []
texts = []
vectors = []
model.eval()
with torch.no_grad():
  for batch in tqdm(data_loader):
    inputs = {name: tensor.to(device) for name, tensor in batch.items()}
    outputs = model(**inputs)
    vectors.extend(inputs)
    batch_text = [tokenizer.decode(input_id) for input_id in inputs['input_ids']]
    logits.extend(outputs.logits.tolist())
    predictions.extend(torch.argmax(outputs.logits, dim=-1).tolist())
    texts.extend(batch_text)

100%|██████████| 79/79 [05:10<00:00,  3.92s/it]


In [24]:
from scipy.special import softmax

preds = pd.DataFrame({'logits': logits, 'predictions': predictions, 'text': texts})
preds['logits'] = preds['logits'].apply(lambda x: softmax(x))

preds['confident'] = preds['logits'].apply(np.max)
preds.head()

Unnamed: 0,logits,predictions,text,confident
0,"[0.12622787959255163, 0.8737721204074483]",1,<s>YOU BETTER SUCK HIS DICK KOZY I SEE YOU WIT...,0.873772
1,"[0.8371830098225985, 0.16281699017740148]",0,<s>I still canr believe it.😭😭😭😭😭</s><pad><pad>...,0.837183
2,"[0.892156637086433, 0.10784336291356696]",0,<s>You should raise the webform....how would t...,0.892157
3,"[0.6542708146343675, 0.3457291853656325]",0,<s>im tired too but this is so entertaining i ...,0.654271
4,"[0.4129459193570216, 0.5870540806429785]",1,<s>Fuckof</s><pad><pad><pad><pad><pad><pad><pa...,0.587054


In [25]:
offensive = preds.loc[preds.predictions == 1].sort_values('confident', ascending=False)
neutral = preds.loc[preds.predictions == 0].sort_values('confident', ascending=False)

In [26]:
offensive.head(10)

Unnamed: 0,logits,predictions,text,confident
2198,"[0.05152632407768937, 0.9484736759223106]",1,<s>don’t you suck his dick or something? ur fu...,0.948474
7686,"[0.053419717404695366, 0.9465802825953046]",1,<s>i genuinely feel sick to my stomach and i c...,0.94658
4042,"[0.0562603674109664, 0.9437396325890336]",1,<s>You're a little bitch</s><pad><pad><pad><pa...,0.94374
9867,"[0.05765910999295674, 0.9423408900070432]",1,<s>Bitch you raggedy af phony ass hoe</s><pad>...,0.942341
9141,"[0.060527939532969345, 0.9394720604670306]",1,<s>You're a fucking racist moron</s><pad><pad>...,0.939472
3849,"[0.06331532189734886, 0.9366846781026512]",1,<s>Shut the fuck up you damn monkey</s><pad><p...,0.936685
8073,"[0.06748701715796969, 0.9325129828420303]",1,<s>its wild how fucking stupid people are</s><...,0.932513
1929,"[0.06866775949791086, 0.9313322405020892]",1,<s>Shut the hell up nobody give a shit about y...,0.931332
2206,"[0.06945895465545943, 0.9305410453445406]",1,<s>You're a fucking dumbass that think's he's ...,0.930541
7800,"[0.07109726076346536, 0.9289027392365345]",1,<s>This dude is a total crybaby man all he doe...,0.928903


In [27]:
neutral.head(10)

Unnamed: 0,logits,predictions,text,confident
7811,"[0.9816755049669286, 0.018324495033071365]",0,<s>Thank you beautiful you are too!!🥰</s><pad>...,0.981676
8599,"[0.981545555476798, 0.018454444523201923]",0,<s>Thank you for this! 💜</s><pad><pad><pad><pa...,0.981546
1109,"[0.9814001650400814, 0.018599834959918588]",0,<s>Thank you for your supporttt! 😍❤️✨</s><pad>...,0.9814
1416,"[0.9809073694650784, 0.01909263053492146]",0,"<s>Oh, that would be great. I will be waiting,...",0.980907
8684,"[0.9806414513892149, 0.019358548610785054]",0,"<s>Aww thank you, Crystal - that means the wor...",0.980641
4249,"[0.9802667112942776, 0.019733288705722458]",0,<s>No problem! Thanks for waiting too 🙆🏻‍♀️❤️<...,0.980267
6506,"[0.9802438230663244, 0.019756176933675443]",0,<s>Aweeee happy 1 year!!! Hope to see you stre...,0.980244
4719,"[0.979927707378828, 0.02007229262117192]",0,<s>Awe I am so so glad you are getting this. ❤...,0.979928
7242,"[0.9798141271077287, 0.02018587289227117]",0,<s>Thank you!! 😭💕</s><pad><pad><pad><pad><pad>...,0.979814
1345,"[0.9793971756491799, 0.020602824350820007]",0,<s>Thanks for your kind words ❤️ our team are ...,0.979397


In [28]:
preds.sort_values('confident', inplace=True)
preds.head(10)

Unnamed: 0,logits,predictions,text,confident
5174,"[0.49953910209254304, 0.5004608979074571]",1,<s>I'm lucky not literate 🙃</s><pad><pad><pad>...,0.500461
3490,"[0.49948020635737367, 0.5005197936426263]",1,<s>THE PAIN IN MY ARM WORTHS SO BAD</s><pad><p...,0.50052
7104,"[0.5005563212358398, 0.4994436787641601]",0,<s>its crazy to me that people think fictional...,0.500556
4359,"[0.5006753565361726, 0.4993246434638275]",0,<s>ok I’m gonna cry now</s><pad><pad><pad><pad...,0.500675
2689,"[0.5008012943560514, 0.4991987056439487]",0,<s>someone pinch me i feel like i’m dreaming</...,0.500801
6390,"[0.5008642526351004, 0.49913574736489963]",0,<s>MY PHONES DYING</s><pad><pad><pad><pad><pad...,0.500864
2897,"[0.5010844824593254, 0.49891551754067465]",0,<s>Why didn’t anyone tell me one piece is also...,0.501084
6689,"[0.4989024916595031, 0.5010975083404969]",1,<s>It’s just become a case of ‘this doctor sai...,0.501098
2662,"[0.5011820813038079, 0.4988179186961921]",0,<s>this is a big brain moment</s><pad><pad><pa...,0.501182
1907,"[0.5011821693132993, 0.4988178306867007]",0,<s>This is just furry and 100 gecs</s><pad><pa...,0.501182


#### According to the tweets we have just printed, it seemed the model is doing a pretty find job, I don't see enormous mistake in the tweets printed labels.

## Use SHAP on the provided tweets, or manually written texts, to see if you can find topics on which the model is biased.

In [29]:
import shap
import scipy as sp

def f(x):
    tv = torch.tensor([tokenizer.encode(v, padding='max_length', max_length=512, truncation=True) for v in x]).cuda()
    outputs = model(tv)[0].detach().cpu().numpy()
    return outputs

explainer = shap.Explainer(f, tokenizer)
shap_values = explainer(["A woman who you fucked multiple times saying yo dick small is a compliment you know u hit that spot", "Fuck I am an idiot, I spilled milk on my black t-shirt."])
shap.plots.text(shap_values)


  0%|          | 0/462 [00:00<?, ?it/s]

Partition explainer:  50%|█████     | 1/2 [00:00<?, ?it/s]

  0%|          | 0/306 [00:00<?, ?it/s]

Partition explainer: 3it [00:34, 17.06s/it]


Model seems to focus the bad worlds, and does not reflect on multiple world, and thus does't understand irony, it will think that all tweets on minority are offensive tweets.

I tried creating tweets for our model to fail, on the last tweet.

## What are the advantages of using a pre-trained transformer vs naive Bayes?

#### A pre-trained transformer as the advantage to need far less data to be put in production, as it as already been trained

#### Furthermore it will give better results on new data as it is supposed to have a better generalization of the language

## Train a naive Bayes model on the data, and compare its results with this model.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB


In [31]:
def naiveBayes(train_texts, train_labels, test_texts, test_labels):
  # Extract features
  vectorizer = CountVectorizer()
  train_features = vectorizer.fit_transform(train_texts)
  test_features = vectorizer.transform(test_texts)

  # Train the Naive Bayes classifier
  nb_classifier = MultinomialNB()
  nb_classifier.fit(train_features, train_labels)

  # Make predictions on the test set
  pred_labels = nb_classifier.predict(test_features)

  # Evaluate the model
  f1 = f1_score(test_labels, pred_labels)
  precision = precision_score(test_labels, pred_labels)
  recall = recall_score(test_labels, pred_labels)

  print(f'F1 score: {f1}')
  print(f'Precision: {precision}')
  print(f'Recall: {recall}')
  return nb_classifier

In [32]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [33]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):

    # Lowercase the text
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'\W', ' ', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords and perform lemmatization
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]

    return " ".join(tokens)

X_train = [preprocess_text(example['text']) for example in dataset['train']]
y_train = [example['label'] for example in dataset['train']]

X_test = [preprocess_text(example['text']) for example in dataset['test']]
y_test = [example['label'] for example in dataset['test']]

better_model = naiveBayes(X_train, y_train, X_test, y_test)

F1 score: 0.6012793176972282
Precision: 0.6157205240174672
Recall: 0.5875


### The naive bayes is really worse than our model, it is not a suitable solution for this problem, we achieve performance just a bit better than random

# Annotate data

In [34]:
tokens_to_remove = ["<s>","</s>", "<pad>"]


In [35]:
preds.drop(['logits', 'confident'], inplace=True, axis=1)


In [36]:
for token in tokens_to_remove:
    preds.text = preds.text.str.replace(token, "")
preds.head(100).predictions.value_counts()


0    53
1    47
Name: predictions, dtype: int64

Our model find almost 50% of the head tweets to be offensive, we will noww annotate thme manually

In [37]:
to_annotate = preds.head(100).drop('predictions', axis=1)
to_annotate.to_excel('/content/to_annotate.xlsx', index=False)

## Evaluate your inter-annotaor agreement using Fleiss Kappa.

In [38]:
import numpy as np
import pandas as pd

df = pd.read_excel('/content/annotated.xlsx')
# Identify all unique categories
categories = np.unique(df.iloc[:, 1:].values)

# Initialize confusion matrix
confusion_matrix = np.zeros((len(df), len(categories)))

# Fill in the confusion matrix
for i, row in df.iterrows():
    for annotation in row[1:]:
        # Find the index of the category in the categories array
        category_index = np.where(categories == annotation)[0][0]
        # Increment the corresponding cell in the confusion matrix
        confusion_matrix[i, category_index] += 1


In [39]:
from statsmodels.stats.inter_rater import fleiss_kappa

kappa_score = fleiss_kappa(confusion_matrix)
print(kappa_score)

0.377557489481874


The kappa score is really low, showing that the guideline are not preciseenough. Also we found the data without context hard to evaluate, what could justify the absence of consensus on them.

After researching on the web, a score below .40 is bad, beetween .40 and .75 is normal, and above is great

## Bonus Evaluate the model your data. Use a majority vote for labels (remove majority "can't tell") and compute the precision, recall, and F1-score.

In [40]:
from collections import Counter
def majority(row):
    counter = Counter(row)
    max_count = max(list(counter.values()))
    mode_val = [num for num, freq in counter.items() if freq == max_count]
    if len(mode_val) == 1:
        return mode_val[0]
    else:
        return None

df['label'] = df[['Mathieu', 'Noé', 'Armand']].apply(majority, axis=1)
df['label'] = df['label'].astype('Int64')
df.label.value_counts()

0    78
1    19
2     1
Name: label, dtype: Int64

We have one unsure, well drop it

In [41]:
df.sort_values('label', ascending=False, inplace=True)
df.head()
df = df.drop(10)
df = df.dropna()

In [42]:
from datasets import Dataset

tweet_dataset = Dataset.from_pandas(df)
tokenized_tweets = tweet_dataset.map(tokenize, batched=True)
tokenized_tweets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
data_loader = DataLoader(tokenized_tweets, batch_size=16)
tweet_dataset

Map:   0%|          | 0/97 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'Mathieu', 'Noé', 'Armand', 'label', '__index_level_0__'],
    num_rows: 97
})

In [43]:
predictions, labels, texts, probs = predict(data_loader)

f1 = f1_score(labels, predictions, average='weighted')
precision = precision_score(labels, predictions, average='weighted')
recall = recall_score(labels, predictions, average='weighted')

print(f'F1 score: {f1}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')

100%|██████████| 7/7 [00:03<00:00,  2.31it/s]

F1 score: 0.6116623711340207
Precision: 0.7282932416953035
Recall: 0.5670103092783505





The score on our datas are
F1 score: 0.6116623711340207
Precision: 0.7282932416953035
Recall: 0.5670103092783505

It shows that our model has trouble finding offensive tweets because the recall is low, but fro its defense we had trouble too withg those data, even our group had trouble to labelized those tweets.

This model could be a good first sort of tweets, before an human check manually the tweet.