# Introduction

## Pick one of the datasets between hate and offensive, and justify your choice. Remember that it is for a commercial application (there is a good and a bad answer).

### We have chosen the **Offensive** dataset as the hate one is under the Creative Commons CC-BY-NC-4.0 license where we can't use it for commercial purpose.

# Evaluating the dataset

## Describe the dataset. Look at the splits, proportion of classes, and see what you can figure out by just looking at the text.

#### The dataset is 14.1k rows, wich is note a lot of data, we will probably need more if we want to perform well on the task, furthermore a big model is likely to overfit.

#### The train test split is about 85/15. Depending of how well we perform we may consider using part of the data of testing in the training, reducing the amount for testing.

#### Regarding the class repartition, obviously there are far more non offensive tweets than offensive one, that is something we'll have to keep in mind during the training (about 10% of offensive tweets).

In [3]:
%%capture
!pip install datasets bertopic

In [25]:
from datasets import load_dataset
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP



In [20]:
dataset = load_dataset("tweet_eval", "offensive")



  0%|          | 0/3 [00:00<?, ?it/s]

In [26]:
train_data = dataset['train']
documents = train_data['text']
umap_model = UMAP(random_state=42)
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics, probs = topic_model.fit_transform(documents)

In [27]:
topic_model.visualize_barchart()

In [28]:
topics_per_class = topic_model.topics_per_class(documents, train_data['label'])
topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=5)


To visualize the topic repartition you can click on the label, default on the first topic

## What do you think about the results? How do you think it could impact a model trained on these data?

#### We can see in the topics that tweets were chosen during a short period of time as one of the maiin topic is about a lawyer that made the news in 2018, but is clearly not one of the main topic in general on tweeter

#### We can clearly see disparity of label within topics, which could lead to learning wrong features for our model, and therefore having great chance of error

## **Bonus** By default, BERTopic extracts single keywords. Play with the model to extract bigrams or more. See if you can go deeper in your analysis.

In [36]:
vectorizer_model = CountVectorizer(ngram_range=(2, 2))
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(documents)

In [35]:
topic_model.visualize_barchart()

In [37]:
vectorizer_model = CountVectorizer(ngram_range=(3, 3))
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(documents)
topic_model.visualize_barchart()

#### Bigram or trigram doesn't seem to affect the topics, which are globaly the same.

# Evaluate a model

## Evaluate their model on the test split of the dataset you picked, using precision, recall, and F1-score.