<div align="center">
  <h1 style="color:darkblue"> Classificação de sentimentos nos Tweets - Parte 2🐦</h1>
</div>

Nesse notebook, vamos continuar a análise dos tweets, mas agora vamos prevê-los para três classes: positivo, negativo e neutro. No notebook anterior, fizemos a análise exploratória dos tweets e a classificação em cinco classes. Vamos usar a limpeza do notebook anterior e comparar com a versão lematizada dos tweets. 

Serão treinados dois modelos de classificação: um com os tweets limpos e outro com os tweets lematizados. Ao final, vamos comparar os resultados e verificar se a lematização dos tweets tem impacto na performance do modelo.

Os modelos de classificação que vamos usar são os mesmos do notebook anterior:
- Regressão Logística
- Naive Bayes
- Floresta Aleatória
- SVM Linear

Além disso, vamos usar a técnica de vetorização dos textos com o TF-IDF. 


In [None]:
from IPython.display import clear_output
from tqdm.auto import tqdm
from collections import Counter

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate
from sklearn.metrics import (
    make_scorer,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

import spacy

nlp = spacy.load("en_core_web_md")
tqdm.pandas()

In [None]:
df = pd.read_csv("../data/Corona_NLP_train.csv", encoding="latin1")
df = df[["OriginalTweet", "Sentiment"]]
df.shape

In [None]:
df["Sentiment"] = df["Sentiment"].replace(
    {"Extremely Negative": "Negative", "Extremely Positive": "Positive"}
)
df["Sentiment"].value_counts()

## Preparação dos Dados

In [None]:
def preprocess_text(text):
    return (
        text.str.lower()
        # remove links
        .str.replace(r"https\S+|www\S+|https\S+", "", regex=True)
        # remove usernames
        .str.replace(r"\@\w+", "", regex=True)
        # remove hashtags
        .str.replace(r"\#(\w+)", "", regex=True)
        # remove non-ascii characters
        .str.normalize("NFKD")
        .str.encode("ascii", errors="ignore")
        .str.decode("utf-8")
        # manter apenas letras, espaços e apóstrofos
        .str.replace(r"[^a-z\s\']", "", regex=True)
        # remove excesso de espaços
        .str.replace(r"\s+", " ", regex=True)
        # remove espaços no começo e no fim
        .str.strip()
    )


df["CleanTweet"] = preprocess_text(df["OriginalTweet"])

# Remover palavras que aparecem apenas uma vez
words = df["CleanTweet"].str.cat(sep=" ").split()
types = Counter(words)
hapax = set([word for word, count in types.items() if count <= 1])

df["CleanTweet"] = df["CleanTweet"].apply(
    lambda text: " ".join([word for word in text.split() if word not in hapax])
)

# Manter apenas tweets com mais de 2 palavras
df = df.loc[df["CleanTweet"].str.split().str.len() > 2]
df = df.drop_duplicates(subset=["CleanTweet", "Sentiment"])
df.shape

In [None]:
docs = nlp.pipe(df["CleanTweet"])

df["Lemmatized"] = [
    " ".join([token.lemma_ for token in doc])
    for doc in tqdm(docs, total=len(df), desc="Lemmatizing")
]

In [None]:
df.loc[
    df["Lemmatized"].duplicated(keep=False),
    ["OriginalTweet", "CleanTweet", "Sentiment"],
].sort_values("CleanTweet")

Ao aplicar a lematização notamos *tweets* duplicados, por isso, vamos remover esses *tweets* duplicados considerando o texto lematizado e o sentimento. Ao executar a célula acima para obter os *tweets* duplicados, vemos também uma inconsistência no *dataset* , onde um texto similar tem sentimentos diferentes.

In [None]:
df.loc[15757, "OriginalTweet"], df.loc[21677, "OriginalTweet"]

In [None]:
df = df.drop_duplicates(subset=["Lemmatized", "Sentiment"])
df = df.drop_duplicates(subset=["Lemmatized"], keep=False)
df.shape

## Modelos de Classificação

In [None]:
models = {
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression(max_iter=1000),
    "RandomForestClassifier": RandomForestClassifier(random_state=42, n_jobs=-1),
    "LinearSVC": LinearSVC(dual="auto", random_state=42),
}
df = df.reset_index(drop=True)

In [None]:
X = df.drop(columns=["Sentiment"])
y = pd.Categorical(
    df["Sentiment"], categories=["Negative", "Neutral", "Positive"], ordered=True
)
y = pd.Series(y, name="Sentiment", index=X.index)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.05, random_state=42, stratify=y
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# X_train.to_parquet("../data/X_train.parquet", index=False)
# X_test.to_parquet("../data/X_test.parquet", index=False)
# y_train.to_frame().to_parquet("../data/y_train.parquet", index=False)
# y_test.to_frame().to_parquet("../data/y_test.parquet", index=False)

### Modelo 1: Tweets Limpos

In [None]:
clean_tweet_vectorizer = TfidfVectorizer(stop_words="english")
X_train_vectorized = clean_tweet_vectorizer.fit_transform(X_train["CleanTweet"])
X_test_vectorized = clean_tweet_vectorizer.transform(X_test["CleanTweet"])

In [None]:
results = {}

print("LIMPEZA SEM LEMA:")
for model_name, model in tqdm(models.items(), desc="Training models"):
    scores = cross_validate(
        model,
        X_train_vectorized,
        y_train,
        cv=5,
        scoring={
            "accuracy": make_scorer(accuracy_score),
            "precision": make_scorer(precision_score, average="weighted"),
            "recall": make_scorer(recall_score, average="weighted"),
            "f1": make_scorer(f1_score, average="weighted"),
        },
        return_train_score=True,
    )

    results[model_name] = scores

    print(f"{model_name:=^55}")
    print(
        f"{'subset':10}",
        f"{'accuracy':>10}",
        f"{'precision':>10}",
        f"{'recall':>10}",
        f"{'f1':>10}",
    )
    print(
        f"{'train':10}",
        f"{scores['train_accuracy'].mean():10.2f}",
        f"{scores['train_precision'].mean():10.2f}",
        f"{scores['train_recall'].mean():10.2f}",
        f"{scores['train_f1'].mean():10.2f}",
    )

    print(
        f"{'test':10}",
        f"{scores['test_accuracy'].mean():10.2f}",
        f"{scores['test_precision'].mean():10.2f}",
        f"{scores['test_recall'].mean():10.2f}",
        f"{scores['test_f1'].mean():10.2f}",
    )
    print()

Por fim, validamos os modelos nos dados de teste e comparamos os resultados.

In [None]:
for model_name, model in models.items():
    model.fit(X_train_vectorized, y_train)
    y_pred = model.predict(X_test_vectorized)
    print(f"{model_name:=^55}")
    print(classification_report(y_test, y_pred))
    print()

Os modelos SVM Linear e Regressão Logística tiveram os melhores resultados. Sendo o primeiro com F1 médio ponderado de 0.82 e o segundo com 0.81.

### Modelo 2: Tweets limpos e lematizados

In [None]:
lemmatized_vectorizer = TfidfVectorizer(stop_words="english")

X_train_vectorized = lemmatized_vectorizer.fit_transform(X_train["Lemmatized"])
X_test_vectorized = lemmatized_vectorizer.transform(X_test["Lemmatized"])

In [None]:
results = {}

print("LIMPEZA COM LEMA:")
for model_name, model in tqdm(models.items(), desc="Training models"):
    scores = cross_validate(
        model,
        X_train_vectorized,
        y_train,
        cv=5,
        scoring={
            "accuracy": make_scorer(accuracy_score),
            "precision": make_scorer(precision_score, average="weighted"),
            "recall": make_scorer(recall_score, average="weighted"),
            "f1": make_scorer(f1_score, average="weighted"),
        },
        return_train_score=True,
    )

    results[model_name] = scores

    print(f"{model_name:=^55}")
    print(
        f"{'subset':10}",
        f"{'accuracy':>10}",
        f"{'precision':>10}",
        f"{'recall':>10}",
        f"{'f1':>10}",
    )
    print(
        f"{'train':10}",
        f"{scores['train_accuracy'].mean():10.2f}",
        f"{scores['train_precision'].mean():10.2f}",
        f"{scores['train_recall'].mean():10.2f}",
        f"{scores['train_f1'].mean():10.2f}",
    )

    print(
        f"{'test':10}",
        f"{scores['test_accuracy'].mean():10.2f}",
        f"{scores['test_precision'].mean():10.2f}",
        f"{scores['test_recall'].mean():10.2f}",
        f"{scores['test_f1'].mean():10.2f}",
    )
    print()

In [None]:
for model_name, model in models.items():
    model.fit(X_train_vectorized, y_train)
    y_pred = model.predict(X_test_vectorized)
    print(f"{model_name:=^55}")
    print(classification_report(y_test, y_pred))
    print()

Os valores de F1 médio ponderado continuaram sendo os melhores para os modelos SVM Linear e Regressão Logística, sendo 0.81 e 0.79 respectivamente. Podemos observar que classificar os *tweets* em três classes é mais fácil do que em cinco classes, pois os modelos tiveram um desempenho melhor.

### Adicional: Análise de Sentimentos com Mixtral: Zero-Shot Prompting

Por fim, vamos usar o modelo Mixtral-8x7B-Instruct-v0.1 para classificar os *tweets* e comparar os resultados com os modelos treinados.

Vamos começar, preparando o ambiente de trabalho, clonando o repositório do Mixtral e instalando as dependências necessárias.

In [None]:
%%bash
git clone https://github.com/dvmazur/mixtral-offloading.git --quiet
cd mixtral-offloading && pip install -r requirements.txt --quiet

# Baixar o modelo
huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo

Seguindo o notebook de referência do repositório, definimos as configurações necessárias para a execução do modelo.

In [None]:
import sys

sys.path.append("mixtral-offloading")

In [None]:
import torch
from hqq.core.quantize import BaseQuantizeConfig
from transformers import AutoConfig, AutoTokenizer
from src.build_model import OffloadConfig, QuantConfig, build_model

model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

config = AutoConfig.from_pretrained(quantized_model_name)

device = torch.device("cuda:0")

##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 1
# offload_per_layer = 5
###############################################################

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (config.num_local_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
)

attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256

ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)

model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Definimos a função `generate_response` para obter a classificação dos *tweets*. A função recebe como parâmetro o *prompt* com o texto do *tweet* e retorna a classificação do *tweet*.

In [None]:
def generate_response(
    prompt: str, model: torch.nn.Module, tokenizer: AutoTokenizer, device: torch.device
) -> str:
    user_entry = [{"role": "user", "content": prompt}]

    input_ids = tokenizer.apply_chat_template(user_entry, return_tensors="pt").to(
        device
    )
    attention_mask = torch.ones_like(input_ids)

    result = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        past_key_values=None,
        do_sample=True,
        temperature=0.9,
        top_p=0.9,
        max_new_tokens=8,
        pad_token_id=tokenizer.eos_token_id,
        return_dict_in_generate=True,
        output_hidden_states=True,
    )

    sequence = result.get("sequences", None)
    if sequence is None:
        raise ValueError("Generation failed")

    outputs = tokenizer.decode(sequence[0], skip_special_tokens=True)
    return outputs.split("[/INST]")[-1].strip().split()[0]

In [None]:
user_prompt = """
<|system|>
You are a tweet categorizer that only responds to whether the tweet is "Negative", "Neutral" or "Positive". 
You should only respond with the label in which the Tweet falls and nothing else. 
<|user|>
Classify the text into one of these categories based on the sentiment of the tweet.
Text: {text}
Sentiment:

<|assistant|>
"""

O processo de classificação dos *tweets* é feito em um loop, onde cada *tweet* é classificado e o resultado é armazenado em uma lista.

In [None]:
import random

for i in range(10):
    idx = random.randint(0, len(X_test))

    prompt = user_prompt.format(text=X_test.iloc[idx]["OriginalTweet"])
    print("Text:", X_test.iloc[idx]["OriginalTweet"])
    print(f"True sentiment: {y_test.iloc[idx]}")

    response = generate_response(prompt, model, tokenizer, device)
    print(f"Predicted sentiment: {response}")
    print()
    print("=" * 80)

In [None]:
y_pred_mixtral = []

for text in tqdm(X_test["OriginalTweet"], desc="Mixtral"):
    response = generate_response(
        user_prompt.format(text=text), model, tokenizer, device
    )
    y_pred_mixtral.append(response)

y_pred_mixtral = pd.Series(y_pred_mixtral, name="Sentiment", index=X_test.index)

Pode-se observar que o modelo Mixtral-8x7B-Instruct-v0.1 obteve um desempenho inferior aos modelos treinados, com um F1 médio ponderado de 0.5.

In [None]:
print(classification_report(y_test, y_pred_mixtral))

### Adicional: Análise de Sentimentos com Bert: Fine-Tuning

Nessa etapa, vamos treinar um modelo de classificação de sentimentos com o Bert. 

In [None]:
# import pandas as pd

# X_train = pd.read_parquet("../data/X_train.parquet")
# X_test = pd.read_parquet("../data/X_test.parquet")
# y_train = pd.read_parquet("../data/y_train.parquet").squeeze()
# y_test = pd.read_parquet("../data/y_test.parquet").squeeze()

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification


model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=3
)
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

In [None]:
import numpy as np
import evaluate


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return clf_metrics.compute(
        predictions=predictions, references=labels, average="weighted"
    )


clf_metrics = evaluate.combine(["f1", "precision", "recall"])

In [None]:
from datasets import Dataset


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


train_dataset = Dataset.from_dict(
    pd.DataFrame(
        {"text": X_train["OriginalTweet"], "label": y_train.values.codes}
    ).to_dict(orient="list")
)

test_dataset = Dataset.from_dict(
    pd.DataFrame(
        {"text": X_test["OriginalTweet"], "label": y_test.values.codes}
    ).to_dict(orient="list")
)

train_dataset = train_dataset.shuffle(seed=42).train_test_split(test_size=0.05)

train_tokenized_datasets = train_dataset.map(tokenize_function, batched=True)
test_tokenized_datasets = test_dataset.map(tokenize_function, batched=True)

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_dir="./logs",
    per_device_train_batch_size=32,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized_datasets["train"],
    eval_dataset=train_tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Realizando o *fine-tuning* do modelo Bert, conseguimos métricas acima dos modelos anteriores, com um F1 médio ponderado de 0.91 no conjunto de teste.

In [None]:
predictions = trainer.predict(test_tokenized_datasets)

y_pred_bert = np.argmax(predictions.predictions, axis=-1)


print(
    classification_report(
        y_test.values.codes, y_pred_bert, target_names=y_test.cat.categories
    )
)