<div align="center">
  <h1 style="color:darkblue"> Classificação de sentimentos nos Tweets - Parte 2🐦</h1>
</div>

Nesse notebook, vamos continuar a análise dos tweets, mas agora vamos prevê-los para três classes: positivo, negativo e neutro. No notebook anterior, fizemos a análise exploratória dos tweets e a classificação em cinco classes. Vamos usar a limpeza do notebook anterior e comparar com a versão lematizada dos tweets. 

Serão treinados dois modelos de classificação: um com os tweets limpos e outro com os tweets lematizados. Ao final, vamos comparar os resultados e verificar se a lematização dos tweets tem impacto na performance do modelo.

Os modelos de classificação que vamos usar são os mesmos do notebook anterior:
- Regressão Logística
- Naive Bayes
- Floresta Aleatória
- SVM Linear

Além disso, vamos usar a técnica de vetorização dos textos com o TF-IDF. 


In [None]:
from tqdm.auto import tqdm
from collections import Counter

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate
from sklearn.metrics import (
    make_scorer,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

import spacy

nlp = spacy.load("en_core_web_md")

In [None]:
df = pd.read_csv("../data/Corona_NLP_train.csv", encoding="latin1")
df = df[["OriginalTweet", "Sentiment"]]
df.shape

In [None]:
df["Sentiment"] = df["Sentiment"].replace(
    {"Extremely Negative": "Negative", "Extremely Positive": "Positive"}
)
df["Sentiment"].value_counts()

## Preparação dos Dados

In [None]:
def preprocess_text(text):
    return (
        text.str.lower()
        # remove links
        .str.replace(r"https\S+|www\S+|https\S+", "", regex=True)
        # remove usernames
        .str.replace(r"\@\w+", "", regex=True)
        # remove hashtags
        .str.replace(r"\#(\w+)", "", regex=True)
        # remove non-ascii characters
        .str.normalize("NFKD")
        .str.encode("ascii", errors="ignore")
        .str.decode("utf-8")
        # manter apenas letras, espaços e apóstrofos
        .str.replace(r"[^a-z\s\']", "", regex=True)
        # remove excesso de espaços
        .str.replace(r"\s+", " ", regex=True)
        # remove espaços no começo e no fim
        .str.strip()
    )


df["CleanTweet"] = preprocess_text(df["OriginalTweet"])

# Remover palavras que aparecem apenas uma vez
words = df["CleanTweet"].str.cat(sep=" ").split()
types = Counter(words)
hapax = set([word for word, count in types.items() if count <= 1])

df["CleanTweet"] = df["CleanTweet"].apply(
    lambda text: " ".join([word for word in text.split() if word not in hapax])
)

# Manter apenas tweets com mais de 2 palavras
df = df.loc[df["CleanTweet"].str.split().str.len() > 2]
df = df.drop_duplicates(subset=["CleanTweet", "Sentiment"])
df.shape

In [None]:
docs = nlp.pipe(df["CleanTweet"])

df["Lemmatized"] = [
    " ".join([token.lemma_ for token in doc])
    for doc in tqdm(docs, total=len(df), desc="Lemmatizing")
]

In [None]:
df.loc[
    df["Lemmatized"].duplicated(keep=False), ["CleanTweet", "Sentiment"]
].sort_values("CleanTweet")

Ao aplicar a lematização notamos *tweets* duplicados, por isso, vamos remover esses *tweets* duplicados considerando apenas o texto lematizado.

In [None]:
df = df.drop_duplicates(subset=["Lemmatized", "Sentiment"])
df.shape

## Modelos de Classificação

In [None]:
models = {
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression(max_iter=1000),
    "RandomForestClassifier": RandomForestClassifier(random_state=42, n_jobs=-1),
    "LinearSVC": LinearSVC(dual="auto", random_state=42),
}
df = df.reset_index(drop=True)

In [None]:
X = df.drop(columns=["Sentiment"])
y = df["Sentiment"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### Modelo 1: Tweets Limpos

In [None]:
clean_tweet_vectorizer = TfidfVectorizer(stop_words="english")
X_train_vectorized = clean_tweet_vectorizer.fit_transform(X_train["CleanTweet"])
X_test_vectorized = clean_tweet_vectorizer.transform(X_test["CleanTweet"])

In [None]:
results = {}

print("LIMPEZA SEM LEMA:")
for model_name, model in tqdm(models.items(), desc="Training models"):
    scores = cross_validate(
        model,
        X_train_vectorized,
        y_train,
        cv=5,
        scoring={
            "accuracy": make_scorer(accuracy_score),
            "precision": make_scorer(precision_score, average="weighted"),
            "recall": make_scorer(recall_score, average="weighted"),
            "f1": make_scorer(f1_score, average="weighted"),
        },
        return_train_score=True,
    )

    results[model_name] = scores

    print(f"{model_name:=^55}")
    print(
        f"{'subset':10}",
        f"{'accuracy':>10}",
        f"{'precision':>10}",
        f"{'recall':>10}",
        f"{'f1':>10}",
    )
    print(
        f"{'train':10}",
        f"{scores['train_accuracy'].mean():10.2f}",
        f"{scores['train_precision'].mean():10.2f}",
        f"{scores['train_recall'].mean():10.2f}",
        f"{scores['train_f1'].mean():10.2f}",
    )

    print(
        f"{'test':10}",
        f"{scores['test_accuracy'].mean():10.2f}",
        f"{scores['test_precision'].mean():10.2f}",
        f"{scores['test_recall'].mean():10.2f}",
        f"{scores['test_f1'].mean():10.2f}",
    )
    print()

Por fim, validamos os modelos nos dados de teste e comparamos os resultados.

In [None]:
for model_name, model in models.items():
    model.fit(X_train_vectorized, y_train)
    y_pred = model.predict(X_test_vectorized)
    print(f"{model_name:=^55}")
    print(classification_report(y_test, y_pred))
    print()

Os modelos SVM Linear e Regressão Logística tiveram os melhores resultados. Sendo o primeiro com F1 médio ponderado de 0.81 e o segundo com 0.80.

### Modelo 2: Tweets limpos e lematizados

In [None]:
lemmatized_vectorizer = TfidfVectorizer(stop_words="english")

X_train_vectorized = lemmatized_vectorizer.fit_transform(X_train["Lemmatized"])
X_test_vectorized = lemmatized_vectorizer.transform(X_test["Lemmatized"])

In [None]:
results = {}

print("LIMPEZA COM LEMA:")
for model_name, model in tqdm(models.items(), desc="Training models"):
    scores = cross_validate(
        model,
        X_train_vectorized,
        y_train,
        cv=5,
        scoring={
            "accuracy": make_scorer(accuracy_score),
            "precision": make_scorer(precision_score, average="weighted"),
            "recall": make_scorer(recall_score, average="weighted"),
            "f1": make_scorer(f1_score, average="weighted"),
        },
        return_train_score=True,
    )

    results[model_name] = scores

    print(f"{model_name:=^55}")
    print(
        f"{'subset':10}",
        f"{'accuracy':>10}",
        f"{'precision':>10}",
        f"{'recall':>10}",
        f"{'f1':>10}",
    )
    print(
        f"{'train':10}",
        f"{scores['train_accuracy'].mean():10.2f}",
        f"{scores['train_precision'].mean():10.2f}",
        f"{scores['train_recall'].mean():10.2f}",
        f"{scores['train_f1'].mean():10.2f}",
    )

    print(
        f"{'test':10}",
        f"{scores['test_accuracy'].mean():10.2f}",
        f"{scores['test_precision'].mean():10.2f}",
        f"{scores['test_recall'].mean():10.2f}",
        f"{scores['test_f1'].mean():10.2f}",
    )
    print()

In [None]:
for model_name, model in models.items():
    model.fit(X_train_vectorized, y_train)
    y_pred = model.predict(X_test_vectorized)
    print(f"{model_name:=^55}")
    print(classification_report(y_test, y_pred))
    print()

Os valores de F1 médio ponderado continuaram sendo os melhores para os modelos SVM Linear e Regressão Logística, ambos com 0.79. Podemos observar que classificar os *tweets* em três classes é mais fácil do que em cinco classes, pois os modelos tiveram um desempenho melhor.