<div align="center">
  <h1 style="color:darkblue">🚀 Análise de Embeddings de Tweets 📉</h1>

</div>

## 📌 Introdução


Nesse notebook, vamos usar vetores semânticos, uma ideia que se baseia na hipótese de distribuição ao aprender a representar palavras, chamada de **embeddings**. Vamos usar essas representações para analisar a similaridade entre tweets. 

In [None]:
%%bash

python -m spacy download en_core_web_md

In [None]:
import spacy
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import plotly.express as px
import math

nlp = spacy.load("en_core_web_md")
stopwords = nlp.Defaults.stop_words
nlp.component_names

## 2. Preparo dos dados

In [None]:
# Carregar o conjunto de dados e selecionar as colunas de interesse
df = pd.read_csv("../data/Corona_NLP_train.csv", encoding="latin1")
df = df[["OriginalTweet", "Sentiment"]]

In [None]:
custom_stopwords = {
    "covid",
    "coronavirus",
    "covid19",
    "corona",
    "coranaviru",
    "covid2019",
    "coronacrisis",
    "coronavirusoutbreak",
    "coronaviruspandemic",
    "coronavirusupdate",
    "coronavirusupdates",
    "coronavirususa",
    "coronavirusuk",
    "covid19uk",
    "covid19usa",
    "19",
    "2019",
    "amp",  # provavelmente &amp;
    # Palavras tiradas do wordcloud presentes em todos os sentimentos
    "food",
    "prices",
    "people",
    "store",
    "supermarket",
    "grocery",
    "will",
}

df["CleanTweet"] = (
    df["OriginalTweet"]
    .str.replace(r"https\S+|www\S+|https\S+", "", regex=True)
    .str.replace(r"\@\w+", "", regex=True)
    .str.replace(r"\#(\w+)", "", regex=True)
    .str.normalize("NFKD")
    .str.encode("ascii", errors="ignore")
    .str.decode("utf-8")
    .str.replace(r"\s+", " ", regex=True)
    .apply(
        lambda text: " ".join(
            [
                word
                for word in text.split()
                if word.lower() not in stopwords
                and word.isalpha()
                and len(word) > 2
                and word.lower() not in custom_stopwords
            ]
        )
    )
    .str.lower()
    .str.strip()
)

df = df.loc[df["CleanTweet"].str.split().str.len() > 2]
df = df[df["CleanTweet"] != ""]
# df = df.drop(columns=["OriginalTweet"])

In [None]:
df.sample(5)

## 3. Análise de Embeddings


### 3.1 Token to Vector

A primeira abordagem é a extração de *embeddings* de palavras. Para isso, vamos usar a classe `Tok2Vec` do pacote `spacy`[[2]](https://spacy.io/api/tok2vec). 

In [None]:
docs = nlp.pipe(df["CleanTweet"])
vectors = np.array([doc.vector for doc in tqdm(docs, total=len(df))])

Usamos o algoritmo de clusterização `KMeans` para agrupar os vetores semânticos. O número de clusters é definido pelo método do cotovelo. 

In [None]:
def optimal_number_of_clusters(wcss):
    x1, y1 = 2, wcss[0]
    x2, y2 = 20, wcss[len(wcss) - 1]

    distances = []
    for i in range(len(wcss)):
        x0 = i + 2
        y0 = wcss[i]
        numerator = abs((y2 - y1) * x0 - (x2 - x1) * y0 + x2 * y1 - y2 * x1)
        denominator = math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)
        distances.append(numerator / denominator)

    return distances.index(max(distances)) + 2

In [None]:
inertia = []
for k in tqdm(range(2, 21)):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(vectors)
    inertia.append(kmeans.inertia_)

In [None]:
plt.plot(range(2, 21), inertia, marker="o")
n_clusters = optimal_number_of_clusters(inertia)

plt.axvline(x=n_clusters, color="red", linestyle="--")
plt.text(
    n_clusters + 0.5,
    inertia[n_clusters - 2] + 1e6,
    f"n_clusters = {n_clusters}",
    fontsize=9,
    color="red",
)

plt.xlabel("Número de clusters")
plt.ylabel("Inércia")
plt.title("Método do cotovelo")
plt.show()

Após obter o ponto de corte, vamos usar o algoritmo `KMeans` para agrupar os vetores semânticos. 

In [None]:
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(vectors)

df["Cluster"] = kmeans.labels_

Em seguida, vemos a similaridade entre os clusters usando a distância de cosseno.

In [None]:
centers = kmeans.cluster_centers_

for i in range(n_clusters):
    print("=======" * 10)
    print(f"Cluster {i}:")
    center = centers[i]
    top_similarities = np.argsort(-vectors.dot(center))[:20]
    top = df.iloc[top_similarities].drop_duplicates("OriginalTweet").head(10)
    for j, row in top.iterrows():
        print(f"{j}:  {row['OriginalTweet']}")
    print()
    print("=======" * 10)

Para visualizar os clusters, usamos o algoritmo `t-SNE` para reduzir a dimensionalidade dos vetores semânticos.

In [None]:
tsne = TSNE(n_components=2, random_state=42)
vectors_2d = tsne.fit_transform(vectors)

In [None]:
df["x"] = vectors_2d[:, 0]
df["y"] = vectors_2d[:, 1]

df["Cluster"] = df["Cluster"].astype(str)


fig = px.scatter(
    df,
    x="x",
    y="y",
    color="Cluster",
    hover_data=["CleanTweet", "Sentiment"],
    title="Clusters",
)

fig.show()