<a href="https://colab.research.google.com/github/lucianoayres/npl-sentinel/blob/main/Projeto_npl_sentinel_versao_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Carregamento dos Dados

In [18]:
import pandas as pd

# URL do arquivo CSV remoto
url = "https://raw.githubusercontent.com/lucianoayres/npl-sentinel/refs/heads/main/data/reviews.csv"

# Carrega o arquivo CSV em um DataFrame do pandas
df = pd.read_csv(url)

# Salva o DataFrame localmente no Google Colab
df.to_csv("reviews.csv", index=False)

# Exibe as primeiras linhas do DataFrame (opcional)
print(df.head())

                                              review  rating
0  Foi uma experiência incrível porque atendeu to...       5
1  Recomendo para todos, é realmente a qualidade ...       5
2  Eu adorei este produto! superou minhas expecta...       5
3  Recomendo para todos, é realmente superou minh...       5
4  Gostei muito porque o atendimento foi excepcio...       5


# 2. Pré-processamento dos Dados

Uso do `nltk` para tokenização, remoção de stopwords e outras técnicas de limpeza.

In [19]:
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords

# Baixando stopwords
nltk.download("stopwords")
stop_words = stopwords.words("portuguese")

# Pré-processando o texto das avaliações
def preprocess_text(text):
    text = text.lower()
    text = "".join([char for char in text if char.isalnum() or char.isspace()])
    tokens = [word for word in text.split() if word not in stop_words]
    return " ".join(tokens)

df["clean_review"] = df["review"].apply(preprocess_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 3. Treinamento de Classificadores

## 3.1 SVM + Bag of Words (BoW)

In [20]:
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# Dividindo o dataset
X_train, X_test, y_train, y_test = train_test_split(df["clean_review"], df["rating"], test_size=0.2, random_state=42)

# Pipeline para SVM com Bag of Words
pipeline_bow = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("classifier", SVC())
])

# Treinamento
pipeline_bow.fit(X_train, y_train)
y_pred_bow = pipeline_bow.predict(X_test)

# Avaliação
print("SVM + Bag of Words")
print(classification_report(y_test, y_pred_bow))

SVM + Bag of Words
              precision    recall  f1-score   support

           1       0.45      0.35      0.40       150
           2       0.55      0.55      0.55       337
           3       0.54      0.65      0.59       155
           4       0.50      0.58      0.53       181
           5       0.48      0.40      0.43       177

    accuracy                           0.52      1000
   macro avg       0.50      0.51      0.50      1000
weighted avg       0.51      0.52      0.51      1000



## 3.2 SVM + Embeddings

Uso de `spacy` para converter as avaliações em embeddings e treinar o SVM com esses vetores.

In [21]:
# Fazendo Download do Modelo de Linguagem PT-BR
!python -m spacy download pt_core_news_sm

import spacy
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Usando embeddings do spaCy
nlp = spacy.load("pt_core_news_sm")
X_train_embedded = [nlp(text).vector for text in X_train]
X_test_embedded = [nlp(text).vector for text in X_test]

# Normalizando os embeddings
scaler = StandardScaler()
X_train_embedded = scaler.fit_transform(X_train_embedded)
X_test_embedded = scaler.transform(X_test_embedded)

# Treinando o SVM com embeddings
svm_embedding = SVC()
svm_embedding.fit(X_train_embedded, y_train)
y_pred_embed = svm_embedding.predict(X_test_embedded)

print("SVM + Embeddings")
print(classification_report(y_test, y_pred_embed))

Collecting pt-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.7.0/pt_core_news_sm-3.7.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.




SVM + Embeddings
              precision    recall  f1-score   support

           1       0.45      0.35      0.40       150
           2       0.54      0.59      0.56       337
           3       0.52      0.52      0.52       155
           4       0.48      0.53      0.51       181
           5       0.47      0.42      0.45       177

    accuracy                           0.50      1000
   macro avg       0.49      0.48      0.49      1000
weighted avg       0.50      0.50      0.50      1000



## 3.3 BERT para Classificação

Usando `transformers` para fine-tuning de um modelo BERT para classificação de sentimentos.

In [5]:
from transformers import BertTokenizer, TFBertForSequenceClassification, AdamWeightDecay
import tensorflow as tf

# Restrict TensorFlow to CPU only
tf.config.set_visible_devices([], 'GPU')

# Preparing the BERT model
tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = TFBertForSequenceClassification.from_pretrained("neuralmind/bert-base-portuguese-cased", num_labels=5)

# Tokenizing the data
X_train_tokens = tokenizer(list(X_train), padding=True, truncation=True, return_tensors="tf")
X_test_tokens = tokenizer(list(X_test), padding=True, truncation=True, return_tensors="tf")

# Define optimizer and loss function
optimizer = AdamWeightDecay(learning_rate=3e-5)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = tf.keras.metrics.SparseCategoricalAccuracy()

# Custom training loop
def train_step(inputs, targets):
    # Subtracting 1 from targets to shift the range to 0-4
    targets = targets - 1
    with tf.GradientTape() as tape:
        predictions = model(inputs)
        loss = loss_fn(targets, predictions.logits)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    metrics.update_state(targets, predictions.logits)
    return loss, metrics.result()

# Training loop
epochs = 3
batch_size = 8
for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    for i in range(0, len(X_train), batch_size):
        batch_inputs = {k: v[i:i + batch_size] for k, v in X_train_tokens.data.items()}
        batch_targets = y_train[i:i + batch_size]
        # Subtracting 1 from batch_targets to shift the range to 0-4
        loss, accuracy = train_step(batch_inputs, batch_targets)
        print(f"Batch {i // batch_size + 1}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}")

# Evaluation
predictions = model(X_test_tokens.data).logits
# Subtracting 1 from y_test to shift the range to 0-4 for evaluation
loss = loss_fn(y_test - 1, predictions)
accuracy = metrics.result()  # reset metrics for evaluation
metrics.update_state(y_test - 1, predictions)

print(f"BERT Test Loss: {loss:.4f}, Test Accuracy: {metrics.result().numpy():.4f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/210k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/529M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier', 'bert/pooler/dense/kernel:0', 'bert/pooler/dense/bias:0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Epoch 1/3
Batch 1: Loss = 1.6374, Accuracy = 0.0000
Batch 2: Loss = 1.6159, Accuracy = 0.1250
Batch 3: Loss = 1.4946, Accuracy = 0.2917
Batch 4: Loss = 1.5115, Accuracy = 0.3438
Batch 5: Loss = 1.4661, Accuracy = 0.3750
Batch 6: Loss = 1.5773, Accuracy = 0.3958
Batch 7: Loss = 1.5923, Accuracy = 0.3750
Batch 8: Loss = 1.7102, Accuracy = 0.3750
Batch 9: Loss = 1.4632, Accuracy = 0.3750
Batch 10: Loss = 1.5646, Accuracy = 0.3625
Batch 11: Loss = 1.4730, Accuracy = 0.3636
Batch 12: Loss = 1.4488, Accuracy = 0.3646
Batch 13: Loss = 1.3078, Accuracy = 0.3654
Batch 14: Loss = 1.5203, Accuracy = 0.3571
Batch 15: Loss = 1.4337, Accuracy = 0.3500
Batch 16: Loss = 1.5903, Accuracy = 0.3438
Batch 17: Loss = 1.4498, Accuracy = 0.3382
Batch 18: Loss = 1.3386, Accuracy = 0.3472
Batch 19: Loss = 1.1778, Accuracy = 0.3684
Batch 20: Loss = 1.2934, Accuracy = 0.3750
Batch 21: Loss = 1.1568, Accuracy = 0.3810
Batch 22: Loss = 1.1698, Accuracy = 0.3864
Batch 23: Loss = 1.1310, Accuracy = 0.3913
Batch 24: 

## 4. Classificação com In-Context Learning (Bônus)

Utilizando LLM para realizar a classificação de sentimentos diretamente com poucas instruções, sem a necessidade de treinamento explícito.

## 4.1 Using OpenAI (GPT-4)

In [13]:
from google.colab import userdata
from openai import OpenAI

# Get your OpenAI API key from userdata
api_key = userdata.get('OPENAI_API_KEY')

# Initialize the OpenAI client
client = OpenAI(api_key=api_key)

# Prompt template
prompt_template = "Classifique a avaliação abaixo como positiva, negativa ou neutra:\n\nAvaliação: {review_text}\n\nClassificação:"

def classify_with_llm(review_text):
    completion = client.chat.completions.create(
        model="gpt-4",  # Or use "gpt-3.5-turbo"
        messages=[
            {"role": "user", "content": prompt_template.format(review_text=review_text)}
        ]
    )
    return completion.choices[0].message.content.strip()

# Classify 10 random reviews (using the existing 'df')
random_reviews = df.sample(n=10)

for index, row in random_reviews.iterrows():
    review_text = row['review']
    classification = classify_with_llm(review_text)
    print(f"{review_text}")
    print(f"{classification}\n")

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

## 4.1 Using Google Gemini 1.5

In [22]:
import google.generativeai as genai
from google.colab import userdata

# Get your Google API key from userdata
myKey = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=myKey)

# Specify the Gemini model
model = genai.GenerativeModel("gemini-1.5-flash-latest")

# Prompt template
prompt_template = "Classifique a avaliação abaixo como positiva, negativa ou neutra:\n\nAvaliação: {review_text}\n\nClassificação:"

def classify_with_gemini(review_text):
    response = model.generate_content(prompt_template.format(review_text=review_text))
    classification = response.text.strip()
    return classification

# Classify 10 random reviews (using the existing 'df')
random_reviews = df.sample(n=10)

for index, row in random_reviews.iterrows():
    review_text = row['review']
    classification = classify_with_gemini(review_text)
    print(f"{review_text}")
    print(f"{classification}\n")

Não recomendo, a qualidade é muito baixa.
Negativa

Tive uma experiência ruim porque não atendeu às expectativas.
Classificação: Negativa

Gostei muito porque o preço é justo.
Classificação: Positiva

Recomendo para todos, é realmente o atendimento foi excepcional.
Classificação: Positiva

Não surpreendeu, apenas não se destaca no mercado.
Classificação: Negativa

É razoável, não se destaca no mercado.
Classificação: Negativa.

Embora não seja explicitamente negativa, a frase "não se destaca no mercado" implica que o produto ou serviço é inferior a outros disponíveis.  "Razoável" também sugere que não é excepcional.  Portanto, a avaliação geral é negativa.

Não surpreendeu, apenas cumpre o que promete.
Neutra.

Decepcionante, não atendeu às expectativas.
Classificação: Negativa

Não surpreendeu, apenas não se destaca no mercado.
Classificação: Negativa

Não gostei porque não funcionou como esperado.
Negativa

