# Curso NLP + Transformers

<img src="https://yaelmanuel.com/wp-content/uploads/2021/12/platzi-banner-logo-matematicas.png" width="500px">

---

# Análisis de Reseñas de Amazon 📦🔍

### 1) Cargar el dataset 🤓

Descomprimir archivo rar

In [None]:
!unrar x "/content/reviews_dataframe_completo.rar"


UNRAR 6.11 beta 1 freeware      Copyright (c) 1993-2022 Alexander Roshal


Extracting from /content/reviews_dataframe_completo.rar

Extracting  reviews_dataframe_completo.csv                                29% 59% 89%100%  OK 
All OK


In [None]:
import pandas as pd

In [None]:
csv_path = "/content/reviews_dataframe_completo.csv"
data = pd.read_csv(csv_path)

In [None]:
data.head(3)

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,es_0491108,product_es_0296024,reviewer_es_0999081,1,Nada bueno se me fue ka pantalla en menos de 8...,television Nevir,es,electronics
1,es_0869872,product_es_0922286,reviewer_es_0216771,1,"Horrible, nos tuvimos que comprar otro porque ...",Dinero tirado a la basura con esta compra,es,electronics
2,es_0811721,product_es_0474543,reviewer_es_0929213,1,Te obligan a comprar dos unidades y te llega s...,solo llega una unidad cuando te obligan a comp...,es,drugstore


### 2) Preparación de la data 👌

#### 2.1) Instalamos las dependencias 🙌

In [None]:
!pip install transformers datasets evaluate

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.

In [None]:
import pandas as pd
import numpy as np

import evaluate

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

from datasets import Dataset, DatasetDict

#### 2.2) Acondicionar columnas 🔍

Separar los dataset en train, test y validation.

- df_train_es (70% del dataset original)
- df_test_es (20% del dataset original)
- df_val_es (10% del dataset original)

In [None]:
from sklearn.model_selection import train_test_split

# Paso 1: Dividir en entrenamiento (70%) y el resto (30%)
df_train_es, df_temp_es = train_test_split(data, test_size=0.3, random_state=42)  # Usa un random_state para reproducibilidad

# Paso 2: Dividir el resto (30%) en prueba (20%) y validación (10%)
df_test_es, df_val_es = train_test_split(df_temp_es, test_size=2/3, random_state=42)  # 20/30 = 2/3

Vamos a asignar una etiqueta basado en la cantidad de estrellas:
- Si el número de estrellas es mayor o igual a 3, le asignamos una buena calificación (valor 1).
- Caso contrario es una mala calificación (valor 0).

In [None]:
df_train_es['labels'] = df_train_es['stars'].apply(lambda x: 1 if x >= 3 else 0)
df_test_es['labels'] = df_test_es['stars'].apply(lambda x: 1 if x >= 3 else 0)
df_val_es['labels'] = df_val_es['stars'].apply(lambda x: 1 if x >= 3 else 0)

In [None]:
df_train_es.head(2)

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category,labels
81812,es_0994681,product_es_0473680,reviewer_es_0370969,3,"Al poderse apilar, ordenas mucho mejor",Prácticas,es,furniture,1
8844,es_0419353,product_es_0907086,reviewer_es_0536262,1,No las e podido poner porque una de las luces ...,Mal,es,automotive,0


In [None]:
df_train_es.tail(2)

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category,labels
146867,es_0950227,product_es_0177239,reviewer_es_0231213,4,Una buena cámara de gama media. Se nota una me...,SJCam SJ6 Legend,es,camera,1
121958,es_0371735,product_es_0679770,reviewer_es_0882536,4,Es como me lo imaginaba. Por el contrario de o...,Lo que estaba buscando,es,kitchen,1


#### 2.3) Adaptar formato del dataset 🔧

In [None]:
# Convertir los DataFrames en objetos Dataset de la librería datasets
train_dataset = Dataset.from_pandas(df_train_es)
test_dataset = Dataset.from_pandas(df_test_es)
val_dataset = Dataset.from_pandas(df_val_es)

# Crear un DatasetDict con los conjuntos de datos
dataset = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

# Ver la estructura
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'labels', '__index_level_0__'],
        num_rows: 147000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'labels', '__index_level_0__'],
        num_rows: 42000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'labels', '__index_level_0__'],
        num_rows: 21000
    })
})


### 3) Tokenización 📊

In [None]:
model_checkpoint = "PlanTL-GOB-ES/roberta-base-bne"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/851k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/509k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

In [None]:
def tokenize_reviews(examples):
    return tokenizer(examples["review_body"], truncation=True)

In [None]:
columns = dataset["train"].column_names
columns.remove("labels")
encoded_dataset = dataset.map(tokenize_reviews, batched=True, remove_columns=columns)
print(encoded_dataset)

Map:   0%|          | 0/147000 [00:00<?, ? examples/s]

Map:   0%|          | 0/42000 [00:00<?, ? examples/s]

Map:   0%|          | 0/21000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 147000
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 42000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 21000
    })
})


### 4) Finetuning de la convnet 😨

In [None]:
num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-base-bne and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Vamos a evaluar el accuracy como métrica de rendimiento**

In [None]:
metric = evaluate.load("accuracy")
print(metric)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
    

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    return metric.compute(predictions=predictions, references=labels)

### 5) Hugging Face Hub 🤗

El modelo entrenado lo vamos a subir a Hugging Face Hub así lo podemos compartir con el mundo 😎

**Importante:** La nueva credencial que vamos a crear debe tener permisos de escritura (write).

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
The token `demo-platzi-project` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to r

In [None]:
!git config --global credential.helper store

### 6) Entrenamiento 💪

In [None]:
model_name = model_checkpoint.split("/")[-1]

In [None]:
print(model_name)

roberta-base-bne


In [None]:
batch_size = 8
num_train_epochs=2
num_train_samples = 20_000
train_dataset = encoded_dataset["train"].shuffle(seed=42).select(range(num_train_samples))
logging_steps = len(train_dataset) // (2 * batch_size * num_train_epochs)

In [None]:
training_args = TrainingArguments(
    output_dir="results",
    num_train_epochs=num_train_epochs,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=logging_steps,
    push_to_hub=True,
    hub_model_id=f"{model_name}-platzi-project-nlp-con-transformers"
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=encoded_dataset["validation"],
    processing_class=tokenizer,
)

**Para este paso es necesario crearse una cuenta gratuita en [Weights & Biases](https://wandb.ai/home), porque el entrenamiento y las métricas se harán ahí.**

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mcabustillo13[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3545,0.348128,0.847167
2,0.2621,0.45929,0.8575


TrainOutput(global_step=5000, training_loss=0.3189806640625, metrics={'train_runtime': 1276.3661, 'train_samples_per_second': 31.339, 'train_steps_per_second': 3.917, 'total_flos': 1735126965673920.0, 'train_loss': 0.3189806640625, 'epoch': 2.0})

### 7) Guardar el modelo 💾

Para eso vamos a hacer un push a Hugging Face Hub

In [None]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/cabustillo13/roberta-base-bne-platzi-project-nlp-con-transformers/commit/e72bf9889b52fa1720b72b7ff9be8919030c363d', commit_message='End of training', commit_description='', oid='e72bf9889b52fa1720b72b7ff9be8919030c363d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/cabustillo13/roberta-base-bne-platzi-project-nlp-con-transformers', endpoint='https://huggingface.co', repo_type='model', repo_id='cabustillo13/roberta-base-bne-platzi-project-nlp-con-transformers'), pr_revision=None, pr_num=None)

### 8) Hacer Predicciones en Producción 🤙

In [None]:
from transformers import pipeline

Cargar el modelo una vez (al inicio de la aplicación)

In [None]:
model_checkpoint = "cabustillo13/roberta-base-bne-platzi-project-nlp-con-transformers"
pipe = pipeline("sentiment-analysis", model=model_checkpoint)

config.json:   0%|          | 0.00/788 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/851k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/509k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.66M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

Device set to use cuda:0


Ejemplos de uso

In [None]:
pipe("me encanto el pantalon!!!")

[{'label': 'LABEL_1', 'score': 0.9962030053138733}]

In [None]:
pipe("Te obligan a comprar dos unidades")

[{'label': 'LABEL_0', 'score': 0.9433281421661377}]

In [None]:
pipe("la peor compra de mi vida!!! no recomiendo!")

[{'label': 'LABEL_0', 'score': 0.9928492307662964}]