<a href="https://colab.research.google.com/github/loresiensis/data-analysis-and-nlp/blob/main/Analisis_de_emociones.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Instalamos librerías

In [None]:
!pip install datasets evaluate transformers[sentencepiece] -q

Nos conectamos a Hugging Face

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Importamos el dataset, en este caso elegí [tweet_eval](https://huggingface.co/datasets/tweet_eval) y específicamente me interesaba el subset que clasifica los tuits en emociones (ira, alegría, optimismo y tristeza)

In [None]:
from datasets import load_dataset
raw_dataset = load_dataset("tweet_eval",'emotion')



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 374
    })
})

In [None]:
raw_dataset['train'].to_pandas()

Unnamed: 0,text,label
0,“Worry is a down payment on a problem you may ...,2
1,My roommate: it's okay that we can't spell bec...,0
2,No but that's so cute. Atsu was probably shy a...,1
3,Rooneys fucking untouchable isn't he? Been fuc...,0
4,it's pretty depressing when u hit pan on ur fa...,3
...,...,...
3252,I get discouraged because I try for 5 fucking ...,3
3253,The @user are in contention and hosting @user ...,3
3254,@user @user @user @user @user as a fellow UP g...,0
3255,You have a #problem? Yes! Can you do #somethin...,0


Tokenizamos el dataset, en este caso he utilizado el modelo [DistilBert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, DataCollatorWithPadding

model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = DistilBertTokenizer.from_pretrained(model_checkpoint)

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased-finetuned-sst-2-english/snapshots/bfdd146ea2b6807255b73527f1327ca12b6ed5c4/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased-finetuned-sst-2-english/snapshots/bfdd146ea2b6807255b73527f1327ca12b6ed5c4/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased-finetuned-sst-2-english/snapshots/bfdd146ea2b6807255b73527f1327ca12b6ed5c4/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 307

In [None]:
def tokenize_func(example):
    return tokenizer(example['text'], truncation=True)

In [None]:
tokenized_dataset = raw_dataset.map(tokenize_func, batched=True)
tokenized_dataset



DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 374
    })
})

Si no hubiera una columna llamada 'label' tendríamos que renombrar la columna que nos muestra la clasificación, pero en este caso la tenemos así que saltamos ese paso

Además, como en este caso el dataset está dividido en tres (train, test y validation) no necesitamos partirlo nosotros

Ahora preparamos los datos que vamos a procesar:

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Definimos los argumentos que vamos a entrenar:

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments('distilbert_classificator',evaluation_strategy="epoch")

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Definimos el modelo (como ya había importado antes el DistilBertForSequenceClassification no lo he tenido que volver a hacer:

En este caso, los labels son 4.
0 = anger / 1 = joy / 2 = optimism / 3 = sadness

In [None]:
model = DistilBertForSequenceClassification.from_pretrained(model_checkpoint, num_labels=4,ignore_mismatched_sizes=True)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased-finetuned-sst-2-english/snapshots/bfdd146ea2b6807255b73527f1327ca12b6ed5c4/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.0",
  "vocab_size": 30522
}

loading weights file pytorch_mo

Definimos la función que más adelante calculará la precisión del modelo:

In [None]:
import evaluate
import numpy as np

def compute_metrics(eval_preds):
  metric = evaluate.load("accuracy")
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

Definimos el objeto Trainer con todos los parámetros necesarios. Como el dataset ya estaba divido, solamente tenemos que indicar que train_dataset=tokenized_dataset['train'] y eval_dataset=tokenized_dataset['test']

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3257
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1224
  Number of trainable parameters = 66956548


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.61738,0.788177
2,0.688400,0.701031,0.794511
3,0.320200,0.862743,0.790992


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1421
  Batch size = 8


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Saving model checkpoint to distilbert_classificator/checkpoint-500
Configuration saved in distilbert_classificator/checkpoint-500/config.json
Model weights saved in distilbert_classificator/checkpoint-500/pytorch_model.bin
tokenizer config file saved in distilbert_classificator/checkpoint-500/tokenizer_config.json
Special tokens file saved in distilbert_classificator/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1421
  Batch size = 8
Saving model checkpoint to distilbert_classificator/checkpoint-1000
Configuration saved in distilbert_classificator/checkpoint-1000/config.json
Model weights saved in distilbert_classificator/checkpoint-1000/pytorch_model.bin
tokenizer config file sav

TrainOutput(global_step=1224, training_loss=0.452194734336504, metrics={'train_runtime': 81.19, 'train_samples_per_second': 120.347, 'train_steps_per_second': 15.076, 'total_flos': 98589273019104.0, 'train_loss': 0.452194734336504, 'epoch': 3.0})

Tenemos un accuracy de 79% con este modelo

Ahora lo guardamos en Hugging Face:

In [None]:
%cd distilbert_classificator
trainer.push_to_hub(commit_message="Training complete", tags="classification")

/content/distilbert_classificator


Cloning https://huggingface.co/leorena/distilbert_classificator into local empty directory.
Saving model checkpoint to distilbert_classificator
Configuration saved in distilbert_classificator/config.json
Model weights saved in distilbert_classificator/pytorch_model.bin
tokenizer config file saved in distilbert_classificator/tokenizer_config.json
Special tokens file saved in distilbert_classificator/special_tokens_map.json


Upload file pytorch_model.bin:   0%|          | 32.0k/255M [00:00<?, ?B/s]

Upload file training_args.bin: 100%|##########| 3.43k/3.43k [00:00<?, ?B/s]

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/leorena/distilbert_classificator
   ab9114d..7806bdb  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/leorena/distilbert_classificator
   ab9114d..7806bdb  main -> main

To https://huggingface.co/leorena/distilbert_classificator
   7806bdb..60eec7b  main -> main

   7806bdb..60eec7b  main -> main



'https://huggingface.co/leorena/distilbert_classificator/commit/7806bdb52e9e58b3eee648f6be35b8c8667a3250'

El modelo ha quedado guardado en: https://huggingface.co/leorena/distilbert_classificator

Por último, ahora podemos usar el modelo con algunos tweets que he sacado de Twitter:

In [None]:
from transformers import pipeline
classifier = pipeline('text-classification', model='leorena/distilbert_classificator')

Downloading (…)lve/main/config.json:   0%|          | 0.00/884 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--leorena--distilbert_classificator/snapshots/60eec7b983d9701dee4a2c48e99ad07710e51100/config.json
Model config DistilBertConfig {
  "_name_or_path": "leorena/distilbert_classificator",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--leorena--distilbert_classificator/snapshots/60eec7b983d9701dee4a2c48e99ad07710e51100/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at leorena/distilbert_classificator.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.


Downloading (…)okenizer_config.json:   0%|          | 0.00/436 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--leorena--distilbert_classificator/snapshots/60eec7b983d9701dee4a2c48e99ad07710e51100/vocab.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--leorena--distilbert_classificator/snapshots/60eec7b983d9701dee4a2c48e99ad07710e51100/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--leorena--distilbert_classificator/snapshots/60eec7b983d9701dee4a2c48e99ad07710e51100/tokenizer_config.json


Recordando el significado de los labels: 0 = anger / 1 = joy / 2 = optimism / 3 = sadness

Me gustaría saber si es posible que al usar la función classifier podemos hacer que el resultado nos diga la emoción (ej. 'JOY' en lugar de 'LABEL_1')

In [None]:
classifier('Nahhhhh 😂!  I love y’all so much!')

[{'label': 'LABEL_1', 'score': 0.9962899684906006}]

In [None]:
classifier('smell me')

[{'label': 'LABEL_1', 'score': 0.7682599425315857}]

In [None]:
classifier('Beyoncé’s announcing a tour AND dropping Ivy Park in the same week??? oh MOTHER😭😭😭')

[{'label': 'LABEL_1', 'score': 0.995198667049408}]

In [None]:
classifier('The more I learn, the worse it gets. The world should know the truth of what has been happening at Twitter.')

[{'label': 'LABEL_3', 'score': 0.7551037669181824}]

In [None]:
classifier('WEF is increasingly becoming an unelected world government that the people never asked for and don’t want')

[{'label': 'LABEL_0', 'score': 0.986369252204895}]

In [None]:
classifier('Disingenuous response from Hamilton 68 regarding their fake claims of Russian interference')

[{'label': 'LABEL_0', 'score': 0.9965588450431824}]