# IIC-3670 NLP UC

____________________________________________________________________________________________________________

## Actividad en clase

Vamos a hacer fine-tuning de Albert.

- Descargue desde hugging face hub datasets el corpus **wikitext-2-raw-v1**.
- Haga checkpoint a **albert-base-v2**.
- Haga fine-tuning usando **MLM** con **mlm_probability=0.1**.
- Guarde localmente el nuevo modelo.
- Use su modelo para encontrar los encodings de la oración "Replace me by any text you'd like.".
- Cuando termine, me avisa para entregarle una **L (logrado)**.
- Recuerde que las L otorgan un bono en la nota final de la asignatura.


***Tiene hasta el final de la clase.***

_________________________________________________________________________________________________________________

# Solución

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
import transformers

print(transformers.__version__)

4.25.1


In [3]:
from transformers.utils import send_example_telemetry

send_example_telemetry("language_modeling_notebook", framework="pytorch")

In [4]:
from datasets import load_dataset

datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Found cached dataset wikitext (C:/Users/marce/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
model_checkpoint = "albert-base-v2"

In [6]:
class TokenizerWrapper:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def tokenize_function(self, examples):
        return self.tokenizer(examples["text"])

In [7]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [8]:
tokenizer_wrapper = TokenizerWrapper(tokenizer)

In [9]:
tokenized_datasets = datasets.map(tokenizer_wrapper.tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

Loading cached processed dataset at C:\Users\marce\.cache\huggingface\datasets\wikitext\wikitext-2-raw-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-23a667b61071c3c9_*_of_00004.arrow
Loading cached processed dataset at C:\Users\marce\.cache\huggingface\datasets\wikitext\wikitext-2-raw-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-b85be2eaf41864d8_*_of_00004.arrow
Loading cached processed dataset at C:\Users\marce\.cache\huggingface\datasets\wikitext\wikitext-2-raw-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-8ac38eaf266f3faf_*_of_00004.arrow


In [10]:
def group_texts(examples):
    block_size = 128
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [11]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Loading cached processed dataset at C:\Users\marce\.cache\huggingface\datasets\wikitext\wikitext-2-raw-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-e8e4924714701766_*_of_00004.arrow
Loading cached processed dataset at C:\Users\marce\.cache\huggingface\datasets\wikitext\wikitext-2-raw-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-46d7292d7411ffd9_*_of_00004.arrow
Loading cached processed dataset at C:\Users\marce\.cache\huggingface\datasets\wikitext\wikitext-2-raw-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-d1875b302eed7f0b_*_of_00004.arrow


In [12]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [13]:
from transformers import Trainer, TrainingArguments

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

In [14]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.1)

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

In [16]:
trainer.train()

***** Running training *****
  Num examples = 20929
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7851
  Number of trainable parameters = 11221680
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,2.0659,2.097129
2,1.9606,1.999555
3,1.8825,1.931062


Saving model checkpoint to albert-base-v2-finetuned-wikitext2\checkpoint-500
Configuration saved in albert-base-v2-finetuned-wikitext2\checkpoint-500\config.json
Model weights saved in albert-base-v2-finetuned-wikitext2\checkpoint-500\pytorch_model.bin
Saving model checkpoint to albert-base-v2-finetuned-wikitext2\checkpoint-1000
Configuration saved in albert-base-v2-finetuned-wikitext2\checkpoint-1000\config.json
Model weights saved in albert-base-v2-finetuned-wikitext2\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to albert-base-v2-finetuned-wikitext2\checkpoint-1500
Configuration saved in albert-base-v2-finetuned-wikitext2\checkpoint-1500\config.json
Model weights saved in albert-base-v2-finetuned-wikitext2\checkpoint-1500\pytorch_model.bin
Saving model checkpoint to albert-base-v2-finetuned-wikitext2\checkpoint-2000
Configuration saved in albert-base-v2-finetuned-wikitext2\checkpoint-2000\config.json
Model weights saved in albert-base-v2-finetuned-wikitext2\checkpoint-20

TrainOutput(global_step=7851, training_loss=2.0259476300789707, metrics={'train_runtime': 559.2199, 'train_samples_per_second': 112.276, 'train_steps_per_second': 14.039, 'total_flos': 352775162769408.0, 'train_loss': 2.0259476300789707, 'epoch': 3.0})

In [17]:
trainer.save_model("albert-v2-finetuned.h5")

Saving model checkpoint to albert-v2-finetuned.h5
Configuration saved in albert-v2-finetuned.h5\config.json
Model weights saved in albert-v2-finetuned.h5\pytorch_model.bin


In [18]:
from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = AlbertModel.from_pretrained("albert-v2-finetuned.h5")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

loading file spiece.model from cache at C:\Users\marce/.cache\huggingface\hub\models--albert-base-v2\snapshots\eab127d1a4a42b10e2b130e1834c64f5844d541e\spiece.model
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at C:\Users\marce/.cache\huggingface\hub\models--albert-base-v2\snapshots\eab127d1a4a42b10e2b130e1834c64f5844d541e\config.json
Model config AlbertConfig {
  "_name_or_path": "albert-base-v2",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position

In [19]:
output

BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.0104,  0.1411, -0.0380,  ..., -0.0693,  0.0191,  0.0779],
         [-0.9227, -1.7339, -1.1261,  ...,  0.1697,  0.7790, -0.5530],
         [-0.3506,  0.7906, -0.1291,  ...,  1.1095,  1.0939,  0.2346],
         ...,
         [ 0.6949, -1.4691, -0.2719,  ...,  1.1329,  1.5574,  0.3873],
         [-0.0698,  0.0725, -0.4419,  ...,  0.8144,  0.7928, -0.1241],
         [ 0.0104,  0.1411, -0.0380,  ..., -0.0693,  0.0191,  0.0779]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-1.8701e-01, -5.9273e-01,  3.1621e-01, -3.0593e-01,  1.3583e-01,
          6.4248e-02,  1.5055e-01, -4.1616e-02, -1.4342e-02,  2.3935e-01,
          1.6858e-01,  3.4054e-01, -7.1437e-02, -1.0849e-01,  5.7066e-02,
          8.4060e-02,  1.7361e-01, -1.1483e-01, -3.4953e-01,  5.2143e-02,
          3.2306e-01, -1.2452e-01, -3.8437e-01, -4.3061e-01, -2.2502e-01,
         -1.9869e-01,  1.7905e-01, -1.2352e-01,  6.7367e-02,  2.2645e-01,
          2