<a href="https://colab.research.google.com/github/larajakl/MCMLR/blob/main/bonus_exercise_2_old_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bonus Exercises 2: Low-Rank Adaptation and Crosslingual Transfer**



This notebook represents the second bonus exercises for the lecture Multilingual and Crosslingual Methods and Language Resources (2024W 340168-1). For each successfully completed bonus exercise, a maximum of three points can be achieved that will be added to the points of the final exam. The tasks to be completed in the following notebook are marked with üëã ‚öí.

---




In this notebook, we will use Low-Rank Adaptation to Fine-Tune XLM-R on the task of linguistic acceptability in English and then test its zero-shot capability in other languages.

# **Make sure to set your runtime to GPU before you start training.**

(Tab: Runtime/Change Runtime Type -> Select GPU)

-----------
## **Fine-Tuning on English**

The first part has already been prepared for you. We will load and preprocess the Corpus for Linguistic Acceptability (COLA) dataset from GLUE and then use Low-Rank Adaptation to fine-tune XLM-R.

### Installation

As always, we first need to install the necessary libraries. One that is new in this notebook is the Parameter-Efficient Fine-Tuning (PEFT) library.

In [None]:
!pip install -U evaluate
!pip install -U datasets
!pip install -U transformers
!pip install -U peft

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

### Loading the Dataset

In this notebook we will first be using the COLA dataset from the GLUE library and then a multilingual extension.
 We will first train on English and transfer to another language and evaluate zero-shot transfer on one more language (see [here](https://huggingface.co/datasets/Geralt-Targaryen/MELA) for a selection).

In [None]:
from datasets import load_dataset

dataset_en = load_dataset("glue", "cola")
dataset_en.num_rows

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/251k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/37.6k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/37.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

{'train': 8551, 'validation': 1043, 'test': 1063}

Let us take a look at the components of the dataset.

In [None]:
dataset_en['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int32', id=None)}

Hugging Face Datasets is designed to be interoperable with libraries like Pandas, as well as NumPy, PyTorch, TensorFlow, and JAX. To enable the conversion between various third-party libraries, Hugging Face Datasets provides a Dataset.set_format() function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format which is Apache Arrow. The formatting is done in-place, so let‚Äôs convert our dataset to Pandas and look at a random sample:

In [None]:
from IPython.display import display, HTML

dataset_en.set_format("pandas")
df = dataset_en["train"][:]
# Create a random sample
sample = df.sample(n=5, random_state=42)
display(HTML(sample.to_html()))

Unnamed: 0,sentence,label,idx
2389,Angela characterized Shelly as a lifesaver.,1,2389
5048,They're not finding it a stress being in the same office.,1,5048
3133,Paul exhaled on Mary.,0,3133
5955,I ordered if John drink his beer.,0,5955
625,Press the stamp against the pad completely.,1,625


The Pandas dataframe can now be used as we would always use Pandas, for instance to count the number of labels for `cause` in the column question.

In [None]:
df["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,6023
0,2528


We can see that the two labels are spread quite evenly across the two types of questions.

This was just a brief detour to show how datasets can be nicely manipulated and displayed using other libraries. We will now get back to our usual datasets library from Hugging Face. To this end, we will reset the format.

In [None]:
dataset_en.reset_format()

### Preprocessing the Dataset

In this example, we model COPA as a multiple-choice task with two choices. Thus, we encode the premise and question as well as both choices as one input to our `xlm-roberta-base` model. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches.

In [None]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
batch_size = 32

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=True, truncation=True)

def preprocess_dataset(dataset):
  token_dataset = dataset.map(tokenize_function, batched=True, batch_size=batch_size)
  tokenized_dataset = token_dataset.rename_column("label", "labels")
  return tokenized_dataset

tokenized_dataset_en = preprocess_dataset(dataset_en)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset_en["train"][1]

{'sentence': "One more pseudo generalization and I'm giving up.",
 'labels': 1,
 'idx': 1,
 'input_ids': [0,
  6561,
  1286,
  74189,
  4537,
  47691,
  136,
  87,
  25,
  39,
  68772,
  1257,
  5,
  2,
  1,
  1,
  1,
  1,
  1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]}

-----------
## **Low-Rank Adaptation (LoRA)**

In order to perform low-rank adaptation (LoRA) on a pretrained language model for parameter-efficient fine-tuning (PEFT), we need to set a few parameters in the LoRA Configuration. Hugging Face offers some [documentation on LoRA](https://huggingface.co/docs/peft/main/en/developer_guides/lora).

The `task-type` specifies which task the model should be fine-tuned on and needs to correspond to the way the model is loaded. If we load a model for Sequence Classification, also the task needs to be `SEQ_CLS`, an abbreviation for Sequence Classification. Then the dataset needs to be one with an input sequence and a number of target classes.

The `target-module` depends on the type of model, which for XLM-R is `["query", "value"]`. Since we wish to change model parameters, the inference mode is set to false. The variable `r`indicates the rank to which the dimensionality is being reduced. The variable `alpha` is a scaling parameter, because `r`scales at 1.0. With small datasets or if unsure, the rank and alpha can be the same. Finally, dropout is a random omission of trainable parameters (setting to zero) during training, mostly to avoid overfitting.

Feel free to play with and adapt these parameters if you are interested in seeing the effect.


In [None]:
from peft import LoraConfig, PeftType, get_peft_model
from transformers import AutoModelForSequenceClassification

peft_type = PeftType.LORA
peft_config = LoraConfig(task_type="SEQ_CLS", target_modules=["query", "value"], inference_mode=False, r=32, lora_alpha=32, lora_dropout=0.1)
model = AutoModelForSequenceClassification.from_pretrained("xlm-roberta-large")
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
model

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 4,197,378 || all params: 564,089,860 || trainable%: 0.7441


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): XLMRobertaForSequenceClassification(
      (roberta): XLMRobertaModel(
        (embeddings): XLMRobertaEmbeddings(
          (word_embeddings): Embedding(250002, 1024, padding_idx=1)
          (position_embeddings): Embedding(514, 1024, padding_idx=1)
          (token_type_embeddings): Embedding(1, 1024)
          (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): XLMRobertaEncoder(
          (layer): ModuleList(
            (0-23): 24 x XLMRobertaLayer(
              (attention): XLMRobertaAttention(
                (self): XLMRobertaSdpaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
     

In [None]:
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer, EvalPrediction
from datasets import concatenate_datasets

num_train_epochs = 5
logging_steps = len(tokenized_dataset_en["train"]) // (batch_size * num_train_epochs)
accuracy = evaluate.load("accuracy")

training_args = TrainingArguments(
    learning_rate=3e-4,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=logging_steps,
    output_dir="./training_output",
    overwrite_output_dir=True,
    report_to='none',
    load_best_model_at_end=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=True,
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_en["train"],
    eval_dataset=tokenized_dataset_en["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Once we have configured the model with PEFT, we can train the PEFT model as usual.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6419,0.621638,0.691275
2,0.6186,0.618164,0.691275
3,0.606,0.618191,0.691275
4,0.6068,0.627176,0.691275
5,0.6105,0.620399,0.691275


TrainOutput(global_step=1340, training_loss=0.6177077933923522, metrics={'train_runtime': 640.516, 'train_samples_per_second': 66.751, 'train_steps_per_second': 2.092, 'total_flos': 3048801241729104.0, 'train_loss': 0.6177077933923522, 'epoch': 5.0})

üëã ‚öí Evaluate on the English test set to see how well the fine-tuning has worked.

In [None]:
# Your code for the evaluation here

new_trainer = Trainer(
    model=model,
    eval_dataset=tokenized_dataset_en["test"],
    compute_metrics=compute_metrics,
)

# Evaluate the model
test_results = new_trainer.evaluate()
print(test_results)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## **Crosslingual Transfer**

In this section, we will be using the Multilingual Evaluation of Linguistic Acceptability ([MELA](https://github.com/sjtu-compling/mela?tab=readme-ov-file)), which is also [available on Hugging Face](https://huggingface.co/datasets/Geralt-Targaryen/MELA) to test the transfer and zero-shot capabilities of XLM-R with LoRA Fine-Tuning.

We will first fine-tune on German and then test the on German but also in a zero-shot approach on another language of your choice.

Please be aware of the fact that MELA "only" offers a dev and a test set - no train, validation, test split. Thus, the preprocessing needs to be slightly adapted.

In [None]:
from datasets import load_dataset, DatasetDict

de = load_dataset("Geralt-Targaryen/MELA", "de")
dataset_de = preprocess_dataset(de)
print(dataset_de["test"][20])

{'idx': 'c1-1.1_n9-a', 'labels': 1, 'sentence': 'Wenn du glaubst, dass er sich geirrt habe, kannst du dann alles verstehen', 'input_ids': [0, 7896, 115, 24682, 5829, 271, 4, 1421, 72, 833, 6, 128696, 3198, 3260, 4, 32540, 115, 3700, 4174, 85516, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


üëã ‚öí Use the German dev partition to further-finetune the previously configured model and then evaluate on the test partition of the German dataset.

In [None]:
print(len(dataset_de["dev"]))
print(len(dataset_de["test"]))

100
945


In [None]:
'''
Interesting to see that the dev partition is so much smaller than the test partition!
'''

In [None]:
# Preprocessing done using previous preprocessing function (in previous cell), and now I split dev into train and validation:

def split_dataset(dataset, train_ratio=0.8):
    """
    Splits the 'dev' set into 'train' and 'validation' sets while keeping the 'test' set intact.

    Args:
        dataset (DatasetDict): The preprocessed dataset containing 'dev' and 'test'.
        train_ratio (float): The proportion of the 'dev' set to use for training. Rest goes to validation.

    Returns:
        DatasetDict: A dataset dictionary with 'train', 'validation', and 'test' splits.
    """
    # Extract 'dev' and 'test' sets
    dev_set = dataset['dev']
    test_set = dataset['test']

    # Calculate split sizes
    train_size = int(len(dev_set) * train_ratio)
    val_size = len(dev_set) - train_size

    # Shuffle and split the dev set
    dev_set = dev_set.shuffle(seed=42)  # Ensure reproducibility
    train_set = dev_set.select(range(train_size))
    val_set = dev_set.select(range(train_size, train_size + val_size))

    # Create the new DatasetDict
    split_dataset = DatasetDict({
        'train': train_set,
        'validation': val_set,
        'test': test_set
    })

    return split_dataset


# Split the preprocessed dataset
final_dataset = split_dataset(dataset_de)

# Inspect the final splits
print(final_dataset)



DatasetDict({
    train: Dataset({
        features: ['idx', 'labels', 'sentence', 'input_ids', 'attention_mask'],
        num_rows: 80
    })
    validation: Dataset({
        features: ['idx', 'labels', 'sentence', 'input_ids', 'attention_mask'],
        num_rows: 20
    })
    test: Dataset({
        features: ['idx', 'labels', 'sentence', 'input_ids', 'attention_mask'],
        num_rows: 945
    })
})


In [None]:
# Fine-tuning on German dataset:
num_train_epochs = 5
batch_size = 5 # smaller batch size needed because only 80 samples in train set
logging_steps = len(final_dataset["train"]) // (batch_size * num_train_epochs)
accuracy = evaluate.load("accuracy")

training_args_German = TrainingArguments(
    learning_rate=3e-4,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=logging_steps,
    output_dir="./training_output_German",
    overwrite_output_dir=True,
    report_to='none',
    load_best_model_at_end=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=True,
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)

trainer_German = Trainer(
    model=model,
    args=training_args_German,
    train_dataset=final_dataset["train"],
    eval_dataset=final_dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
# Evaluation on test set of German dataset

new_trainer_German = Trainer(
    model=model,
    eval_dataset=final_dataset["test"],
    compute_metrics=compute_metrics,
)

# Evaluate the model
test_results = new_trainer_German.evaluate()
print(test_results)

üëã ‚öí Select another language of your choice from the [MELA dataset](https://huggingface.co/datasets/Geralt-Targaryen/MELA) to only evaluate the fine-tuned model (zero-shot capability).

**Alternative**: Feel free to create your own mini-dataset of a few (non)-acceptable sentences in a language of your choice to test the model's zero-shot capacity.

In [None]:
# French MELA dataset:
fr = load_dataset("Geralt-Targaryen/MELA", "fr")
dataset_fr = preprocess_dataset(fr)
print(dataset_fr["test"][20])

# Split the preprocessed dataset
final_dataset_French = split_dataset(dataset_fr)

# Inspect the final splits
print(final_dataset)


{'idx': 'c2.1.1_n8-b-3', 'labels': 1, 'sentence': 'Ce sont des ≈ìils-de-perdix', 'input_ids': [0, 1845, 2045, 224, 6, 52908, 7870, 9, 112, 9, 1264, 428, 425, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
DatasetDict({
    train: Dataset({
        features: ['idx', 'labels', 'sentence', 'input_ids', 'attention_mask'],
        num_rows: 80
    })
    validation: Dataset({
        features: ['idx', 'labels', 'sentence', 'input_ids', 'attention_mask'],
        num_rows: 20
    })
    test: Dataset({
        features: ['idx', 'labels', 'sentence', 'input_ids', 'attention_mask'],
        num_rows: 945
    })
})


In [None]:
# Evaluate using zero-shot:

new_trainer_French = Trainer(
    model=model,
    eval_dataset=final_dataset_French["test"],
    compute_metrics=compute_metrics,
)

# Evaluate the model
test_results = new_trainer_French.evaluate()
print(test_results)