# Language Models 2 | Fine-tuning

Modified from [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/training).

Note, this is (*meant to be used on Colab*(, simply because you will need a fair amount of GPU memory to get these models running! Have a look at [the PyTorch part of `setup.md`](https://github.com/jchwenger/DMLCP/blob/main/setup.md#pytorch--huggingfacegradio) for details on how to make a Huggingface environment on your own machine.

## Install & Workflow

#### Drive

If you need to load/save to your drive:

```python
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive/')

import os
os.chdir('drive/My Drive/IS53055B-DMLCP/DMLCP/python') # to change to another directory
```

#### Huggingface login

For some models and datasets, and if you want to push your model to HF (same as GitHub, but for models) you need to be logged into your HF account.

For that, you need to create an account [here](https://huggingface.co/) and then to ['/settings/tokens'](https://huggingface.co/settings/tokens) to create an access token.

```python
from pathlib import Path
from huggingface_hub import notebook_login
if not (Path.home()/'.huggingface'/'token').exists():
    notebook_login()
```

#### Install

1. On Colab:

```python
!pip install -Uq transformers datasets accelerate
```

2. Locally, the install is the same as the one used for Language models, see [`setup.md`](https://github.com/jchwenger/DMLCP/blob/main/setup.md#pytorch--huggingfacegradio).

## Imports

In [None]:
import os

import torch

# Get cpu, gpu or mps device for training.
# See: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html#creating-models
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

from transformers import pipeline
from transformers import GenerationConfig

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

from datasets import load_dataset

from transformers import Trainer
from transformers import TrainingArguments

In [None]:
import textwrap # The textwrap module automatically formats text for you

tw = textwrap.TextWrapper(   # many more options, see them with textwrap.TextWrapper?
    width=79,                # the formatted width we want
    replace_whitespace=False # this will keep whitespace & line breaks in the original text
)

def wrap_print(s):
    """Format text into Textwrapped lines and print it"""
    print("\n".join(tw.wrap(s)))

## Test the raw gpt-2 model

Hugginface uses [pipelines](https://huggingface.co/docs/transformers/pipeline_tutorial) as the unified high-level API for inference.

You can choose from a [large amount of tasks](https://huggingface.co/tasks), and, in each task, from a [huge amount of models](https://huggingface.co/models?sort=trending)! Not only that, but you can also find loads and loads of [datasets](https://huggingface.co/datasets?task_categories=task_categories:text-generation&sort=trending) readily available, and already classified by tasks. In our case, we will be using the [`tiny_shakespeare`](https://huggingface.co/datasets/tiny_shakespeare) dataset, originally created by [Andrej Karpathy](https://karpathy.ai/) to test language models.

In this case, you can see [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) the list of models for `text-generation`, and in the search bar if you type `gpt`, the "classic" `gpt2` model will appear. This is actually GPT-1, with around 120 millions parameters (GPT-2 has 1.5 billions, and GPT-3 175 billions...).

One thing that you can test on Colab directly is to switch between:

- `gpt2` (\~120mio),  
- `gpt2-medium` (\~350mio),  
- `gpt2-large` (\~770mio) and  
- `gpt2-xl` (\~1.5bn) parameters  

and see if you notice a difference in the text quality. Without more advanced methods like freezing layers, or quantization + parameter-efficient methods, only the smallest one will fit in a free T4 Colab GPU!

**Warning**: these models were trained on large portions of the Internet (more specifically upvoted Reddit articles...). Their output is biased, and often offensive, in the same way as the Internet is!

Note: HuggingFace now requires you to define your configuration for generation in advance in an object that you pass to your generator. See [the documentation](https://huggingface.co/docs/transformers/main_classes/text_generation).

In [None]:
MODEL_ID = "gpt2"

generator = pipeline(
    'text-generation', # it is quite easy to change task if you want to!
    model=MODEL_ID,      # here you can choose your model from: https://huggingface.co/models?pipeline_tag=text-generation&sort=trending
    device=device      # device allows you to choose the device to run your model on
)

# our configuration
generation_config = GenerationConfig.from_pretrained(MODEL_ID)
generation_config.pad_token_id = generation_config.eos_token_id
generation_config.max_length = 250
generation_config.do_sample = True
generation_config.top_p = 0.95
generation_config.temperature = .9
generation_config.batch_size = 1

# this should *not* sound like Shakespeare...
wrap_print(
    generator(
        "MERCUTIO:",
        generation_config=generation_config,
        num_return_sequences=1
    )[0]['generated_text'] # [0] to select the first element (you can generate batches),
)                          # this yields an object, and the text lives in 'generated_text'

#### Note on memory

If you try several models, before you do any training it is a good idea to clear the memory, using this:
```python
del generator            # del to delete any Python object
torch.cuda.clear_cache() # PyTorch to clear GPU mem
                         # (it can take a few moments to be executed!)
```

---

## 1. Training in Python, with a Trainer class

Ported from the [Causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling) and [Fine-tune a language model](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb) tutorials. ([*So many other tutorials here, for all sorts of models*](https://huggingface.co/docs/transformers/notebooks)...)

### Load tokenizer & model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# removes some errors with the data collator
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id # use EOS (end of sequence) as padding

model = AutoModelForCausalLM.from_pretrained(MODEL_ID).to(device) # manually move the model to the GPU

In [None]:
FINETUNED_MODEL_ID = "gpt2.shak" # change if you need or work with another dataset!

### Load & prepare dataset

In [None]:
dataset_raw = load_dataset("tiny_shakespeare")

print(dataset_raw)
print("-" *40)
print("Number of characters per split:")
print([(split, len(dataset_raw[split]["text"][0])) for split in dataset_raw])
print("-" *40)
print(dataset_raw["train"]["text"][0][:250])

See [here](https://huggingface.co/learn/nlp-course/chapter5/3#the-map-methods-superpowers) for an explanation of the `batched=True` argument.

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

dataset_tok = dataset_raw.map(
    tokenize_function,
    batched=True,           # this 'batched' says the mapping will happen in parallel
    remove_columns=["text"]
)

# the content of our "text" now lives in "input_ids", and since we only have one example
# with all the text in it, we select it with [0] to see up to 40 tokens
print(dataset_tok["train"]["input_ids"][0][:40])

In [None]:
# https://github.com/huggingface/transformers/blob/5936c8c57ccb2bda3b3f28856a7ef992c5c9f451/examples/pytorch/language-modeling/run_clm.py#L516
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.

block_size = tokenizer.max_len_single_sentence # 1024

def group_texts(ds):
    # Concatenate all texts.
    concat_ds = {k: sum(ds[k], []) for k in ds.keys()}
    total_length = len(concat_ds[list(ds.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concat_ds.items()
    }
    # Important note: during training, the two sequences will be shifted by one,
    # so that the model predicts the next step for each step. This is done auto-
    # matically by Huggingface internally
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = dataset_tok.map(
    group_texts,
    batched=True,
)

print(f"Now our dataset contains {len(lm_dataset['train']['input_ids'])} examples,")
print(f"each of length {len(lm_dataset['train']['input_ids'][0])} tokens (that we can feed in batches)...")
print(f"(the 'block_size', aka 'attention window' of our model is {block_size=})")

### The Fine-Tuning

In [None]:
BATCH_SIZE = 3 # I'm able to fine-tune the 120mio gpt2 on a 6GB GPU laptop...

training_args = TrainingArguments(
    output_dir=FINETUNED_MODEL_ID,
    evaluation_strategy="epoch",
    learning_rate=3e-5,             # small learning rate
    weight_decay=0.01,
    logging_steps=5,
    # max_steps=100,                # overrides num_train_epochs, to train even less than one epoch!
    num_train_epochs=5,
    per_device_train_batch_size=BATCH_SIZE,
    save_strategy="epoch",
    load_best_model_at_end=True,
    # push_to_hub=True,              # uncomment to push to your HF account
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset['train'],
    eval_dataset=lm_dataset['validation'],
)

In [None]:
history = trainer.train()

Note that in the training above I did 5 epochs, which is overkill! Models tend to suffer from "catastrophic forgetting" when you fine-tune, and it's often interesting to train for as little as possible to explore the change in the network output, before training a bit more.

#### Note on memory (again)

When testing things and running into memory issues, as before one (imperfect) way to solve this is to delete the `trainer` variable and clear the GPU cache like so (otherwise, just restart the runtime and re-run only the cells you need):

```python
del trainer
torch.cuda.empty_cache()
```

### Testing the fine-tuned model

Here I go through the lower level steps of encoding text with our tokenizer, creating a batch of prefixes, defining a config, then generating using the model, instead of using a pipeline!

In [None]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('MERCUTIO:', return_tensors='pt')

batched_input_ids = input_ids.repeat(2,1).to(device) # copy and place on GPU

# same logic as before
generation_config = GenerationConfig.from_pretrained(MODEL_ID)
generation_config.pad_token_id = generation_config.eos_token_id
generation_config.max_length = 100
generation_config.do_sample = True
generation_config.top_p = 0.98
generation_config.temperature = .9

# generate text using the same config as before
output = model.generate(
    batched_input_ids,
    generation_config=generation_config
)

# decode back from tokens to text
texts = tokenizer.batch_decode(output, skip_special_tokens=True)

# print
for t in texts:
    wrap_print(t)
    print("-" * 40)

### Saving your model

If you want to save your model manually, you can just do:

```python
trainer.save_model(FINETUNED_MODEL_ID) # or some other directory
```

To save the model and tokenizer manually, you can do:

```python
tokenizer.save_pretrained(FINETUNED_MODEL_ID)
model.save_pretrained(FINETUNED_MODEL_ID)
```

#### Note on Huggingface Hub

[Share a model Huggingface tutorial](https://huggingface.co/docs/transformers/model_sharing)

 Huggingface `models`, `tokenizers`, and `trainers` all have a `.push_to_hub('my-model')` method, but the `trainer` will be the one saving everything you need.

 You can push your finetuned pipeline like so:

 ```python
 trainer.push_to_hub(FINETUNED_MODEL_ID) # or another name
 ```

 Once the model is on the hub, we can create a new pipeline you can now access your model from anywhere using `model='username/your-model-id'` (or any name you used for the output folder). You can also use the folder where you saved your model (`model=/path/to/your/model`).

 For a full course with videos and notebooks, check the [NLP Course](https://huggingface.co/learn/nlp-course/chapter1/1).

In [None]:
HF_HUB_MODEL_ID = "jchwenger/gpt2.shak" # replace with your own

generator = pipeline(
    'text-generation',
    model=HF_HUB_MODEL_ID, # download the model from the hub
    tokenizer=tokenizer,         # this time, we need to specify which tokenizer to use
    device=device
)

wrap_print(
    generator(
        "MERCUTIO:",
        generation_config=generation_config,
        num_return_sequences=1
    )[0]['generated_text']
)

---

## Experiments

### Search for other datasets

`tiny_shakespeare` is obviously not the only dataset availabe on the Hub. Another example I came across a few times recently is the [`english_quotes` dataset by Abirate](https://huggingface.co/datasets/Abirate/english_quotes). You can load it like so:

```python
dataset_raw = load_dataset("Abirate/english_quotes")
```

The one thing to watch out for is that the text lives in `"quote"`, not in `"text"`! Your `tokenize_function`, for instance, should then work on `examples["quote"]` rather than `examples["text"]`, and so does the rest of this code.

### Freezing layers

One strategy to save memory and to reduce the impact of fine-tuning is to finetune only *some* layers of the model (usually the top-ones, leaving the base 'frozen'). You can achieve that by:

1. Looking at a list of all your layers:

```python
for i, (name, params) in enumerate(model.named_parameters()):
    print(i, name)
```
2. Setting the `.requires_grad` attribute to `False` for most layers (here I just loop through them and freeze the first 122 layers (see [this link](https://discuss.huggingface.co/t/freeze-lower-layers-with-auto-classification-model/11386/2)):

```python
layer_threshold = 122
for i, (name, params) in enumerate(model.named_parameters()):
    if i < layer_threshold:
        # print(f"freezing: {name}")
        params.requires_grad = False
```

After doing this, you can notice that the memory footprint of the model is much less big on the GPU.

### Working with raw text files

To load data from a file or a directory, see [this reference](https://huggingface.co/docs/datasets/nlp_load).

There are various options available, either manually download the file, for instance like so:

```python
# let's download the tiny shakespeare dataset manually
dataset_dir = "text-dataset"
if not os.path.isdir(dataset_dir):
    os.mkdir(dataset_dir)

# move into the directory and download the file
os.chdir(dataset_dir)
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
os.chdir('..')
```

Then we can load the dataset from the directory we created:

```python
dataset_raw_from_dir = load_dataset(
    "text", # in this case we need "text" as a generic name to specify the task
    data_dir=dataset_dir,
    sample_by="document"  # we indicate we want the whole text
)                         # the default is line by line, "paragraph" cuts on empty lines
# print(dataset_raw_from_dir["train"]["text"][0][:250])
print(len(dataset_raw_from_dir["train"]["text"][0]))
```

Note that in the code above, `load_dataset` is usable if you have more than one files. You can also select which files go where like so: `data_files={"train": "text-dataset/input.txt"}`. Even more, you could skip the above step and just do `data_files=='https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'`.

When we load raw files, the train/validation/test split is not done for us. *If you wanted to do this* (you could also not care about a validation dataset and just have "train", you wouldn't be the first person to do this), for this one file, you would do it this way:

```python
# adapted from ChatGPT-4 output
from datasets import Dataset, DatasetDict

def split_text_dataset(dataset, train_percent=0.9, validation_percent=0.05):
    # Retrieve all texts
    full_text = '\n'.join(dataset["train"]["text"])

    # Calculate split lengths
    total_length = len(full_text)
    train_length = int(total_length * train_percent)
    validation_length = int(total_length * validation_percent)

    # Split the text
    train_text = full_text[:train_length]
    validation_text = full_text[train_length:train_length + validation_length]
    test_text = full_text[train_length + validation_length:]

    # Combine into a DatasetDict
    return DatasetDict({
        'train': Dataset.from_dict({'text': [train_text]}),
        'validation': Dataset.from_dict({'text': [validation_text]}),
        'test': Dataset.from_dict({'text': [test_text]})
    })

dataset_raw = split_text_dataset(dataset_raw_from_dir)
```

Now our dataset is almost the same as the one downloaded from the Hub:

```python
print(dataset_raw)
print([(split, len(dataset_raw[split]['text'][0])) for split in dataset_raw])
print(dataset_raw['train']['text'][0][:250])
```

### Even deeper: the full GPT pipeline

If you want to see how deep the rabbit hole goes, I can only recommend [this Colab](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=2c5V0FvqseE0) by the same Karpathy, accompanying [his tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY) (I highly recommend this series on building language models from scratch!).

There is also this less low-level HF tutorial ["Training a causal language model from scratch
"](https://huggingface.co/learn/nlp-course/en/chapter7/6?fw=pt), and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter7/section6_pt.ipynb).