<a href="https://colab.research.google.com/github/pratikagithub/All-About-GenAI-and-LLMs/blob/main/Fine_Tuning_LLMs_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Fine-tuning is the process of taking a pre-trained model and further training it on a specialized dataset to adapt it for a specific task. In traditional Machine Learning, training typically starts from scratch with a model initialized with random parameters. The model gradually learns by updating these parameters to minimize errors on the dataset. However, fine-tuning large language models (LLMs) begins with a model that has already learned general language patterns from extensive pre-training on vast, diverse datasets. This gives the model a foundational understanding of language that can be tailored by fine-tuning on a smaller, more focused dataset to capture domain-specific nuances.

Fine-tuning is ideal when we need a model to perform well in a particular field or when you need the model to generate text that aligns closely with specialized terminology or style (e.g., legal or medical text). Conversely, using LLMs directly without fine-tuning is effective when a task is broad, has a general purpose, or benefits from the diversity of the original pre-training data, such as casual conversation, creative writing, or answering general knowledge questions.

Fine-tuning requires additional time and resources, so it’s best reserved for tasks where the model’s performance noticeably improves by specializing in a specific domain.

**Fine-Tuning LLMs with Python: A Practical Guide**

Now, let’s understand how to fine-tune LLMs practically using Python. In this guide, I’ll be using a lightweight LLM and a smaller dataset to explain the process of fine-tuning. It will help you understand the fine-tuning process practically on your available computational resources.

**Step 1: Installation and Initial Setup**

Install the necessary libraries and set up the environment:

!pip install transformers datasets

The transformers library, provided by Hugging Face, contains pre-trained models and tools for building and fine-tuning various Natural Language Processing (NLP) models. The datasets library is used to load popular datasets conveniently, which makes it easy to prepare data for training and fine-tuning models. Run this installation command at the beginning to set up these libraries.

**Step 2: Loading and Sampling the Dataset**

Load a dataset suitable for fine-tuning:

In [2]:
pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [3]:
from datasets import load_dataset

# load IMDb dataset and take a small sample
dataset = load_dataset("imdb", split="train[:1%]")
print(dataset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

Here, we load the IMDb movie reviews dataset, often used in NLP tasks for sentiment analysis. By specifying train[:1%], we only load 1% of the training set, which is beneficial for quick experimentation and avoids using excessive computational resources. The print(dataset[0]) command checks that the data is loaded correctly.

**Step 3: Data Preprocessing**

Prepare data by cleaning the text and ensuring consistent formatting:

In [4]:
def preprocess(batch):
    batch['text'] = [text.replace('\n', ' ') for text in batch['text']]
    return batch

# apply preprocessing to the dataset
dataset = dataset.map(preprocess, batched=True)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In this function, we replaced newline characters in each review with spaces. This step is crucial because some models may not handle newline characters well, especially if trained for single-line inputs. dataset.map(preprocess, batched=True) applies this preprocessing function to the entire dataset, batch by batch, which improves efficiency.

**Step 4: Initializing the Model and Tokenizer**

Load a pre-trained model and tokenizer for fine-tuning:

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Here, we loaded distilgpt2, a lightweight version of GPT-2, which is suitable for causal language modelling tasks. AutoTokenizer and AutoModelForCausalLM automatically download and set up the tokenizer and model architecture for the specified model. Setting the pad_token to eos_token ensures consistent padding in sequences, which is necessary for batch processing.

**Step 5: Tokenizing the Data**

Convert text into tokens the model can understand:

In [6]:
def tokenize_function(examples):
    tokenized = tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)
    tokenized['labels'] = tokenized['input_ids'].copy()  # set labels to be the same as input_ids
    return tokenized

tokenized_data = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

This function tokenizes each text input by converting it into integer IDs that the model can process. Using padding= “max_length” and truncation=True; ensures each tokenized sequence has a fixed length of 128, which avoids model memory overflow. Setting labels as a copy of input_ids prepares the dataset for language modelling by ensuring the model learns to predict the next word in a sequence.

**Step 6: Configuring Training Parameters**

The next step in the fine-tuning process is to set up hyperparameters for model training:

In [7]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    logging_dir='./logs',
    logging_steps=10,
    save_total_limit=1
)



The TrainingArguments class is used to define the hyperparameters and settings for training. Key parameters include:

output_dir: Directory to save model checkpoints.

evaluation_strategy= “epoch”: Evaluate the model at the end of each epoch.

per_device_train_batch_size and per_device_eval_batch_size: Number of samples processed per device in each batch during training and evaluation, respectively.

num_train_epochs=1: Train the model for a single epoch.

logging_steps: How often to log training information.

save_total_limit=1: Limits the saved checkpoints to avoid storage overload.

**Step 7: Splitting the Dataset**

Now, divide the dataset into training and evaluation sets:

In [8]:
train_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data))))
eval_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data)), len(tokenized_data)))

Here, we randomly shuffle the dataset and then split it into 80% training data and 20% evaluation data. This ensures that the model has enough data to learn from and also allows for a validation set to assess the model’s performance.

**Step 8: Setting Up the Trainer & Fine-Tuning the Model**

Now, the next step in the process of fine-tuning LLMs is to initialize and configure the training process for fine-tuning:

In [9]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data
)

The Trainer class in transformers simplifies the training process by automating tasks like gradient updates and model evaluation. It uses training_args for hyperparameters and takes the train_data and eval_data datasets to structure the training and validation process.

Now, this is the fine-tuning step. Start training the model on the custom dataset:

In [12]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    run_name="custom_run_name",  # Replace this with your desired run name
)


his command initiates the fine-tuning process. The train() function performs multiple forward and backward passes through the data, which updates the model’s weights to minimize prediction errors based on the IMDb dataset. Fine-tuning will allow the pre-trained distilgpt2 model to adjust to the specific language and style of movie reviews.

**Step 9: Save & Test the Fine-tuned Model**

Save the model and tokenizer for future use:

In [13]:
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.json',
 './fine_tuned_model/merges.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

Once training is completed, saving the model ensures that the fine-tuned parameters can be reused without re-running the entire process. The save_pretrained function saves both the model weights and the tokenizer configuration to a directory.

Now, let’s generate text based on a prompt to evaluate the model:

In [14]:
prompt = "The script"
inputs = tokenizer(prompt, return_tensors="pt")

output = model.generate(inputs['input_ids'], max_length=15)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The script is a script that is written in a script that is written in


In this final section, we provide a sample prompt (“The script”) to test the model’s generative capabilities. The generate() function creates a new text sequence by sampling from the model’s learned distribution. By decoding and printing the output, you can observe how well the fine-tuned model generates text that aligns with the IMDb dataset.