In this notebook, I'll illustrate how to fine-tune a Hugging Face model and generate text using transformers.

Transformers are a type of model architecture that has revolutionized NLP tasks. They use self-attention mechanisms to process input data in parallel, making them efficient for large datasets.



### Underlying Math

The mathematics behind the transformer model, particularly the self-attention mechanism. This is the core idea that powers generative models like GPT-2 and GPT-3.

**Self-Attention Mechanism**

The **self-attention mechanism** allows the model to attend to different parts of the input sequence. It works by computing three vectors for each token in the sequence:
- **Query (Q)**: Represents the token's search request.
- **Key (K)**: Represents the token's content.
- **Value (V)**: Holds the token's information.

These vectors are calculated by multiplying the input embeddings $X$ with learned weight matrices:
$
Q = X W_Q, \quad K = X W_K, \quad V = X W_V
$

where:
- $X$ is the input matrix of embeddings.
- $ W_Q, W_K, W_V$ are learnable weight matrices.

**Scaled Dot-Product Attention**

Once we have these vectors, the model computes how much attention each token should give to others by taking the **dot product** of the Query and Key vectors. This gives an initial attention score:

$
\text{Attention Score} = Q \cdot K^T
$

This score is then **scaled** by $ \sqrt{d_k} $, where $d_k$ is the dimension of the key vector. This scaling ensures that the dot product values aren’t too large, which could destabilize the training process:

$
\text{Scaled Score} = \frac{Q \cdot K^T}{\sqrt{d_k}}
$

The **softmax** function is applied to convert these scores into probabilities (attention weights), so that they sum to 1:

$
\text{Attention Weights} = \text{softmax}\left( \frac{Q \cdot K^T}{\sqrt{d_k}} \right)
$


These attention weights are used to take a **weighted sum** of the Value vectors, resulting in the **output** of the self-attention layer:

$
\text{Output} = \text{Attention Weights} \cdot V
$

**Multi-Head Attention**

To allow the model to focus on different relationships in the input simultaneously, **multi-head attention** is used. Instead of computing attention once, the model computes it multiple times in parallel, using different sets of learned weights for each head. The outputs of all attention heads are concatenated and projected into the final output:

$
Q_i = X W_Q^i, \quad K_i = X W_K^i, \quad V_i = X W_V^i
$

The multi-head attention output is:

$$
\text{Multi-Head Output} = \text{concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h) W_O
$$

where:
- $h$ is the number of attention heads.
- $W_O$ is a projection matrix that combines the attention heads.

**Positional Encoding**

Since transformers process tokens in parallel, they don’t inherently know the order of tokens. To address this, **positional encoding** is added to the input embeddings to provide order information. The positional encoding is computed using sine and cosine functions:

$
PE_{(i, 2k)} = \sin\left(\frac{i}{10000^{2k/d}}\right), \quad PE_{(i, 2k+1)} = \cos\left(\frac{i}{10000^{2k/d}}\right)
$

where:
- $ i$ is the position of the token in the sequence.
- $d$ is the dimensionality of the token embeddings.

These positional encodings are added to the input embeddings to provide the model with information about the relative position of tokens in the sequence.

**Why Self-Attention?**

Self-attention is crucial for allowing transformers to handle long-range dependencies in text. By processing all tokens at once and allowing the model to decide which tokens to focus on, it can capture relationships between tokens that are far apart in the sequence. This ability to "pay attention" to the important parts of the input is what makes transformers so powerful for **text generation** and other natural language tasks.


### Pyton code:

- Fine-tune a pre-trained model on your dataset.

- Use the fine-tuned model to generate text based on a given prompt.


Step 1: Setting Up Your Environment

Install Hugging Face Transformers library along with PyTorch or TensorFlow (depending on your preference).

In [1]:
!pip install transformers torch  # For PyTorch
# or
!pip install transformers tensorflow  # For TensorFlow


In [2]:
#here I use torch
import torch

In [3]:
import transformers
print(transformers.__version__)


4.51.3


In [5]:
!pip install --upgrade transformers


Step 2: Loading a Pre-trained Model

Hugging Face provides a wide range of pre-trained models. Here’s how to load a pre-trained model and tokenizer for text generation.


In [6]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = "gpt2"  # You can also try "gpt2-medium", "gpt2-large", etc.
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Step 3: Generating Text

Now that you have the model and tokenizer, you can generate text. Here’s a simple example:

In [8]:
# Function to generate text
def generate_text(prompt, max_length=50):
    # Encode the input prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    # Generate text
    with torch.no_grad():
        output = model.generate(input_ids, max_length=max_length, num_return_sequences=1)

    # Decode the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

# Example usage
prompt = "The individual has always had to struggle to keep from being overwhelmed by the tribe. If you try it"
generated = generate_text(prompt)
print(generated)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The individual has always had to struggle to keep from being overwhelmed by the tribe. If you try it, you'll be able to get through it.

"I think it's a very important thing to have a good relationship with your tribe.


Step 4: Fine-tuning a Model on a Dataset


In [11]:
!pip install datasets

In [12]:
import os
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset
from huggingface_hub import login


In [15]:
# Retrieve the Hugging Face token using the updated method
from google.colab import userdata
hf_token = userdata.get('HF_key')  # Get your_huggingface_api_key from the Colab secret manager

# Check if the token was retrieved correctly
if hf_token is None:
    print("Error: Hugging Face token not found.")
else:
    print(f"Token Retrieved: {hf_token[:5]}...")  # Only print first 5 characters for security

    # Log in using the Hugging Face API token
    login(hf_token)


Token Retrieved: hf_BM...


In [16]:
# Load the IMDB dataset
dataset = load_dataset("imdb")

# Print a sample from the dataset to verify
print(dataset["train"][0])


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from datasets import load_dataset

# Load the IMDB dataset (or any other dataset you're using)
dataset = load_dataset("imdb")

# Filter out empty entries (if any)
filtered_train = dataset['train'].filter(lambda x: x['text'].strip() != '')
filtered_validation = dataset['test'].filter(lambda x: x['text'].strip() != '')

# Load the GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set the pad token id to the eos token id (GPT-2 does not have a dedicated pad token)
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained(model_name)

# Tokenize the dataset and create labels (shifted input)
def tokenize_function(examples):
    # Tokenize input text
    encodings = tokenizer(examples['text'], truncation=True, padding="max_length", max_length=512)

    # Shift labels by one token for causal language modeling
    labels = encodings['input_ids'].copy()
    for i in range(len(labels)):
        labels[i] = labels[i][1:] + [tokenizer.pad_token_id]  # Shift the input and pad

    encodings['labels'] = labels  # Add labels to encodings
    return encodings

# Tokenize the training and validation datasets
tokenized_train = filtered_train.map(tokenize_function, batched=True)
tokenized_validation = filtered_validation.map(tokenize_function, batched=True)

# Set training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_dir='./logs',  # Directory for storing logs
    logging_steps=10,      # Log every 10 steps
    report_to="none",      # Disable WandB
    disable_tqdm=True      # Disable tqdm progress bar
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_validation,
)

# Fine-tune the model
trainer.train()


Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


{'loss': 6.0812, 'grad_norm': 32.88490295410156, 'learning_rate': 1.99952e-05, 'epoch': 0.0008}
{'loss': 3.5082, 'grad_norm': 13.66903305053711, 'learning_rate': 1.998986666666667e-05, 'epoch': 0.0016}
{'loss': 3.2751, 'grad_norm': 11.793081283569336, 'learning_rate': 1.9984533333333335e-05, 'epoch': 0.0024}
{'loss': 2.9565, 'grad_norm': 4.352120399475098, 'learning_rate': 1.99792e-05, 'epoch': 0.0032}


In [None]:
model.save_pretrained('./fine_tuned_model')
tokenizer.save_pretrained('./fine_tuned_model')


In [None]:
!ls ./fine_tuned_model


Step5: Generating Text After fine-tuning the model:

In [None]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Path to your fine-tuned model directory
fine_tuned_model_path = './fine_tuned_model'  # Adjust if necessary

# Initialize the model and tokenizer with your fine-tuned model
model = GPT2LMHeadModel.from_pretrained(fine_tuned_model_path)
tokenizer = GPT2Tokenizer.from_pretrained(fine_tuned_model_path)

# Function to generate text
def generate_text(prompt, max_length=50):
    # Encode the input prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    # Create attention mask - to indicate which tokens should be attended to
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

    # Generate text
    with torch.no_grad():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=max_length,
            num_return_sequences=1,
            do_sample=True,  # Use sampling to generate more diverse outputs
            top_k=50,        # Limit the sampling pool to the top k tokens
            top_p=0.95,      # Use nucleus sampling
            temperature=0.7   # Control randomness in generation
        )

    # Decode the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

# Example usage
prompt = "The movie was"
generated = generate_text(prompt)
print(generated)


With these steps, confidently fine-tune and use transformers for your own generative AI projects while keeping your data and credentials secure.

You can use other models from Hugging Face Model Hub. Or, use your own dataset and explore different tasks beyond text generation e.g., text classification, summarization, and translation.


