<a href="https://colab.research.google.com/github/mkimitch/generative-ai-jupyter-notebooks/blob/main/10_Fine_Tunning_LLM-Data_Preparation_for_LLM_Fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Data Preparation for LLM Fine-tuning
import pandas as pd
import json
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer
from pprint import pprint

In [2]:

print("ðŸš€ Data Preparation Tutorial (Following Original Notebook)")
print("=" * 60)

# Step 1: Import necessary libraries
print("Step 1: Import necessary libraries")
print("âœ… Libraries imported successfully!")

ðŸš€ Data Preparation Tutorial (Following Original Notebook)
Step 1: Import necessary libraries
âœ… Libraries imported successfully!


In [3]:
# Step 2: Load tokenizer
print("\nStep 2: Load tokenizer")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
tokenizer.pad_token = tokenizer.eos_token
print("âœ… Tokenizer loaded!")


Step 2: Load tokenizer


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

âœ… Tokenizer loaded!


In [4]:
# Step 3: Load and prepare the dataset
print("\nStep 3: Load and prepare the dataset")

# Sample data
sample_data = [
    {
        "question": "What are the different types of documents available in the repository?",
        "answer": "Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/."
    },
    {
        "question": "What is the recommended way to set up and configure the code repository?",
        "answer": "Lamini can be downloaded as a python package and used in any codebase that uses python. Additionally, we provide a language agnostic REST API."
    },
    {
        "question": "How can I find the specific documentation I need for a particular feature or function?",
        "answer": "You can ask this model about documentation, which is trained on our publicly available docs and source code, or you can go to https://lamini-ai.github.io/."
    },
    {
        "question": "Does the documentation include explanations of the code's purpose?",
        "answer": "Our documentation provides both real-world and toy examples of how one might use Lamini in a larger system."
    },
    {
        "question": "Does the documentation provide information about external dependencies?",
        "answer": "External dependencies and libraries are all available on the Python package hosting website Pypi at https://pypi.org/project/lamini/"
    }
]



Step 3: Load and prepare the dataset


In [5]:
# Create DataFrame
instruction_dataset = pd.DataFrame(sample_data)
print(f"âœ… Dataset loaded with {len(instruction_dataset)} examples")

# Convert to dictionary format
examples = instruction_dataset.to_dict()

# Extract text data
if "question" in examples and "answer" in examples:
    text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
    text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
    text = examples["input"][0] + examples["output"][0]
else:
    text = examples["text"][0]

print("Sample text extracted:", text[:100] + "...")

âœ… Dataset loaded with 5 examples
Sample text extracted: What are the different types of documents available in the repository?Lamini has documentation on Ge...


In [6]:


# Step 4: Format data for fine-tuning
print("\nStep 4: Format data for fine-tuning")



Step 4: Format data for fine-tuning


In [7]:


prompt_template = """### Question:
{question}

### Answer:"""

num_examples = len(examples["question"])
finetuning_data = []

for i in range(num_examples):
    question = examples["question"][i]
    answer = examples["answer"][i]
    text_with_prompt_template = prompt_template.format(question=question)
    finetuning_data.append({"question": text_with_prompt_template, "answer": answer})

print("âœ… Data formatted!")
print("Sample datapoint:")
pprint(finetuning_data[0])

âœ… Data formatted!
Sample datapoint:
{'answer': 'Lamini has documentation on Getting Started, Authentication, '
           'Question Answer Model, Python Library, Batching, Error Handling, '
           'Advanced topics, and class documentation on LLM Engine available '
           'at https://lamini-ai.github.io/.',
 'question': '### Question:\n'
             'What are the different types of documents available in the '
             'repository?\n'
             '\n'
             '### Answer:'}


In [17]:

# Step 5: Tokenize a single example
print("\nStep 5: Tokenize a single example")


text = finetuning_data[0]["question"] + finetuning_data[0]["answer"]

tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    padding=True
)
print("Token IDs:", tokenized_inputs["input_ids"])


Step 5: Tokenize a single example
Token IDs: [[ 4118 19782    27   187  1276   403   253  1027  3510   273  7177  2130
    275   253 18491    32   187   187  4118 37741    27    45  4988    74
    556 10097   327 27669 11075   264    13  5271 23058    13 19782 37741
  10031    13 13814 11397    13   378 16464    13 11759 10535  1981    13
  21798 12989    13   285   966 10097   327 21708    46 10797  2130   387
   5987  1358    77  4988    74    14  2284    15  7280    15   900 14206]]


In [18]:
# Step 6: Handle long sequences
print("\nStep 6: Handle long sequences")


max_length = 2048
max_length = min(
    tokenized_inputs["input_ids"].shape[1],
    max_length,
)

print(f"Using max_length: {max_length}")

# Apply truncation
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    truncation=True,
    max_length=max_length
)
print("âœ… Truncation applied")
print("Final shape:", tokenized_inputs["input_ids"].shape)


Step 6: Handle long sequences
Using max_length: 72
âœ… Truncation applied
Final shape: (1, 72)


In [19]:
# Step 7: Create tokenization function
print("\nStep 7: Create tokenization function")

def tokenize_function(examples):
    """Tokenize function following original notebook logic"""

    # Handle the batched format
    if "question" in examples and "answer" in examples:
        # Combine question and answer for each example in batch
        texts = []
        for i in range(len(examples["question"])):
            text = examples["question"][i] + examples["answer"][i]
            texts.append(text)
    else:
        texts = examples["text"]

    # Set pad token
    tokenizer.pad_token = tokenizer.eos_token

    # Get initial tokenization
    tokenized_inputs = tokenizer(
        texts,
        return_tensors=None,
        padding=True,
    )

    # Handle max length
    if len(texts) > 0:
        # Get the longest sequence in this batch
        first_tokenized = tokenizer(texts[0], return_tensors="np", padding=True)
        current_max_length = min(first_tokenized["input_ids"].shape[1], 2048)
    else:
        current_max_length = 2048

    # Set truncation side
    tokenizer.truncation_side = "left"

    # Final tokenization with truncation
    tokenized_inputs = tokenizer(
        texts,
        return_tensors=None,
        truncation=True,
        max_length=current_max_length
    )

    return tokenized_inputs

print("âœ… Tokenization function created")


Step 7: Create tokenization function
âœ… Tokenization function created


In [20]:

# Step 8: Tokenize the entire dataset (using HuggingFace )
print("\nStep 8: Tokenize the entire dataset")

# Create dataset from our formatted data
dataset = Dataset.from_list(finetuning_data)

# Apply tokenization with same parameters
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print("âœ… Dataset tokenized!")
print(f"Tokenized dataset: {tokenized_dataset}")


Step 8: Tokenize the entire dataset


Map:   0%|          | 0/5 [00:00<?, ? examples/s]

âœ… Dataset tokenized!
Tokenized dataset: Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 5
})


In [21]:
# Step 9: Add labels
print("\nStep 9: Add labels")


tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

print("âœ… Labels added!")
print(f"Dataset with labels: {tokenized_dataset}")


Step 9: Add labels
âœ… Labels added!
Dataset with labels: Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 5
})


In [22]:
# Step 10: Create train/test splits
print("\nStep 10: Create train/test splits")


split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)

print("âœ… Dataset split complete!")
print(f"Training examples: {len(split_dataset['train'])}")
print(f"Test examples: {len(split_dataset['test'])}")

# Show final structure
print(f"\nFinal dataset structure:")
print(split_dataset)

# Convert to pandas
train_df = pd.DataFrame(split_dataset["train"])
test_df = pd.DataFrame(split_dataset["test"])

print(f"\nTrain DataFrame shape: {train_df.shape}")
print(f"Test DataFrame shape: {test_df.shape}")


Step 10: Create train/test splits
âœ… Dataset split complete!
Training examples: 4
Test examples: 1

Final dataset structure:
DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 4
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1
    })
})

Train DataFrame shape: (4, 5)
Test DataFrame shape: (1, 5)


In [23]:
# Show sample data
print(f"\nSample from training set:")
# Get individual examples properly
for i in range(min(2, len(split_dataset["train"]))):  # Show up to 2 examples
    example = split_dataset["train"][i]
    print(f"Example {i+1}:")
    print(f"  Question: {example['question'][:100]}...")
    print(f"  Answer: {example['answer'][:100]}...")
    print(f"  Input IDs length: {len(example['input_ids'])}")
    print(f"  Labels length: {len(example['labels'])}")



Sample from training set:
Example 1:
  Question: ### Question:
What are the different types of documents available in the repository?

### Answer:...
  Answer: Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, ...
  Input IDs length: 72
  Labels length: 72
Example 2:
  Question: ### Question:
How can I find the specific documentation I need for a particular feature or function?...
  Answer: You can ask this model about documentation, which is trained on our publicly available docs and sour...
  Input IDs length: 62
  Labels length: 62


In [24]:
# Final statistics
print(f"\nðŸ“Š Final Statistics:")
train_lengths = [len(example['input_ids']) for example in split_dataset['train']]
test_lengths = [len(example['input_ids']) for example in split_dataset['test']]

import numpy as np
print(f"Training set:")
print(f"  Average length: {np.mean(train_lengths):.1f} tokens")
print(f"  Min length: {min(train_lengths)} tokens")
print(f"  Max length: {max(train_lengths)} tokens")

print(f"Test set:")
print(f"  Average length: {np.mean(test_lengths):.1f} tokens")
print(f"  Min length: {min(test_lengths)} tokens")
print(f"  Max length: {max(test_lengths)} tokens")

# Conclusion
print(f"\nðŸŽ‰ DATA PREPARATION COMPLETE!")
print("=" * 60)
print("""
âœ… WHAT WE ACCOMPLISHED:
1. Loaded and prepared the dataset
2. Formatted data with prompt template
3. Tokenized single example for testing
4. Handled long sequences with truncation
5. Created tokenization function
6. Tokenized entire dataset
7. Added labels for training
8. Created train/test splits
9. Analyzed final dataset

ðŸš€ NEXT STEPS:
This concludes the data preparation process for fine-tuning a Language Learning Model.
The next steps would involve setting up the model, fine-tuning it on the training data,
and evaluating its performance on the test data.

Your split_dataset is ready for training!
""")


ðŸ“Š Final Statistics:
Training set:
  Average length: 58.0 tokens
  Min length: 43 tokens
  Max length: 72 tokens
Test set:
  Average length: 47.0 tokens
  Min length: 47 tokens
  Max length: 47 tokens

ðŸŽ‰ DATA PREPARATION COMPLETE!

âœ… WHAT WE ACCOMPLISHED:
1. Loaded and prepared the dataset
2. Formatted data with prompt template
3. Tokenized single example for testing
4. Handled long sequences with truncation
5. Created tokenization function
6. Tokenized entire dataset
7. Added labels for training
8. Created train/test splits
9. Analyzed final dataset

ðŸš€ NEXT STEPS:
This concludes the data preparation process for fine-tuning a Language Learning Model.
The next steps would involve setting up the model, fine-tuning it on the training data,
and evaluating its performance on the test data.

Your split_dataset is ready for training!

