# Finetuning on a New Dataset - Paris, Texas (finetune on new dataset) Versus Paris, France (pre-trained model)

### Data Handling and Visualization

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

### LLM model training

In [2]:
import torch
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer, DataCollatorForLanguageModeling
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from datasets import Dataset
from unsloth import is_bfloat16_supported

# Saving model
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Warnings
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Your Flash Attention 2 installation seems to be broken?
A possible explanation is you have a new CUDA version which isn't
yet compatible with FA2? Please file a ticket to Unsloth or FA2.
We shall now use Xformers instead, which does not have any performance hits!
We found this negligible impact by benchmarking on 1x A100.
🦥 Unsloth Zoo will now patch everything to make training faster!




### Study response of pre-trained model

In [3]:
max_seq_length = 5020
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
    trust_remote_code=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state = 32,
    loftq_config = None,
)
print(model.print_trainable_parameters())

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.2.5: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA L40. Max memory: 44.418 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.2.5 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


trainable params: 11,272,192 || all params: 1,247,086,592 || trainable%: 0.9039
None


In [4]:
model = FastLanguageModel.for_inference(model)

### ENTER YOUR QUESTION BELOW

question = "What are the top 5 attractions of Paris? Sort by popularity."

# Format the question
eval_prompt = f"{question}\n\n"

promptTokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

# model.eval()
# with torch.no_grad():
#     print(tokenizer.decode(modelFinetuned.generate(**promptTokenized, max_new_tokens = 5020)[0], skip_special_tokens=True))
# torch.cuda.empty_cache()

outputs = model.generate(**promptTokenized, max_new_tokens = 5020, use_cache = True)
answer = tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]

print("Answer of the question is:", answer)

Answer of the question is: <|begin_of_text|>What are the top 5 attractions of Paris? Sort by popularity.

## Top 10 Attractions in Paris

1. Arc de Triomphe
2. Eiffel Tower
3. Notre Dame Cathedral
4. Louvre Museum
5. Musee de l'Armee
6. Musee de la Defence
7. Musee de l'Armee
8. Musee de l'Armee
9. Musee de l'Armee
10. Musee de l'Armee

## Top 10 Attractions in Paris

1. Arc de Triomphe
2. Eiffel Tower
3. Notre Dame Cathedral
4. Louvre Museum
5. Musee de l'Armee
6. Musee de la Defence
7. Musee de l'Armee
8. Musee de l'Armee
9. Musee de l'Armee
10. Musee de l'Armee

## Top 10 Attractions in Paris

1. Arc de Triomphe
2. Eiffel Tower
3. Notre Dame Cathedral
4. Louvre Museum
5. Musee de l'Armee
6. Musee de la Defence
7. Musee de l'Armee
8. Musee de l'Armee
9. Musee de l'Armee
10. Musee de l'Armee

## Top 10 Attractions in Paris

1. Arc de Triomphe
2. Eiffel Tower
3. Notre Dame Cathedral
4. Louvre Museum
5. Musee de l'Armee
6. Musee de la Defence
7. Musee de l'Armee
8. Musee de l'Armee
9. M

### Calling the datasetfor finetuning

NOTE: REPLACE DATASET BELOW WITH DATASET ON PARIS, TEXAS!!!

In [5]:
# import sys, pathlib, pymupdf
# fname = sys.argv[1]  # get document filename
# with pymupdf.open(fname) as doc:  # open document
#     text = chr(12).join([page.get_text() for page in doc])
# # write as a binary file to support non-ASCII characters
# pathlib.Path(fname + ".txt").write_bytes(text.encode())

In [6]:
import glob
import os

def get_filenames_with_glob(directory):
    """
    Gets all filenames in a directory using the glob module.

    Args:
        directory: The path to the directory.

    Returns:
        A list of filenames in the directory.
    """
    # Ensure the directory path ends with a separator
    if not directory.endswith(os.path.sep):
        directory += os.path.sep

    # Use glob to match all files in the directory
    all_files = glob.glob(directory + '*')
    
    # Filter out directories, keeping only files
    filenames = [f for f in all_files if os.path.isfile(f)]

    return filenames

In [7]:
import sys, pathlib, pymupdf

absolute_path = "/myapp/local/text_files"
directory_path = absolute_path  # Replace with the actual path
filenames = get_filenames_with_glob(directory_path)
# print(filenames)

all_text = ""
print("Filenames in directory:")
for fname in filenames:
    print(fname)

    with pymupdf.open(fname) as doc:  # open document
        text = chr(12).join([page.get_text() for page in doc])
        all_text = " ".join([all_text, text])
# print(all_text)
    
# write as a binary file to support non-ASCII characters
file_name = "/myapp/local/paris_texas_sites"
pathlib.Path(file_name + ".txt").write_bytes(all_text.encode())

Filenames in directory:
/myapp/local/text_files/Parks _ Paris.pdf
/myapp/local/text_files/A Romantic Rendezvous in Paris—Texas.pdf
/myapp/local/text_files/Festival of Pumpkins _ Paris.pdf
/myapp/local/text_files/details_pdf.pdf
/myapp/local/text_files/Farmers & Artisan Vendors List _ Paris.pdf
/myapp/local/text_files/Eats.pdf
/myapp/local/text_files/History.pdf
/myapp/local/text_files/Paris Downtown Hosts _ Paris.pdf
/myapp/local/text_files/Paris Farmers and Artisan Market _ Paris.pdf
/myapp/local/text_files/Paris Texas Day Trip - Visit the Other Eiffel Tower - Tui Snider - author & speaker.pdf
/myapp/local/text_files/Paris Texas Wine Fest 2025 _ Paris.pdf
/myapp/local/text_files/Red River Valley Veterans Memorial _ Lamar United States.pdf
/myapp/local/text_files/Sam Bell Maxey House _ Texas Historical Commission.pdf
/myapp/local/text_files/Savory Restaurant and Catering Partners _ Paris.pdf
/myapp/local/text_files/Weekend Road Trip to Paris, Texas - Tui Snider - author & speaker.pdf
/

213257

In [8]:
from datasets import load_dataset
file_path = "/myapp/local/paris_texas_sites.txt"
train_dataset = load_dataset("text", data_files={"train": [file_path]}, split='train')

Generating train split: 6489 examples [00:00, 1197449.89 examples/s]


In [9]:
print(train_dataset.shape)
print(train_dataset['text'][:10])
print(train_dataset['text'][-10:])
# type(train_dataset)

(6489, 1)
[' Parks', 'Commitment of Responsibilities', 'The Parks Division is responsible for the maintenance of all parks, playgrounds, restrooms, pavilions,', 'swimming pool, athletic fields and the two lakes in the city’s park system.', 'Parks is responsible for numerous other city properties consisting of over 240 acres as well as mowing of', 'all roadside rights of way and creeks within the city limits where city easements have been established.', 'Common work provided by this Division includes upkeep on high grass, high weeds and visual hazards', 'along the streets, sidewalks, pathways and parks to ensure safe passage by motorists and pedestrians.', 'Contact Us', 'Bill Loranger']
['B O O K S', 'T R A V E L', 'L I F E S T Y L E', 'B O O K  C O R N E R', 'R E S O U R C E S', '\uf343', '2/8/25, 8:45 PM', 'The Best Things to do in Paris Texas for Couples - IDimitrova', 'https://www.ivankadimitrova.com/paris-tx/', '39/39']


### Model training

#### Loading the model

We are going to use Llama 3.2 with only 1 billion parameters.

(You can use the 3, 11 or 90 billion version as well.)

- Max Sequence Length:
    We used max_seq_length 5020.

- Loading Llama 3.2 Model:

    - The model and tokenizer are loaded using `FastLanguageModel.from_pretrained` with a specific pre-trained model, "unsloth/Llama-3.2-1B-bnb-4bitt". 
    - This is optimized for 4-bit precision, which reduces memory usage and increases training speed without significantly compromising performance.  
    - load_in_4bit=True 

- Applying PEFT (Parameter-Efficient Fine-Tuning):

    - Then we configured model using get_peft_model, which applies LoRA (Low-Rank Adaptation) techniques. 
    - This approach focuses on fine-tuning only specific layers or parts of the model, rather than the entire network.
    - This drastically reduces the computational resources needed.

- Parameters:

    - r=16
    - lora_alpha=16 for target_modules (include key components involved in attention mechanisms like q_proj, k_proj, and v_proj)
    - use_rslora=True (activates Rank-Stabilized LoRA())
    - use_gradient_checkpointing="unsloth" (memory usage optimized during training)

- Verifying Trainable Parameters:
    We used `model.print_trainable_parameters()`.

In [10]:
# max_seq_length = 5020
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name="unsloth/Llama-3.2-1B-bnb-4bit",
#     max_seq_length=max_seq_length,
#     load_in_4bit=True,
#     dtype=None,
#     trust_remote_code=True,
# )

# model = FastLanguageModel.get_peft_model(
#     model,
#     r=16,
#     lora_alpha=16,
#     lora_dropout=0,
#     target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
#     use_rslora=True,
#     use_gradient_checkpointing="unsloth",
#     random_state = 32,
#     loftq_config = None,
# )
# print(model.print_trainable_parameters())

#### Prepare data for model feed

Main points to remember:

- Data Prompt Structure:
The data_prompt is a formatted string template designed to guide the model in analyzing the provided text. It includes placeholders for the input text (the context) and the model's response. This template specifically prompts the model to identify mental health indicators, making it easier to fine-tune the model for mental health-related tasks.

- End-of-Sequence Token:
The EOS_TOKEN is retrieved from the tokenizer to signify the end of each text sequence. This token is essential for the model to recognize when a prompt has ended, helping to maintain the structure of the data during training or inference.

- Formatting Function:
The formatting_prompt used to take a batch of examples and formats them according to the data_prompt. It iterates over the input and output pairs, inserting them into the template and appending the EOS token at the end. The function then returns a dictionary containing the formatted text, ready for model training or evaluation.

- Function Output:
The function outputs a dictionary where the key is "text" and the value is a list of formatted strings. Each string represents a fully prepared prompt for the model, combining the context, response and the structured prompt template.

In [11]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})

In [12]:
EOS_TOKEN = tokenizer.eos_token

tokenized_train_dataset=[]
for phrase in train_dataset:
    tokenized_train_dataset.append(phrase['text'] + EOS_TOKEN)
tokenized_training_data={"text": tokenized_train_dataset}

In [13]:
train_dataset[:10]

{'text': [' Parks',
  'Commitment of Responsibilities',
  'The Parks Division is responsible for the maintenance of all parks, playgrounds, restrooms, pavilions,',
  'swimming pool, athletic fields and the two lakes in the city’s park system.',
  'Parks is responsible for numerous other city properties consisting of over 240 acres as well as mowing of',
  'all roadside rights of way and creeks within the city limits where city easements have been established.',
  'Common work provided by this Division includes upkeep on high grass, high weeds and visual hazards',
  'along the streets, sidewalks, pathways and parks to ensure safe passage by motorists and pedestrians.',
  'Contact Us',
  'Bill Loranger']}

In [14]:
tokenized_training_data["text"][:10]

[' Parks<|end_of_text|>',
 'Commitment of Responsibilities<|end_of_text|>',
 'The Parks Division is responsible for the maintenance of all parks, playgrounds, restrooms, pavilions,<|end_of_text|>',
 'swimming pool, athletic fields and the two lakes in the city’s park system.<|end_of_text|>',
 'Parks is responsible for numerous other city properties consisting of over 240 acres as well as mowing of<|end_of_text|>',
 'all roadside rights of way and creeks within the city limits where city easements have been established.<|end_of_text|>',
 'Common work provided by this Division includes upkeep on high grass, high weeds and visual hazards<|end_of_text|>',
 'along the streets, sidewalks, pathways and parks to ensure safe passage by motorists and pedestrians.<|end_of_text|>',
 'Contact Us<|end_of_text|>',
 'Bill Loranger<|end_of_text|>']

In [15]:
def formatting_prompt(sample):
    return sample

#### Format the data for training

In [16]:
training_data = Dataset.from_dict(tokenized_training_data)

training_data = training_data.map(formatting_prompt, batched=True)

Map: 100%|██████████| 6489/6489 [00:00<00:00, 2486010.11 examples/s]


#### Training setup to start fine tuning

- Trainer Initialization:
We are going to initialize SFTTrainer with the model and tokenizer, as well as the training dataset. 

- Training Arguments:
The TrainingArguments class is used to define key hyperparameters for the training process:

    - learning_rate=3e-4: Sets the learning rate for the optimizer.
    - per_device_train_batch_size=32: Defines the batch size per device, optimizing GPU usage.
    - num_train_epochs=20: Specifies the number of training epochs.
    - fp16=not is_bfloat16_supported() and bf16=is_bfloat16_supported(): Enable mixed precision training to reduce memory usage, depending on hardware support.
    - optim="adamw_8bit": Uses the 8-bit AdamW optimizer for efficient memory usage.
    - weight_decay=0.01: Applies weight decay to prevent overfitting.
    - output_dir="output": Specifies the directory where the trained model and logs will be saved.

- Training Process:

    - Finally we called trainer.train() method to start the training process. 
    - It uses the defined parameters of our fine-tune the model, adjusting weights and learning from the provided dataset. 
    - The trainer also handles data packing and gradient accumulation, optimizing the training pipeline for better performance.

In [17]:
trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=training_data,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=8,
        num_train_epochs=40,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        optim="paged_adamw_8bit", 
        weight_decay=0.01,
        warmup_steps=20,
        output_dir="output",
        logging_dir="logs",  
        logging_steps=1,
        seed=0,
    ),
    # use to form a batch from a list of elements of train_dataset
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),  
)

# if use_cache is True, past key values are used to speed up decoding
# if applicable to model. This defeats the purpose of finetuning
model.config.use_cache = False

trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 14 | Num Epochs = 40
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 40
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss
1,5.9716
2,5.9716
3,5.8958
4,5.662
5,5.5932
6,5.2171
7,4.5235
8,4.1636
9,3.934
10,3.9332


TrainOutput(global_step=40, training_loss=3.4269424438476563, metrics={'train_runtime': 218.9799, 'train_samples_per_second': 2.557, 'train_steps_per_second': 0.183, 'total_flos': 1.66043804172288e+16, 'train_loss': 3.4269424438476563, 'epoch': 40.0})

### Inference

In [29]:
modelFinetuned = FastLanguageModel.for_inference(model)

### ENTER YOUR QUESTION BELOW

question = "What are the top 5 attractions of Paris? Sort by popularity. Only show the top 5 attractions once."

# Format the question
eval_prompt = f"{question}\n\n"

promptTokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

# modelFinetuned.eval()
# with torch.no_grad():
#     print(tokenizer.decode(modelFinetuned.generate(**promptTokenized, max_new_tokens = 5020)[0], skip_special_tokens=True))
# torch.cuda.empty_cache()

outputs = modelFinetuned.generate(**promptTokenized, max_new_tokens = 5020, use_cache = True)
answer = tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]
print("Answer of the question is:", answer)


Answer of the question is: <|begin_of_text|>What are the top 5 attractions of Paris? Sort by popularity. Only show the top 5 attractions once.

## Top 5 Attractions in Paris, France

Paris is a city of romance, culture, history, and art. It is a city of love, art, and architecture. Paris is a city that has a lot to offer for the tourists. It is a city that has a lot of history and culture. Paris is a city that has a lot of attractions. It is a city that has a lot of museums and galleries. It is a city that has a lot of restaurants and cafes. It is a city that has a lot of shops and boutiques. It is a city that has a lot of attractions. It is a city that has a lot of attractions. It is a city that has a lot of attractions. It is a city that has a lot of attractions. It is a city that has a lot of attractions. It is a city that has a lot of attractions. It is a city that has a lot of attractions. It is a city that has a lot of attractions. It is a city that has a lot of attractions. It i

### Expected response:

...

### Push a fine-tuned model and its tokenizer to the Hugging Face Hub

**Note:**
- Create a **.env** file in your local/ folder in your working directory in the Docker environment (/myapp/local).
- Copy the line your **HF_TOKEN=\<your Hugginface API token\>** with your Hugginface API token inserted as value into your .env file.
- Run the cell below to load **HF_TOKEN** as an environment variable.

In [2]:
# import os
# from dotenv import load_dotenv
# load_dotenv()

In [20]:
# os.environ["HF_TOKEN"] = "hugging face token key, you can create from your HF account."
# model.push_to_hub("ImranzamanML/1B_finetuned_llama3.2", use_auth_token=os.getenv("HF_TOKEN"))
# tokenizer.push_to_hub("ImranzamanML/1B_finetuned_llama3.2", use_auth_token=os.getenv("HF_TOKEN"))

### Save fine-tuned model and its tokenizer locally on the machine.

In [21]:
model.save_pretrained("model/Simplified_Paris_Texas_1B_finetuned_llama3.2")
tokenizer.save_pretrained("model/Simplified_Paris_Texas_1B_finetuned_llama3.2")

('model/1B_finetuned_llama3.2/tokenizer_config.json',
 'model/1B_finetuned_llama3.2/special_tokens_map.json',
 'model/1B_finetuned_llama3.2/tokenizer.json')