<a href="https://colab.research.google.com/github/peremartra/FinLLMOpt/blob/FinChat-XS-Instruct/FinChat-XS/01_Finetuning_FinChat-XS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FinChat-XS
## Fine-Tuning & Merge.


________
* Model: HuggingFaceTB/SmolLM2-360M-Instruct
* Dataset: sujet-ai/Sujet-Finance-Instruct-177k
_________
This notebook simply replicates the process followed to create the Llama-FinChat-XS model.

In this notebook, you can see the fine-tuning process of the SmolLM2-360M-Instruct model created by Hugging Face, using a subset of the data contained in the Sujet-Finance-Instruct-177k dataset.

This first version of the model is fully functional, but it can be considered a proof of concept while awaiting the creation of a specific dataset for its training.
______________
If you’re looking for explanations about fine-tuning LLMs, you can find them in the Fine-Tuning section of the [Large Language Models course](https://github.com/peremartra/Large-Language-Model-Notebooks-Course) that I maintain on GitHub.

# Install and Import Libraries.

In [1]:
!pip install -q transformers==4.47.1
!pip install -q datasets==3.2.0
!pip install -q torch==2.5.1
!pip install -q peft==0.14.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the foll

In [2]:
# Import Libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from peft import get_peft_model, LoraConfig
from transformers import DataCollatorForLanguageModeling
import re
from transformers import EarlyStoppingCallback

# Load & Process the Dataset

In [3]:
# Load the Dataset
dataset = load_dataset("sujet-ai/Sujet-Finance-Instruct-177k")

# We only need the rows that belong to the categories "qa" and "qa_conversation".
dataset["train"] = dataset["train"].filter(lambda x: x["task_type"].strip().lower() in ["qa", "qa_conversation"])

README.md:   0%|          | 0.00/4.64k [00:00<?, ?B/s]

Sujet-Finance-Instruct-177k.csv:   0%|          | 0.00/337M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/177597 [00:00<?, ? examples/s]

Filter:   0%|          | 0/177597 [00:00<?, ? examples/s]

In [4]:
# Due to the limited size of the model, we restrict both the prompt and
# response length when selecting rows from the dataset for fine-tuning.
def filter_by_length(example):
    return len(example["answer"]) <= 300 and len(example["user_prompt"]) <= 100
dataset["train"] = dataset["train"].filter(filter_by_length)

Filter:   0%|          | 0/54414 [00:00<?, ? examples/s]

In [5]:
# The few short conversation records are duplicated to ensure the model retains
# its ability to interact and respond to greetings such as: "Hi, how are you?"
short_qa_conversations = []

for i, example in enumerate(dataset["train"]):
    if example["task_type"] == "qa_conversation" and len(example["answer"]) < 30:
        short_qa_conversations.append(i)

print(f"Found {len(short_qa_conversations)} short qa_conversation examples to duplicate")

augmented_dataset = dataset["train"].select(range(len(dataset["train"])))

Found 6 short qa_conversation examples to duplicate


In [6]:
n = 5
# Add n-1 copies of short qa_conversation examples
for _ in range(n-1):
    for idx in short_qa_conversations:
        example = dataset["train"][idx]
        augmented_dataset = augmented_dataset.add_item(example)

print(f"Original training dataset size: {len(dataset['train'])}")
print(f"Augmented training dataset size: {len(augmented_dataset)}")
print(f"Added {(n-1) * len(short_qa_conversations)} additional copies of short qa_conversation examples")

# Replace the original training dataset with the augmented one
dataset["train"] = augmented_dataset

Original training dataset size: 15752
Augmented training dataset size: 15776
Added 24 additional copies of short qa_conversation examples


In [7]:
split_dataset = dataset["train"].train_test_split(test_size=0.1, seed=2)
# split_dataset now has keys: "train" and "test"

In [8]:
def clean_dataset(example):
    # Remove examples with excessive emojis or unprofessional language
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U00002702-\U000027B0"  # Dingbats
                               "]+", flags=re.UNICODE)
    if emoji_pattern.search(example["answer"]):
        return False
    # Filter out extremely informal answers (lots of !!, ??, etc.)
    if re.search(r'[!?]{3,}', example["answer"]):
        return False
    return True

split_dataset["train"] = split_dataset["train"].filter(clean_dataset)
split_dataset["test"] = split_dataset["test"].filter(clean_dataset)

Filter:   0%|          | 0/14198 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1578 [00:00<?, ? examples/s]

In [9]:
# Count in the train split
qa_count_train = sum(1 for example in split_dataset["train"] if example["task_type"].strip().lower() == "qa")
qa_conversation_count_train = sum(1 for example in split_dataset["train"] if example["task_type"].strip().lower() == "qa_conversation")

# Count in the test split
qa_count_test = sum(1 for example in split_dataset["test"] if example["task_type"].strip().lower() == "qa")
qa_conversation_count_test = sum(1 for example in split_dataset["test"] if example["task_type"].strip().lower() == "qa_conversation")

# Print the counts
print(f"Train split: {qa_count_train} 'qa' and {qa_conversation_count_train} 'qa_conversation' examples")
print(f"Test split: {qa_count_test} 'qa' and {qa_conversation_count_test} 'qa_conversation' examples")
print(f"Total: {qa_count_train + qa_count_test} 'qa' and {qa_conversation_count_train + qa_conversation_count_test} 'qa_conversation' examples")

Train split: 14143 'qa' and 52 'qa_conversation' examples
Test split: 1572 'qa' and 6 'qa_conversation' examples
Total: 15715 'qa' and 58 'qa_conversation' examples


In [10]:
split_dataset["train"][4211]

{'Unnamed: 0': 127922,
 'inputs': 'As a finance expert, your role is to provide clear, concise, and informative responses to finance-related questions. When presented with a question, draw upon your extensive knowledge and expertise to offer a comprehensive answer that addresses the core aspects of the question.\n\nQuestion:\nCalculate the population density of Spain.\n\nAnswer:',
 'answer': 'The population density of Spain is 91.9 people per square kilometer.',
 'system_prompt': 'As a finance expert, your role is to provide clear, concise, and informative responses to finance-related questions. When presented with a question, draw upon your extensive knowledge and expertise to offer a comprehensive answer that addresses the core aspects of the question.',
 'user_prompt': 'Question:\nCalculate the population density of Spain.',
 'task_type': 'qa',
 'dataset': 'gbharti/finance-alpaca',
 'index_level': None,
 'conversation_id': None}

In [11]:
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'inputs', 'answer', 'system_prompt', 'user_prompt', 'task_type', 'dataset', 'index_level', 'conversation_id'],
        num_rows: 14195
    })
    test: Dataset({
        features: ['Unnamed: 0', 'inputs', 'answer', 'system_prompt', 'user_prompt', 'task_type', 'dataset', 'index_level', 'conversation_id'],
        num_rows: 1578
    })
})


## Load Tokenizer & Model

In [12]:
# Load the Base Model and Tokenizer.
model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

In [13]:
#default_system_prompt = "You are FinChat, a helpful AI assistant with expertise in finance. For general questions, respond naturally and concisely. Only use your financial knowledge when questions specifically relate to markets, investments, or financial concepts."
#tokenizer.chat_template_config = {
#    "default_system_message": default_system_prompt
#}

In [14]:
# Compute token lengths of the inputs column
lengths = [len(tokenizer.encode(example["inputs"])) for example in dataset["train"]]

# Get max length
max_length = max(lengths)
print(f"Maximum prompt length in tokens: {max_length}")

Maximum prompt length in tokens: 100


In [15]:
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 960, padding_idx=2)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=960, out_features=960, bias=False)
          (k_proj): Linear(in_features=960, out_features=320, bias=False)
          (v_proj): Linear(in_features=960, out_features=320, bias=False)
          (o_proj): Linear(in_features=960, out_features=960, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=960, out_features=2560, bias=False)
          (up_proj): Linear(in_features=960, out_features=2560, bias=False)
          (down_proj): Linear(in_features=2560, out_features=960, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((960,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05)
      )
    )
    (norm

In [16]:
print(model.dtype)

torch.bfloat16


## Format Prompt for Training.

In [17]:
def format_chat(row):
    user_prompt = row["user_prompt"]

    if re.match(r'^\s*question\b', user_prompt, re.IGNORECASE):
        # Remove everything up to the first colon
        user_prompt = re.sub(r'^\s*question.*?:\s*', '', user_prompt, flags=re.IGNORECASE)

    messages = [
        #{"role": "system", "content": default_system_prompt},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": row["answer"]}
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

In [18]:
# Apply formatting
formatted_dataset = split_dataset.map(format_chat)

Map:   0%|          | 0/14195 [00:00<?, ? examples/s]

Map:   0%|          | 0/1578 [00:00<?, ? examples/s]

In [19]:
print(formatted_dataset["train"][2000]["text"])

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
Brainstorm 5 possible topics for a 30-minute presentation.<|im_end|>
<|im_start|>assistant
Possible topics for a 30-minute presentation:
1. Crafting the Perfect Elevator Pitch
2. Finding Creative Solutions to Problems
3. The Basics of Time Management
4. The Art of Networking
5. Understanding the Different Types of Leadership Styles<|im_end|>



In [20]:
# Compute token lengths of the inputs column
lengths = [len(tokenizer.encode(example["text"])) for example in formatted_dataset["train"]]

# Get max length
max_length = max(lengths)
print(f"Maximum prompt length in tokens: {max_length}")

Maximum prompt length in tokens: 323


## Tokenize Dataset.

In [21]:
# Tokenize the dataset (truncating to a maximum length, e.g., 512 tokens).
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, max_length=max_length+50)

In [22]:
cols_to_remove = formatted_dataset["train"].column_names
tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True, remove_columns=cols_to_remove)


Map:   0%|          | 0/14195 [00:00<?, ? examples/s]

Map:   0%|          | 0/1578 [00:00<?, ? examples/s]

In [23]:
cols_to_remove

['Unnamed: 0',
 'inputs',
 'answer',
 'system_prompt',
 'user_prompt',
 'task_type',
 'dataset',
 'index_level',
 'conversation_id',
 'text']

# Training

In [24]:
# LoRA Configuration (unchanged)
lora_config = LoraConfig(
    r=4,
    lora_alpha=16,
    lora_dropout=0.05,
    #target_modules="all-linear",
    #target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    target_modules=["q_proj", "v_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)
model_peft = get_peft_model(model, lora_config)

In [25]:
print(model_peft.print_trainable_parameters())

trainable params: 409,600 || all params: 362,230,720 || trainable%: 0.1131
None


In [26]:
# Training Arguments
training_args = TrainingArguments(
    output_dir="./lora_finetuned",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=1.5e-4,  # Learning rate (QLoRA paper)
    #fp16=True,
    bf16=True,
    weight_decay=0.005,
    warmup_ratio=0.03,
    save_steps=100,
    logging_steps=100,
    evaluation_strategy="steps",
    report_to="none",
    # Add this to handle padding properly:
    dataloader_pin_memory=False,
    max_grad_norm=0.5,
    lr_scheduler_type="cosine",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    remove_unused_columns=False  # Required for chat templates
)



In [27]:
# Use proper data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8  # Improve GPU efficiency
)

trainer = Trainer(
    model=model_peft,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

In [28]:
# Execute Training
trainer.train()
print("Training complete!")


Step,Training Loss,Validation Loss
100,1.7629,1.157334
200,1.0828,1.109517
300,1.0627,1.096925
400,1.0754,1.091741
500,1.0682,1.087465
600,1.0687,1.085027
700,1.0471,1.084087
800,1.0349,1.083602


Training complete!


## Save & Upload Model to HF

In [29]:
new_model_name = "FinChat-XS"
# Merge the LoRA Adapter with the base model.
merged_model = model_peft.merge_and_unload()
merged_model.save_pretrained(new_model_name)

In [30]:
merged_model.push_to_hub(new_model_name,
                  private=True,
                  use_temp_dir=False)
tokenizer.push_to_hub(new_model_name,
                      private=True,
                      use_temp_dir=False)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/FinChat-XS/commit/7f559b01350e2aa76bbf54e4b1f7dc190aa43d11', commit_message='Upload tokenizer', commit_description='', oid='7f559b01350e2aa76bbf54e4b1f7dc190aa43d11', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/FinChat-XS', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/FinChat-XS'), pr_revision=None, pr_num=None)

In [31]:
# Push adapter to Hub
model_peft.push_to_hub(
    "qa-adapter" + new_model_name,
    commit_message="Add qa LoRA adapter"
)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/oopere/qa-adapterFinChat-XS/commit/5fb337aed385e60341255aefcad101e5943a4767', commit_message='Add qa LoRA adapter', commit_description='', oid='5fb337aed385e60341255aefcad101e5943a4767', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/qa-adapterFinChat-XS', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/qa-adapterFinChat-XS'), pr_revision=None, pr_num=None)

# Conclusions

Our small financial chat model is now ready.

The entire process took just over 30  minutes (depending on the data selected) on a Google Colab A100 GPU, which is roughly equivalent to an RTX 3090.

During the fine-tuning process, it has been observed that the quality of the dataset has a significant impact on the model's final behavior. As a result, one of the future tasks will be to create a dataset of financial conversations, along with a dataset creation tool.
