<a href="https://colab.research.google.com/github/peremartra/FinLLMOpt/blob/FinChat-XS-Instruct/FinChat-XS/01_Finetuning_FinChat-XS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FinChat-XS
## Fine-Tuning & Merge.


________
* Model: HuggingFaceTB/SmolLM2-360M-Instruct
* Dataset: sujet-ai/Sujet-Finance-Instruct-177k
_________
This notebook simply replicates the process followed to create the Llama-FinChat-XS model.

En este notebook se puede ver el proceso de FineTuninbg del modelo SmolLM2-360M-Instruct creado por Hugging Face, usando el Dataset Sujet-Finance-Instruct-177k.

Esta primera versión del mdoelo es totalmente funcional, pero puede considerar una prueba conceptual a la espera de la creación de un Dataset especifico para su entreno.
______________
If you’re looking for explanations about fine-tuning LLMs, you can find them in the Fine-Tuning section of the [Large Language Models course](https://github.com/peremartra/Large-Language-Model-Notebooks-Course) that I maintain on GitHub.

# Install and Import Libraries.

In [1]:
!pip install -q transformers==4.47.1
!pip install -q datasets==3.2.0
!pip install -q torch==2.5.1
!pip install -q lm-eval==0.4.7
!pip install -q peft==0.14.0

In [2]:
# Import Libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from peft import get_peft_model, LoraConfig
from transformers import DataCollatorForLanguageModeling
import re
from transformers import EarlyStoppingCallback

# Load & Process the Dataset

In [3]:
# Load the Dataset
dataset = load_dataset("sujet-ai/Sujet-Finance-Instruct-177k")
dataset["train"] = dataset["train"].filter(lambda x: x["task_type"].strip().lower() == "qa")

In [4]:
split_dataset = dataset["train"].train_test_split(test_size=0.1, seed=2)
# split_dataset now has keys: "train" and "test"

In [5]:
split_dataset["train"][10000]

{'Unnamed: 0': 123865,
 'inputs': 'As a finance expert, your role is to provide clear, concise, and informative responses to finance-related questions. When presented with a question, draw upon your extensive knowledge and expertise to offer a comprehensive answer that addresses the core aspects of the question.\n\nQuestion:\nGenerate a list of 5 arguments for government regulation of business.\n\nAnswer:',
 'answer': '1. To protect consumers from harm from unsafe or unethical business practices\n2. To create a fair playing field by preventing companies from gaining too much market power\n3. To help ensure that businesses are acting in the public interest\n4. To provide safeguards to promote competition and innovation\n5. To minimize the potential for conflicts of interest.',
 'system_prompt': 'As a finance expert, your role is to provide clear, concise, and informative responses to finance-related questions. When presented with a question, draw upon your extensive knowledge and expert

In [6]:
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'inputs', 'answer', 'system_prompt', 'user_prompt', 'task_type', 'dataset', 'index_level', 'conversation_id'],
        num_rows: 34920
    })
    test: Dataset({
        features: ['Unnamed: 0', 'inputs', 'answer', 'system_prompt', 'user_prompt', 'task_type', 'dataset', 'index_level', 'conversation_id'],
        num_rows: 3881
    })
})


## Load Tokenizer & Model

In [7]:
# Load the Base Model and Tokenizer.
model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In [8]:
# Compute token lengths of the inputs column
lengths = [len(tokenizer.encode(example["inputs"])) for example in dataset["train"]]

# Get max length
max_length = max(lengths)
print(f"Maximum prompt length in tokens: {max_length}")

Maximum prompt length in tokens: 153


In [9]:
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 960, padding_idx=2)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=960, out_features=960, bias=False)
          (k_proj): Linear(in_features=960, out_features=320, bias=False)
          (v_proj): Linear(in_features=960, out_features=320, bias=False)
          (o_proj): Linear(in_features=960, out_features=960, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=960, out_features=2560, bias=False)
          (up_proj): Linear(in_features=960, out_features=2560, bias=False)
          (down_proj): Linear(in_features=2560, out_features=960, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((960,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05)
      )
    )
    (norm

In [10]:
print(model.dtype)

torch.bfloat16


## Format Prompt for Training.

In [11]:
def format_chat(row):
    user_prompt = row["user_prompt"]

    if re.match(r'^\s*question\b', user_prompt, re.IGNORECASE):
        # Remove everything up to the first colon
        user_prompt = re.sub(r'^\s*question.*?:\s*', '', user_prompt, flags=re.IGNORECASE)

    messages = [
        {"role": "system", "content": """You are FinChat, a specialized finance AI assistant. Provide accurate,
concise information on markets, investments, accounting, and personal finance.
Always clarify when financial information may vary by jurisdiction."""},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": row["answer"]}
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

In [12]:
# Apply formatting
formatted_dataset = split_dataset.map(format_chat)

In [13]:
print(formatted_dataset["test"][1000]["text"])

<|im_start|>system
You are FinChat, a specialized finance AI assistant. Provide accurate,
concise information on markets, investments, accounting, and personal finance.
Always clarify when financial information may vary by jurisdiction.<|im_end|>
<|im_start|>user
OTC Markets, Time, and Trading<|im_end|>
<|im_start|>assistant
Something to consider is that in the case of the company you chose, on the OTC market, that stock is thinly traded and with such low volume, it can be easy for it to fluctuate greatly to have trades occur.  This is why volume can matter for some people when it comes to buying shares. Some OTC stocks may have really low volume and thus may have bigger swings than other stocks that have higher volume.<|im_end|>



## Tokenize Dataset.

In [14]:
# Tokenize the dataset (truncating to a maximum length, e.g., 512 tokens).
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, max_length=256)

In [15]:
cols_to_remove = formatted_dataset["train"].column_names
tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True, remove_columns=cols_to_remove)


Map:   0%|          | 0/3881 [00:00<?, ? examples/s]

# Training

In [16]:
# LoRA Configuration (unchanged)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.2,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    #target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)
model_peft = get_peft_model(model, lora_config)

In [17]:
print(model_peft.print_trainable_parameters())

trainable params: 8,683,520 || all params: 370,504,640 || trainable%: 2.3437
None


In [18]:
# Training Arguments
training_args = TrainingArguments(
    output_dir="./lora_finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    #fp16=True,
    bf16=True,
    weight_decay=0.001,
    warmup_steps=200,
    save_steps=200,
    logging_steps=200,
    evaluation_strategy="steps",
    report_to="none",
    # Add this to handle padding properly:
    dataloader_pin_memory=False,
    max_grad_norm=0.5,
    lr_scheduler_type="cosine",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    remove_unused_columns=False  # Required for chat templates
)



In [19]:
# Use proper data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8  # Improve GPU efficiency
)

trainer = Trainer(
    model=model_peft,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
)

In [20]:
# Execute Training
trainer.train()
print("Training complete!")


Step,Training Loss,Validation Loss
200,1.621,1.309393
400,1.2866,1.284507
600,1.2755,1.275564
800,1.2572,1.270696
1000,1.2616,1.267579
1200,1.2378,1.26594
1400,1.2481,1.265156
1600,1.2385,1.264996


Training complete!


## Save & Upload Model to HF

In [21]:
new_model_name = "FinChat-XS"
# Merge the LoRA Adapter with the base model.
merged_model = model_peft.merge_and_unload()
merged_model.save_pretrained(new_model_name)

In [22]:
merged_model.push_to_hub(new_model_name,
                  private=True,
                  use_temp_dir=False)
tokenizer.push_to_hub(new_model_name,
                      private=True,
                      use_temp_dir=False)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/oopere/FinChat-XS/commit/0004d11b9169774b17cb75468a963ef50ffea56d', commit_message='Upload tokenizer', commit_description='', oid='0004d11b9169774b17cb75468a963ef50ffea56d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/FinChat-XS', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/FinChat-XS'), pr_revision=None, pr_num=None)

In [23]:
# Push adapter to Hub
model_peft.push_to_hub(
    "qa-adapter" + new_model_name,
    commit_message="Add qa LoRA adapter"
)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/qa-adapterFinChat-XS/commit/bcc3bed45fdfe824d1dba950c9c150acd4ba8166', commit_message='Add qa LoRA adapter', commit_description='', oid='bcc3bed45fdfe824d1dba950c9c150acd4ba8166', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/qa-adapterFinChat-XS', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/qa-adapterFinChat-XS'), pr_revision=None, pr_num=None)

# Conclusions

Our small financial chat model is now ready.

The entire process took just over 30 minutes on a Google Colab A100 GPU, which is roughly equivalent to an RTX 3090.
