To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### Installation

In [2]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [3]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-0.6B-unsloth-bnb-4bit",
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-0.6B-unsloth-bnb-4bit",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.6.4: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/576M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.6.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
Qwen3 has both reasoning and a non reasoning mode. So, we should use 2 datasets:

1. We use the [Open Math Reasoning]() dataset which was used to win the [AIMO](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard) (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and whicht got > 95% accuracy.

2. We also leverage [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.

In [None]:
from datasets import load_dataset
# reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
# non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
#non_reasoning_dataset = load_dataset("miriad/miriad-5.8M", split = "train")
non_reasoning_dataset = load_dataset("miriad/miriad-4.4M", split = "train")

In [None]:


non_reasoning_dataset = non_reasoning_dataset.shuffle(seed=42)
non_reasoning_dataset = non_reasoning_dataset.select(range(100000))

Let's see the structure of both datasets:

In [None]:
# reasoning_dataset

In [None]:
non_reasoning_dataset

In [None]:
# prompt: Quiero dividir el datset non_reasoning_dataset en train, valid y test

from datasets import load_dataset

# Load the dataset (assuming it's already loaded as per the previous code)
# non_reasoning_dataset = load_dataset("miriad/miriad-5.8M", split = "train")

# Split the dataset into training, validation, and test sets
# Adjust the test_size and validation_size as needed
train_validtest = non_reasoning_dataset.train_test_split(test_size=0.2) # 80% train, 20% for valid/test
valid_test = train_validtest['test'].train_test_split(test_size=0.5) # Split the 20% into 10% valid and 10% test

train_dataset = train_validtest['train']
valid_dataset = valid_test['train'] # This is now the validation set
test_dataset = valid_test['test']   # This is now the test set

print("Training dataset size:", len(train_dataset))
print("Validation dataset size:", len(valid_dataset))
print("Test dataset size:", len(test_dataset))

We now convert the reasoning dataset into conversational format:

In [None]:
# def generate_conversation(examples):
#     problems  = examples["problem"]
#     solutions = examples["generated_solution"]
#     conversations = []
#     for problem, solution in zip(problems, solutions):
#         conversations.append([
#             {"role" : "user",      "content" : problem},
#             {"role" : "assistant", "content" : solution},
#         ])
#     return { "conversations": conversations, }

In [None]:
# reasoning_conversations = tokenizer.apply_chat_template(
#     reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
#     tokenize = False,
# )

Let's see the first transformed row:

In [None]:
# reasoning_conversations[0]

Next we take the non reasoning dataset and convert it to conversational format as well.

We have to use Unsloth's `standardize_sharegpt` function to fix up the format of the dataset first.

In [8]:
print(train_dataset[10])

{'qa_id': 4719681346802, 'paper_id': 196813468, 'question': 'What are the typical symptoms of heart failure and why might some patients not exhibit early symptoms?', 'answer': "The typical symptoms of heart failure include dyspnea (shortness of breath), fatigue, and edema of the lower extremities (swelling in the legs and ankles). These symptoms arise due to the heart's inability to pump blood efficiently, leading to fluid accumulation in the lungs and peripheral tissues. However, some heart failure patients may not exhibit early symptoms, which could lead to missed diagnoses. This may occur because the heart can compensate for its reduced function initially, and symptoms may only become apparent as the disease progresses or during periods of increased stress on the heart, such as during physical activity or illness.", 'paper_url': 'https://api.semanticscholar.org/CorpusID:196813468', 'paper_title': 'Current understanding of gut microbiota alterations and related therapeutic interventi

In [9]:
from datasets import Dataset

# Tomar solo los primeros 1,000 ejemplos
#subset = non_reasoning_dataset.select(range(10000))

# Convertir cada par pregunta-respuesta en una conversación tipo chat
def qa_to_conversation(example):
    return {
        "conversations": [
            {"role": "user", "content": example["question"]},
            {"role": "assistant", "content": example["answer"]}
        ]
    }

# Aplicar la transformación al subset
#conversation_dataset = subset.map(qa_to_conversation)
conversation_dataset_train= train_dataset.map(qa_to_conversation)
conversation_dataset_valid= valid_dataset.map(qa_to_conversation)
conversation_dataset_test= test_dataset.map(qa_to_conversation)
# Ahora puedes usar tokenizer.apply_chat_template sobre conversation_dataset["conversations"]


Map:   0%|          | 0/80000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [10]:
print(conversation_dataset_train)

Dataset({
    features: ['qa_id', 'paper_id', 'question', 'answer', 'paper_url', 'paper_title', 'passage_text', 'passage_position', 'year', 'venue', 'specialty', 'conversations'],
    num_rows: 80000
})


In [11]:
from unsloth.chat_templates import standardize_sharegpt
dataset_train = standardize_sharegpt(conversation_dataset_train)
dataset_valid = standardize_sharegpt(conversation_dataset_valid)
dataset_test = standardize_sharegpt(conversation_dataset_test)


non_reasoning_conversations_train = tokenizer.apply_chat_template(
    dataset_train["conversations"],
    tokenize = False,
)

non_reasoning_conversations_valid = tokenizer.apply_chat_template(
    dataset_valid["conversations"],
    tokenize = False,
)

non_reasoning_conversations_test = tokenizer.apply_chat_template(
    dataset_test["conversations"],
    tokenize = False,
)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/80000 [00:00<?, ? examples/s]

  """Create `ConcatenationTable` from list of tables.
  result: list[list[TableBlock]], blocks: list[list[TableBlock]]


Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

  """Create `ConcatenationTable` from list of tables.
  result: list[list[TableBlock]], blocks: list[list[TableBlock]]


Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

  """Create `ConcatenationTable` from list of tables.
  result: list[list[TableBlock]], blocks: list[list[TableBlock]]


Let's see the first row

In [12]:
non_reasoning_conversations_train[0]

'<|im_start|>user\nHow does the socioeconomic status of parents influence the ascertainment and prevalence of autism spectrum disorders (ASD)?\n<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nThere is evidence to suggest that autism spectrum disorders (ASD) may be underascertained in children of lower social class. Studies have found that autism is more prevalent in groups of higher social class in the United States, even when using active surveillance methods like the Autism and Developmental Disabilities Monitoring (ADDM) network. The socioeconomic status of parents, particularly maternal education, can impact the ascertainment and prevalence of ASD. Understanding these socioeconomic disparities in ASD prevalence is important for ensuring equitable access to diagnosis and support for individuals with ASD, regardless of their socioeconomic background.<|im_end|>\n'

Now let's see how long both datasets are:

In [13]:
# print(len(reasoning_conversations))
print(len(non_reasoning_conversations_train))

80000


The non reasoning dataset is much longer. Let's assume we want the model to retain some reasoning capabilities, but we specifically want a chat model.

Let's define a ratio of chat only data. The goal is to define some mixture of both sets of data.

Let's select 75% reasoning and 25% chat based:

In [None]:
# chat_percentage = 0.25

Let's sample the reasoning dataset by 75% (or whatever is 100% - chat_percentage)

In [None]:
# import pandas as pd
# non_reasoning_subset = pd.Series(non_reasoning_conversations)
# non_reasoning_subset = non_reasoning_subset.sample(
#     int(len(reasoning_conversations)*(chat_percentage/(1 - chat_percentage))),
#     random_state = 2407,
# )
# print(len(reasoning_conversations))
# print(len(non_reasoning_subset))
# print(len(non_reasoning_subset) / (len(non_reasoning_subset) + len(reasoning_conversations)))

Finally combine both datasets:

In [14]:
import pandas as pd
data = pd.concat([
    # pd.Series(reasoning_conversations),
    # pd.Series(non_reasoning_subset)
    pd.Series(non_reasoning_conversations_train)
])
data.name = "text"

from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed = 3407)



data_valid = pd.concat([
    # pd.Series(reasoning_conversations),
    # pd.Series(non_reasoning_subset)
    pd.Series(non_reasoning_conversations_valid)
])
data_valid.name = "text"

from datasets import Dataset
combined_dataset_valid = Dataset.from_pandas(pd.DataFrame(data_valid))
combined_dataset_valid = combined_dataset.shuffle(seed = 3407)


data_test = pd.concat([
    # pd.Series(reasoning_conversations),
    # pd.Series(non_reasoning_subset)
    pd.Series(non_reasoning_conversations_test)
])
data_test.name = "text"

from datasets import Dataset
combined_dataset_test = Dataset.from_pandas(pd.DataFrame(data_test))
combined_dataset_test = combined_dataset.shuffle(seed = 3407)

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
import os
os.cpu_count()

In [22]:
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset = combined_dataset_valid, # Can set up evaluation!
    dataset_num_proc=1,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        #max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
        eval_strategy="steps",
        eval_steps=10000,
    ),
)


average_tokens_across_devices is set to True but it is invalid when world size is1. Turn it to False automatically.


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/80000 [00:00<?, ? examples/s]

  """Create `ConcatenationTable` from list of tables.
  result: list[list[TableBlock]], blocks: list[list[TableBlock]]


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/80000 [00:00<?, ? examples/s]

  """Create `ConcatenationTable` from list of tables.
  result: list[list[TableBlock]], blocks: list[list[TableBlock]]


In [23]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
0.807 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 80,000 | Num Epochs = 1 | Total steps = 5,000
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 20,185,088/600,000,000 (3.36% trained)


Step,Training Loss,Validation Loss


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Qwen-3` team, the recommended settings for reasoning inference are `temperature = 0.6, top_p = 0.95, top_k = 20`

For normal chat based inference, `temperature = 0.7, top_p = 0.8, top_k = 20`

In [None]:
# Import FastLanguageModel
from unsloth import FastLanguageModel

# Load the LoRA model for inference
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # Path to the saved LoRA model
    max_seq_length = 2048,
    dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True, # Use 4bit quantization to reduce memory usage
)



In [None]:
messages = [
    {"role" : "user", "content" : "What are the phytochemical compounds found in the root and stem parts of Peristrophe bicalyculata?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

In [None]:
messages = [
    {"role" : "user", "content" : "What are the phytochemical compounds found in the root and stem parts of Peristrophe bicalyculata?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024, # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

In [None]:
# prompt: Quiero añadir que s eme guarde el modelo que estoy finetuniando y que se me descarge depaso
# model.save_pretrained("lora_model")  # Local saving
# tokenizer.save_pretrained("lora_model")
# # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# # tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

from google.colab import files
# Assuming the model and tokenizer are saved in a directory named 'lora_model'
# You can compress the directory before downloading
!zip -r lora_model.zip lora_model/
files.download('lora_model.zip')


Validate & Test

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import zipfile
import os
from datasets import load_dataset

# 1. Extraer el zip
zip_path = "/content/miriad_4.4M_dataset.zip"
extract_path = "data/miriad_4.4M_dataset"
try:
  with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)
except:
  %cd '/content/drive/MyDrive/Classroom/TFM'
  with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

# 2. Buscar el archivo .jsonl
jsonl_file = None
for root, _, files in os.walk(extract_path):
    for file in files:
        if file.endswith(".jsonl"):
            jsonl_file = os.path.join(root, file)
            break

assert jsonl_file, "No se encontró un archivo .jsonl."

# 3. Cargar el dataset completo
non_reasoning_dataset = load_dataset("json", data_files=jsonl_file, split="train")

# 4. Aplicar tu lógica personalizada
non_reasoning_dataset = non_reasoning_dataset.shuffle(seed=42)
non_reasoning_dataset = non_reasoning_dataset.select(range(100000))

train_validtest = non_reasoning_dataset.train_test_split(test_size=0.2)
valid_test = train_validtest['test'].train_test_split(test_size=0.5)

train_dataset = train_validtest['train']
valid_dataset = valid_test['train']
test_dataset = valid_test['test']

# 5. Verificar
print("Train size:", len(train_dataset))
print("Valid size:", len(valid_dataset))
print("Test size:", len(test_dataset))
print("Ejemplo test:", test_dataset[0])

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

In [6]:
%pip install evaluate python-Levenshtein sentence-transformers rouge_score bert_score

Collecting python-Levenshtein
  Using cached python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting bert_score
  Using cached bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting Levenshtein==0.27.1 (from python-Levenshtein)
  Using cached levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Using cached python_levenshtein-0.27.1-py3-none-any.whl (9.4 kB)
Using cached levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
Using cached bert_score-0.3.13-py3-none-any.whl (61 kB)
Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylin

In [2]:
from unsloth import FastLanguageModel
from tqdm import tqdm
import evaluate
from Levenshtein import distance as levenshtein_distance
from sentence_transformers import SentenceTransformer, util
import torch

# prompt: ahora quiero cargarlo desde el .zip

!unzip -q lora_model.zip -d lora_model

# Load the model from the extracted zip file
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # Path to the extracted LoRA model
    max_seq_length = 2048,
    dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True, # Use 4bit quantization to reduce memory usage
)

# Paso 1: preparar modelo (ya lo cargaste antes)
FastLanguageModel.for_inference(model)

# Paso 2: preparar métricas
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
model_sts = SentenceTransformer("all-MiniLM-L6-v2")

# Paso 3: generar respuestas y recolectar ground truth
references = []
predictions = []

test_dataset = test_dataset.select(range(100))

for example in tqdm(test_dataset, desc="Generando respuestas con LLM"):
    question = example["question"].strip()
    context = example["passage_text"].strip()
    references.append(example["answer"].strip())

    prompt = f"<|im_start|>user\n{question}\n<|im_end|>\n<|im_start|>assistant\n"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=False,
            temperature=0.7,
            top_p=0.95,
            eos_token_id=tokenizer.eos_token_id,
        )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extraer solo la respuesta generada (lo que viene tras el último "<|im_start|>assistant")
    generated_answer = decoded.split("<|im_start|>assistant")[-1].strip()
    predictions.append(generated_answer)

# Paso 4: métricas BLEU y ROUGE
bleu_result = bleu.compute(predictions=predictions, references=[[r] for r in references])
rouge_result = rouge.compute(predictions=predictions, references=references)

# Paso 5: métricas Levenshtein y STS
lev_dists = []
sts_scores = []

for pred, ref in tqdm(zip(predictions, references), total=len(predictions), desc="Calculando Levenshtein y STS"):
    lev_dists.append(levenshtein_distance(pred, ref))
    emb_pred = model_sts.encode(pred, convert_to_tensor=True)
    emb_ref = model_sts.encode(ref, convert_to_tensor=True)
    score = util.cos_sim(emb_pred, emb_ref).item()
    sts_scores.append(score)

# Paso 6: Mostrar resultados
print("\n--- Resultados métricas ---")
print(f"BLEU: {bleu_result['bleu']:.4f}")
print(f"ROUGE-L: {rouge_result['rougeL']:.4f}")
print(f"Distancia Levenshtein promedio: {sum(lev_dists)/len(lev_dists):.2f}")
print(f"Similitud semántica (STS) promedio: {sum(sts_scores)/len(sts_scores):.4f}")


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.6.4: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

Unsloth 2025.6.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generando respuestas con LLM:   0%|          | 0/100 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generando respuestas con LLM:   1%|          | 1/100 [00:11<18:57, 11.49s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generando respuestas con LLM:   2%|▏         | 2/100 [00:14<10:52,  6.66s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generando respuestas con LLM:   3%|▎         | 3/100 [00:22<11:36,  7.18s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generando respuestas con LLM:   4%|▍         | 4/100 [00:26<09:21,  5.85s/it]The following g


--- Resultados métricas ---
BLEU: 0.1225
ROUGE-L: 0.2585
Distancia Levenshtein promedio: 460.64
Similitud semántica (STS) promedio: 0.7688





In [7]:
import zipfile
import os
import pandas as pd
from datasets import Dataset

# 1. Extraer el zip (keeping the existing extraction logic)
zip_path = "/content/miriad_4.4M_dataset.zip"
extract_path = "data/miriad_4.4M_dataset"
try:
  with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)
except:
  # This part seems specific to your drive setup, keeping it for now
  %cd '/content/drive/MyDrive/Classroom/TFM'
  with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)


# 2. Buscar el archivo .jsonl
jsonl_file = None
for root, _, files in os.walk(extract_path):
    for file in files:
        if file.endswith(".jsonl"):
            jsonl_file = os.path.join(root, file)
            break

assert jsonl_file, "No se encontró un archivo .jsonl."

# 3. Cargar el dataset usando pandas and then converting to Dataset
print(f"Loading data from: {jsonl_file}")
df = pd.read_json(jsonl_file, lines=True)
non_reasoning_dataset = Dataset.from_pandas(df)


# 4. Aplicar tu lógica personalizada
non_reasoning_dataset = non_reasoning_dataset.shuffle(seed=42)
non_reasoning_dataset = non_reasoning_dataset.select(range(100000))

train_validtest = non_reasoning_dataset.train_test_split(test_size=0.2)
valid_test = train_validtest['test'].train_test_split(test_size=0.5)

train_dataset = train_validtest['train']
valid_dataset = valid_test['train']
test_dataset = valid_test['test']

# 5. Verificar
print("Train size:", len(train_dataset))
print("Valid size:", len(valid_dataset))
print("Test size:", len(test_dataset))
print("Ejemplo test:", test_dataset[0])

Loading data from: data/miriad_4.4M_dataset/miriad_4.4M.jsonl
Train size: 80000
Valid size: 10000
Test size: 10000
Ejemplo test: {'qa_id': 362292619013, 'paper_id': 22926190, 'question': 'What is the relationship between connexin gene mutations and cardiac arrhythmias?\n', 'answer': 'Nucleotide substitutions in connexin genes, such as the GJA1 gene that codes for connexin 43, have been hypothesized to serve as a potential arrhythmogenic substrate. However, it is important to note that known patients with oculodentodigital dysplasia, which is caused by connexin gene mutations, do not typically exhibit a cardiac arrhythmia phenotype. Further research is needed to fully understand the relationship between connexin gene mutations and cardiac arrhythmias.', 'paper_url': 'https://api.semanticscholar.org/CorpusID:22926190', 'paper_title': 'A novel GJA1 mutation causing familial oculodentodigital dysplasia with dilated cardiomyopathy and arrhythmia', 'passage_text': 'Her echocardiogram demonst