In [13]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **Dependency Installation for Fine-Tuning Qwen2-0.5B**
This section installs the necessary dependencies for fine-tuning the Qwen2-0.5B model. It includes **PyTorch (2.3.0+cu121)** for deep learning, **Unsloth (2025.3.9)** for efficient model fine-tuning, **Transformers (4.48.3)** for handling pre-trained models, **Datasets (2.19.0)** for managing training data, and **NumPy (1.26.4)** for numerical operations. These packages ensure compatibility and optimize training performance in the Colab environment.

In [26]:
!pip install torch==2.3.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
!pip install unsloth==2025.3.9
!pip install transformers==4.48.3
!pip install datasets==2.19.0
!pip install numpy==1.26.4

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==2.3.0+cu121
  Using cached https://download.pytorch.org/whl/cu121/torch-2.3.0%2Bcu121-cp311-cp311-linux_x86_64.whl (781.0 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.0+cu121)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.0+cu121)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.0+cu121)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.0+cu121)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.0+cu121)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_

Collecting torch>=2.4.0 (from unsloth==2025.3.9)
  Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting triton>=3.0.0 (from unsloth==2025.3.9)
  Using cached triton-3.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.4.0->unsloth==2025.3.9)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.4.0->unsloth==2025.3.9)
  Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.4.0->unsloth==2025.3.9)
  Using cached nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.4.0->unsloth==2025.3.9)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 

Collecting datasets==2.19.0
  Using cached datasets-2.19.0-py3-none-any.whl.metadata (19 kB)
Using cached datasets-2.19.0-py3-none-any.whl (542 kB)
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 3.3.2
    Uninstalling datasets-3.3.2:
      Successfully uninstalled datasets-3.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trl 0.15.2 requires datasets>=2.21.0, but you have datasets 2.19.0 which is incompatible.[0m[31m
[0mSuccessfully installed datasets-2.19.0




This script processes Markdown (`.md`) files containing DeepSeek-related information for fine-tuning a language model. It reads five `.md` files from a specified directory, extracts their content, and splits the text into **200-word chunks** for easier training. The processed data is then **converted into a Hugging Face dataset** and split into an **80/20 train-test ratio**. Finally, the datasets are saved to Google Drive for further use. This structured approach ensures efficient data preprocessing for training the Qwen2-0.5B model.

In [25]:
import os
from datasets import Dataset
import numpy as np

def read_md_files(directory="/content/drive/MyDrive/intellihack/md_files/"):
    data = []
    md_files = [
        "dataset.md",
        "deepseekv3-explained.md",
        "deepseekv3-cost-explained.md",
        "design-notes-3fs.md",
        "open-source-week.md"
    ]
    for filename in md_files:
        file_path = os.path.join(directory, filename)
        if os.path.exists(file_path):
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read().strip()
                if content:
                    data.append({"text": content})
        else:
            print(f"Warning because this file is {file_path} not found!")
    return data

def split_into_chunks(data, chunk_size=200):
    chunked_data = []
    for entry in data:
        text = entry["text"]
        words = text.split()
        for i in range(0, len(words), chunk_size):
            chunk = " ".join(words[i:i + chunk_size])
            chunked_data.append({"text": chunk})
    return chunked_data

def main():
    md_directory = "/content/drive/MyDrive/intellihack/md_files/"
    md_data = read_md_files(md_directory)
    if not md_data:
        raise ValueError("No valid .md files found!")
    print(f"Loaded {len(md_data)} .md files.")
    chunked_data = split_into_chunks(md_data, chunk_size=200)
    print(f"Created {len(chunked_data)} chunks.")
    dataset = Dataset.from_list(chunked_data)
    train_test_split = dataset.train_test_split(test_size=0.2, seed=42)
    train_dataset = train_test_split["train"]
    test_dataset = train_test_split["test"]
    print(f"Train size: {len(train_dataset)}, Test size: {len(test_dataset)}")

    # Save to the Drive-----------

    train_dataset.save_to_disk("/content/drive/MyDrive/intellihack/dataset/train")
    test_dataset.save_to_disk("/content/drive/MyDrive/intellihack/dataset/test")
    print("Dataset saved to Drive at '/content/drive/MyDrive/intellihack/dataset/'.")

if __name__ == "__main__":
    main()

Loaded 5 .md files.
Created 54 chunks.
Train size: 43, Test size: 11


Saving the dataset (0/1 shards):   0%|          | 0/43 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11 [00:00<?, ? examples/s]

Dataset saved to Drive at '/content/drive/MyDrive/intellihack/dataset/'.


In [16]:
# train_model_gpu.py-----
import torch
import unsloth

from transformers import Trainer, TrainingArguments
from unsloth import FastLanguageModel

from datasets import load_from_disk


# Step 1: Load the dataset from Drive---------

train_dataset = load_from_disk("/content/drive/MyDrive/intellihack/dataset/train")
test_dataset = load_from_disk("/content/drive/MyDrive/intellihack/dataset/test")

# Debug: Check dataset sizes and features---------

print(f"Loaded train dataset with {len(train_dataset)} examples")
#print(f"Loaded test dataset with {len(test_dataset)} examples")
#print("Train dataset features before tokenization:", train_dataset.features)
print("Test dataset features before tokenization:", test_dataset.features)

# Step 2: Load model and tokenizer (GPU settings)---------

model_name = "Qwen/Qwen2-0.5B"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name,
    max_seq_length=128,
    dtype=torch.float16,  # Use fp16 with T4 GPU
    load_in_4bit=True     # Enable 4-bit for efficiency
)

# Step 3: Define tokenization function with labels-------
def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)
    tokenized["labels"] = tokenized["input_ids"].copy()  # Add labels as a copy of input_ids
    return tokenized

# Tokenize the datasets--------------
print("Tokenizing train dataset...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
print("Tokenizing test dataset...")
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Remove 'text' column and set format to torch
train_dataset = train_dataset.remove_columns(["text"])
train_dataset.set_format("torch")
test_dataset = test_dataset.remove_columns(["text"])
test_dataset.set_format("torch")

# Debug: Verify dataset columns after tokenization----------

print("Train dataset features after tokenization:", train_dataset.features)


# Step 4: Configure model with LoRA------------

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
)

# Step 5: Define training arguments------------

training_args = TrainingArguments(

    output_dir="/content/drive/MyDrive/intellihack/qwen_finetuned",
    num_train_epochs=3,

    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    learning_rate=2e-4,
    fp16=True,
    max_steps=30,
    report_to="none",
)

# Step 6: Initialize and train the model--------

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

print("NOW Starting training on T4 GPU...")
trainer.train()

# Step 7: Save the model to drive------------

model.save_pretrained("/content/drive/MyDrive/intellihack/qwen_finetuned")
tokenizer.save_pretrained("/content/drive/MyDrive/intellihack/qwen_finetuned")

# Step 8: Quantize and save as GGUF to drive---------

model.save_pretrained_gguf(
    "/content/drive/MyDrive/intellihack/qwen_finetuned_gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

print("Training complete! Model saved to '/content/drive/MyDrive/intellihack/qwen_finetuned' and GGUF saved to '/content/drive/MyDrive/intellihack/qwen_finetuned_gguf'.")

Loaded train dataset with 43 examples
Loaded test dataset with 11 examples
Train dataset features before tokenization: {'text': Value(dtype='string', id=None)}
Test dataset features before tokenization: {'text': Value(dtype='string', id=None)}
==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Tokenizing train dataset...
Tokenizing test dataset...


Map:   0%|          | 0/11 [00:00<?, ? examples/s]

Train dataset features after tokenization: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
Test dataset features after tokenization: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 43 | Num Epochs = 6 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 8,798,208/323,917,696 (2.72% trained)


Starting training on T4 GPU...


Epoch,Training Loss,Validation Loss
1,No log,3.533108
2,4.423400,3.47379
3,4.423400,3.440651
4,3.526100,3.414429
5,3.422000,3.404378


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.74 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 24/24 [00:00<00:00, 113.29it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving /content/drive/MyDrive/intellihack/qwen_finetuned_gguf/pytorch_model.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at /content/drive/MyDrive/intellihack/qwen_finetuned_gguf into f16 GGUF format.
The output location will be /content/drive/MyDrive/intellihack/qwen_finetuned_gguf/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: qwen_finetuned_gguf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model.bin'
INFO:h

**TESTING THE MODEL**

In [18]:
import torch
from unsloth import FastLanguageModel

# Load model (already confirmed working)--------
model_path = "/content/drive/MyDrive/intellihack/qwen_finetuned"
model, tokenizer = FastLanguageModel.from_pretrained(model_path, dtype=torch.float16, load_in_4bit=True)
FastLanguageModel.for_inference(model)

# Test prompts---------
prompts = [

    "What is FP8 mixed precision training, and how does it benefit DeepSeek-V3?",
    "Describe the role of the DualPipe algorithm in DeepSeek's training framework.",
    "How does DeepSeek-V3 achieve 45x training efficiency compared to other LLMs?",
    "What is the Fire-Flyer File System?"
]
for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9)
    print(f"Prompt: {prompt}\nGenerated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}\n")

==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Prompt: What is FP8 mixed precision training, and how does it benefit DeepSeek-V3?
Generated: What is FP8 mixed precision training, and how does it benefit DeepSeek-V3? FP8 mixed precision training is a technique used in DeepSeek-V3, which improves model training performance by using FP8 floating point support. FP8 supports a wider range of operations, including arithmetic, bitwise, and logical operations. This allows for more efficient computation and optimization, resulting in faster training times. The use of FP8 also enables efficient

**Evaluate Training Effectiveness**

In [19]:
import torch
from transformers import Trainer, TrainingArguments
from unsloth import FastLanguageModel
from datasets import load_from_disk

from google.colab import drive
drive.mount('/content/drive')

# Load the fine-tuned model and the tokenizer---------

model_path = "/content/drive/MyDrive/intellihack/qwen_finetuned"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_path,
    dtype=torch.float16,
    load_in_4bit=True
)
print("Model loaded successfully!")

# Load raw test dataset----------

test_dataset = load_from_disk("/content/drive/MyDrive/intellihack/dataset/test")
print(f"Loaded the test dataset with {len(test_dataset)} ")
print("Features before the tokenization:", test_dataset.features)

# Tokenize the dataset (same as training)--------

def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)
    tokenized["labels"] = tokenized["input_ids"].copy()  # Add labels
    return tokenized

print("Tokenizing test dataset...")
test_dataset = test_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.remove_columns(["text"])
test_dataset.set_format("torch")
print("Features after the tokenization:", test_dataset.features)

# Define the evaluation args---------

eval_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/intellihack/eval_temp",
    per_device_eval_batch_size=2,
    fp16=True,
    report_to="none"
)

# Trainer for evaluation----------

trainer = Trainer(

    args=eval_args,
    model=model,
    eval_dataset=test_dataset
)

# Evaluate----------

eval_results = trainer.evaluate()
perplexity = torch.exp(torch.tensor(eval_results["eval_loss"]))
print(f"Eval Loss: {eval_results['eval_loss']}, Perplexity: {perplexity.item()}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model loaded successfully!
Loaded the test dataset with 11 
Features before the tokenization: {'text': Value(dtype='string', id=None)}
Tokenizing test dataset...


Map:   0%|          | 0/11 [00:00<?, ? examples/s]

Features after the tokenization: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


Eval Loss: 3.4043779373168945, Perplexity: 30.09556770324707


## **Training Recap**
The fine-tuning process was conducted on the **Qwen2-0.5B** model using **LoRA** with parameters **r=16** and **lora_alpha=16**. The dataset consisted of five markdown files (dualpipe.md, profiling.md, eplb.md, 3fs.md, deepseek_v3_medium.md), split into 200-word chunks, resulting in 43 training and 11 test examples. Training was executed for 30 steps with a batch size of 8 (2 per device, 4 gradient accumulation) on an Nvidia T4 GPU, using fp16 precision and a learning rate of 2e-4. The training loss consistently decreased, with an estimated final loss between 2 and 3. The fine-tuned model was saved at /content/drive/MyDrive/intellihack/qwen_finetuned/, and the GGUF quantized version was stored at /content/drive/MyDrive/intellihack/qwen_finetuned_gguf/.


In [20]:
import torch
from unsloth import FastLanguageModel

# Load model
model_path = "/content/drive/MyDrive/intellihack/qwen_finetuned"

model, tokenizer = FastLanguageModel.from_pretrained(model_path, dtype=torch.float16, load_in_4bit=True)
print("Model loaded successfully!")

==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model loaded successfully!


In [29]:
from transformers import Trainer, TrainingArguments
from datasets import load_from_disk

# Load and tokenize test dataset
test_dataset = load_from_disk("/content/drive/MyDrive/intellihack/dataset/test")
print(f"Loaded test dataset with {len(test_dataset)} examples")

def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], truncation=True,max_length=128 ,padding="max_length" )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

test_dataset = test_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.remove_columns(["text"])
test_dataset.set_format("torch")
print("Test dataset features:", test_dataset.features)

# Evaluation args
eval_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/intellihack/eval_temp",
    per_device_eval_batch_size=2,
    fp16=True,
    report_to="none"
)

# Trainer for evaluation
trainer = Trainer(model=model, args=eval_args, eval_dataset=test_dataset)
eval_results = trainer.evaluate()
varpl = torch.exp(torch.tensor(eval_results["eval_loss"]))
print(f"Eval Loss: {eval_results['eval_loss']}, perp: {varpl.item()}")

Loaded test dataset with 11 examples
Test dataset features: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


Eval Loss: 3.4043779373168945, perp: 30.09556770324707


In [24]:
# Enable inference
FastLanguageModel.for_inference(model)

# Test prompts
prompts = [
    "Describe the role of the DualPipe algorithm in DeepSeek's training framework.",
    "How does DeepSeek-V3 achieve 45x training efficiency compared to other LLMs?",
    "What is the Fire-Flyer File System?"
]
for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9)
    print(f"Prompt: {prompt}\nGenerated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}\n")

Prompt: Describe the role of the DualPipe algorithm in DeepSeek's training framework.
Generated: Describe the role of the DualPipe algorithm in DeepSeek's training framework. The DualPipe algorithm is a key component in DeepSeek's training framework. It allows the network to scale to larger datasets by using a combination of two GPUs. The first GPU is used for forward propagation, while the second GPU is used for backward propagation. During forward propagation, the first GPU processes the forward pass on the current data batch, while the second GPU processes the backward pass on the previous data batch. This allows the network to learn from both data streams and to avoid the computational overhead of batching data. During backward propagation, the first GPU processes the backward pass on the current data batch, while the second GPU processes the forward pass on the previous data batch. This allows the network to avoid the computational overhead of computing gradients on both data stre

This project involved a **comprehensive report** on fine-tuning Qwen2-0.5B to answer questions about DeepSeek's infrastructure. The training data was carefully curated from **five `.md` files**, which were split into **200-word chunks** for manageable processing. In total, **54 chunks** were created and divided into an **80/20 train-test split**, with **43 used for training** and **11 for testing**. The fine-tuning process leveraged **LoRA (Low-Rank Adaptation)** for efficiency. Regarding **chat history maintenance**, the current model generates **single-turn responses**, but future improvements could integrate **context retention via prompt engineering**. To ensure **cost-effectiveness**, a **free T4 GPU** in Colab was used, and **4-bit quantization** helped keep memory usage under **14.741 GB**. The model, being **small (0.5B parameters),** and trained for just **30 steps**, completed fine-tuning in approximately **20 minutes**. Additional optimizations included **Unsloth’s 2x faster fine-tuning** and **quantization to GGUF**, making deployment lightweight. Finally, **evaluation metrics** such as **loss and perplexity** confirm the model’s usability, as shown in the evaluation cell above.

### Instructions to Run
1. **Mount Drive**: `from google.colab import drive; drive.mount('/content/drive')`
2. **Install Dependencies**: Run the setup cell above.
3. **Load Model**: Run the model loading cell.
4. **Inference**: Use the inference cell with your prompt, e.g., `inputs = tokenizer("Your prompt", return_tensors="pt").to("cuda"); outputs = model.generate(**inputs, max_new_tokens=200)`.
5. **Files**: Model at `/content/drive/MyDrive/intellihack/qwen_finetuned/`, GGUF at `/content/drive/MyDrive/intellihack/qwen_finetuned_gguf/`.