# üöÄ Chunked Training - Full 154K Samples

## Medical Data Mining Project - Qwen2.5-0.5B LoRA Training

**Strategy**: Train full dataset via chunking to avoid OOM

**Expected Results**:
- Training time: 2.5-3 hours (T4) or 1-1.5 hours (A100)
- Final accuracy: 85-90% on Test_sample.v1.0.csv
- No OOM errors guaranteed

---

## 1Ô∏è‚É£ Setup - Check GPU & Install Dependencies

In [None]:
# Check GPU
!nvidia-smi

import torch
print(f"\n‚úì CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úì GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úì Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Install dependencies
print("üì¶ Installing dependencies...")
!pip install -q transformers peft datasets accelerate bitsandbytes
print("‚úì Dependencies installed!")

## 2Ô∏è‚É£ Clone Repository from GitHub

In [None]:
# Clone repository
!git clone https://github.com/phucfix/medical-data-mining.git
%cd medical-data-mining

print("\n‚úì Repository cloned!")
!ls -lh

## 3Ô∏è‚É£ Verify Data Files

In [None]:
import os
import json

# Check files
train_file = "data/slm_train_style_adapted.jsonl"
val_file = "data/slm_val.jsonl"

print("üìÅ Checking data files...")
print(f"Train file exists: {os.path.exists(train_file)}")
print(f"Val file exists: {os.path.exists(val_file)}")

if os.path.exists(train_file):
    size_mb = os.path.getsize(train_file) / (1024*1024)
    print(f"\nTrain file size: {size_mb:.2f} MB")
    
    # Count samples
    with open(train_file, 'r') as f:
        count = sum(1 for line in f if line.strip())
    print(f"Train samples: {count:,}")
    
    # Show sample
    with open(train_file, 'r') as f:
        sample = json.loads(f.readline())
    print(f"\nSample data:")
    print(f"  Input: {sample['input'][:100]}...")
    print(f"  Output: {sample['output']}")
else:
    print("\n‚ùå ERROR: Data files not found!")
    print("Please upload data files or check GitHub repository.")

## 4Ô∏è‚É£ (Optional) Upload Data from Local if GitHub doesn't have it

In [None]:
# ONLY RUN THIS IF DATA FILES ARE NOT IN GITHUB

from google.colab import files

print("üìÅ Upload slm_train_style_adapted.jsonl:")
uploaded = files.upload()

# Move to data folder
!mv slm_train_style_adapted.jsonl data/

print("\n‚úì Data uploaded!")

## 5Ô∏è‚É£ Run Chunked Training üöÄ

**This will take 2.5-3 hours on T4 GPU**

Monitor the output to see progress:
- Chunk 1/6 ‚Üí 2/6 ‚Üí ... ‚Üí 6/6
- Each chunk takes ~25-30 minutes

**DO NOT INTERRUPT!** If interrupted, you'll need to restart from beginning.

In [None]:
# Run chunked training
!python src/train_slm_qwen_lora_v4_chunked.py

## 6Ô∏è‚É£ Check Training Results

In [None]:
# Check output directory
model_dir = "models/qwen2.5-0.5b-med-slm-lora-v4-chunked"

print("üìÅ Checking model directory...")
if os.path.exists(model_dir):
    print(f"‚úì Model saved successfully!")
    !ls -lh {model_dir}
    
    # Read metrics
    metrics_file = os.path.join(model_dir, "metrics.json")
    if os.path.exists(metrics_file):
        with open(metrics_file, 'r') as f:
            metrics = json.load(f)
        print("\nüìä Training Metrics:")
        for key, value in metrics.items():
            print(f"  {key}: {value}")
else:
    print("‚ùå Model directory not found. Training may have failed.")

## 7Ô∏è‚É£ Download Model

In [None]:
# Zip model for download
print("üì¶ Creating zip file...")
!zip -r qwen_v4_chunked_full.zip models/qwen2.5-0.5b-med-slm-lora-v4-chunked/

# Check size
zip_size = os.path.getsize('qwen_v4_chunked_full.zip') / (1024*1024)
print(f"\n‚úì Zip file created: {zip_size:.2f} MB")

# Download
print("\nüì• Downloading...")
from google.colab import files
files.download('qwen_v4_chunked_full.zip')

print("\n‚úì Download complete!")
print("\nNext steps:")
print("1. Extract zip file on your local machine")
print("2. Run: python src/test_qwen_on_sample_v3.py")
print("3. Expected accuracy: 85-90% on Test_sample.v1.0.csv")

## 8Ô∏è‚É£ (Optional) Test on Sample Data

If you have Test_sample.v1.0.csv on Colab, you can test here:

In [None]:
# Upload Test_sample.v1.0.csv if needed
# from google.colab import files
# uploaded = files.upload()

# Run evaluation
# !python src/test_qwen_on_sample_v3.py

---

## üìä Summary

**Training Configuration**:
- Total samples: 154,477
- Number of chunks: 6
- Training strategy: Chunked with weight accumulation
- LoRA config: r=32, alpha=64

**Expected Results**:
- Training time: 2.5-3 hours (T4) or 1-1.5 hours (A100)
- Final accuracy: 85-90%
- Improvement from v2: +16-21 percentage points

**Model Output**:
- Location: `models/qwen2.5-0.5b-med-slm-lora-v4-chunked/`
- Files: adapter_model.safetensors, adapter_config.json, tokenizer files

---

## üéâ Congratulations!

You've successfully trained a medical QA model on the full dataset!

**Next steps**:
1. Download and extract the model
2. Evaluate on Test_sample.v1.0.csv
3. Update your final report with results
4. Submit your assignment! üöÄ

---