# Tokenizer Model Training

This notebook trains the tokenizer model using the Alpaca format. Follow the steps below to:
1. Set up the environment
2. Download the training data
3. Train the model
4. Save the results

## Initial Setup

In [None]:
%%capture
!pip install transformers torch wandb datasets tqdm
!pip install -q git+https://github.com/huggingface/transformers.git

## Clone Repository and Setup Data

In [None]:
!git clone https://github.com/lebsral/raspberry.git
%cd raspberry
!pip install -r requirements.txt

## Mount Google Drive
Mount Google Drive to save checkpoints and load data if needed:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Import Dependencies

In [None]:
import os
import sys
import json
import torch
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
import wandb

## Configuration
Set up training parameters:

In [None]:
# Training configuration
config = {
    "model_name": "gpt2",  # Base model to fine-tune
    "train_file": "data/processed/alpaca_examples.json",
    "output_dir": "/content/drive/MyDrive/tokenizer_checkpoints",
    "num_epochs": 3,
    "batch_size": 4,
    "gradient_accumulation_steps": 4,
    "learning_rate": 2e-5,
    "max_length": 512,
}

# Create output directory
os.makedirs(config["output_dir"], exist_ok=True)

## Training Code
Import the training implementation:

In [None]:
from src.training.train import AlpacaDataset, train

# Initialize wandb
wandb.login()

# Start training
train(
    model_name=config["model_name"],
    train_file=config["train_file"],
    output_dir=config["output_dir"],
    num_epochs=config["num_epochs"],
    batch_size=config["batch_size"],
    gradient_accumulation_steps=config["gradient_accumulation_steps"],
    learning_rate=config["learning_rate"],
    max_length=config["max_length"],
)

## Save Results
The model checkpoints are automatically saved to Google Drive. You can also download them locally:

In [None]:
# Download final checkpoint if needed
!zip -r /content/model_checkpoint.zip {config["output_dir"]}
from google.colab import files
files.download("/content/model_checkpoint.zip")