# YouTube Transcript to Chapter Converter - Dataset Preparation

This notebook prepares training data for fine-tuning an LLM to convert YouTube transcripts into chapters.

## What we'll do:
1. Fetch transcripts from FreeCodeCamp videos using yt-dlp
2. Extract existing chapters from video descriptions
3. Format data for LoRA/QLoRA fine-tuning
4. Save in JSONL format for training

## Fine-tuning Format:
We'll use the **instruction-following format** which works with most LLMs:
```json
{
  "instruction": "Convert this YouTube transcript into chapters with timestamps.",
  "input": "<full transcript>",
  "output": "<chapters with timestamps>"
}
```

## RTX 3050 Compatibility:
âœ… **LoRA/QLoRA fine-tuning is possible** on RTX 3050 (4-8GB VRAM)
- Use 4-bit quantization (QLoRA)
- Recommended models: Mistral-7B, Llama-3-8B, Phi-3
- Batch size: 1-2
- Gradient accumulation: 4-8 steps

## Step 1: Install Required Packages

First, let's create a requirements.txt file with all necessary packages.

## Step 2: Import Libraries

In [7]:
import json
import re
import os
from pathlib import Path
from typing import List, Dict, Optional
import yt_dlp
from yt_dlp import YoutubeDL

from tqdm import tqdm
import pandas as pd
import time

## Step 3: Configuration

In [2]:
# Configuration
CHANNEL_URL = "https://www.youtube.com/@freecodecamp/videos"
OUTPUT_DIR = Path("./dataset")
OUTPUT_DIR.mkdir(exist_ok=True)

# Number of videos to process (set to 100 or adjust as needed)
MAX_VIDEOS = 100

# Minimum transcript length (to filter out very short videos)
MIN_TRANSCRIPT_LENGTH = 1000  # characters

print(f"Output directory: {OUTPUT_DIR}")
print(f"Maximum videos to process: {MAX_VIDEOS}")

Output directory: dataset
Maximum videos to process: 100


## Step 4: Utility Functions

In [3]:
def extract_video_id(url: str) -> str:
    """Extract video ID from YouTube URL."""
    pattern = r'(?:v=|\/)([0-9A-Za-z_-]{11}).*'
    match = re.search(pattern, url)
    return match.group(1) if match else None

def seconds_to_timestamp(seconds: float) -> str:
    """Convert seconds to timestamp format (HH:MM:SS or MM:SS)."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    
    if hours > 0:
        return f"{hours:02d}:{minutes:02d}:{secs:02d}"
    else:
        return f"{minutes:02d}:{secs:02d}"

def parse_timestamp(timestamp: str) -> Optional[int]:
    """Parse timestamp string to seconds."""
    try:
        # Remove any non-digit/colon characters
        timestamp = re.sub(r'[^0-9:]', '', timestamp.strip())
        parts = timestamp.split(':')
        
        if len(parts) == 2:  # MM:SS
            return int(parts[0]) * 60 + int(parts[1])
        elif len(parts) == 3:  # HH:MM:SS
            return int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
        else:
            return None
    except:
        return None

def extract_chapters_from_description(description: str) -> List[Dict[str, str]]:
    """Extract chapter information from video description."""
    chapters = []
    
    # Common patterns for chapters in descriptions
    patterns = [
        r'(\d{1,2}:\d{2}(?::\d{2})?)\s*[-â€“â€”]?\s*(.+?)(?=\n|$)',  # 00:00 - Chapter Name
        r'(\d{1,2}:\d{2}(?::\d{2})?)\s+(.+?)(?=\n|$)',  # 00:00 Chapter Name
        r'\((\d{1,2}:\d{2}(?::\d{2})?)\)\s*(.+?)(?=\n|$)',  # (00:00) Chapter Name
        r'\[(\d{1,2}:\d{2}(?::\d{2})?)\]\s*(.+?)(?=\n|$)',  # [00:00] Chapter Name
    ]
    
    for pattern in patterns:
        matches = re.findall(pattern, description, re.MULTILINE)
        if matches:
            for timestamp, title in matches:
                seconds = parse_timestamp(timestamp)
                if seconds is not None:
                    chapters.append({
                        'timestamp': timestamp,
                        'seconds': seconds,
                        'title': title.strip()
                    })
            break  # Use the first pattern that matches
    
    # Sort by seconds and remove duplicates
    chapters = sorted(chapters, key=lambda x: x['seconds'])
    return chapters

print("Utility functions loaded successfully!")

Utility functions loaded successfully!


## Step 5: Fetch FreeCodeCamp Video URLs

In [4]:
def get_channel_videos(channel_url: str, max_videos: int = 100) -> List[Dict]:
    """Fetch video information from a YouTube channel."""
    ydl_opts = {
        'quiet': True,
        'extract_flat': True,
        'playlistend': max_videos,
        'skip_download': True,
    }
    
    videos = []
    
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            print(f"Fetching videos from: {channel_url}")
            info = ydl.extract_info(channel_url, download=False)
            
            if 'entries' in info:
                for entry in info['entries'][:max_videos]:
                    if entry:
                        videos.append({
                            'video_id': entry.get('id'),
                            'title': entry.get('title'),
                            'url': f"https://www.youtube.com/watch?v={entry.get('id')}"
                        })
    
    except Exception as e:
        print(f"Error fetching channel videos: {e}")
    
    return videos

# Fetch video URLs
print("Fetching FreeCodeCamp videos...")
videos = get_channel_videos(CHANNEL_URL, MAX_VIDEOS)
print(f"Found {len(videos)} videos")

# Display first few videos
if videos:
    print("\nFirst 5 videos:")
    for i, video in enumerate(videos[:5], 1):
        print(f"{i}. {video['title']}")

Fetching FreeCodeCamp videos...
Fetching videos from: https://www.youtube.com/@freecodecamp/videos
Found 100 videos

First 5 videos:
1. Relational Database Design â€“ Full Course
2. Let's Build Pipeline Parallelism from Scratch â€“ Tutorial
3. How to stay curious as a dev in the AI hype era with Sumit Saha [Podcast #205]
4. RAG & MCP Fundamentals â€“ A Hands-On Crash Course
5. Learn Dynamic Programming with Animations â€“ Full Course for Beginners


## Step 6: Fetch Transcripts and Chapters

In [10]:
import json
import time
import urllib.request
import yt_dlp
from typing import Optional

def get_transcript(video_id: str, retries: int = 3, base_delay: float = 2.0) -> Optional[str]:
    """Fetch transcript for a video using yt-dlp (with retry & backoff)."""
    video_url = f"https://www.youtube.com/watch?v={video_id}"

    ydl_opts = {
        'writesubtitles': True,
        'writeautomaticsub': True,
        'subtitleslangs': ['en'],
        'skip_download': True,
        'quiet': True,
        'no_warnings': True,
    }

    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(video_url, download=False)

        subtitles = info.get('subtitles', {})
        automatic_captions = info.get('automatic_captions', {})

        if 'en' in subtitles:
            subtitle_data = subtitles['en']
        elif 'en' in automatic_captions:
            subtitle_data = automatic_captions['en']
        else:
            return None

        json3_url = None
        for fmt in subtitle_data:
            if fmt.get('ext') == 'json3':
                json3_url = fmt.get('url')
                break

        if not json3_url:
            return None

        # ---- RETRY + BACKOFF HERE (important) ----
        data = None
        for attempt in range(retries):
            try:
                with urllib.request.urlopen(json3_url) as response:
                    data = json.loads(response.read().decode('utf-8'))
                break
            except Exception as e:
                if "429" in str(e):
                    delay = base_delay * (2 ** attempt)
                    print(f"429 on captions for {video_id}, retrying in {delay:.1f}s")
                    time.sleep(delay)
                else:
                    raise e

        if not data:
            return None

        # ---- Parse transcript ----
        full_transcript = ""

        for event in data.get('events', []):
            if 'segs' not in event:
                continue

            start_time = event.get('tStartMs', 0) / 1000
            timestamp = seconds_to_timestamp(start_time)

            text = ''.join(
                seg.get('utf8', '') for seg in event['segs']
            ).strip().replace('\n', ' ')

            if text:
                full_transcript += f"[{timestamp}] {text}\n"

        return full_transcript.strip() if full_transcript else None

    except Exception as e:
        print(f"Error fetching transcript for {video_id}: {e}")
        return None


In [8]:
def get_video_metadata(video_url: str) -> dict | None:
    """
    Extract video metadata using yt-dlp.
    Provides chapters, description, title, duration.
    """

    ydl_opts = {
        "quiet": True,
        "skip_download": True,
        "no_warnings": True,
    }

    try:
        with YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(video_url, download=False)

        if not info:
            return None

        metadata = {
            "id": info.get("id"),
            "title": info.get("title"),
            "description": info.get("description", ""),
            "duration": info.get("duration"),
            "chapters": []
        }

        # Preferred: yt-dlp native chapters
        if info.get("chapters"):
            for ch in info["chapters"]:
                metadata["chapters"].append({
                    "start_time": ch.get("start_time", 0),
                    "end_time": ch.get("end_time"),
                    "title": ch.get("title", "").strip()
                })

        return metadata

    except Exception as e:
        print(f"Metadata extraction failed: {e}")
        return None


In [11]:
def format_chapters_output(chapters: List[Dict]) -> str:
    """
    Format chapters into:
    MM:SS - Title
    HH:MM:SS - Title
    """
    if not chapters:
        return ""

    lines = []
    for ch in chapters:
        timestamp = ch.get("timestamp")
        if not timestamp:
            timestamp = seconds_to_timestamp(ch.get("seconds", 0))

        title = ch.get("title", "").strip()
        lines.append(f"{timestamp} - {title}")

    return "\n".join(lines)


## Step 7: Process Videos and Create Dataset

In [None]:
import time
import random

def process_videos(videos: List[Dict], min_length: int = 1000) -> List[Dict]:
    dataset = []
    successful = 0
    failed = 0

    for video in tqdm(videos, desc="Processing videos"):
        try:
            video_id = video['video_id']
            video_url = video['url']

            # ðŸ”¥ VERY strong throttle (30â€“40 seconds)
            sleep_time = random.uniform(30, 40)
            time.sleep(sleep_time)

            transcript = get_transcript(video_id)
            if not transcript or len(transcript) < min_length:
                print(f"\nSkipping {video['title']}: Transcript too short or unavailable")
                failed += 1
                continue

            metadata = get_video_metadata(video_url)
            if not metadata:
                failed += 1
                continue

            chapters = []

            if metadata.get('chapters'):
                for ch in metadata['chapters']:
                    chapters.append({
                        'timestamp': seconds_to_timestamp(ch.get('start_time', 0)),
                        'seconds': ch.get('start_time', 0),
                        'title': ch.get('title', '')
                    })

            if not chapters and metadata.get('description'):
                chapters = extract_chapters_from_description(metadata['description'])

            if not chapters or len(chapters) < 2:
                print(f"\nSkipping {video['title']}: No chapters found")
                failed += 1
                continue

            chapters_output = format_chapters_output(chapters)

            dataset.append({
                "video_id": video_id,
                "video_title": video['title'],
                "instruction": (
                    "Convert the following YouTube video transcript into chapters "
                    "with timestamps. Each chapter should have a timestamp in "
                    "MM:SS or HH:MM:SS format followed by a descriptive title."
                ),
                "input": transcript,
                "output": chapters_output,
                "num_chapters": len(chapters),
                "transcript_length": len(transcript)
            })

            successful += 1

        except Exception as e:
            print(f"\nError processing {video['title']}: {e}")
            failed += 1
            continue

    print(f"\nProcessing complete!")
    print(f"Successful: {successful}")
    print(f"Failed: {failed}")

    return dataset


# Process all videos
print("Starting video processing...")
print("This may take a while depending on the number of videos.\n")
dataset = process_videos(videos, MIN_TRANSCRIPT_LENGTH)

Starting video processing...
This may take a while depending on the number of videos.



Processing videos:   0%|          | 0/100 [00:00<?, ?it/s]

429 on captions for 26ls5lNiijk, retrying in 2.0s
429 on captions for 26ls5lNiijk, retrying in 4.0s
429 on captions for 26ls5lNiijk, retrying in 8.0s


Processing videos:   1%|          | 1/100 [00:18<30:41, 18.60s/it]


Skipping Relational Database Design â€“ Full Course: Transcript too short or unavailable
429 on captions for D5F8kp_azzw, retrying in 2.0s
429 on captions for D5F8kp_azzw, retrying in 4.0s
429 on captions for D5F8kp_azzw, retrying in 8.0s


Processing videos:   1%|          | 1/100 [00:37<1:01:06, 37.03s/it]


KeyboardInterrupt: 

## Step 8: Analyze Dataset

In [None]:
if dataset:
    print(f"\nDataset Statistics:")
    print(f"Total examples: {len(dataset)}")
    print(f"\nTranscript length statistics:")
    
    transcript_lengths = [ex['transcript_length'] for ex in dataset]
    print(f"  Min: {min(transcript_lengths):,} characters")
    print(f"  Max: {max(transcript_lengths):,} characters")
    print(f"  Average: {sum(transcript_lengths) // len(transcript_lengths):,} characters")
    
    chapter_counts = [ex['num_chapters'] for ex in dataset]
    print(f"\nChapter count statistics:")
    print(f"  Min: {min(chapter_counts)} chapters")
    print(f"  Max: {max(chapter_counts)} chapters")
    print(f"  Average: {sum(chapter_counts) / len(chapter_counts):.1f} chapters")
    
    # Show sample
    print(f"\n{'='*80}")
    print("Sample Training Example:")
    print(f"{'='*80}")
    sample = dataset[0]
    print(f"\nVideo: {sample['video_title']}")
    print(f"\nInstruction:\n{sample['instruction']}")
    print(f"\nInput (first 500 chars):\n{sample['input'][:500]}...")
    print(f"\nOutput:\n{sample['output']}")
else:
    print("No data in dataset!")

## Step 9: Save Dataset in Multiple Formats

In [None]:
if dataset:
    # 1. Save as JSONL (for LoRA training with libraries like Axolotl, LLaMA Factory)
    jsonl_path = OUTPUT_DIR / "training_data.jsonl"
    with open(jsonl_path, 'w', encoding='utf-8') as f:
        for example in dataset:
            # Format for instruction fine-tuning
            training_format = {
                "instruction": example['instruction'],
                "input": example['input'],
                "output": example['output']
            }
            f.write(json.dumps(training_format, ensure_ascii=False) + '\n')
    print(f"âœ“ Saved JSONL format: {jsonl_path}")
    
    # 2. Save as JSON (complete dataset with metadata)
    json_path = OUTPUT_DIR / "training_data_full.json"
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(dataset, f, indent=2, ensure_ascii=False)
    print(f"âœ“ Saved full JSON format: {json_path}")
    
    # 3. Save as CSV (for analysis)
    csv_data = []
    for ex in dataset:
        csv_data.append({
            'video_id': ex['video_id'],
            'video_title': ex['video_title'],
            'num_chapters': ex['num_chapters'],
            'transcript_length': ex['transcript_length']
        })
    df = pd.DataFrame(csv_data)
    csv_path = OUTPUT_DIR / "dataset_metadata.csv"
    df.to_csv(csv_path, index=False)
    print(f"âœ“ Saved metadata CSV: {csv_path}")
    
    # 4. Save as Alpaca format (alternative format for some trainers)
    alpaca_data = []
    for ex in dataset:
        alpaca_data.append({
            "instruction": ex['instruction'],
            "input": ex['input'],
            "output": ex['output']
        })
    alpaca_path = OUTPUT_DIR / "training_data_alpaca.json"
    with open(alpaca_path, 'w', encoding='utf-8') as f:
        json.dump(alpaca_data, f, indent=2, ensure_ascii=False)
    print(f"âœ“ Saved Alpaca format: {alpaca_path}")
    
    # 5. Save training/validation split (80/20)
    split_idx = int(len(dataset) * 0.8)
    train_data = dataset[:split_idx]
    val_data = dataset[split_idx:]
    
    train_path = OUTPUT_DIR / "train.jsonl"
    with open(train_path, 'w', encoding='utf-8') as f:
        for example in train_data:
            training_format = {
                "instruction": example['instruction'],
                "input": example['input'],
                "output": example['output']
            }
            f.write(json.dumps(training_format, ensure_ascii=False) + '\n')
    
    val_path = OUTPUT_DIR / "validation.jsonl"
    with open(val_path, 'w', encoding='utf-8') as f:
        for example in val_data:
            training_format = {
                "instruction": example['instruction'],
                "input": example['input'],
                "output": example['output']
            }
            f.write(json.dumps(training_format, ensure_ascii=False) + '\n')
    
    print(f"âœ“ Saved training split: {train_path} ({len(train_data)} examples)")
    print(f"âœ“ Saved validation split: {val_path} ({len(val_data)} examples)")
    
    print(f"\n{'='*80}")
    print("Dataset preparation complete!")
    print(f"{'='*80}")
    print(f"\nAll files saved in: {OUTPUT_DIR}")
    print(f"\nReady for fine-tuning with LoRA/QLoRA on your RTX 3050!")
else:
    print("No dataset to save!")

## Step 10: Generate Training Configuration

Create a configuration file for training with popular frameworks.

In [None]:
# Generate training config for Axolotl (popular LoRA training framework)
axolotl_config = {
    "base_model": "mistralai/Mistral-7B-v0.1",
    "model_type": "MistralForCausalLM",
    "tokenizer_type": "LlamaTokenizer",
    "load_in_8bit": False,
    "load_in_4bit": True,
    "strict": False,
    "datasets": [
        {
            "path": "train.jsonl",
            "type": "alpaca"
        }
    ],
    "val_set_size": 0.05,
    "output_dir": "./youtube-chapter-lora",
    "adapter": "lora",
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "lora_target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
    "sequence_len": 2048,
    "sample_packing": True,
    "gradient_accumulation_steps": 4,
    "micro_batch_size": 1,
    "num_epochs": 3,
    "optimizer": "adamw_torch",
    "lr_scheduler": "cosine",
    "learning_rate": 0.0002,
    "train_on_inputs": False,
    "group_by_length": False,
    "bf16": False,
    "fp16": True,
    "tf32": False,
    "gradient_checkpointing": True,
    "logging_steps": 1,
    "save_steps": 100,
    "eval_steps": 100,
    "warmup_steps": 10,
    "weight_decay": 0.01
}

config_path = OUTPUT_DIR / "axolotl_config.yml"
import yaml
with open(config_path, 'w') as f:
    yaml.dump(axolotl_config, f, default_flow_style=False)

print(f"âœ“ Saved Axolotl training config: {config_path}")

# Generate a simple training script
training_script = '''#!/usr/bin/env python3
"""
Simple LoRA Fine-tuning Script for RTX 3050
Using transformers + PEFT (Parameter-Efficient Fine-Tuning)
"""

# Install required packages first:
# pip install transformers peft bitsandbytes accelerate datasets torch

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

# Configuration
MODEL_NAME = "mistralai/Mistral-7B-v0.1"  # or "meta-llama/Llama-3-8B"
DATASET_PATH = "./dataset/train.jsonl"
OUTPUT_DIR = "./youtube-chapter-model"

# 4-bit quantization config for RTX 3050
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# Prepare model for training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,  # LoRA alpha
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load and prepare dataset
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")

def format_prompt(example):
    prompt = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    return {"text": prompt}

dataset = dataset.map(format_prompt)

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding="max_length"
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments optimized for RTX 3050
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=1,  # Small batch size for 4GB VRAM
    gradient_accumulation_steps=8,  # Accumulate gradients
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    warmup_steps=50,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit"
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Start training
print("Starting training...")
trainer.train()

# Save the final model
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}")
'''

script_path = OUTPUT_DIR / "train_lora.py"
with open(script_path, 'w') as f:
    f.write(training_script)

print(f"âœ“ Saved training script: {script_path}")
print(f"\nTo start training, run: python {script_path}")

## Step 11: Create README with Instructions

In [None]:
readme_content = '''# YouTube Transcript to Chapter Converter - Fine-tuning Dataset

This dataset contains YouTube video transcripts and their corresponding chapters from FreeCodeCamp videos.

## Dataset Structure

Each training example contains:
- `instruction`: Task description for the model
- `input`: Full video transcript with timestamps
- `output`: Chapter titles with timestamps

## Files

- `training_data.jsonl`: Main training data in JSONL format
- `train.jsonl`: Training split (80%)
- `validation.jsonl`: Validation split (20%)
- `training_data_alpaca.json`: Alpaca format
- `training_data_full.json`: Complete dataset with metadata
- `dataset_metadata.csv`: Summary statistics
- `axolotl_config.yml`: Configuration for Axolotl framework
- `train_lora.py`: Simple training script

## Training on RTX 3050

Your RTX 3050 (4-8GB VRAM) can handle LoRA fine-tuning with these optimizations:

### Requirements

```bash
pip install transformers peft bitsandbytes accelerate datasets torch
```

### Option 1: Using the Provided Script

```bash
python train_lora.py
```

### Option 2: Using Axolotl

```bash
# Install Axolotl
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
pip install -e .

# Copy your data and config
cp path/to/dataset/* axolotl/

# Start training
accelerate launch -m axolotl.cli.train axolotl_config.yml
```

### Option 3: Using Ollama (for inference after training)

After training, you can convert your model to GGUF format and use with Ollama:

```bash
# Convert to GGUF
python llama.cpp/convert.py ./youtube-chapter-model --outfile model.gguf

# Create Modelfile
FROM ./model.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9

# Import to Ollama
ollama create youtube-chapter-converter -f Modelfile
```

## Recommended Models for RTX 3050

1. **Mistral-7B** (Recommended)
   - Best performance for size
   - ~4GB VRAM with 4-bit quantization

2. **Llama-3-8B**
   - Excellent instruction following
   - ~4GB VRAM with 4-bit quantization

3. **Phi-3-mini**
   - Smaller, faster
   - ~2GB VRAM with 4-bit quantization

## Training Parameters (Optimized for RTX 3050)

```python
- Batch size: 1
- Gradient accumulation: 8 steps
- Learning rate: 2e-4
- LoRA rank: 16
- LoRA alpha: 32
- 4-bit quantization: Enabled
- Gradient checkpointing: Enabled
- FP16 training: Enabled
```

## Expected Training Time

- ~2-4 hours for 50 examples on RTX 3050
- ~4-8 hours for 100 examples

## Using the Trained Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_4bit=True,
    device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "./youtube-chapter-model")
tokenizer = AutoTokenizer.from_pretrained("./youtube-chapter-model")

# Generate chapters
prompt = f"""### Instruction:
Convert the following YouTube video transcript into chapters with timestamps.

### Input:
{your_transcript}

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
chapters = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(chapters)
```

## Tips for Better Results

1. **More data**: Try to get 200-500 examples for better performance
2. **Diverse videos**: Include various video lengths and topics
3. **Quality filtering**: Remove low-quality transcripts
4. **Augmentation**: Create variations of existing examples
5. **Validation**: Always check model outputs on validation set

## Troubleshooting

### Out of Memory Errors
- Reduce batch size to 1
- Increase gradient accumulation steps
- Enable gradient checkpointing
- Use 4-bit quantization
- Reduce sequence length

### Slow Training
- Use FP16 training
- Enable gradient checkpointing
- Use paged optimizers (paged_adamw_8bit)
- Reduce LoRA rank if needed

## License

Dataset is for educational purposes. Respect YouTube's Terms of Service and video creators' rights.
'''

readme_path = OUTPUT_DIR / "README.md"
with open(readme_path, 'w') as f:
    f.write(readme_content)

print(f"âœ“ Saved README: {readme_path}")
print(f"\n{'='*80}")
print("All files created successfully!")
print(f"{'='*80}")
print(f"\nDataset directory: {OUTPUT_DIR}")
print("\nNext steps:")
print("1. Review the dataset and README")
print("2. Install required packages: pip install transformers peft bitsandbytes accelerate")
print("3. Run training: python dataset/train_lora.py")
print("4. Wait 2-8 hours depending on dataset size")
print("5. Test your model!")