# Speech TTS Test - Getting Started

This notebook demonstrates the complete workflow for fine-tuning a model using our modern ML template.

## 🚀 What we'll cover:
1. Environment setup and device detection
2. Data loading and analysis with Polars
3. Model training with Mac MPS support
4. Model serving and inference

In [None]:
import sys
import logging
from pathlib import Path

# Add package to path for imports
sys.path.append(str(Path.cwd().parent))

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("✅ Environment setup complete")

## 1. Configuration and Device Detection

In [None]:
import torch
from speech_tts_test.config import get_settings
from speech_tts_test.models.train_model import get_device

# Load configuration
settings = get_settings()
print(f"📊 Model: {settings.model.checkpoint}")
print(f"📚 Dataset: {settings.training.dataset_name}")
print(f"⚙️ Epochs: {settings.training.num_train_epochs}")

# Detect device
device = get_device()
print(f"🔧 Using device: {device}")

# Show device info
if device == "cuda":
    print(f"   GPU: {torch.cuda.get_device_name()}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
elif device == "mps":
    print("   Mac Metal Performance Shaders enabled 🍎")
else:
    print("   CPU training (slower but works everywhere)")

## 2. Data Loading and Analysis

In [None]:
from speech_tts_test.data_utils import load_and_process_dataset, analyze_dataset_stats, create_data_summary_report

# Load dataset with smaller subset for notebook demo
print("📥 Loading dataset...")
dataset = load_and_process_dataset(
    settings.training.dataset_name,
    subset_size=1000  # Small subset for demo
)

print(f"✅ Loaded {len(dataset)} splits")
for split_name, split_data in dataset.items():
    print(f"   {split_name}: {len(split_data)} examples")

In [None]:
# Analyze dataset statistics
print("📊 Analyzing dataset statistics...")
stats = analyze_dataset_stats(dataset)

# Create a detailed report
report = create_data_summary_report(dataset)
print("\n" + "="*50)
print(report)

In [None]:
# Look at some examples
print("📝 Sample data:")
for i in range(3):
    example = dataset["train"][i]
    sentiment = "positive" if example["label"] == 1 else "negative"
    text_preview = example["text"][:100] + "..." if len(example["text"]) > 100 else example["text"]
    print(f"\n{i+1}. [{sentiment.upper()}] {text_preview}")

## 3. Model Training

Now let's train our model. This will work on Mac MPS, CUDA GPUs, or CPU.

In [None]:
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    set_seed,
)
from speech_tts_test.models.train_model import preprocess_function, compute_metrics

# Set seed
set_seed(42)

print("🤗 Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(settings.model.checkpoint)

# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(
    settings.model.checkpoint, 
    num_labels=2,
    torch_dtype=torch.float16 if device != "cpu" else torch.float32,
)

if tokenizer.pad_token is not None:
    model.config.pad_token_id = tokenizer.pad_token_id

print("✅ Model loaded successfully")

In [None]:
# Tokenize dataset
print("🔤 Tokenizing dataset...")
tokenized_datasets = dataset.map(
    lambda x: preprocess_function(x, tokenizer, settings.model.max_length),
    batched=True,
    remove_columns=dataset["train"].column_names,
    desc="Tokenizing"
)

print("✅ Tokenization complete")
print(f"Train examples: {len(tokenized_datasets['train'])}")
print(f"Test examples: {len(tokenized_datasets['test'])}")

In [None]:
# Setup training
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Adjust batch size for device
batch_size = 4 if device in ["mps", "cpu"] else 8
print(f"🎯 Using batch size: {batch_size}")

output_dir = Path("../models") / settings.model.checkpoint.replace("/", "-")
training_args = TrainingArguments(
    output_dir=str(output_dir),
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=2,  # Quick training for demo
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    logging_steps=10,
    save_total_limit=2,
    gradient_checkpointing=True,
    dataloader_pin_memory=False if device == "mps" else True,
    fp16=False if device in ["mps", "cpu"] else True,
    auto_find_batch_size=True,
    report_to=[],  # Disable wandb/tensorboard for notebook
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("✅ Trainer initialized")

In [None]:
# Start training
print("🚀 Starting training...")
print(f"   Device: {device}")
print(f"   Batch size: {batch_size}")
print(f"   Training examples: {len(tokenized_datasets['train'])}")

try:
    train_result = trainer.train()
    print("✅ Training completed successfully!")
    print(f"Final training loss: {train_result.training_loss:.4f}")
except Exception as e:
    print(f"❌ Training failed: {e}")
    print("This might be due to memory limitations. Try reducing batch size.")

In [None]:
# Evaluate the model
print("📊 Evaluating model...")
eval_results = trainer.evaluate()

print("\n📈 Evaluation Results:")
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"   {key}: {value:.4f}")
    else:
        print(f"   {key}: {value}")

## 4. Model Inference

Let's test our trained model with some examples!

In [None]:
# Test the model with custom examples
test_texts = [
    "This movie was absolutely fantastic! Great acting and amazing story.",
    "Terrible film. Waste of time and money. Very disappointed.",
    "It was okay, nothing special but not bad either.",
    "One of the best movies I've ever seen! Highly recommend!",
    "Boring and predictable. Fell asleep halfway through."
]

print("🧪 Testing model inference...\n")

model.eval()
for i, text in enumerate(test_texts, 1):
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=settings.model.max_length)
    
    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(prediction, dim=-1).item()
        confidence = prediction[0][predicted_class].item()
    
    sentiment = "😊 POSITIVE" if predicted_class == 1 else "😞 NEGATIVE"
    print(f"{i}. {sentiment} (confidence: {confidence:.3f})")
    print(f"   Text: \"{text}\"\n")

## 5. Save the Model

Let's save our trained model for deployment.

In [None]:
# Save the model
print("💾 Saving model...")
trainer.save_model()
tokenizer.save_pretrained(output_dir)

print(f"✅ Model saved to: {output_dir}")
print(f"   Model files: {list(output_dir.glob('*'))}")

## 🎉 Success!

You've successfully:
1. ✅ Set up the environment with device detection
2. ✅ Loaded and analyzed data using Polars
3. ✅ Fine-tuned a model with Mac MPS support
4. ✅ Evaluated the model performance
5. ✅ Tested inference with custom examples
6. ✅ Saved the trained model

## Next Steps:

- **Deploy the model**: Use `uv run task serve` to start the API server
- **Cloud training**: Use `uv run task train-cloud` for larger datasets (if configured)
- **Experiment**: Try different models in `configs/settings.yaml`
- **Scale up**: Remove the `subset_size` parameter for full dataset training

Happy machine learning! 🚀