# Text Embedding Model - Modern Architecture (T4 Optimized)

**VSCode Colab Extension Ready!**

**üöÄ Modern Architecture:**
- RMSNorm (10-15% faster than LayerNorm)
- Grouped Query Attention (4x less KV cache)
- RoPE with YaRN (better positional encoding)
- **Hybrid Muon+AdamW optimizer** (faster!)

**‚è±Ô∏è Training Time:** ~8-12 hours for 100K samples

## 1. Setup - Choose ONE method:

### Method A: Clone from GitHub (Recommended)

In [None]:
# Clone your repository
!git clone https://github.com/yourusername/embedding_model.git
%cd embedding_model

### Method B: Google Drive (for VSCode Colab Extension)

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to your project (adjust path as needed)
%cd /content/drive/MyDrive/Embedding_Model

# Or symlink it
# !ln -s /content/drive/MyDrive/Embedding_Model /content/Embedding_Model
# %cd /content/Embedding_Model

### Method C: Upload manually (Files > Upload)

In [None]:
# After uploading, navigate to folder
%cd /content/Embedding_Model

## 2. Install Dependencies

In [None]:
# Check GPU
!nvidia-smi

In [None]:
!pip install -q torch transformers datasets tokenizers scipy tqdm tensorboard

## 3. Verify Setup

In [None]:
# Verify project structure
import os
print(f"Current directory: {os.getcwd()}")
print(f"\nProject structure:")
!ls -la

# Check if src exists
if os.path.exists('src'):
    print("\n‚úÖ src directory found!")
    !ls src/
else:
    print("\n‚ùå src directory not found!")
    print("Please use one of the setup methods above.")

## 4. Quick Test (30-60 min)

In [None]:
# Quick test
!python -m src.training.train --quick-start --optimizer hybrid

## 5. Full Training (8-12 hours)

In [None]:
# Full training with Hybrid optimizer
!python -m src.training.train \
  --optimizer hybrid \
  --muon-lr 0.02 \
  --batch-size 32 \
  --grad-accum-steps 8 \
  --num-epochs 10 \
  --learning-rate 2e-4 \
  --use-wikipedia \
  --use-snli \
  --max-wiki-samples 100000 \
  --output-dir ./outputs \
  --fp16

## 6. Monitor Training

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs

In [None]:
# Plot loss
import json
import matplotlib.pyplot as plt

try:
    with open('./outputs/training_history.json') as f:
        history = json.load(f)
    steps = [item['step'] for item in history['train_loss']]
    losses = [item['loss'] for item in history['train_loss']]
    plt.figure(figsize=(12, 5))
    plt.plot(steps, losses)
    plt.xlabel('Steps')
    plt.ylabel('Loss')
    plt.title('Training Loss')
    plt.grid(True)
    plt.show()
except:
    print("No history yet")

## 7. Evaluation

In [None]:
!python -m src.evaluation.sts_evaluation \
  --checkpoint ./outputs/best_model/checkpoint.pt \
  --tokenizer ./data/tokenizer/tokenizer.json

## 8. Inference

In [None]:
from src.inference.inference import load_model

model = load_model(
    "./outputs/best_model/checkpoint.pt",
    "./data/tokenizer/tokenizer.json"
)
print("‚úÖ Model loaded!")

In [None]:
# Test
emb = model.encode("Machine learning is amazing!")
print(f"Shape: {emb.shape}")

sim = model.similarity("I love AI", "AI is great")
print(f"Similarity: {sim:.4f}")

## üí° Tips for VSCode Colab Extension

**Best workflow:**
1. Save code to GitHub
2. Clone in notebook (Method A)
3. Train on Colab GPU
4. Download checkpoints

**Alternative (Google Drive):**
1. Upload project to Drive: `MyDrive/Embedding_Model/`
2. Use Method B to access
3. All files stay in Drive (persistent!)

**Performance:**
- Training: 210 samples/sec
- Inference: 300 samples/sec (2x faster!)
- STS-B: 0.62-0.72 (10 epochs)