# Complete Deep Learning Pipeline
Complete Pipeline: Preprocess ‚Üí Tokenizer ‚Üí Train Models ‚Üí Embeddings ‚Üí Evaluation

This notebook orchestrates the entire pipeline with configurable parameters at the top.

In [11]:
# ============================================================================
# üîß CONFIGURATION - MODIFY THESE BEFORE RUNNING
# ============================================================================

# ========== SKIP FLAGS - Set to True to skip a stage ==========
SKIP_PREPROCESS = False
SKIP_TOKENIZER = True
SKIP_LSTM = False
SKIP_TRANSFORMER = False
SKIP_EMBEDDINGS = False
SKIP_EVALUATION = False

# ========== PREPROCESSING PARAMETERS ==========
CORPUS_SIZE = 10      # Tiny data size for testing

# ========== TOKENIZER PARAMETERS ==========
TOKENIZER_VOCAB_SIZE = 2000

# ========== LSTM TRAINING PARAMETERS ==========
LSTM_EPOCHS = 1         # Just for testing small number
LSTM_BATCH_SIZE = 32
LSTM_SEQ_LENGTH = 128
LSTM_LEARNING_RATE = 0.001

# ========== TRANSFORMER TRAINING PARAMETERS ==========
TRANSFORMER_EPOCHS = 1 # Same here
TRANSFORMER_BATCH_SIZE = 32
TRANSFORMER_SEQ_LENGTH = 128
TRANSFORMER_LEARNING_RATE = 0.001

# ========== EMBEDDINGS PARAMETERS ==========
EMBEDDINGS_MODELS = None  # None = all models ['byt5', 'canine', 'bpe-lstm', 'bpe-transformer', 'bert']
EMBEDDINGS_CLEAR_EXISTING = True

# ============================================================================
# Display current configuration
# ============================================================================
print("="*80)
print("PIPELINE CONFIGURATION")
print("="*80)
print("\nüìç SKIP FLAGS:")
print(f"  SKIP_PREPROCESS: {SKIP_PREPROCESS}")
print(f"  SKIP_TOKENIZER: {SKIP_TOKENIZER}")
print(f"  SKIP_LSTM: {SKIP_LSTM}")
print(f"  SKIP_TRANSFORMER: {SKIP_TRANSFORMER}")
print(f"  SKIP_EMBEDDINGS: {SKIP_EMBEDDINGS}")
print(f"  SKIP_EVALUATION: {SKIP_EVALUATION}")
print("\n‚öôÔ∏è PARAMETERS:")
print(f"  Corpus size: {CORPUS_SIZE}")
print(f"  Tokenizer vocab size: {TOKENIZER_VOCAB_SIZE}")
print(f"  LSTM epochs: {LSTM_EPOCHS}, batch size: {LSTM_BATCH_SIZE}, seq length: {LSTM_SEQ_LENGTH}")
print(f"  Transformer epochs: {TRANSFORMER_EPOCHS}, batch size: {TRANSFORMER_BATCH_SIZE}, seq length: {TRANSFORMER_SEQ_LENGTH}")
print(f"  Embeddings models: {EMBEDDINGS_MODELS}")
print("="*80)

PIPELINE CONFIGURATION

üìç SKIP FLAGS:
  SKIP_PREPROCESS: False
  SKIP_TOKENIZER: True
  SKIP_LSTM: False
  SKIP_TRANSFORMER: False
  SKIP_EMBEDDINGS: False
  SKIP_EVALUATION: False

‚öôÔ∏è PARAMETERS:
  Corpus size: 10
  Tokenizer vocab size: 2000
  LSTM epochs: 1, batch size: 32, seq length: 128
  Transformer epochs: 1, batch size: 32, seq length: 128
  Embeddings models: None


In [12]:
# ============================================================================
# SETUP: Import Libraries & Set Path
# ============================================================================
import os
import sys
import subprocess
from pathlib import Path

# Add repo root to path - go up from pipeline dir to repo root
notebook_dir = Path.cwd()
repo_root = notebook_dir.parent if notebook_dir.name == 'pipeline' else notebook_dir

# Add repo root to Python path
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

# Change working directory to repo root
os.chdir(repo_root)

print(f"Notebook directory: {notebook_dir}")
print(f"Repository root: {repo_root}")
print(f"Current working directory: {os.getcwd()}")
print(f"Python path updated")

Notebook directory: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers
Repository root: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers
Current working directory: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers
Python path updated


## Stage 1: Preprocessing
Preprocess NQ dataset: filter corpus and align queries

In [13]:
def stage_preprocess():
    """Preprocess NQ dataset: filter corpus and align queries"""
    if SKIP_PREPROCESS:
        print("\n[SKIP] Preprocessing")
        return
    
    print("\n" + "="*80)
    print("STAGE 1: PREPROCESSING")
    print("="*80)
    
    from data_processing.nq_preprocess import preprocess_data
    
    try:
        print(f"\nParameters:")
        print(f"  Corpus size: {CORPUS_SIZE}")
        
        corpus_file, queries_file = preprocess_data(corpus_size=CORPUS_SIZE)
        print(f"\n[OK] Preprocessing complete")
        print(f"  Corpus: {corpus_file}")
        print(f"  Queries: {queries_file}")
    except Exception as e:
        print(f"\n[ERROR] Preprocessing failed: {e}")
        raise

# Run preprocessing stage
stage_preprocess()


STAGE 1: PREPROCESSING

Parameters:
  Corpus size: 10
Loading corpus dataset...
  Total documents: 2681468
  Total documents: 2681468
  Unique titles: 108593

Filtering corpus to 10 documents...
  Unique titles: 108593

Filtering corpus to 10 documents...
  Filtered documents: 116
  Unique titles: 10
  Saved to: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_processing\..\data_filtered\corpus_filtered.jsonl

Loading queries dataset...
  Total queries: 3452
Loading relevance judgments...
  Total query-corpus pairs: 4201

Merging queries with relevance judgments...
  Filtered documents: 116
  Unique titles: 10
  Saved to: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_processing\..\data_filtered\corpus_filtered.jsonl

Loading queries dataset...
  Total queries: 3452
Loading relevance judgments...
  Total query-corpus pairs: 4201

Merging queries with relevance judgments...
Filtering queries to match filtered c

## Stage 2: Tokenizer Training
Train BPE tokenizer on dataset

In [14]:
def stage_tokenizer():
    """Train BPE tokenizer on dataset"""
    if SKIP_TOKENIZER:
        print("\n[SKIP] Tokenizer training")
        return
    
    print("\n" + "="*80)
    print("STAGE 2: TOKENIZER TRAINING")
    print("="*80)
    
    tokenizer_script = repo_root / 'tokenization' / 'our_tokenizers' / 'train_tokenizer.py'
    
    try:
        print(f"\nParameters:")
        print(f"  Vocab size: {TOKENIZER_VOCAB_SIZE}")
        print(f"\nRunning tokenizer training...")
        result = subprocess.run(
            [sys.executable, str(tokenizer_script)],
            cwd=repo_root / 'tokenization' / 'our_tokenizers',
            check=True,
            capture_output=False
        )
        print(f"\n[OK] Tokenizer training complete")
    except subprocess.CalledProcessError as e:
        print(f"\n[ERROR] Tokenizer training failed with exit code {e.returncode}")
        raise
    except Exception as e:
        print(f"\n[ERROR] Tokenizer training failed: {e}")
        raise

# Run tokenizer stage
stage_tokenizer()


[SKIP] Tokenizer training


## Stage 3A: Train LSTM Model
Train LSTM language model with BPE tokenization

In [15]:
def stage_train_lstm():
    """Train LSTM language model with BPE tokenization"""
    if SKIP_LSTM:
        print("\n[SKIP] LSTM model training")
        return
    
    print("\n" + "="*80)
    print("STAGE 3A: LSTM MODEL TRAINING")
    print("="*80)
    
    from models.LSTM.training.train_bpe_lstm import main as train_lstm_main
    
    try:
        print(f"\nParameters:")
        print(f"  Epochs: {LSTM_EPOCHS}")
        print(f"  Batch size: {LSTM_BATCH_SIZE}")
        print(f"  Sequence length: {LSTM_SEQ_LENGTH}")
        print(f"  Learning rate: {LSTM_LEARNING_RATE}")
        
        train_lstm_main(
            batch_size=LSTM_BATCH_SIZE,
            seq_length=LSTM_SEQ_LENGTH,
            num_epochs=LSTM_EPOCHS,
            learning_rate=LSTM_LEARNING_RATE
        )
        print(f"\n[OK] LSTM training complete")
    except Exception as e:
        print(f"\n[ERROR] LSTM training failed: {e}")
        raise

# Run LSTM training stage
stage_train_lstm()


STAGE 3A: LSTM MODEL TRAINING

Parameters:
  Epochs: 1
  Batch size: 32
  Sequence length: 128
  Learning rate: 0.001
Training LSTM Language Model with BPE Tokenization

üîß Using device: cuda

üì¶ Loading BPE tokenizer from c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\tokenization\vocabularies\bpe_tokenizer.json
Tokenizer loaded from c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\tokenization\vocabularies\bpe_tokenizer.json
   Vocabulary size: 2001

üìö Loading documents from c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_filtered\corpus_filtered.jsonl
   Loaded 116 documents

üî® Creating dataset...
Creating dataset with seq_length=128, stride=64...
  Processing text 0/116
  Processing text 100/116
‚úÖ Created 328 training examples

üîß Using device: cuda

üì¶ Loading BPE tokenizer from c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\tok

Epoch 1: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:00<00:00, 10.69it/s, loss=6.4859, ppl=655.80] 




Epoch 1/1 (0.9s)
  Train Loss: 7.4920 | Train PPL: 1793.68
  Val Loss:   6.0539 | Val PPL:   425.79
  LR: 0.001000
  ‚úÖ Saved best model (val_loss=6.0539)

üìä Final evaluation on test set...

FINAL RESULTS
Test Loss:       6.0520
Test Perplexity: 424.98
Bits per Char:   8.731

‚úÖ Training complete! Model saved to c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\models\lstm_bpe_final.pt

[OK] LSTM training complete


## Stage 3B: Train Transformer Model
Train Transformer language model with BPE tokenization

In [16]:
def stage_train_transformer():
    """Train Transformer language model with BPE tokenization"""
    if SKIP_TRANSFORMER:
        print("\n[SKIP] Transformer model training")
        return
    
    print("\n" + "="*80)
    print("STAGE 3B: TRANSFORMER MODEL TRAINING")
    print("="*80)
    
    from models.Transformer.training.train_bpe_transformer import main as train_transformer_main
    
    try:
        print(f"\nParameters:")
        print(f"  Epochs: {TRANSFORMER_EPOCHS}")
        print(f"  Batch size: {TRANSFORMER_BATCH_SIZE}")
        print(f"  Sequence length: {TRANSFORMER_SEQ_LENGTH}")
        print(f"  Learning rate: {TRANSFORMER_LEARNING_RATE}")
        
        train_transformer_main(
            batch_size=TRANSFORMER_BATCH_SIZE,
            seq_length=TRANSFORMER_SEQ_LENGTH,
            num_epochs=TRANSFORMER_EPOCHS,
            learning_rate=TRANSFORMER_LEARNING_RATE
        )
        print(f"\n[OK] Transformer training complete")
    except Exception as e:
        print(f"\n[ERROR] Transformer training failed: {e}")
        raise

# Run Transformer training stage
stage_train_transformer()


STAGE 3B: TRANSFORMER MODEL TRAINING

Parameters:
  Epochs: 1
  Batch size: 32
  Sequence length: 128
  Learning rate: 0.001
Training Transformer Language Model with BPE Tokenization

üîß Using device: cuda

üì¶ Loading BPE tokenizer from c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\tokenization\vocabularies\bpe_tokenizer.json
Tokenizer loaded from c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\tokenization\vocabularies\bpe_tokenizer.json
   Vocabulary size: 2001

üìö Loading documents from c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_filtered\corpus_filtered.jsonl
   Loaded 116 documents

üî® Creating dataset...
Creating dataset with seq_length=128, stride=64...
  Processing text 0/116
  Processing text 100/116
‚úÖ Created 328 training examples
   Train: 262 examples
   Val:   32 examples
   Test:  34 examples

üß† Creating Transformer model...
   Parameters: 2,607,825


Epoch 1: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:00<00:00, 20.42it/s, loss=5.5424, ppl=255.30] 




Epoch 1/1 (0.5s)
  Train Loss: 6.4789 | Train PPL: 651.28
  Val Loss:   5.5096 | Val PPL:   247.05
  LR: 0.001000
  ‚úÖ Saved best model (val_loss=5.5096)

üìä Final evaluation on test set...

FINAL RESULTS
Test Loss:       5.4612
Test Perplexity: 235.39
Bits per Char:   7.879

‚úÖ Training complete! Model saved to c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\models\Transformer\transformer_bpe_final.pt

[OK] Transformer training complete

FINAL RESULTS
Test Loss:       5.4612
Test Perplexity: 235.39
Bits per Char:   7.879

‚úÖ Training complete! Model saved to c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\models\Transformer\transformer_bpe_final.pt

[OK] Transformer training complete


## Stage 4: Embeddings Generation
Generate embeddings using all models and store in database

In [17]:
def stage_embeddings():
    """Generate embeddings using all models and store in database"""
    if SKIP_EMBEDDINGS:
        print("\n[SKIP] Embeddings generation")
        return
    
    print("\n" + "="*80)
    print("STAGE 4: EMBEDDINGS GENERATION")
    print("="*80)
    
    from pipeline.run_all_embeddings import run_embeddings_pipeline
    
    try:
        # Prepare models to run
        if EMBEDDINGS_MODELS is None:
            models = ['byt5', 'canine', 'bpe-lstm', 'bpe-transformer', 'bert']
        else:
            models = EMBEDDINGS_MODELS
        
        print(f"\nParameters:")
        print(f"  Models: {', '.join(models)}")
        print(f"  Clear tables: {EMBEDDINGS_CLEAR_EXISTING}")
        
        results = run_embeddings_pipeline(
            models=models,
            clear_existing=EMBEDDINGS_CLEAR_EXISTING
        )
        print(f"\n[OK] Embeddings generation complete")
    except Exception as e:
        print(f"\n[ERROR] Embeddings generation failed: {e}")
        raise

# Run embeddings stage
stage_embeddings()


STAGE 4: EMBEDDINGS GENERATION

Parameters:
  Models: byt5, canine, bpe-lstm, bpe-transformer, bert
  Clear tables: True
PyTorch version: 2.6.0.dev20241112+cu121
OS: Windows AMD64
üöÄ CUDA device: NVIDIA GeForce RTX 3050 Laptop GPU

üöÄ Running pipeline for 5 model(s): byt5, canine, bpe-lstm, bpe-transformer, bert
Dataset: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_filtered\corpus_filtered.jsonl
Database: postgresql+psycopg://nick:secret@localhost:5433/vectordb
Clear existing tables: True


Processing: ByT5
--- Clearing existing table: byt5_small ---
--- Clearing existing table: byt5_small ---


  DeclarativeMeta.__init__(cls, classname, bases, dict_, **kw)


--- Loading ByT5 Model: google/byt5-small ---
Using device: cuda
--- Dataset: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_filtered\corpus_filtered.jsonl ---
--- Table: byt5_small ---
--- Batch size: 64 ---
--- Dataset: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_filtered\corpus_filtered.jsonl ---
--- Table: byt5_small ---
--- Batch size: 64 ---


Embedding with ByT5: 116it [01:01,  1.88it/s]



‚úÖ ByT5 completed successfully!
--- Cleaning up memory for ByT5 ---
    GPU fully cleared and synchronized
    Memory cleanup complete


Processing: Canine
--- Clearing existing table: canine_s ---
--- Loading CANINE Model: google/canine-s ---
Using device: cuda
    GPU fully cleared and synchronized
    Memory cleanup complete


Processing: Canine
--- Clearing existing table: canine_s ---
--- Loading CANINE Model: google/canine-s ---
Using device: cuda
--- Dataset: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_filtered\corpus_filtered.jsonl ---
--- Table: canine_s ---
--- Batch size: 64 ---
--- Dataset: c:\Users\nick\Desktop\DTU Courses\02456 Deep Learning\Deep-Learning-Transformers\data_filtered\corpus_filtered.jsonl ---
--- Table: canine_s ---
--- Batch size: 64 ---


Embedding with Canine: 64it [00:37,  1.73it/s]

Error processing doc: CUDA out of memory. Tried to allocate 196.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 10.01 GiB is allocated by PyTorch, and 461.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 65it [01:10,  1.30s/it]

Error processing doc: CUDA out of memory. Tried to allocate 3.12 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.02 GiB is allocated by PyTorch, and 2.53 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 66it [01:45,  2.32s/it]

Error processing doc: CUDA out of memory. Tried to allocate 3.17 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.10 GiB is allocated by PyTorch, and 2.50 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 67it [02:18,  3.58s/it]

Error processing doc: CUDA out of memory. Tried to allocate 3.21 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.18 GiB is allocated by PyTorch, and 2.47 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 68it [02:52,  5.29s/it]

Error processing doc: CUDA out of memory. Tried to allocate 3.26 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.26 GiB is allocated by PyTorch, and 2.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 69it [03:27,  7.47s/it]

Error processing doc: CUDA out of memory. Tried to allocate 3.31 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.34 GiB is allocated by PyTorch, and 2.40 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 70it [04:11, 10.97s/it]

Error processing doc: CUDA out of memory. Tried to allocate 4.36 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 4.89 GiB is allocated by PyTorch, and 1.54 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 71it [04:51, 14.49s/it]

Error processing doc: CUDA out of memory. Tried to allocate 4.42 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 4.94 GiB is allocated by PyTorch, and 1.50 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 72it [06:22, 25.62s/it]

Error processing doc: CUDA out of memory. Tried to allocate 11.21 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.17 GiB is allocated by PyTorch, and 2.34 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 73it [07:41, 34.88s/it]

Error processing doc: CUDA out of memory. Tried to allocate 11.36 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.24 GiB is allocated by PyTorch, and 3.17 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 74it [07:55, 30.66s/it]

Error processing doc: CUDA out of memory. Tried to allocate 1.58 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 5.14 GiB is allocated by PyTorch, and 4.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 75it [09:39, 46.99s/it]

Error processing doc: CUDA out of memory. Tried to allocate 11.68 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.39 GiB is allocated by PyTorch, and 2.33 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 76it [11:21, 60.08s/it]

Error processing doc: CUDA out of memory. Tried to allocate 11.83 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.46 GiB is allocated by PyTorch, and 2.34 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 77it [12:43, 65.77s/it]

Error processing doc: CUDA out of memory. Tried to allocate 11.99 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.54 GiB is allocated by PyTorch, and 2.35 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 78it [14:32, 77.25s/it]

Error processing doc: CUDA out of memory. Tried to allocate 12.14 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.61 GiB is allocated by PyTorch, and 2.33 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 79it [16:19, 85.49s/it]

Error processing doc: CUDA out of memory. Tried to allocate 12.30 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.69 GiB is allocated by PyTorch, and 2.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 80it [16:21, 61.94s/it]

Error processing doc: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 5.88 GiB is allocated by PyTorch, and 4.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 81it [16:42, 49.93s/it]

Error processing doc: CUDA out of memory. Tried to allocate 1.90 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 5.93 GiB is allocated by PyTorch, and 4.09 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 82it [17:02, 41.37s/it]

Error processing doc: CUDA out of memory. Tried to allocate 1.92 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 5.98 GiB is allocated by PyTorch, and 4.04 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 83it [17:22, 35.19s/it]

Error processing doc: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.03 GiB is allocated by PyTorch, and 3.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 84it [17:43, 31.02s/it]

Error processing doc: CUDA out of memory. Tried to allocate 1.97 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.08 GiB is allocated by PyTorch, and 3.93 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 85it [18:05, 28.21s/it]

Error processing doc: CUDA out of memory. Tried to allocate 1.99 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.13 GiB is allocated by PyTorch, and 3.88 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 86it [18:46, 32.09s/it]

Error processing doc: CUDA out of memory. Tried to allocate 2.02 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.18 GiB is allocated by PyTorch, and 3.83 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 87it [18:49, 23.43s/it]

Error processing doc: CUDA out of memory. Tried to allocate 2.04 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.23 GiB is allocated by PyTorch, and 3.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 88it [19:12, 23.12s/it]

Error processing doc: CUDA out of memory. Tried to allocate 2.06 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.28 GiB is allocated by PyTorch, and 3.73 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 89it [19:34, 22.89s/it]

Error processing doc: CUDA out of memory. Tried to allocate 2.09 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.33 GiB is allocated by PyTorch, and 3.68 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 90it [19:58, 23.11s/it]

Error processing doc: CUDA out of memory. Tried to allocate 2.11 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.38 GiB is allocated by PyTorch, and 3.63 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 91it [20:21, 23.26s/it]

Error processing doc: CUDA out of memory. Tried to allocate 2.13 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.44 GiB is allocated by PyTorch, and 3.58 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 92it [22:58, 63.22s/it]

Error processing doc: CUDA out of memory. Tried to allocate 552.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 8.91 GiB is allocated by PyTorch, and 1.57 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 93it [23:03, 45.72s/it]

Error processing doc: CUDA out of memory. Tried to allocate 2.18 GiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 6.54 GiB is allocated by PyTorch, and 3.95 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 94it [24:00, 49.20s/it]

Error processing doc: CUDA out of memory. Tried to allocate 564.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 9.07 GiB is allocated by PyTorch, and 1.46 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Embedding with Canine: 94it [25:39, 16.38s/it]



KeyboardInterrupt: 

## Stage 5: Evaluation
Evaluate all embedding models on retrieval task

In [None]:
def stage_evaluation():
    """Evaluate all embedding models on retrieval task"""
    if SKIP_EVALUATION:
        print("\n[SKIP] Evaluation")
        return
    
    print("\n" + "="*80)
    print("STAGE 5: EVALUATION")
    print("="*80)
    
    from tokenization.evaluation.evaluation import main as evaluation_main
    
    try:
        evaluation_main()
        print(f"\n[OK] Evaluation complete")
    except Exception as e:
        print(f"\n[ERROR] Evaluation failed: {e}")
        raise

# Run evaluation stage
stage_evaluation()

## Pipeline Summary
Display the final status and summary

In [None]:
print("\n" + "="*80)
print("‚úÖ PIPELINE EXECUTION COMPLETE")
print("="*80)
print("\nConfiguration Summary:")
print(f"  SKIP_PREPROCESS: {SKIP_PREPROCESS}")
print(f"  SKIP_TOKENIZER: {SKIP_TOKENIZER}")
print(f"  SKIP_LSTM: {SKIP_LSTM}")
print(f"  SKIP_TRANSFORMER: {SKIP_TRANSFORMER}")
print(f"  SKIP_EMBEDDINGS: {SKIP_EMBEDDINGS}")
print(f"  SKIP_EVALUATION: {SKIP_EVALUATION}")
print("="*80)