# 02 - Data Preparation: SST-2

**Thesis Section Reference:** Chapter 3.6 - Tasks and Datasets

This notebook prepares the SST-2 sentiment classification dataset:
1. Load SST-2 from GLUE benchmark
2. Create subsets for FAST MODE
3. Tokenize for causal LM training
4. Save processed datasets

## Task Description
- **Dataset:** GLUE SST-2 (Stanford Sentiment Treebank)
- **Task:** Binary sentiment classification (positive/negative)
- **Metrics:** Accuracy, F1
- **Splits:** Train (67,349), Validation (872)

In [1]:
# Standard setup - load environment and config
import os
import sys
from pathlib import Path

ROOT_DIR = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
sys.path.insert(0, str(ROOT_DIR / "src"))

from dotenv import load_dotenv
load_dotenv(ROOT_DIR / ".env")

from config import load_config
from utils_seed import set_seed

config = load_config(str(ROOT_DIR / "configs" / "experiment.yaml"))
config.ensure_dirs()

# Set seed for reproducibility
SEED = config.get_seeds()[0]
set_seed(SEED)

print(f"Mode: {'FAST' if config.fast_mode else 'FULL'}")
print(f"Seed: {SEED}")

Mode: FAST
Seed: 42


In [2]:
# Check if data already exists (idempotent)
DATA_DIR = ROOT_DIR / "results" / "processed_data"
DATA_DIR.mkdir(parents=True, exist_ok=True)

sst2_train_path = DATA_DIR / "sst2_train.arrow"
sst2_val_path = DATA_DIR / "sst2_validation.arrow"

if sst2_train_path.exists() and sst2_val_path.exists():
    print("✓ SST-2 data already exists, loading from cache...")
    SKIP_PROCESSING = True
else:
    print("SST-2 data not found, will process...")
    SKIP_PROCESSING = False

SST-2 data not found, will process...


In [3]:
# Load SST-2 dataset
from datasets import load_dataset

if not SKIP_PROCESSING:
    print("Loading SST-2 from GLUE...")
    
    raw_dataset = load_dataset(
        "glue", 
        "sst2",
        cache_dir=str(ROOT_DIR / "hf_cache")
    )
    
    print(f"\nDataset structure:")
    print(raw_dataset)
    
    print(f"\nSample examples:")
    for i in range(3):
        ex = raw_dataset["train"][i]
        label = "positive" if ex["label"] == 1 else "negative"
        print(f"  [{label}] {ex['sentence'][:80]}...")

Loading SST-2 from GLUE...


README.md: 0.00B [00:00, ?B/s]

sst2/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

sst2/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

sst2/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]


Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

Sample examples:
  [negative] hide new secretions from the parental units ...
  [negative] contains no wit , only labored gags ...
  [positive] that loves its characters and communicates something rather beautiful about huma...


In [4]:
# Create subsets based on mode
if not SKIP_PROCESSING:
    train_size = config.get_subset_size("sst2", "train")
    val_size = config.get_subset_size("sst2", "validation")
    
    if train_size is not None:
        print(f"FAST MODE: Subsetting to {train_size} train, {val_size} validation examples")
        
        train_dataset = raw_dataset["train"].shuffle(seed=SEED).select(range(train_size))
        val_dataset = raw_dataset["validation"].shuffle(seed=SEED).select(range(min(val_size, len(raw_dataset["validation"]))))
    else:
        print("FULL MODE: Using complete dataset")
        train_dataset = raw_dataset["train"]
        val_dataset = raw_dataset["validation"]
    
    print(f"\nFinal sizes:")
    print(f"  Train: {len(train_dataset)}")
    print(f"  Validation: {len(val_dataset)}")
    
    # Check label distribution
    train_labels = train_dataset["label"]
    pos_ratio = sum(train_labels) / len(train_labels)
    print(f"\nLabel distribution (train):")
    print(f"  Positive: {pos_ratio:.1%}")
    print(f"  Negative: {1-pos_ratio:.1%}")

FAST MODE: Subsetting to 2000 train, 500 validation examples

Final sizes:
  Train: 2000
  Validation: 500

Label distribution (train):
  Positive: 56.0%
  Negative: 44.0%


In [5]:
# Load tokenizer
from transformers import AutoTokenizer

if not SKIP_PROCESSING:
    # Use student tokenizer (will be used for all models)
    tokenizer_name = os.getenv("STUDENT_S1", config.student_s1.name)
    
    print(f"Loading tokenizer: {tokenizer_name}")
    
    tokenizer = AutoTokenizer.from_pretrained(
        tokenizer_name,
        trust_remote_code=True,
        cache_dir=str(ROOT_DIR / "hf_cache")
    )
    
    # Ensure pad token exists
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    print(f"  Vocab size: {tokenizer.vocab_size}")
    print(f"  Pad token: {tokenizer.pad_token} (id: {tokenizer.pad_token_id})")
    print(f"  EOS token: {tokenizer.eos_token} (id: {tokenizer.eos_token_id})")

Loading tokenizer: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  Vocab size: 32000
  Pad token: </s> (id: 2)
  EOS token: </s> (id: 2)


In [6]:
# Define prompt template and tokenization
from data_sst2 import create_sst2_prompt, get_sst2_label_tokens

if not SKIP_PROCESSING:
    max_length = config.get_max_length("sst2")
    print(f"Max sequence length: {max_length}")
    
    # Get label token IDs for classification
    label_tokens = get_sst2_label_tokens(tokenizer)
    print(f"\nLabel tokens:")
    print(f"  positive: token ID {label_tokens['positive']}")
    print(f"  negative: token ID {label_tokens['negative']}")
    
    # Show example prompt
    example_sentence = train_dataset[0]["sentence"]
    example_prompt = create_sst2_prompt(example_sentence, include_label=False)
    print(f"\nExample prompt:")
    print("-" * 40)
    print(example_prompt)
    print("-" * 40)

Max sequence length: 256

Label tokens:
  positive: token ID 29871
  negative: token ID 29871

Example prompt:
----------------------------------------
Classify the sentiment of the following sentence as positive or negative.

Sentence: klein , charming in comedies like american pie and dead-on in election , 

Sentiment:
----------------------------------------


In [7]:
# Tokenize dataset
from data_sst2 import tokenize_sst2_for_lm

if not SKIP_PROCESSING:
    print("Tokenizing datasets...")
    
    def tokenize_fn(examples):
        return tokenize_sst2_for_lm(
            examples, 
            tokenizer, 
            max_length=max_length,
            include_labels=True
        )
    
    # Tokenize train
    print("  Tokenizing train split...")
    tokenized_train = train_dataset.map(
        tokenize_fn,
        batched=True,
        remove_columns=train_dataset.column_names,
        desc="Tokenizing train"
    )
    
    # Tokenize validation
    print("  Tokenizing validation split...")
    tokenized_val = val_dataset.map(
        tokenize_fn,
        batched=True,
        remove_columns=val_dataset.column_names,
        desc="Tokenizing validation"
    )
    
    print(f"\nTokenized dataset features:")
    print(f"  {tokenized_train.features}")

Tokenizing datasets...
  Tokenizing train split...


Tokenizing train:   0%|          | 0/2000 [00:00<?, ? examples/s]

  Tokenizing validation split...


Tokenizing validation:   0%|          | 0/500 [00:00<?, ? examples/s]


Tokenized dataset features:
  {'input_ids': List(Value('int32')), 'attention_mask': List(Value('int8')), 'labels': List(Value('int64')), 'original_labels': Value('int64')}


In [8]:
# Verify tokenization
if not SKIP_PROCESSING:
    print("Verifying tokenization...")
    
    sample = tokenized_train[0]
    
    # Decode input
    decoded = tokenizer.decode(sample["input_ids"], skip_special_tokens=False)
    print(f"\nSample decoded input:")
    print(decoded[:300])
    
    # Check labels
    labels = sample["labels"]
    non_masked = [l for l in labels if l != -100]
    if non_masked:
        print(f"\nNon-masked label tokens: {non_masked}")
        print(f"Decoded: {tokenizer.decode(non_masked)}")
    
    print(f"\nOriginal label: {sample['original_labels']}")

Verifying tokenization...

Sample decoded input:
<s> Classify the sentiment of the following sentence as positive or negative.

Sentence: klein , charming in comedies like american pie and dead-on in election , 

Sentiment: positive</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><

Original label: 1


In [9]:
# Save processed datasets
if not SKIP_PROCESSING:
    print("Saving processed datasets...")
    
    tokenized_train.save_to_disk(str(sst2_train_path.with_suffix("")))
    tokenized_val.save_to_disk(str(sst2_val_path.with_suffix("")))
    
    # Save tokenizer for later use
    tokenizer_path = DATA_DIR / "sst2_tokenizer"
    tokenizer.save_pretrained(str(tokenizer_path))
    
    # Save metadata
    import json
    metadata = {
        "task": "sst2",
        "train_size": len(tokenized_train),
        "val_size": len(tokenized_val),
        "max_length": max_length,
        "tokenizer": tokenizer_name,
        "fast_mode": config.fast_mode,
        "seed": SEED,
        "label_tokens": label_tokens
    }
    
    with open(DATA_DIR / "sst2_metadata.json", "w") as f:
        json.dump(metadata, f, indent=2)
    
    print(f"\n✓ Saved to {DATA_DIR}")
    print(f"  - sst2_train/")
    print(f"  - sst2_validation/")
    print(f"  - sst2_tokenizer/")
    print(f"  - sst2_metadata.json")

Saving processed datasets...


Saving the dataset (0/1 shards):   0%|          | 0/2000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/500 [00:00<?, ? examples/s]


✓ Saved to /Users/pjere/Workshop/thesis-exp/results/processed_data
  - sst2_train/
  - sst2_validation/
  - sst2_tokenizer/
  - sst2_metadata.json


In [10]:
# Load cached data (for verification or if skipped)
from datasets import load_from_disk
import json

if SKIP_PROCESSING:
    print("Loading cached SST-2 data...")
    tokenized_train = load_from_disk(str(sst2_train_path.with_suffix("")))
    tokenized_val = load_from_disk(str(sst2_val_path.with_suffix("")))
    
    with open(DATA_DIR / "sst2_metadata.json", "r") as f:
        metadata = json.load(f)
    
    print(f"\nLoaded from cache:")
    print(f"  Train: {len(tokenized_train)} examples")
    print(f"  Validation: {len(tokenized_val)} examples")
    print(f"  Max length: {metadata['max_length']}")
    print(f"  Tokenizer: {metadata['tokenizer']}")

In [11]:
# Summary
print("=" * 60)
print("SST-2 DATA PREPARATION COMPLETE")
print("=" * 60)
print(f"""
Dataset: GLUE SST-2 (Sentiment Classification)
Mode: {'FAST' if config.fast_mode else 'FULL'}

Sizes:
  Train: {len(tokenized_train)} examples
  Validation: {len(tokenized_val)} examples

Files saved to: {DATA_DIR}

Next Steps:
  1. Run 03_data_prep_squad.ipynb to prepare SQuAD data
  2. Run 04_teacher_cache_outputs.ipynb to cache teacher outputs
""")

SST-2 DATA PREPARATION COMPLETE

Dataset: GLUE SST-2 (Sentiment Classification)
Mode: FAST

Sizes:
  Train: 2000 examples
  Validation: 500 examples

Files saved to: /Users/pjere/Workshop/thesis-exp/results/processed_data

Next Steps:
  1. Run 03_data_prep_squad.ipynb to prepare SQuAD data
  2. Run 04_teacher_cache_outputs.ipynb to cache teacher outputs

