# News Article Semantic Similarity & Topic Retrieval Using Contrastive Learning

This notebook implements a complete contrastive learning pipeline for news article semantic similarity.


## 1. Environment Setup

Make sure you have installed all required packages:
```bash
pip install -r requirements.txt
```


## 2. Load Dataset


In [None]:
from datasets import load_dataset
from src.data_loader import preprocess_dataset

# Load AG News dataset
dataset = load_dataset("ag_news")

# Inspect sample
print("Sample from train set:")
print(dataset['train'][0])
print(f"\nTrain size: {len(dataset['train'])}, Test size: {len(dataset['test'])}")


## 3. Preprocess Text


In [None]:
# Preprocess dataset
dataset = preprocess_dataset(dataset)

# Check preprocessed sample
print("Preprocessed sample:")
print(dataset['train'][0])


## 4. Build Anchor-Positive-Negative Triplets


In [None]:
from src.triplets import create_triplets_from_dataset

# Create triplets (limit for demo)
triplets = create_triplets_from_dataset(dataset['train'], max_triplets=2000)

print(f"Created {len(triplets)} triplets")
print("\nSample triplet:")
print(f"Anchor: {triplets[0][0][:100]}...")
print(f"Positive: {triplets[0][1][:100]}...")
print(f"Negative: {triplets[0][2][:100]}...")


## 5. Load Pre-Trained Encoder


In [None]:
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded: {model.get_sentence_embedding_dimension()} dimensions")


## 6. Baseline Evaluation


In [None]:
from src.baseline import BaselineEvaluator
from src.data_loader import get_text_and_labels

# Get test samples
texts, labels = get_text_and_labels(dataset['test'], max_samples=500)

# Evaluate baseline
baseline_evaluator = BaselineEvaluator(model_name='all-MiniLM-L6-v2')
baseline_embeddings = baseline_evaluator.encode(texts)

# Test similarity search
results = baseline_evaluator.compute_similarity(0, top_k=5)
print("\nTop 5 similar articles to query:")
for idx, score in results:
    print(f"  [{idx}] (score: {score:.4f}): {texts[idx][:80]}...")

# Visualize
baseline_evaluator.visualize_embeddings(labels, save_path="../baseline_embeddings.png")


## 7. Prepare DataLoader for Contrastive Learning


In [None]:
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader

# Prepare training examples
train_examples = [InputExample(texts=[a, p, n]) for a, p, n in triplets[:2000]]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

print(f"Prepared {len(train_examples)} training examples")
print(f"Batch size: {train_dataloader.batch_size}")


## 8. Train Model with Contrastive Learning


In [None]:
from src.training import ContrastiveTrainer

# Initialize trainer
trainer = ContrastiveTrainer(base_model_name='all-MiniLM-L6-v2')

# Prepare dataloader
train_dataloader = trainer.prepare_dataloader(triplets[:2000], batch_size=32)

# Train with triplet loss
num_epochs = 2
trainer.train(
    train_dataloader,
    loss_type='triplet',
    num_epochs=num_epochs,
    output_path='../models/news_contrastive_model',
    save_model=True
)

print("\nTraining complete!")


## 9. Evaluate Fine-Tuned Model


In [None]:
from src.evaluation import Evaluator
from sentence_transformers import SentenceTransformer

# Load fine-tuned model
finetuned_model = SentenceTransformer('../models/news_contrastive_model')

# Evaluate
evaluator = Evaluator(finetuned_model, texts, labels)
results = evaluator.evaluate_all(k_values=[1, 5, 10])

# Visualize improved embeddings
evaluator.visualize_embeddings(
    labels=labels,
    save_path="../finetuned_embeddings.png",
    title="Fine-tuned Embeddings"
)

# Compare with baseline
comparison = evaluator.compare_with_baseline(baseline_embeddings, k_values=[1, 5, 10])


## 10. Hard Negative Mining (Optional)


In [None]:
from src.hard_negatives import HardNegativeMiner
from src.data_loader import get_text_and_labels

# Get corpus for hard negative mining
corpus_texts, _ = get_text_and_labels(dataset['train'], max_samples=1000)

# Build BM25 index
miner = HardNegativeMiner()
miner.build_bm25_index(corpus_texts)

# Mine hard negatives
query = "breaking news in politics"
hard_negatives = miner.mine_bm25_hard_negatives(query, n=5)

print(f"Hard negatives for '{query}':")
for idx in hard_negatives:
    print(f"  [{idx}]: {corpus_texts[idx][:80]}...")


## 11. Math Behind InfoNCE Loss

The InfoNCE loss is defined as:

$$\\mathcal{L}_{i} = - \\log \\frac{\\exp(\\text{sim}(x_i, x_i^+)/\\tau)}{\\sum_{j=0}^{N} \\exp(\\text{sim}(x_i, x_j)/\\tau)}$$

Where:
- $x_i$ is the anchor
- $x_i^+$ is the positive sample
- $\\tau$ is the temperature hyperparameter
- $\\text{sim}$ is the cosine similarity function

See `docs/loss_explanation.md` for detailed explanation.
