This repository contains implementations of various text generation algorithms including baseline methods, Sequential Importance Sampling (SIS), Sequential Monte Carlo (SMC), and Twisted Sequential Monte Carlo (TSMC).
AML/assignment_3_final/
βββ README.md # This documentation file
βββ PA3_Problem_Statement.pdf # Assignment problem statement
βββ .gitignore # Git ignore rules
βββ .gitattributes # Git LFS configuration
βββ api.py # Optimized trigram reward calculator
βββ utils.py # Shared utilities (I/O, seeding)
βββ eval.py # Comprehensive evaluation suite
βββ task0.py # Task 0: Baseline decoding methods CLI
βββ task1.py # Task 1: Sequential Importance Sampling CLI
βββ task2.py # Task 2: Sequential Monte Carlo CLI
βββ task3.py # Task 3: Twisted Sequential Monte Carlo CLI
βββ generate_task0.py # Core baseline implementations
βββ generate_task1_is.py # SIS algorithm implementation
βββ generate_task2_smc.py # SMC algorithm implementation
βββ generate_task3_tsmc.py # TSMC algorithm implementation
βββ plot_histogram.py # Plotting utilities for analysis
βββ tinystories_ngrams.tar.gz # Pre-computed n-gram data archive (235MB)
βββ data/
β βββ test_prompts.jsonl # Test prompts for evaluation (20 prompts)
βββ __pycache__/ # Python bytecode cache (auto-generated)
The pre-computed n-gram data is required for Tasks 1, 2, and 3. The data is included in this repository as tinystories_ngrams.tar.gz.
Extract the included data:
tar -xzf tinystories_ngrams.tar.gzThe extracted directory should contain:
trigram_probs.pkl- Pre-computed trigram probabilities (561MB)trigram_counts.json- Raw trigram counts (392MB)bigram_counts.json- Bigram counts (30MB)unigram_counts.json- Unigram counts (417KB)vocab.json- Vocabulary mappings (460KB)totals.json- Count statistics (369B)
If you have n-gram data in a different location, update the --counts-dir parameter in the commands below to point to your directory containing trigram_probs.pkl.
You must first get access to the Llama model weights from HuggingFace:
- Visit: https://huggingface.co/meta-llama/
- Request access to
meta-llama/Meta-Llama-3-8B-Instruct - Follow the approval process (may take some time)
After getting model access, create an authentication token:
- Visit: https://huggingface.co/docs/hub/en/security-tokens
- Go to your HuggingFace settings β Access Tokens
- Create a new token with "Read" permissions
- Save this token securely - you'll need it for all commands
# Create a new conda environment
conda create -n cs791_a3 python=3.10
# Activate the environment
conda activate cs791_a3
# Install PyTorch with CUDA support
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
# Install additional dependencies
pip install transformers# Set your HuggingFace token (required for all tasks)
export HF_TOKEN="your_huggingface_token_here"Note: All tasks use the fixed model meta-llama/Meta-Llama-3-8B-Instruct - no need to specify it in commands.
Implements three fundamental text generation approaches: greedy, temperature, and top-k sampling.
python task0.py \
--hf-token $HF_TOKEN \
--method greedy \
--A 10 \
--B 5 \
--out data/outputs_task0_greedy.jsonlpython task0.py \
--hf-token $HF_TOKEN \
--method temperature \
--tau 1.0 \
--A 10 \
--B 5 \
--out data/outputs_task0_temperature.jsonlpython task0.py \
--hf-token $HF_TOKEN \
--method topk \
--k 10 \
--A 10 \
--B 5 \
--out data/outputs_task0_topk.jsonlParameters:
--method: Choose fromgreedy,temperature,topk--A: Number of prompts to process--B: Number of samples per prompt--tau: Temperature for temperature sampling (default: 1.0)--k: Top-k parameter for top-k sampling (default: 10)
python task1.py \
--hf-token $HF_TOKEN \
--counts-dir tinystories_ngrams \
--A 10 \
--B 32 \
--beta 5.0 \
--k 10 \
--out data/outputs_task1_IS.jsonlParameters:
--counts-dir: Path to n-gram data directory--beta: Reward scaling factor Ξ² in exp(Ξ²ΓR(x)) (default: 5.0)--k: Top-k parameter for proposal sampling (default: 10)--epsilon: Smoothing parameter for unseen trigrams (default: 1e-9)
python task2.py \
--hf-token $HF_TOKEN \
--counts-dir tinystories_ngrams \
--A 10 \
--B 32 \
--beta 5.0 \
--k 10 \
--out data/outputs_task2_SMC.jsonlParameters:
--counts-dir: Path to n-gram data directory--beta: Terminal reward scaling factor (default: 5.0)--k: Top-k parameter for proposal distribution (default: 10)--B: Number of particles per prompt
python task3.py \
--hf-token $HF_TOKEN \
--counts-dir tinystories_ngrams \
--A 10 \
--B 32 \
--beta 5.0 \
--k 10 \
--out data/outputs_task3_TSMC.jsonl--hf-token: HuggingFace authentication token (required)--device: CUDA device (default: "cuda:0")
Note: All tasks use the fixed model meta-llama/Meta-Llama-3-8B-Instruct
--test-file: Path to test prompts (default: "data/test_prompts.jsonl")--A: Number of prompts to process (required)--B: Number of samples/particles per prompt (required)--seed: Random seed for reproducibility (default: 123)
--counts-dir: Directory with trigram_probs.pkl (required for tasks 1-3)--epsilon: Smoothing parameter (default: 1e-9)
All tasks generate JSONL files with the following structure:
{
"prompt_id": 1,
"prefix": "The wind whispered through old ruins",
"continuations": [{
"method": "Greedy",
"samples": [
{"text": "and told ancient stories", "weight": 1.0}
],
"normalized_weights": [1.0]
}]
}# 1. Extract data
tar -xzf tinystories_ngrams.tar.gz
# 2. Run baseline
python task0.py --hf-token $HF_TOKEN --method greedy --A 5 --B 3
# 3. Run SIS
python task1.py --hf-token $HF_TOKEN --counts-dir tinystories_ngrams --A 5 --B 16
# 4. Run SMC
python task2.py --hf-token $HF_TOKEN --counts-dir tinystories_ngrams --A 5 --B 16
# 5. Run TSMC
python task3.py --hf-token $HF_TOKEN --counts-dir tinystories_ngrams --A 5 --B 16
# 6. Evaluate results
python eval.py --inputs data/outputs_task*.jsonl --counts-dir tinystories_ngrams --model meta-llama/Meta-Llama-3-8B-Instruct --hf-token $HF_TOKEN