![](https://i.imgur.com/fQY5xFv.png)


# Workshop Overview 🎯  

### From NLP Breakthroughs to Physics Research 🚀
- By NLP-like we mean to learn from breakthroughs in ML for NLP and bring it to physics
- The best place to start is symbolic mathematics. 

#### Expressions are a collection of **terms** (sentences) made up from mathematical **objects** (words) and put together with specific **rule** (syntax) that **encode** information (meaning).
---

## 🛠️ Part 1: Quickly From Idea to Model
- Introducing **Hugging Face** Ecosystem
- Rapidly prototype & train state-of-the-art models  
- Going from **0 ➡️ 100** with minimal effort

---

## 🎓 Part 2: Tackling a Realistic Physics Problem
- How we approached a practical, interesting scenario
- Dataset creation & sampling 🗃️
- Tokenization & data preparation 📐
- Evaluation & validation of model performance 📈
- Computational resources: What and Where? 💻

---

In [None]:
# Let's import some useful things first. 

import torch
import random
import numpy as np
from datasets import Dataset
# The transformers library is part of the Hugging Face ecosystem and provides a wide range of pre-trained models and tokenizers.
from transformers import (
    BartTokenizer, 
    BartForConditionalGeneration, 
    BartConfig,
    Seq2SeqTrainer, 
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq
)
import wandb


# Set best available device. Those of you with macbook MX chips will be very happy to run this on your machine.

device = torch.device("cuda" if torch.cuda.is_available() else 
                      "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")


# Initialize wandb, which is a tool for tracking experiments and visualizing results.
#wandb.init(project="addition-bart", name="addition-model")

# 🤗 What is Hugging Face?

---

### 🌐 **Community**
- Large, collaborative community of ML researchers and developers
- Share, discuss, and learn from others’ work 

### 📚 **Repository**
- Easy access to thousands of pretrained ML models, datasets but also "empty" architectures and tools to train them.
- Open-source sharing and reproducibility of state-of-the-art models

### 🛠️ **Library**
- Powerful Python libraries designed to quickly use, train, and fine-tune models
- Simplifies experimentation and accelerates research workflows

---

➡️ **All-in-one ML toolbox:**  
Easy to use. Easy to contribute. Easy to innovate.

![](https://i.imgur.com/WU23myI.png)

# ➕ Addition Task with BART Sequence-to-Sequence Model 🤖

---

## 🎯 **Goal**
- Train a model to perform integer addition  
*(e.g., "5 + 7 = ?" → "12")*

## 🧠 **Architecture**
- **BART**: Sequence-to-sequence transformer  
- Combines:
  - Bidirectional encoder (**BERT-like**)
  - Autoregressive decoder (**GPT-like**)

## ⚙️ **Implementation Steps**
- ✅ Configure a small BART model from scratch with Hugging Face
- ✅ Generate a custom dataset of addition examples
- ✅ Train the model from scratch specifically on this numeric addition task

---

✨ **Why BART?**  
- Ideal structure for sequence-to-sequence tasks
- Popular for working with symbolic mathematics

In [None]:

# Initialize tokenizer 
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

# Define model configuration
config = BartConfig(
    vocab_size=len(tokenizer),  
    d_model=128,
    encoder_layers=4,
    decoder_layers=4,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=512,
    encoder_ffn_dim=512,
    max_position_embeddings=32,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    decoder_start_token_id=tokenizer.bos_token_id
)

# Initialize model from scratch with our config
model = BartForConditionalGeneration(config)
print(f"Model parameters: {model.num_parameters():,}")


# 📊 Dataset Generation for Addition Task ➕

---

## 🎯 **Goal**
Generate a dataset of integer addition examples.

### 🧩 **Procedure**
- 🎲 **Sample integers** randomly from a predefined range.
- 🧮 **Compute the sum** for each sampled pair.
- 📝 **Format examples clearly** (e.g., `"23 + 15 = ?" → "38"`).
- 📂 **Split data** into training and validation datasets.

---


In [None]:

# Generate dataset directly with Hugging Face structures
def generate_dataset(num_examples=5000, min_num=1, max_num=100, train_ratio=0.8):
    examples = []
    for _ in range(num_examples):
        num1 = random.randint(min_num, max_num)
        num2 = random.randint(min_num, max_num)
        examples.append({
            "input_text": f"{num1} + {num2} = ?",
            "target_text": str(num1 + num2)
        })
    
    # Create and split datasets
    dataset = Dataset.from_dict({
        "input_text": [ex["input_text"] for ex in examples],
        "target_text": [ex["target_text"] for ex in examples]
    })
    return dataset.train_test_split(test_size=1-train_ratio, seed=42)

# For later to check OOD

def check_number_pair_in_dataset(dataset_dict, num1, num2):
    # Generate the input text format
    input_text = f"{num1} + {num2} = ?"
    
    # Check in the training set
    for item in dataset_dict['train']:
        if item['input_text'] == input_text:
            return True
        
    # Check in the evaluation set
    for item in dataset_dict['test']:
        if item['input_text'] == input_text:
            return True
        
    return False

# Example usage
dataset_dict = generate_dataset()
num1, num2 = 4, 5
is_in_dataset = check_number_pair_in_dataset(dataset_dict, num1, num2)
print(f"Is '{num1} + {num2} = ?' in the dataset? {'Yes' if is_in_dataset else 'No'}")


# Create datasets
dataset_dict = generate_dataset()

# Verify data
print("\nSample Data:")
for i in range(3):
    print(f"Sample {i+1}: Input: '{dataset_dict['train']['input_text'][i]}', Target: '{dataset_dict['train']['target_text'][i]}'")


# 🔢 Data Tokenization for Addition ➕

---

## 🎯 **What is Tokenization?**
Converting input data into numerical representations that models can process.

## 🛠️ **Why is it Important?**
- NN eat numbers not expressions.
- Proper tokenization ensures learning the right patterns.

### ✨ **Examples (`73 + 4 = 77`)**
- ## **Number-level**:  `"73 + 4 = 77"` → `[73, '+', 4, '=', 77]`
- ## **Digit-by-digit**:  `"73 + 4 = 77"` → `[7, 3, '+', 4, '=', 7, 7]`

---

## 🚀 **NOW: BART Tokenizer**
The Tokenizer used by developers when training the model with Language. Splits texts in subwords. Why? It's good enough.

In [None]:
# Define preprocessing function - fixed tokenization approach
def preprocess_function(examples):
    # Tokenize inputs
    model_inputs = tokenizer(
        examples["input_text"],
        padding="max_length",
        truncation=True,
        max_length=16,
        return_tensors=None  # Return python lists
    )
    
    # Tokenize targets
    labels = tokenizer(
        text_target=examples["target_text"],
        padding="max_length",
        truncation=True,
        max_length=8,
        return_tensors=None  # Return python lists
    )
    
    # Replace padding token id with -100 in labels (transformers convention)
    for i in range(len(labels["input_ids"])):
        labels["input_ids"][i] = [
            -100 if token == tokenizer.pad_token_id else token
            for token in labels["input_ids"][i]
        ]
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Preprocess datasets
tokenized_datasets = dataset_dict.map(
    preprocess_function, 
    batched=True, 
    remove_columns=["input_text", "target_text"]
)

# Show example tokenization of an input with the actual input and tokenized input
print("\nTokenization Example:")
example_input = dataset_dict['train'][0]['input_text']
example_tokenized = tokenizer(example_input)
print(f"Input: {example_input}")
print(f"Tokenized: {example_tokenized}")
# Decode
print(f"Decoded: {tokenizer.decode(example_tokenized['input_ids'])}")


# NOW WE TRAIN! 

### Data Collator
The Data Collator is responsible for batching our processed examples together for efficient training. It performs several critical functions:

- **Padding**: Ensures all sequences in a batch have the same length by adding padding tokens
- **Tensor conversion**: Converts data from Python lists to PyTorch tensors
- **Special handling for labels**: Properly masks padded tokens in labels with -100 so they don't contribute to the loss

### Training Arguments
We configure the training process with parameters like:
- Learning rate and optimization settings
- Batch sizes and number of epochs
- Evaluation and checkpointing frequency
- Logging configuration for monitoring training progress

After setting up these components, we'll initialize the trainer and start the training process to teach our model how to add numbers!

In [None]:
# Set up data collator (without using deprecated features)
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding="max_length",
    max_length=16,
    pad_to_multiple_of=8
)

# Training arguments with better learning rate
training_args = Seq2SeqTrainingArguments(
    output_dir="./addition_results",
    run_name="addition-bart-fixed",
    num_train_epochs=20,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    learning_rate=5e-4,  
    report_to="none",
    logging_strategy="steps",
    logging_steps=100,
    predict_with_generate=True,
    generation_max_length=8,
    warmup_ratio=0.1,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8
)

# Define trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

# Train the model
trainer.train()
print("Training completed!")

# Save the model
model.save_pretrained("./addition_model")

# Clean up wandb
wandb.finish()

In [None]:

# Define test function
def test_addition(model, tokenizer, num1, num2):
    input_text = f"{num1} + {num2} = ?"
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    
    # Move model to device for prediction
    model.to(device)
    model.eval()
    
    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_length=8,
            num_beams=4,
            early_stopping=True
        )
    
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return prediction.strip()

# Load best model for testing
model = BartForConditionalGeneration.from_pretrained("./addition_model")
model.to(device)

# Test cases random

test_cases = [(random.randint(1, 100), random.randint(1, 100)) for _ in range(10)]

print("\nTesting the model:")

# 1 + 1 = 2 example
test_cases.append((1, 1))

correct = 0
for num1, num2 in test_cases:
    # Check if in dataset_dict
    expected = str(num1 + num2)
    predicted = test_addition(model, tokenizer, num1, num2)
    print(f"{num1} + {num2} = {predicted} (Expected: {expected})")
    is_in_dataset = check_number_pair_in_dataset(dataset_dict, num1, num2)
    if is_in_dataset:
        print(f"In dataset")
    else:
        print(f"Not in dataset")
    if predicted == expected:
        correct += 1

print(f"\nAccuracy: {correct/len(test_cases):.2%}")


# **Beyond a toy problem: Generating Lagrangians**

## With this philosophy in mind, we though a good general task to tackle would be:

![](https://i.imgur.com/Tb701DF.png)

## But as a starting point towards a general Lagrangian generator, we decided to tackle what we know best:

![](https://i.imgur.com/dGfUOPB.png)

# That's a big jump. What did we have to think to succesfully do NLP-like approach with particle physics Lagrangians?


![](https://i.imgur.com/X4NW1q0.png)

# link: [https://bit.ly/4ctADR6](https://bit.ly/4ctADR6)

[](https://imgur.com/dGfUOPB)