<h1> How to do NLP-like research in physics


This notebook provides a step-by-step demonstration/tutorial based on the Lagrangian paper.

# Acknowledge SUPR

The computations and data handling were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) from projects ????, partially funded by the Swedish Research Council through grant agreement no. 2022-06725

# Introduction
A short flash-talk style introduction to the Lagrangian paper to ensure we are on the same page regarding the example.

Link to slides: $\texttt{www.something.com}$

# Libraries

In [None]:
import torch


# Models
- Overview of HuggingFace library.
- How to find off-the-shelf transformer models (e.g., BART-L).
- Example usage of a HuggingFace model.

## HuggingFace Library

In [None]:
# Import HuggingFace libraries
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load a pre-trained model and tokenizer (e.g., BART-Large)
model_name = 'facebook/bart-large'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example usage
text = "This is a sample input."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
from transformers import BartForConditionalGeneration, PreTrainedTokenizerFast

model_name = "JoseEliel/BART-Lagrangian"
model = BartForConditionalGeneration.from_pretrained(model_name)
hf_tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)


# Dataset
- Discussion on data generation considerations:
  - Data distribution.
  - Tokenization choices.
- Example of tokenizing a dataset.

## Data Distribution


Show plots from paper:
- one from random ->  more equal better at long expression
- one from smart  ->  more biased (cover edge terms) better at special cases

## Tokenization choices
Considerations: 
- What information is required for your model to learn?
- Do you care about expressivity? 

Practical 
- How much 

## Example: Tokenizing a dataset


In [None]:
# Example: Tokenizing a dataset
dataset = ["Example sentence 1.", "Example sentence 2."]
tokenized_dataset = [tokenizer(sentence, return_tensors="pt") for sentence in dataset]
print(tokenized_dataset)

# Training
- Mention available resources: SUPR/NAISS -> Alvis.
- Example of training a model.

##  Mention available resources: SUPR/NAISS -> Alvis.
How to access ALVIS


## CPU or GPU

In [None]:
# DO you have GPU?

# set the device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# move the model to the device
model.to(device)
# Example usage with GPU
text = "This is a sample input."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Example usage with GPU


In [None]:
# Example: Training a model (pseudo-code)
# Define training loop and optimizer
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)
for epoch in range(3):
    for batch in tokenized_dataset:
        outputs = model(**batch, labels=batch['input_ids'])
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Evaluation
- Generating output from the model.
- Discussion on evaluation choices:
  - Existing or novel metrics.
  - Embedding analysis.
  - Out-of-distribution tests.

In [None]:
# Example: Generating output
test_text = "This is a test input."
test_inputs = tokenizer(test_text, return_tensors="pt")
test_outputs = model.generate(**test_inputs)
print(tokenizer.decode(test_outputs[0], skip_special_tokens=True))

## Existing Metric  : Does it work? 

mainly to see if things work as expected
Loss : Deviation from actual term 
Accuracy : How much is perfect? 
New metric, Score : (Order does not always matter, XEN)

## Embedding analysis : What has it really learn?

Considerations : 
- Is efficiency the only think you need? 
- Or is it important for you to know whether the model knows what it is learning? 

Practical Questions : 
- Can it associate inputs to some embedding space? <br> 
- Can it understand relations between inputs?  <br> 

## OOD Generalization : Can it go beyond what its trained? 

Considerations : 
- Is your problem's "data space" very big? 
- Is the probably of an unseen case high? 
- If yes, then chances of OOD data cases are high. 
- Do you want to think about the next archietcture?

Practical Questions : 
- Can it work with never seen scenarios? What is your OOD?

In [None]:
#!/usr/bin/env python3
"""
A simplified example of training a small BART model to perform addition,
using HuggingFace components and with proper device support (CPU/CUDA/MPS).
"""

import torch
import random
import numpy as np
from torch.utils.data import Dataset
from transformers import (
    BartTokenizer,
    BartConfig,
    BartForConditionalGeneration,
    Trainer,
    TrainingArguments,
    DataCollatorForSeq2Seq,
)

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

# Select appropriate device (CPU, CUDA, or MPS)
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print(f"Using device: {device}")

# Load the pretrained tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

class AdditionDataset(Dataset):
    def __init__(self, num_samples, tokenizer):
        self.examples = []
        for _ in range(num_samples):
            a = random.randint(0, 99)
            b = random.randint(0, 99)
            inp = f"{a:02d}+{b:02d}="
            target = f"{a + b:03d}"
            self.examples.append((inp, target))
    
    def __getitem__(self, idx):
        inp, target = self.examples[idx]
        # Tokenize inputs
        model_inputs = tokenizer(
            inp, 
            max_length=6,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )
        
        # Tokenize targets using the text_target argument (adding special tokens)
        labels = tokenizer(
            text_target=target,
            max_length=4,  # Allows for digit tokens plus an EOS token
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )["input_ids"].squeeze(0)
        
        # Replace pad token ids with -100 (ignore in loss computation)
        labels[labels == tokenizer.pad_token_id] = -100
        
        # Squeeze the input tensor dimensions (from [1, ...] to [...])
        model_inputs = {k: v.squeeze(0) for k, v in model_inputs.items()}
        model_inputs["labels"] = labels
        return model_inputs
    
    def __len__(self):
        return len(self.examples)

# Create training and validation datasets
train_dataset = AdditionDataset(num_samples=400, tokenizer=tokenizer)
val_dataset = AdditionDataset(num_samples=100, tokenizer=tokenizer)

# Define a small BART configuration with the proper special token settings.
config = BartConfig(
    vocab_size=tokenizer.vocab_size,
    max_position_embeddings=32,
    encoder_layers=2,
    decoder_layers=2,
    encoder_attention_heads=2,
    decoder_attention_heads=2,
    encoder_ffn_dim=64,
    decoder_ffn_dim=64,
    d_model=32,
    activation_function="gelu",
    dropout=0.1,
    bos_token_id=tokenizer.bos_token_id,  # beginning-of-sequence token
    eos_token_id=tokenizer.eos_token_id,  # end-of-sequence token
    pad_token_id=tokenizer.pad_token_id,
)
# IMPORTANT: set decoder_start_token_id so the generation is primed correctly.
config.decoder_start_token_id = tokenizer.bos_token_id

# Initialize the model and move it to the selected device
model = BartForConditionalGeneration(config)
model.to(device)

# Compute steps per epoch. With 400 samples and a batch size of 8, that's 50 steps per epoch.
steps_per_epoch = len(train_dataset) // 8  # 400 // 8 = 50

# We want evaluation to happen every 10 epochs, i.e. every 50 * 10 = 500 steps.
eval_interval = steps_per_epoch * 10

# Set up training arguments. Using evaluation_strategy="steps" with eval_steps=500
# will print the evaluation table every 500 steps (i.e., every 10 epochs).
training_args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="steps",
    eval_steps=eval_interval,
    num_train_epochs=1000,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    logging_steps=10,
    save_strategy="epoch",
    save_total_limit=2,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

# Train the model
print("Starting training...")
trainer.train()
trainer.evaluate()  # This will do a final evaluation after training


# Run tests
print("\nRunning tests...")
test_addition(7, 25)    # Expected: 032
test_addition(45, 55)   # Expected: 100
test_addition(99, 1)    # Expected: 100

Using device: mps


Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Starting training...


[34m[1mwandb[0m: Currently logged in as: [33mkysheng[0m ([33mml-thep[0m). Use [1m`wandb login --relogin`[0m to force relogin


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Step,Training Loss,Validation Loss
500,8.9523,8.984289
1000,7.3127,7.476482
1500,5.84,6.211458
2000,4.6507,5.179939
2500,3.8077,4.451714
3000,3.2619,4.003561
3500,2.947,3.640288
4000,2.7248,3.361407
4500,2.5578,3.166435
5000,2.4421,3.026706



Running tests...
Testing: 07+25=
Predicted: '', Actual: '032', Correct: False
Testing: 45+55=
Predicted: '', Actual: '100', Correct: False
Testing: 99+01=
Predicted: '', Actual: '100', Correct: False
