<h1> How to do NLP-like research in physics


This notebook provides a step-by-step demonstration/tutorial based on the Lagrangian paper.

# Acknowledge SUPR

The computations and data handling were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) from projects ????, partially funded by the Swedish Research Council through grant agreement no. 2022-06725

# Introduction
A short flash-talk style introduction to the Lagrangian paper to ensure we are on the same page regarding the example.

Link to slides: $\texttt{www.something.com}$

# Libraries

In [1]:
import torch


# Models
- Overview of HuggingFace library.
- How to find off-the-shelf transformer models (e.g., BART-L).
- Example usage of a HuggingFace model.

## HuggingFace Library

In [None]:
# Import HuggingFace libraries
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load a pre-trained model and tokenizer (e.g., BART-Large)
model_name = 'facebook/bart-large'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example usage
text = "This is a sample input."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

KeyboardInterrupt: 

In [None]:
from transformers import BartForConditionalGeneration, PreTrainedTokenizerFast

model_name = "JoseEliel/BART-Lagrangian"
model = BartForConditionalGeneration.from_pretrained(model_name)
hf_tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)


# Dataset
- Discussion on data generation considerations:
  - Data distribution.
  - Tokenization choices.
- Example of tokenizing a dataset.

## Data Distribution


Show plots from paper:
- one from random ->  more equal better at long expression
- one from smart  ->  more biased (cover edge terms) better at special cases

## Tokenization choices
Considerations: 
- What information is required for your model to learn?
- Do you care about expressivity? 

Practical 
- How much 

## Example: Tokenizing a dataset


In [None]:
# Example: Tokenizing a dataset
dataset = ["Example sentence 1.", "Example sentence 2."]
tokenized_dataset = [tokenizer(sentence, return_tensors="pt") for sentence in dataset]
print(tokenized_dataset)

# Training
- Mention available resources: SUPR/NAISS -> Alvis.
- Example of training a model.

##  Mention available resources: SUPR/NAISS -> Alvis.
How to access ALVIS


## CPU or GPU

In [None]:
# DO you have GPU?

# set the device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# move the model to the device
model.to(device)
# Example usage with GPU
text = "This is a sample input."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Example usage with GPU


In [None]:
# Example: Training a model (pseudo-code)
# Define training loop and optimizer
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)
for epoch in range(3):
    for batch in tokenized_dataset:
        outputs = model(**batch, labels=batch['input_ids'])
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Evaluation
- Generating output from the model.
- Discussion on evaluation choices:
  - Existing or novel metrics.
  - Embedding analysis.
  - Out-of-distribution tests.

In [None]:
# Example: Generating output
test_text = "This is a test input."
test_inputs = tokenizer(test_text, return_tensors="pt")
test_outputs = model.generate(**test_inputs)
print(tokenizer.decode(test_outputs[0], skip_special_tokens=True))

## Existing Metric  : Does it work? 

mainly to see if things work as expected
Loss : Deviation from actual term 
Accuracy : How much is perfect? 
New metric, Score : (Order does not always matter, XEN)

## Embedding analysis : What has it really learn?

Considerations : 
- Is efficiency the only think you need? 
- Or is it important for you to know whether the model knows what it is learning? 

Practical Questions : 
- Can it associate inputs to some embedding space? <br> 
- Can it understand relations between inputs?  <br> 

## OOD Generalization : Can it go beyond what its trained? 

Considerations : 
- Is your problem's "data space" very big? 
- Is the probably of an unseen case high? 
- If yes, then chances of OOD data cases are high. 
- Do you want to think about the next archietcture?

Practical Questions : 
- Can it work with never seen scenarios? What is your OOD?