# <br> Up your ML game 
![title](images/levelup.jpg)
## PyHEP  
### Liv Våge 28.10.2025 

# Why this talk?

The ML ecosystem changes **fast**. 

E.g. by the time you finally understand the intricacies of one package it falls out of favour (_tensorflow_ 👀)

**Goal:** Help you pick the right tools, increase efficiency and smooth out pain points 

**Disclaimer:** If you're well versed in ML, there might not be a lot of new material here. And these are just my biased opinions -
please make a PR with edits if you find mistakes or have something to add! 

---

## What we'll cover:
1. **ML Frameworks** - PyTorch, JAX, Keras, etc. (which one and why?)
2. **Workflow Tools** - W&B, MLflow, Optuna (making it trackable and reproducable)
3. **Training & Deployment** - hls4ml, ONNX, HTCondor (get off your laptop)
4. **HEP-ML Bridge** - uproot, awkward, hist (one of the major pain points)
5. **Industry Tools** - What industry does better (and what we can steal)
6. **Fun Shortcuts** - LLMs, Hugging Face, and other "cheats"


# 1. Common ML Frameworks



## Which framework should I use?

**Short answer:** PyTorch (probably)

**Long answer:** Depends on the use case 

### Quick Framework Comparison

| Framework | Best For | Pros | Cons |
|-----------|----------|------|------|
| **PyTorch** | Research, flexibility, HEP | Pythonic, great debugging, huge community | Verbose, more boilerplate |
| **PyTorch Lightning** | Production, clean code | Organized, less boilerplate, built-in best practices | Another abstraction to learn |
| **JAX** | Speed demons, researchers | FAST, functional programming, auto-vectorization | NumPy 2.x conflicts, functional paradigm learning curve |
| **Keras** | Beginners, quick prototypes | Super simple API, fast to start | Less flexibility, slower development |
| **Scikit-learn** | Classical ML, baselines | Easy, stable, great docs | Not for deep learning |
| **XGBoost** | Tabular data, structured features | Fast, interpretable, great for HEP kinematics | NumPy 2.x compatibility issues, not for complex deep learning |
| **Tensorflow** | Legacy code | You might find legacy code examples in tensorflow | It's falling out of favour, would avoid if possible|

_Note that Keras is actually an API that lets you call jax, tensorflow and pytorch!_ 

- Just starting? → **Keras** or **Scikit-learn**
- Need a quick baseline on tabular data? → **XGBoost** or **Scikit-learn**
- Working with HEP kinematic features? → **XGBoost** (often best!)
- Doing research/custom architectures? → **PyTorch**
- Want cleaner code? → **PyTorch Lightning**
- Need maximum speed? → **JAX**
- Working in a team? → **PyTorch** or **PyTorch Lightning**
- You can also write custom Cuda code if you really like to suffer


### 1.1 - XGBoost: The Gradient Boosting Powerhouse

**What it is:** Extreme Gradient Boosting - tree-based ensemble method

**Why it matters for HEP:**
- Handles tabular data exceptionally well (which HEP has lots of!)
- Often outperforms neural networks on structured data
- Interpretable (feature importance, SHAP values)
- Fast training and inference
- Great baseline before trying deep learning

**When to use:**
- Tabular data with many features
- Need quick, interpretable results
- Want feature importance
- Limited training data available


In [1]:
# Quick demo: Same simple neural network in different frameworks
import numpy as np

# Generate some fake HEP-like data (4 kinematic features)
np.random.seed(42)
X_train = np.random.randn(1000, 4).astype(np.float32)
y_train = (X_train[:, 0] + X_train[:, 1] > 0).astype(np.float32)
X_test = np.random.randn(200, 4).astype(np.float32)
y_test = (X_test[:, 0] + X_test[:, 1] > 0).astype(np.float32)

print(f"Training data: {X_train.shape}, Labels: {y_train.shape}")


Training data: (1000, 4), Labels: (1000,)


In [10]:
# XGBoost example

import xgboost as xgb
from sklearn.metrics import accuracy_score

# Train XGBoost
clf_xgb = xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss', verbosity=0)
clf_xgb.fit(X_train, y_train)

# Evaluate
y_pred_xgb = clf_xgb.predict(X_test)
acc_xgb = accuracy_score(y_test, y_pred_xgb)
print(f"XGBoost Accuracy: {acc_xgb:.3f}")

# Show feature importance
print("Feature Importance:")
for i, imp in enumerate(clf_xgb.feature_importances_):
    print(f"  Feature {i}: {imp:.3f}")

XGBoost Accuracy: 0.980
Feature Importance:
  Feature 0: 0.509
  Feature 1: 0.462
  Feature 2: 0.015
  Feature 3: 0.015


## 1.2 - Scikit learn

Great for a range of ML models and quick benchmarking 

In [17]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import time

# Match the same hyperparameters as other neural networks
start_sklearn = time.time()
clf = MLPClassifier(hidden_layer_sizes=(4, 16, 1), max_iter=20, random_state=42,
                    learning_rate='constant', learning_rate_init=0.001,
                    solver='adam', activation='tanh', batch_size=32)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
time_sklearn = time.time() - start_sklearn
print(f"✅ Scikit-learn: 3 lines and done. Accuracy: {acc:.3f}")



✅ Scikit-learn: 3 lines and done. Accuracy: 0.995




A very neat detail is that in most libraries (xgboost, sklearn, pytorch, keras) follow the same general pattern of 
```
model = ...
model.fit(data)
model.predict(data)
``` 

## 1.3 - Neural networks 

#### Option 1: PyTorch (the verbose way)


In [18]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# Define PyTorch model class
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        # Standard architecture: 4 -> 16 -> 1
        self.layers = nn.Sequential(
            nn.Linear(4, 16),
            nn.Tanh(),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.layers(x)

print("✅ PyTorch model defined")


✅ PyTorch model defined


In [19]:
import time
from sklearn.metrics import accuracy_score

# PyTorch training with exact same hyperparameters
start_pytorch = time.time()
torch.manual_seed(42)
model_torch = SimpleNN()
criterion = nn.BCELoss()
optimizer = optim.Adam(model_torch.parameters(), lr=0.001)

X_train_t = torch.from_numpy(X_train).float()
y_train_t = torch.from_numpy(y_train).float().unsqueeze(1)

train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=False)

model_torch.train()
for epoch in range(20):
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = model_torch(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()

model_torch.eval()
with torch.no_grad():
    predictions = (model_torch(torch.from_numpy(X_test).float()).numpy().flatten() > 0.5).astype(int)
    acc_pytorch = accuracy_score(y_test, predictions)

time_pytorch = time.time() - start_pytorch
print(f"✅ PyTorch: Full control. Accuracy: {acc_pytorch:.3f}")



✅ PyTorch: Full control. Accuracy: 0.980


#### Option 2: PyTorch Lightning (the clean and quick code way)


Pytorch lightning relies on inheritance so you don't have to write boilerplate code. It's great for development, but is generally disfavoured in production - you only need inference and other libraries handle that better. 

In [20]:
import pytorch_lightning as pl

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        torch.manual_seed(42)  # Match other frameworks
        self.layers = nn.Sequential(
            nn.Linear(4, 16), nn.Tanh(), nn.Linear(16, 1), nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.layers(x)
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = nn.BCELoss()(y_hat, y)
        self.log('train_loss', loss, on_step=False, on_epoch=True)
        return loss
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.001)

print("✅ PyTorch Lightning model class defined")


✅ PyTorch Lightning model class defined


In [21]:
from sklearn.metrics import accuracy_score
import pytorch_lightning as pl
import time

# PyTorch Lightning training
start_pl = time.time()
torch.manual_seed(42)
lit_model = LitModel()
train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader_pl = DataLoader(train_dataset, batch_size=32, shuffle=False)
trainer = pl.Trainer(max_epochs=20, log_every_n_steps=1000, enable_progress_bar=False, enable_model_summary=False, logger=False, enable_checkpointing=False, fast_dev_run=False)
trainer.fit(lit_model, train_loader_pl)

# Faster evaluation - use the underlying PyTorch model directly
lit_model.eval()
X_test_t = torch.from_numpy(X_test).float()
with torch.no_grad():
    outputs = lit_model(X_test_t)
    predictions_pl = (outputs.squeeze().numpy() > 0.5).astype(int)
    acc_pl = accuracy_score(y_test, predictions_pl)

time_pl = time.time() - start_pl
print(f"✅ PyTorch Lightning: Clean code. Accuracy: {acc_pl:.3f}")


GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/liv/pyhep-talk/venv-pyhep/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:433: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
`Trainer.fit` stopped: `max_epochs=20` reached.


✅ PyTorch Lightning: Clean code. Accuracy: 0.980


#### Option 3: Keras (the simple API way)


In [22]:
# Keras: Super simple API
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.metrics import accuracy_score
import time
# Set random seed for reproducibility
start_keras = time.time()
tf.random.set_seed(42)
# Define and compile model - matches PyTorch architecture
model_keras = keras.Sequential([
    layers.Dense(16, activation='tanh', input_shape=(4,)),
    layers.Dense(1, activation='sigmoid')
])
model_keras.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
                    loss='binary_crossentropy')
# Train
model_keras.fit(X_train, y_train, epochs=20, batch_size=32, verbose=0)
# Evaluate
y_pred_keras = (model_keras.predict(X_test, verbose=0) > 0.5).astype(int).flatten()
acc_keras = accuracy_score(y_test, y_pred_keras)
time_keras = time.time() - start_keras
print(f"✅ Keras: 4 lines of code. Accuracy: {acc_keras:.3f}")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


✅ Keras: 4 lines of code. Accuracy: 0.985


#### Option 4: JAX (the functional way)

In [8]:
# JAX: Functional programming approach
import jax
import jax.numpy as jnp
from jax import random
import optax
from sklearn.metrics import accuracy_score
import numpy as np
import time
# Set seed for reproducibility
start_jax = time.time()
key = random.PRNGKey(42)
np.random.seed(42)
# Initialize parameters - same architecture as others
layer_sizes = [4, 16, 1]
keys = random.split(key, len(layer_sizes))
params = []
for m, n, k in zip(layer_sizes[:-1], layer_sizes[1:], keys):
    w_key, b_key = random.split(k)
    w = random.normal(w_key, (m, n)) * 0.1
    b = random.normal(b_key, (n,)) * 0.1
    params.append((w, b))
# Forward pass
def forward(params, x):
    for w, b in params[:-1]:
        x = jnp.tanh(x @ w + b)
    w, b = params[-1]
    return jax.nn.sigmoid(x @ w + b).squeeze()
# Loss function
def loss_fn(params, x, y):
    pred = forward(params, x)
    return optax.sigmoid_binary_cross_entropy(pred, y).mean()
# Training
optimizer_jax = optax.adam(learning_rate=0.001)
opt_state = optimizer_jax.init(params)
X_train_jax = jnp.array(X_train)
y_train_jax = jnp.array(y_train)
X_test_jax = jnp.array(X_test)
y_test_jax = jnp.array(y_test)
batch_size = 32
n_batches = len(X_train) // batch_size
for epoch in range(20):
    for batch_idx in range(n_batches):
        start_idx = batch_idx * batch_size
        end_idx = start_idx + batch_size
        X_batch = X_train_jax[start_idx:end_idx]
        y_batch = y_train_jax[start_idx:end_idx]
        loss, grads = jax.value_and_grad(loss_fn)(params, X_batch, y_batch)
        updates, opt_state = optimizer_jax.update(grads, opt_state)
        params = optax.apply_updates(params, updates)
# Evaluate
test_pred = forward(params, X_test_jax) > 0.5
acc_jax = accuracy_score(y_test, np.array(test_pred))
time_jax = time.time() - start_jax
print(f"✅ JAX: Functional and fast. Accuracy: {acc_jax:.3f}")


✅ JAX: Functional and fast. Accuracy: 0.960


In [23]:
# Comparison: Timing and Accuracy

import pandas as pd

print("="*60)
print("FRAMEWORK COMPARISON")
print("="*60)
print("All neural networks use identical setup:")

print("- Architecture: 4→16→1 (tanh activation)")
print("- Optimizer: Adam (lr=0.001)")
print("- Training: 20 epochs, batch_size=32")
print("- Random seed: 42")
print("="*60)

# Compare the results from individual training cells above
print()
print("Results:")
print("-" * 60)
print(f"Scikit-learn:  Accuracy: {acc:.3f},  Time: {time_sklearn:.3f}s")
print(f"PyTorch:       Accuracy: {acc_pytorch:.3f},  Time: {time_pytorch:.3f}s")
print(f"PyTorch L.:    Accuracy: {acc_pl:.3f},  Time: {time_pl:.3f}s")
print(f"Keras:         Accuracy: {acc_keras:.3f},  Time: {time_keras:.3f}s")
print(f"JAX:           Accuracy: {acc_jax:.3f},  Time: {time_jax:.3f}s")
print("-" * 60)

print()
print("✅ All frameworks achieved similar accuracy (~0.98)")
print("💡 PyTorch Lightning is slower due to framework overhead")
print("💡 Pick based on ease of use, not performance differences!")

FRAMEWORK COMPARISON
All neural networks use identical setup:
- Architecture: 4→16→1 (tanh activation)
- Optimizer: Adam (lr=0.001)
- Training: 20 epochs, batch_size=32
- Random seed: 42

Results:
------------------------------------------------------------
Scikit-learn:  Accuracy: 0.995,  Time: 0.118s
PyTorch:       Accuracy: 0.980,  Time: 0.254s
PyTorch L.:    Accuracy: 0.980,  Time: 2.026s
Keras:         Accuracy: 0.985,  Time: 1.433s
JAX:           Accuracy: 0.960,  Time: 8.122s
------------------------------------------------------------

✅ All frameworks achieved similar accuracy (~0.98)
💡 PyTorch Lightning is slower due to framework overhead
💡 Pick based on ease of use, not performance differences!


## Takeaway:

All neural networks above use **identical setup**: 
- Architecture: 4→16→1 (tanh activation)
- Optimizer: Adam (lr=0.001)
- Training: 20 epochs
- Batch size: 32
- Same random seed (42)

The comparison shows:
- **Accuracy should are very similar** across neural network frameworks (differences reflect implementation details and numerical precision)
- **Speed varies** due to different optimizations and backend implementations - also because this is a very small example 

**Bottom line:** Choose your framework based on ease of use and ecosystem, not tiny performance differences.



 # 2 - ML workflow tools 

Moving beyond jupyter notebooks and into configuration file centred and reproducible ML 

### 2.1 Logging experiments with Weights & Biases (W&B) 

**What it does:**
- Automatic logging of metrics, hyperparameters, system info
- Beautiful dashboards
- Experiment comparison
- Model versioning
- Artifact tracking
- **Free for academics!**

**When to use:** Any serious project. Very easy to set up


In [38]:
# W&B Quick Start - Experiment Monitoring

try:
    import wandb
    wandb.login(key="your-key-here")  # Would need account
    
    # Initialize tracking
    wandb.init(
        project="pyhep-demo", 
        name="neural-network-comparison",
        config={
            "learning_rate": 0.001,
            "epochs": 20,
            "batch_size": 32,
            "architecture": "Scale→16→1"
        }
    )
    
    # Simulate training with fake metrics
    for epoch in range(5):
        fake_loss = 1.0 / (epoch + 2) + 0.1
        fake_acc = 0.8 + epoch * 0.04
        wandb.log({
            "loss": fake_loss,
            "accuracy": fake_acc,
            "epoch": epoch
        })
    
    wandb.finish()
    print("🎨 W&B: Beautiful dashboards for experiment tracking")
except Exception as e:
    print("Note: W&B requires account setup (wandb.ai)")
    print("🎨 W&B: Beautiful dashboards for experiment tracking")
    print("💡 Pro tip: wandb.watch(model) tracks gradients automatically!")
    print("💡 Free for academics!")



Note: W&B requires account setup (wandb.ai)
🎨 W&B: Beautiful dashboards for experiment tracking
💡 Pro tip: wandb.watch(model) tracks gradients automatically!
💡 Free for academics!


### 2.2 MLflow - The Open Source Alternative

**Pros:**
- Fully open source
- Self-hosted (for the privacy-conscious)
- Experiment tracking + model registry
- Works with any ML library

**Cons:**
- Less pretty than W&B
- Need to host it yourself
- Smaller community

**When to use:** You need full control, can't/won't use cloud services


### Key Difference: W&B vs MLflow

**W&B (Weights & Biases):**
- **Best at:** Experiment monitoring, visualization, hyperparameter tuning
- Interactive dashboards, automatic logging
- Great for research and experimentation
- Cloud-first (free for academics)

**MLflow:**
-  **Best at:** Model registry, versioning, deployment, MLOps
-  Model storage and retrieval
-  Production deployment support
-  On-premise friendly

**TL;DR:** Use W&B for experiments, MLflow for production models.

### 2.3 Optuna - Hyperparameter Optimization Made Easy

**Stop doing grid search in 2025!**

Optuna uses smart algorithms (TPE, CMA-ES) to find good hyperparameters faster.


In [26]:
# Optuna example - hyperparameter tuning
import optuna

def objective(trial):
    # Suggest hyperparameters
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    
    # For demo, return a mock score based on lr
    # In reality, you'd train a model with this lr and return validation score
    import random
    random.seed(int(lr * 1000))
    accuracy = random.uniform(0.85, 0.99)
    return accuracy

# Create and run study
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=5, show_progress_bar=False)

print(f"Best params: {study.best_params}")
print(f"Best value: {study.best_value:.3f}")

print("🎯 Optuna: Smarter than grid search, easier than manual tuning")
print("💡 Integrates with W&B, PyTorch Lightning, etc.")

[I 2025-10-27 23:49:38,718] A new study created in memory with name: no-name-cb8c72b2-c697-4b61-b262-99327ab5e3d1
[I 2025-10-27 23:49:38,729] Trial 0 finished with value: 0.9682190592135067 and parameters: {'lr': 0.0002924856939622585}. Best is trial 0 with value: 0.9682190592135067.
[I 2025-10-27 23:49:38,732] Trial 1 finished with value: 0.9682190592135067 and parameters: {'lr': 0.00019878565765361588}. Best is trial 0 with value: 0.9682190592135067.
[I 2025-10-27 23:49:38,733] Trial 2 finished with value: 0.9682190592135067 and parameters: {'lr': 2.2484412418492542e-05}. Best is trial 0 with value: 0.9682190592135067.
[I 2025-10-27 23:49:38,744] Trial 3 finished with value: 0.8688109941757362 and parameters: {'lr': 0.0015867747971138769}. Best is trial 0 with value: 0.9682190592135067.
[I 2025-10-27 23:49:38,745] Trial 4 finished with value: 0.9682190592135067 and parameters: {'lr': 2.4973564722352153e-05}. Best is trial 0 with value: 0.9682190592135067.


Best params: {'lr': 0.0002924856939622585}
Best value: 0.968
🎯 Optuna: Smarter than grid search, easier than manual tuning
💡 Integrates with W&B, PyTorch Lightning, etc.


### HEP-Specific ML Workflow Tools

**b-hive** - CERN's ML platform
- 🐝 Built specifically for HEP researchers
- 🔗 Integrates with CERN infrastructure
- 📊 Experiment tracking + job submission
- 🎓 CERN users only (automatic access)

**LAW** - CERN Python library
- 📚 Law workflow automation and analysis
- 🧬 HEP-specific patterns and tools
- 🔧 Task graphs, dependencies, resubmission
- 💪 Production analysis workflows

**hep-ml-templates** - Community project
- 🚀 Starting templates for HEP ML projects
- 🏗️ Best practices and common patterns
- 📦 Quick start for new projects
- 🤝 Community-driven (on GitHub)


### Workflow Tools: Quick Comparison

| Tool | Best For | Setup Difficulty | Cost |
|------|----------|------------------|------|
| **W&B** | Everything | Easy | Free (academic) |
| **MLflow** | On-premise, privacy | Medium | Free (self-host) |
| **Optuna** | Hyperparameter tuning | Easy | Free |
| **b-hive** | CMS users | Easy | Free (CERN) |
| **hep-ml-templates** | Quick project start | Easy | Free |
| **LAW** | CERN workflows | Medium | Free (CERN) |

**Pro tip:** Use W&B + Optuna together. They integrate perfectly!
**For CERN users:** b-hive is built-in, lower setup barrier!


# 3. Model Training & Deployment

## Get off your laptop!

Your MacBook is crying. Let's talk about scaling up.

### HTCondor - The HEP Classic

**What it is:** Distributed computing system used at CERN and beyond

**Pros:**
- Already set up at most HEP institutions
- Handle thousands of jobs
- Free (for you)

**Cons:**
- Not designed for ML (but works!)
- Can be slow to start
- Queue times vary

**When to use:** You're at a HEP institution and need to run many jobs

```bash
# Example HTCondor submit file
# universe = vanilla
# executable = train_model.sh
# arguments = --learning-rate 0.001
# queue 100 # Submit 100 jobs!
```


### SWAN - CERN's Jupyter Hub

**What it is:** Cloud-based Jupyter notebooks at CERN

**Pros:**
- Access to CERN data
- Pre-configured environment
- Spark integration
- GPUs available

**When to use:** You're at CERN and want to prototype quickly

**URL:** https://swan.cern.ch/


### ONNX - Make your model portable

**Problem:** Trained in PyTorch, but production uses TensorFlow (or C++, or...)

**Solution:** ONNX (Open Neural Network Exchange)

**What it does:**
- Convert models between frameworks
- Optimize for inference
- Deploy anywhere (edge devices, web, etc.)


In [None]:
# ONNX Example - Export PyTorch model
import torch.onnx

# Use our trained PyTorch model
try:
    # Create dummy input
    dummy_input = torch.randn(1, 4)
    
    # Export to ONNX
    torch.onnx.export(model_torch, dummy_input, "model.onnx", verbose=False)
    
    # Load and test with ONNX Runtime
    import onnxruntime as ort
    session = ort.InferenceSession("model.onnx")
    
    # Test inference
    test_input = X_test[:1].astype('float32')
    ort_inputs = {session.get_inputs()[0].name: test_input}
    result = session.run(None, ort_inputs)
    
    print("📦 ONNX: Model exported and loaded successfully!")
    print(f"Original prediction: {model_torch(torch.from_numpy(X_test[:1]).float()).item():.3f}")
    print(f"ONNX prediction: {result[0][0][0]:.3f}")
    
    print("\n🎯 ONNX: Train anywhere, deploy everywhere")
    print("🎯 Especially useful for edge deployment and production")
except Exception as e:
    print(f"Note: ONNX export demo (error: {e})")
    print("📦 ONNX: Train anywhere, deploy everywhere")

### hls4ml - ML on FPGAs

**The coolest HEP-specific tool you didn't know you needed**

**Problem:** You need ultra-low latency inference (< 1 microsecond) for triggers

**Solution:** hls4ml converts your neural network to FPGA firmware

**Use cases:**
- LHC trigger systems
- Real-time event selection
- Anything requiring hardware acceleration

```python
# import hls4ml
# config = hls4ml.utils.config_from_keras_model(model, granularity='name')
# hls_model = hls4ml.converters.convert_from_keras_model(
# model, hls_config=config, output_dir='my-hls-test'
# )
# hls_model.compile()
```

**Mind-blowing:** Your Python model → Hardware in < 1 hour


# HEP-ML Bridge Tools

## The "ROOT files aren't going anywhere" section

You can't do ML without data. In HEP, that means ROOT files, weird event structures, and ragged arrays.

**These tools save your sanity:**

### uproot - Read ROOT files without ROOT

**The game changer.**

Before: Install ROOT, fight with Python bindings, cry 
After: `pip install uproot`, read files with pandas-like syntax

**No C++ dependencies. No ROOT installation. Pure Python bliss.**


In [None]:
# uproot example - create mock HEP data
import uproot
import numpy as np
from hist import Hist

# Create a mock ROOT file with HEP-like data
with uproot.recreate("mock_data.root") as file:
    # Simulate jet pt distribution
    np.random.seed(42)
    jet_pt = np.random.exponential(50, size=1000) * np.random.uniform(0.8, 1.2, size=1000)
    jet_eta = np.random.normal(0, 1.5, size=1000)
    jet_phi = np.random.uniform(-np.pi, np.pi, size=1000)
    
    # Create branches
    file["Events"] = {
        "jet_pt": jet_pt,
        "jet_eta": jet_eta,
        "jet_phi": jet_phi,
    }

# Now read it back with uproot
file = uproot.open("mock_data.root")
tree = file["Events"]

# Get branches as arrays
pt = tree["jet_pt"].array()
eta = tree["jet_eta"].array()

# Or as pandas DataFrame
df = tree.arrays(["jet_pt", "jet_eta", "jet_phi"], library="pd")

print("🎉 uproot: Read ROOT-like data successfully!")
print(f"Events: {len(df)}, Columns: {list(df.columns)}")
print(f"\nSample data:\n{df.head()}")

### Awkward Array - Handle Jagged Data

**Problem:** HEP events have variable-length lists (jets, tracks, etc.)

**Standard approach:** Pad everything, waste memory, write ugly code

**Awkward Array:** Numpy for jagged/nested/variable-length data


In [None]:
import awkward as ak

# Events with variable numbers of jets
events = ak.Array([
    {"jets": [{"pt": 50, "eta": 0.1}, {"pt": 30, "eta": -0.5}]},  # 2 jets
    {"jets": [{"pt": 100, "eta": 1.2}]},                           # 1 jet
    {"jets": [{"pt": 40, "eta": 0.3}, {"pt": 35, "eta": 0.8}, {"pt": 25, "eta": -1.0}]}  # 3 jets
])

# Operations work naturally on jagged data!
jet_pts = events.jets.pt
print("Jet pts:", jet_pts)

# Calculate things per event
leading_jet_pt = ak.max(events.jets.pt, axis=1)
print("Leading jet pt per event:", leading_jet_pt)

# Slice like numpy
high_pt_jets = events.jets[events.jets.pt > 35]
print("High-pt jets:", high_pt_jets)

print("\n✨ Awkward: No more padding! No more for-loops!")


### hist - Modern Histogramming

**ROOT's TH1/TH2 are... showing their age.**

`hist` is a modern, Pythonic histogramming library:
- Clean syntax
- Integrates with numpy, awkward
- Beautiful plotting with matplotlib/mplhep
- Type hints, named axes, units


In [None]:
# Modern histogramming with hist
from hist import Hist
import matplotlib.pyplot as plt
import numpy as np

# Create histogram with named axes
np.random.seed(42)
pt_data = np.random.exponential(50, 1000) * np.random.uniform(0.8, 1.2, 1000)

h = Hist.new.Reg(50, 0, 200, name="pt", label="$p_T$ [GeV]").Double()
h.fill(pt=pt_data)

# Plot
fig, ax = plt.subplots(figsize=(8, 6))
h.plot1d(ax=ax)
ax.set_xlabel("Jet $p_T$ [GeV]")
ax.set_ylabel("Events")
ax.set_title("Mock Jet $p_T$ Distribution")
plt.tight_layout()
plt.savefig("jet_pt_hist.png", dpi=100, bbox_inches='tight')
print("📊 hist: Histogram created and saved!")

print("\n💡 Named axes, units, better plotting. Just better.")

### The Complete HEP-ML Pipeline

```python
import uproot
import awkward as ak
import numpy as np
from hist import Hist

# 1. Read ROOT file
with uproot.open("data.root:Events") as tree:
 events = tree.arrays(["jet_*"], library="ak")

# 2. Process with awkward
good_events = events[ak.num(events.jet_pt) >= 2]
leading_jets = good_events.jet_pt[:, 0]

# 3. Make histograms
h = Hist.new.Reg(50, 0, 200, name="pt").Double()
h.fill(leading_jets)

# 4. Convert to ML format
X = ak.to_numpy(ak.pad_none(events.jet_pt, 5, clip=True))
# Now feed to PyTorch/JAX/etc!
```

**The dream: ROOT file → ML model in < 50 lines**


# Industry Tools

## What industry does better (and what we can steal)

HEP is amazing at physics. Industry is amazing at software engineering.

**Let's learn from them:**

### Testing & Linting - Stop Breaking Things

**Industry:** Comprehensive tests, CI/CD, code review, linting 
**HEP:** "It worked on my machine" 

**Tools you should use:**

1. **pytest** - Testing framework
2. **black** - Code formatter (stop arguing about formatting)
3. **ruff** - Fast linter
4. **mypy** - Type checking
5. **pre-commit** - Run checks before committing


In [None]:
# Quick testing example - run simple tests
import pytest

def test_model_output_shape():
    model = SimpleNN()
    x = torch.randn(10, 4)
    output = model(x)
    assert output.shape == (10, 1), "Wrong output shape!"
    return True

def test_model_output_range():
    model = SimpleNN()
    x = torch.randn(10, 4)
    output = model(x)
    assert torch.all(output >= 0) and torch.all(output <= 1), "Sigmoid broken!"
    return True

# Run tests
print("Running tests...")
try:
    test_model_output_shape()
    test_model_output_range()
    print("✅ All tests passed!")
except AssertionError as e:
    print(f"❌ Test failed: {e}")

print("\n💡 Pro tip: Test your preprocessing! That's where most bugs hide.")

### GitHub Actions - Automate Everything

**Stop manually running tests. Let robots do it.**

Example `.github/workflows/test.yml`:
```yaml
name: Tests
on: [push, pull_request]
jobs:
 test:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v3
 - uses: actions/setup-python@v4
 with:
 python-version: '3.10'
 - run: pip install -r requirements.txt
 - run: pytest
 - run: ruff check .
```

**Now every commit is automatically tested. Magic!**


### AWS SageMaker - When You Need Industrial Scale

**What it is:** AWS's ML platform (training, deployment, everything)

**Pros:**
- Scales to infinity
- Managed infrastructure
- Production-ready deployment
- AutoML features

**Cons:**
- Costs money (sometimes a lot)
- Learning curve
- Vendor lock-in

**When to use:** 
- You need serious scale
- You have budget
- Production deployment

**HEP alternative:** Usually HTCondor + custom scripts (cheaper, less polished)


### What HEP Can Learn from Industry

| Practice | Industry | HEP | What to do |
|----------|----------|-----|-----------|
| **Testing** | Comprehensive | Sparse | Write pytest tests! |
| **CI/CD** | GitHub Actions | Manual | Add GitHub Actions |
| **Code Review** | Required | Optional | Make PRs mandatory |
| **Documentation** | Detailed | "See code" | Write docstrings |
| **Versioning** | Semantic | Git SHA | Use proper versions |
| **Linting** | Enforced | What's that? | Use ruff/black |

**Bottom line:** Treat your code like a product, not a script.


# Fun Shortcuts & "Cheats"

## Work smarter, not harder

The secret sauce. The shortcuts your supervisor doesn't want you to know about.

### 1. Make LLMs Do Your Work

**Let's be honest:** We're all using ChatGPT/Claude/Copilot

**Good uses:**
- Boilerplate code (data loaders, training loops)
- Documentation and docstrings
- Bug finding
- Code explanation
- Unit test generation

**Bad uses:**
- Novel research code (they hallucinate)
- Critical analysis code (verify everything!)
- Anything you don't understand

**Pro tips:**
- Be specific: "Write a PyTorch data loader for awkward arrays"
- Iterate: Start simple, add complexity
- **Always understand the code it gives you**


In [None]:
# Example: Let's ask an LLM to write a custom loss function
# Prompt: "Write a PyTorch loss function that combines binary cross entropy with a custom regularization term"

# LLM output (cleaned up):
class CustomLoss(nn.Module):
    def __init__(self, reg_weight=0.01):
        super().__init__()
        self.bce = nn.BCELoss()
        self.reg_weight = reg_weight
    
    def forward(self, predictions, targets, model_params):
        bce_loss = self.bce(predictions, targets)
        # L2 regularization
        reg_loss = sum(p.pow(2.0).sum() for p in model_params)
        return bce_loss + self.reg_weight * reg_loss

print("🤖 LLMs: Your 24/7 coding assistant")
print("⚠️  But verify everything! They confidently hallucinate.")


### 2. Steal from Hugging Face

**Hugging Face:** GitHub for ML models

**What's there:**
- 500,000+ pre-trained models
- Datasets
- Code examples
- Entire pipelines

**You can:**
- Fine-tune existing models (faster than training from scratch)
- Use pre-trained embeddings
- Copy architectures
- Download datasets


In [None]:
# Hugging Face example - Use a pre-trained model
from transformers import pipeline

try:
    # Use a pre-trained sentiment analysis model (small and fast)
    classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
    
    # Test it
    result = classifier("The LHC is amazing!")
    print(f"Sentiment: {result[0]['label']} (confidence: {result[0]['score']:.3f})")
    print("\n🤗 Hugging Face: Don't reinvent the wheel, fine-tune it!")
except Exception as e:
    print(f"Note: HF model loading skipped (error: {e})")
    print("🤗 Hugging Face: Search for 'particle physics', 'HEP', 'jet tagging'")

### 3. Quick Prototyping Tricks

**Trick 1: Use `fastai` for rapid prototyping**
- High-level API (even simpler than Keras)
- Best practices built-in
- Great for quick experiments

**Trick 2: `torchinfo` for model debugging**
```python
from torchinfo import summary
summary(model, input_size=(1, 4))
# Instantly see: layers, params, output shapes
```

**Trick 3: `einops` for tensor operations**
```python
from einops import rearrange, reduce
# No more confusing reshapes!
x = rearrange(x, 'b c h w -> b (c h w)')
```

**Trick 4: `timm` for vision models**
- 1000+ pre-trained computer vision models
- `pip install timm`


In [None]:
# Quick tricks demo

# Trick: torchinfo for model summary
from torchinfo import summary

# Show summary of our PyTorch model
print("Model Summary:")
summary(SimpleNN(), input_size=(32, 4))  # batch_size=32, features=4

# Trick: Use repr to see object details
print("\nModel architecture (repr):")
print(SimpleNN())

# Trick: Quick timing
import time
start = time.time()
# ... your code ...
print(f"Took {time.time() - start:.3f}s")

# Better: Use %%time or %%timeit in Jupyter!

print("
💡 Small tricks add up to big time savings!")

### 4. Dataset Shortcuts

**Don't start from scratch:**

1. **Papers with Code** - Find datasets and benchmarks
2. **Kaggle** - Tons of curated datasets
3. **UCI ML Repository** - Classic datasets
4. **HEP Data** - Published HEP datasets
5. **Zenodo** - Open science data

**For HEP specifically:**
- CERN Open Data Portal
- LHC Olympics datasets
- Public collision data

**Pro tip:** Start with a small subset! Debug on 1000 events, not 1M.


### 5. The Ultimate Shortcut List

**Must-bookmark resources:**

 **Learning:**
- fast.ai course (free, excellent)
- PyTorch tutorials (official)
- Kaggle Learn (interactive)
- Papers with Code (implementations)

🛠️ **Tools:**
- GitHub Copilot / Cursor (AI pair programmer)
- Paperswithcode.com (find state-of-the-art)
- Connected Papers (explore research)

💬 **Community:**
- PyHEP working group
- Scikit-HEP GitHub
- ML4Jets workshop materials
- Discord/Slack ML communities

🎓 **HEP-specific:**
- IML (Inter-experimental Machine Learning)
- ML4Jets workshops
- PyHEP workshops
- IRIS-HEP training


# Summary: Your ML Toolkit

## Quick Reference Guide

### For Beginners:
1. **Start here:** Scikit-learn for classical ML, Keras for deep learning
2. **Read data:** uproot for ROOT files
3. **Track experiments:** W&B (free for academics!)
4. **Learn:** fast.ai course, official PyTorch tutorials

### For Intermediate Users:
1. **Framework:** PyTorch or PyTorch Lightning
2. **Data:** uproot + awkward array
3. **Optimization:** Optuna
4. **Deployment:** ONNX
5. **Code quality:** pytest, ruff, GitHub Actions

### For Advanced Users:
1. **Speed:** JAX for compute-intensive tasks
2. **Scale:** HTCondor or cloud (SageMaker)
3. **Hardware:** hls4ml for FPGAs
4. **Tools:** Custom pipelines with all the above

### Universal Tips:
- Use version control (git)
- Write tests (pytest)
- Log experiments (W&B/MLflow)
- Document your code
- Start small, scale up
- Leverage pre-trained models (Hugging Face)
- Use LLMs wisely (verify everything!)


# Final Thoughts

## The ML landscape is vast, but you don't need to know everything

**Key takeaways:**

1. **Pick tools that fit YOUR needs** - Don't use fancy tools just because they're fancy
2. **Start simple, add complexity** - Scikit-learn → PyTorch → JAX
3. **Steal shamelessly** - Use pre-trained models, copy good code, ask LLMs
4. **Automate early** - W&B, GitHub Actions, testing save time in the long run
5. **Bridge HEP ↔ ML** - uproot, awkward, hist make life easier
6. **Learn from industry** - Testing, CI/CD, code quality matter
7. **Community is key** - PyHEP, ML4Jets, IML, Scikit-HEP

---

## Most important:

### **The best tool is the one you'll actually use.**

Perfect code that doesn't exist < Working code that's "good enough"

---

# Questions? 

Resources:
- These slides: [your-repo-link]
- Scikit-HEP: https://scikit-hep.org/
- PyHEP: https://hepsoftwarefoundation.org/workinggroups/pyhep.html
- My contact: [your-contact]

**Now go build something cool!** 
