# PointNet for particle flow

<div class="alert alert-block alert-succes">
    
This notebook focuses on wandb Artifacts and how they can be used for data and model versioning.

**Main changes:** 

- introduce wandb artifacts for data versions
- log artifacts to wandb UI
- retrieve artifact from UI as input for processing

</div>

## Problem

This dataset contains a Monte Carlo simulation of $\rho^{\pm} \rightarrow \pi^{\pm} + \pi^0$ decays and the corresponding detector response. Specifically, the data report the measured response of **i) tracker** and **ii) calorimeter**, along with the true pyshical quantitites that generated those measurements.

<div class="alert alert-block alert-info">
This means that we expect one track per event, with mainly two energy blobs (clusters of cells) in the calorimeter.
</div>

The final **goal** is to associate the cell signals observed in the calorimeter to the track that caused those energy deposits.

## Method

The idea is to leverage a **point cloud** data representation to combine tracker and calorimeter information so to associate cell hits to the corresponding track. We will use a [**PointNet**](https://openaccess.thecvf.com/content_cvpr_2017/papers/Qi_PointNet_Deep_Learning_CVPR_2017_paper.pdf) model that is capable of handling this type of data, framed as a **semantic segmentation** approach. More precisely, this means that:
- we represent each hit in the detector as a point in the point cloud: x, y, z coordinates + additional features ("3+"-dimensional point)
- the **learning task** will be binary classification at hit level: for each cell the model learns whether its energy comes mostly from the track (class 1) or not (class 0)

## Data structure

<div class="alert alert-block alert-info">

This dataset is organized as follows:
 - for each event, we create a **sample** (i.e. point cloud)
 - each sample contains all hits in a cone around a track of the event, called **focal track**
     - the cone includes all hits within some $\Delta R$ distance of the track
     - if an event has multiple tracks, then we have more samples per event
     - since different samples have possibly different number of hits, **we pad all point clouds to ensure they have same size** (needed since the model requires inputs of same size)

</div>

## Settings & config

This section collects all configuration variables and training/model hyperparameters. 

The idea is to put it at the top so that it is easy to find and edit.

In [1]:
import sys
import numpy as np
import pandas as pd
from pathlib import Path

import matplotlib.pyplot as plt

# path settings
REPO_BASEPATH = Path().cwd().parent
DATA_PATH = REPO_BASEPATH / "pnet_data/raw/rho_small.npz"
CODE_PATH = REPO_BASEPATH / "src"
sys.path.append(str(CODE_PATH))
MODEL_CHECKPOINTS_PATH = REPO_BASEPATH / "results" / "models" / "pointnet_baseline.weights.h5"

import wandb
from data_viz import *
from model_utils import *

LABELS = ["unfocus hit", "focus hit"]

# set random seed for reproducibility
SEED = 18
set_global_seeds(SEED)

# data settings
N_TRAIN, N_VAL, N_TEST = 210, 65, 50 # roughly 0.65, 0.2, 0.15

2024-11-26 11:04:49.837243: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## wandb Artifacts

Weights & Biases use `Artifacts` as a tool to store objects we want to track and version. Artifacts are typically inputs or outputs of runs, so they are particularly useful for data and models. 

    By linking artifacts with runs, it is also possible to track how/when those artifacts were created and when they were used. 

In brief, artifacts can be handled with a few useful commands:

```python
# create an artifact for dataset
artifact = wandb.Artifact(name = "example_artifact", type = "dataset")
artifact.add_file(local_path = "./dataset.h5", name = "training_dataset")
artifact.save()

# reference dataset version used for this experiment
artifact = run.use_artifact("training_dataset:latest") #returns a run object using the "my_data" artifact

# actually download the data
datadir = artifact.download() #downloads the full "my_data" artifact to the default directory.

```

<div class="alert alert-block alert-warning">

Do **not need to log full dataset!** Data hash is also fine, the objective is mainly to ensure versioning and reproducibility!

</div>

## Create `Artifact` for raw data

Initially we can simply track raw data and how we split them into training, validation and test datasets. 

As before, we choose 65%, 20%, 15% fractions for training, validation and testing data, respectively.

In [2]:
with wandb.init(project="mlops-ai_infn", entity="lclissa", name="dataset-logging",
                job_type="data-creation", config={'DATA_PATH': DATA_PATH,'seed': SEED},
                notes="Playing with Artifacts ...") as run:
    
    # create artifact for raw data
    raw_data_artifact = wandb.Artifact(name="raw_data", type="dataset", 
                              description="MC simulation of rho -> pions decays (full data)"
    )
    raw_data_artifact.add_file(local_path = str(DATA_PATH), name="rho_small.npz")
    wandb.log_artifact(raw_data_artifact)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mlclissa[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Data splitting

Once we have our raw data artifact, we can track all processing it is subjected to through wandb.

The key is to to either download or just reference it so that wandb knows what artifact is used as input and can track the outputs.

Note that Artifacts can store metadata, which is very useful to document how artifacts were created and should be used wisely.

In [3]:
# data splitting
def create_and_save_artifact_locally(data, name, desc, meta={}):
    data_artifact = wandb.Artifact(name=name, type="dataset", 
                                     description=desc, metadata=meta)
    outpath = DATA_PATH.parent.parent / name
    outpath.mkdir(exist_ok=True, parents=True)
    outname = str(outpath / DATA_PATH.name)
    np.savez(outname, feats=data)
    data_artifact.add_file(outname)
    # with data_artifact.new_file(outname, mode="w") as file:
    #     np.savez(outname, data)
    return data_artifact

def split(events, n_train, n_val):
    all_idx = [*range(events.shape[0])]
    
    train_idx = np.random.choice(all_idx, n_train, replace=False)
    remaining_idx = np.array(list(set(all_idx).difference(train_idx)))
    val_idx = np.random.choice(remaining_idx, n_val, replace=False)
    test_idx = np.array(list(set(remaining_idx).difference(val_idx)))
    
    return train_idx, val_idx, test_idx 
    
with wandb.init(project="mlops-ai_infn", entity="lclissa", name="dataset-splitting",
                job_type="data-split", config={'DATA_PATH': DATA_PATH,'seed': SEED},
                notes="Playing with Artifacts ...") as run:
    
    # reference data artifact as input of our run
    raw_data_artifact = run.use_artifact('raw_data:latest')
    
    # optionally, we can also download data from wandb
    # note: this does not repeat download if already available locally
    data_dir = raw_data_artifact.download(root=DATA_PATH.parent)
    events = np.load(Path(data_dir) / DATA_PATH.name)["feats"]

    # split data
    train_idx, val_idx, test_idx = split(events, N_TRAIN, N_VAL)
    train_data = events[train_idx, :]
    val_data = events[val_idx, :]
    test_data = events[test_idx, :]

    
    # create new artifacts for train, validation and test datasets
    meta_dict = {'n_train': N_TRAIN, 'n_val': N_VAL}
    train_data_artifact = create_and_save_artifact_locally(
        train_data, name="train_data", desc="training data", meta=meta_dict)
    val_data_artifact = create_and_save_artifact_locally(
        val_data, name="val_data", desc="validation data", meta=meta_dict)
    test_data_artifact = create_and_save_artifact_locally(
        test_data, name="test_data", desc="test data", meta=meta_dict)
    
    wandb.log_artifact(train_data_artifact)
    wandb.log_artifact(val_data_artifact)
    wandb.log_artifact(test_data_artifact)


[34m[1mwandb[0m:   1 of 1 files downloaded.  


## Versioning artifacts

In ML projects, we ofter iterate over several times, attempting different preprocessing steps, random split seeds, or feature engineering approaches. 
Creating entirely new artifacts for our data every time we apply a change would quickly end up with a tone of datasets that are difficult to track and navigate. Instead of doing so, we can leverage W&B artifact versioning to keep conceptually related artifacts together while allowing intuitive tracking and lineage.

### Why is it useful?

- Maintains clear data lineage and provenance tracking
- Makes it easy to reproduce experiments by referencing specific versions
- Reduces storage overhead by only tracking changes between versions
- Enables easy comparison between different preprocessing approaches
- Simplifies rolling back to previous versions if needed
- Helps team collaboration by providing a single source of truth with version history

### Example: Creating a new version of a dataset artifact

```python
import wandb

# Initialize wandb run
run = wandb.init(project="artifact_demo")

# link run to the artifact we want versioning for
old_artifact = run.use_artifact("raw_data:latest")

# Apply changes and save somewhere, say "path/to/processed_data"

# Create a new artifact with same name of existing artifact
artifact = wandb.Artifact("raw_data", type="dataset")

# Add files or data to the artifact
artifact.add_file("path/to/processed_data")

# Log the artifact - W&B will automatically create a new version
run.log_artifact(artifact)

# Later you can reference specific versions using :v0, :v1, etc.
# Example: artifact = run.use_artifact('raw_data:v1')
```

**Note**: Each new version gets an incremental version number (v0, v1, v2, etc.). You can also use aliases like 'latest' to always get the most recent version.

## Artifact recap

Artifacts are useful to store anything that can be seen as input/output of our experiments. Hence, this is particularly useful for:
 - datasets
 - models

A nice feature is that we can inspect the artifacts' lineage from the wandb UI, as well as track metadata. Also, wandb  takes care of automatically versioning artifacts, so that we have all tools to make sure our results are reproducible.