# Preprocessing for `CNNTopTagging.ipynb` (optional excercise)

The dataset used in [`CNNTopTagging.ipynb`](CNNTopTagging.ipynb) has to be preprocessed to be in the form of images. Preprocessed images for 100k training and testing examples and 10k validation examples are provided in the [data](data) folder of the course repository.

## The dataset

The authors of [arXiv:1707.08966](https://arxiv.org/abs/1707.08966) provide us with a dataset for studying this problem. There is also a [summary paper](https://arxiv.org/abs/1902.09914) reviewing different methods.

If you want to run this exercise at home you can download the data at https://desycloud.desy.de/index.php/s/llbX3zpLhazgPJ6 (1.6 GB).

If you run this notebook at the CIP pool, during the course the data can be found at `/large_tmp/LMU_DA_ML/top_tagging` - otherwise adjust the following path:

In [None]:
data_dir = "/large_tmp/LMU_DA_ML/top_tagging"

In [None]:
import pandas as pd
import numpy as np
import os

The dataset contains about 1M training examples. For now we will just use 100k for training and testing and 10k for validation during the training.

In [None]:
n_examples = 100000
df_train = pd.read_hdf(os.path.join(data_dir, "train.h5"), "table", stop=n_examples)

The dataset contains the Lorentz Vectors $(E, p_x, p_y, p_z)$ for the leading 200 constituents of the jets. The field `is_signal_new` flags whether the jet is a QCD jet (0) or a Top quark jet (1).

In [None]:
df_train.head()

## Preprocessing

Since our jets are already clustered with a fixed radius parameter it is convenient to transform the coordinates such that the leading constituent is in the center of the image. As azimuthal and longitudinal coordinates we use the angle $\phi$ and the [pseudorapidity](https://en.wikipedia.org/wiki/Pseudorapidity) $\eta$, a quantity where differences are invariant under boost in beam direction. The images will be created by summing (histogramming) transverse momentum values into 40x40 pixels and normalize by the leading constituents transverse momentum (center of the image).

In [None]:
def get_df_rel(df):
    """
    Create dataframe with PT, ETA, PHI in coordinates relative to leading constituent
    """

    # make new df with relative coordinates (to leading constituent)
    # first, just copy the labels for convenience
    df_rel = df[["is_signal_new"]].copy()

    # Augment with pt, eta, phi
    for i in range(200):
        df_rel["PT_{}".format(i)] = np.sqrt(df["PX_{}".format(i)]**2 + df["PY_{}".format(i)]**2)
        df_rel["ETA_{}".format(i)] = np.arcsinh(df["PZ_{}".format(i)]/df_rel["PT_{}".format(i)])
        df_rel["PHI_{}".format(i)] = np.arcsin(df["PY_{}".format(i)]/df_rel["PT_{}".format(i)])

    PT_0 = df_rel.PT_0.copy()
    ETA_0 = df_rel.ETA_0.copy()
    PHI_0 = df_rel.PHI_0.copy()
    for i in range(200):
        # normalize by leading constituent
        df_rel["PT_{}".format(i)] = df_rel["PT_{}".format(i)] / PT_0
        
        # shift coordinates
        df_rel["ETA_{}".format(i)] = df_rel["ETA_{}".format(i)] - ETA_0
        df_rel["PHI_{}".format(i)] = df_rel["PHI_{}".format(i)] - PHI_0

    df_rel.fillna(0, inplace=True)
    return df_rel

In [None]:
df_rel_train = get_df_rel(df_train)

In [None]:
df_rel_train.head()

How does an average jet image look like now?

In [None]:
import matplotlib
import matplotlib.pyplot as plt

In [None]:
def plot_avg(df, label=1):
    
    columns = sum([["PT_{}".format(i), "ETA_{}".format(i), "PHI_{}".format(i)] for i in range(200)], [])

    # transform to reshaped numpy array of particles (irrespective of event)
    trf = df[df["is_signal_new"]==label][columns].values.reshape(-1, 3)
    pt = trf[:,0]
    eta = trf[:,1]
    phi = trf[:,2]

    plt.hist2d(
        eta, phi, bins=(40, 40), range=([-1, 1], [-1, 1]),
        # the pixel intensity is the transverse momentum, so we have to weight by pt here
        weights=pt,
        norm=matplotlib.colors.LogNorm(),
    )
    plt.colorbar()
    plt.xlabel("eta")
    plt.ylabel("phi")

Average QCD jet

In [None]:
plot_avg(df_rel_train, label=0)

Average Top quark jet

In [None]:
plot_avg(df_rel_train, label=1)

For training a CNN we now have to make an array of these images:

In [None]:
def get_img_array(df):
    """
    Pixelate constituent arrays per jet
    """
    columns = sum([["PT_{}".format(i), "ETA_{}".format(i), "PHI_{}".format(i)] for i in range(200)], [])
    hists = []
    trf = df[columns].values.reshape(-1, 200, 3)
    for i in range(len(trf)):
        pt = trf[i][:,0]
        eta = trf[i][:,1]
        phi = trf[i][:,2]
        # remember: the pixel intensity is the transverse momentum, so we have to weight by pt here
        hist, xedges, yedges = np.histogram2d(eta, phi, bins=(40, 40), range=([-1, 1], [-1, 1]), weights=pt)
        hists.append(np.array([hist]))
    return np.stack(hists).reshape(-1, 40, 40, 1)

In [None]:
x_train = get_img_array(df_rel_train)

In [None]:
x_train.shape

In [None]:
y_train = df_rel_train.is_signal_new.values

Plot the mean value of these arrays again to check if everything worked as expected:

In [None]:
plt.imshow(x_train[y_train==0].mean(axis=0), norm=matplotlib.colors.LogNorm())

In [None]:
plt.imshow(x_train[y_train==1].mean(axis=0), norm=matplotlib.colors.LogNorm())

Now, preprocess the validation and testing dataset as well. To save a bit of memory, iterate over the initial dataframe in chunks:

In [None]:
def preprocess(df_path, n_examples=100000, chunksize=10000):
    x = []
    y = []
    for start in range(0, n_examples, chunksize):
        df = pd.read_hdf(df_path, "table", start=start, stop=start + chunksize)
        df_rel = get_df_rel(df)
        x.append(get_img_array(df_rel))
        y.append(df_rel.is_signal_new.values)
    return np.concatenate(x), np.concatenate(y)

In [None]:
x_test, y_test = preprocess(os.path.join(data_dir, "test.h5"))

In [None]:
x_val, y_val = preprocess(os.path.join(data_dir, "val.h5"), n_examples=10000)

For the files we put into the git repository, we also converted the color values to unsigned 8bit integers (values between 0 and 255).

To loose as little information a possible by this, let's first do a logarithmic transformation. Let's use a small subset for experimentation:

In [None]:
x = x_train[::10]

In [None]:
_ = plt.hist(np.log(x).ravel(), bins=300, range=(-10, 5))

The peak at 0 (1 in the untransformed case) comes from the fact that we normalized our transverse momenta to be relative to the leading constituent for each jet.

To convert to unsigned 8bit integers, we map the range `(-10, 5)` to `(1, 255)` and set the `-np.inf` values (resulting from `np.log(0)`) to 0.

In [None]:
def transform(x, range=(-10, 5)):
    map_1_255 = ((np.log(x) - range[0]) / (range[1] - range[0]) * 255 + 1)
    return np.where(x != 0, map_1_255, 0).astype(np.uint8)

In [None]:
_ = plt.hist(transform(x).ravel(), range=(1, 255), bins=254)

To transform this back later we will need to reverse that transformation

In [None]:
def reverse_transform(x_uint8, range=(-10, 5)):
    reverse_map_1_255 =  np.exp((x_uint8 - 1) / 255 * (range[1] - range[0]) + range[0])
    return np.where(x_uint8 != 0, reverse_map_1_255, 0)

In [None]:
plt.imshow(x.mean(axis=0), norm=matplotlib.colors.LogNorm())

One can see that there is some information loss for very high values, where the 8bit unsigned int values get a bit more discrete.

In [None]:
opts = dict(bins=200, alpha=0.5, range=(0, 5))
plt.hist(x.ravel(), **opts)
plt.hist((reverse_transform(transform(x))).ravel(), **opts)
plt.yscale("log")

This is less visible in a more coarse binning:

In [None]:
opts = dict(bins=30, alpha=0.5, range=(0, 5))
plt.hist(x.ravel(), **opts)
plt.hist((reverse_transform(transform(x))).ravel(), **opts)
plt.yscale("log")

Finally save them in a compressed format:

In [None]:
np.savez_compressed(
    "top_tagging_images.npz",
    x_train=transform(x_train),
    y_train=y_train,
    x_test=transform(x_test),
    y_test=y_test,
    x_val=transform(x_val),
    y_val=y_val
)