# Data manipulation

This notebook will give some explanations about the NA62 data that are made available for you and explain how to read and use them to perform later some basic data analysis. Low level functionalities to help you use pandas dataframes are provided and we will guide you through providing higher level functionalies based on particle physics (manipulation of 3-and-4-momenta, invariant masses, ...)

Let's start by loading a little bit of data and look at the variables that are provided.

In [1]:
# Import useful packages
import pandas as pd
import numpy as np
from na62 import prepare
from typing import List, Union

In [2]:
# Then load some data (just 10 events)
data, _ = prepare.import_root_files(["data/run12450.root"], total_limit=10)
data

Unnamed: 0,run,burst,event_type,event_time,reference_time,ktag_time,beam_momentum_mag,beam_direction_x,beam_direction_y,beam_direction_z,...,cluster2_lkr_energy,cluster2_position_x,cluster2_position_y,cluster2_time,track1_eop,track2_eop,track3_eop,track1_position_am_z,track2_position_am_z,track3_position_am_z
0,12450,1,3,95382725,14.424831,14.599336,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.0111,,,180000,180000,180000
1,12450,1,3,99777525,7.797206,7.792658,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.0,,,180000,180000,180000
2,12450,1,3,94745925,1.754371,2.085522,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.0,,,180000,180000,180000
3,12450,1,3,102113650,2.436627,2.139214,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.011516,,,180000,180000,180000
4,12450,1,3,95432550,9.356647,9.205115,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.011238,,,180000,180000,180000
5,12450,1,3,108954600,9.064252,9.272685,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.0,,,180000,180000,180000
6,12450,1,3,114192150,21.247387,21.449175,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.011218,,,180000,180000,180000
7,12450,1,3,99930475,22.027107,27.29752,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.022132,,,180000,180000,180000
8,12450,1,3,114778400,17.153853,17.365828,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.0,,,180000,180000,180000
9,12450,1,3,109796925,17.348784,16.52146,75037.0,0.00121,1.3e-05,0.999999,...,,,,,0.021266,,,180000,180000,180000


We can see here 10 events, which contain many variables. Let's look more in details at the variables
## Information about the data structure

In [3]:
# This prints information about the dataframe structure
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 92 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   run                     10 non-null     int32  
 1   burst                   10 non-null     int32  
 2   event_type              10 non-null     int32  
 3   event_time              10 non-null     int32  
 4   reference_time          10 non-null     float64
 5   ktag_time               10 non-null     float64
 6   beam_momentum_mag       10 non-null     float64
 7   beam_direction_x        10 non-null     float64
 8   beam_direction_y        10 non-null     float64
 9   beam_direction_z        10 non-null     float64
 10  beam_position_x         10 non-null     float32
 11  beam_position_y         10 non-null     float32
 12  beam_position_z         10 non-null     float32
 13  vtx_x                   10 non-null     float32
 14  vtx_y                   10 non-null     float

The command above give us the full list of variables with their respective data type. 

Vectors are important variables. The flat datastructure cannot contain a vector object, so it is spread in the structure accross the four variables `[name]_direction_{x,y,z}, [name]_momentum_mag` containing respectively the direction and the magnitude of the momentum (the direction vector is a unit vector i.e. with magnitude 1). `[name]` indicates the objects to which the momentum refers. We also have a vector `[name]_direction_am_{x,y,z}` and `[name]_position_am_{x,y,z}` corresponding to the direction and position of the track after the spectrometer magnet. These last variables are valid and present for charged tracks only (deflected by the magnet) and can be used to extrapolate position for 'z' values located after the magnet.

As can be seen above, the data structure contains also information about:
 - The event: run number, burst number, event_time. Those three values allow us to uniquely identify an event in the NA62 data. No two events will share the same triplet of values. With the caveat that some events may contain multiple decay which will appear as multiple rows with the same event time in the dataframe.
 - The trigger reference fine time, the time of the closest KTAG candidate to the event.
 - The beam: momentum, position at z=102.4m
 - The position of the computed event vertex (using kinematic fit or CDA approach depending on the event type)
 - Placeholder for three tracks (track1, track2, track3): for each of the track, a variable indicate if the track exists for the event. If the track exists, the information is filled about its momentum, position after the magnet, time and charge, whether the track has associated MUV3 signal and at which time, RICH information (hypothesis, ring radius, position and number of hits), and the associated LKr energy. For convenience the EoP (energy over momentum) is already calculated and stored.
 - Similar placeholder for up to two clusters (cluster1, cluster2). If present, those clusters of energy on the LKr are not associated to one of the tracks, and the variables giving information about its energy, position on LKr front plane (na62.constants.lkr_position), and time are filled.

In addition each event was pre-identified and the variable `event_type` indicate which kaon decay channel was detected. Similarly for the `[name]_rich_hypothesis` variables indicating which is the most likely particle that was measured as a track.

These are the basic information that are reconstructed from (part) of the NA62 detector, and that will be enough for the kind of analysis that we want to do here. However we need to be able to combine these informations according to mathematical and physical principles that you know. The dataframe does not provide such facility so you have to develop that yourself as an exercises.

# Exercises
Each exercise will ask you to implement some function to manipulate the data. The input/output of the functions are well determined. You will be able to pass each function to a test suite that will let you know if the implementation is correct.
The test suite is available through the tests module of the na62 package.

## Three-vector operations
The first functions that will needed and that are not provided by the dataframe are vector operations (sum, product, magnitude). We ask you to fill the functions below to provide these functionalities.

You can assume that the dataframes passed as arguments (as a list or single dataframe, depending on the function) contain the following variables to be used: `direction_{x,y,z}`, `momentum_mag`

In [4]:
def three_vectors_sum(vectors: List[pd.DataFrame]) -> pd.DataFrame:
    # Check that there are any vectors to sum
    if len(vectors) == 0:
        return pd.DataFrame({"direction_x": [], "direction_y": [], "direction_z": [], "momentum_mag": []})

    # [FILL HERE]
    # The code below should perform the sum of all three-vectors and return 
    # a new dataframe containing the summed vector using the same format 
    # as the input (the variables "direction_x", "direction_y", "direction_z", "momentum_mag")
    # Make sure that the direction vector is a unit vector

    return # [SOMETHING]


def three_vector_mag(vector: pd.DataFrame) -> pd.Series:
    # [FILL HERE]
    # Return the magnitude of the three-vector

    return # [SOMETHING]


def three_vector_invert(vector: pd.DataFrame) -> pd.DataFrame:
    # [FILL HERE]
    # The code below should return a new vector (with the same standard format)
    # where all the coordinates are inverted (corresponding to the mathematical operation -1*vector)

    return # [SOMETHING]


def three_vector_dot_product(v1: pd.DataFrame, v2: pd.DataFrame) -> pd.Series:
    # [FILL HERE]
    # The code below should return the dot product of two vectors

    return # [SOMETHING]


def three_vector_cross_product(v1: pd.DataFrame, v2: pd.DataFrame) -> pd.DataFrame:
    # [FILL HERE]
    # The code below should perform the cross product between two vectors
    # and return a new DataFrame containing the result using the same format as the input
    # (the variables "direction_x", "direction_y", "direction_z", "momentum_mag")
    # Make sure that the direction vector is a unit vector
    
    return # [SOMETHING]


def three_vector_transverse(v1: pd.DataFrame, v2: pd.DataFrame) -> pd.Series:
    # [FILL HERE]
    # The code below should compute the magnitude of the transverse momentum of the vector v1 with respect to the vector v2
    # Hint: Compute the squared magnitude  of the transverse momentum vector and take the square root
    
    return # [SOMETHING]


In [5]:
# Perform the tests
from na62.tests.test_vectors import Test_ThreeVector
Test_ThreeVector().run_tests(sum_function=three_vectors_sum, mag_function=three_vector_mag, invert_function=three_vector_invert,
                            cross_product_function=three_vector_cross_product, dot_product_function=three_vector_dot_product, transverse_function=three_vector_transverse)

[ERROR] Magnitude function does not return the expected data type (pandas.Series expected)
[ERROR] Magnitude function does not return the expected values
[ERROR] Sum function does not return the expected data type (pandas.DataFrame expected)
[ERROR] Sum function does not return the expected values
[ERROR] Sum function does not return a unit direction vector
[ERROR] Invert function does not return the expected data type (pandas.DataFrame expected)
[ERROR] Invertion function does not return a vector whose coordinates are inverted
[ERROR] Cross product function does not return the expected data type (pandas.DataFrame expected)
[ERROR] Cross product function does not return the expected values
[ERROR] Dot product function does not return the expected data type (pandas.Series expected)
[ERROR] Dot product function does not return the expected values
[ERROR] Transverse function does not return the expected data type (pandas.Series expected)
[ERROR] Transverse function does not return the exp

## Four-vector operations
Next we want to operate on four-vectors. Please write below the functions providing the requested functionalities. You can make the same assumption as above for the input dataframes, and you can use the additional variable `energy` providing the four-vector energy.

In [6]:
def four_vectors_sum(vectors: List[pd.DataFrame]) -> pd.DataFrame:
    # Check that there are any vectors to sum
    if len(vectors) == 0:
        return pd.Series()

    # [FILL HERE]
    # The code below should perform the sum of all four-vectors and return 
    # a new dataframe containing the summed vector using the same format 
    # as the input (the variables "direction_x", "direction_y", "direction_z", "momentum_mag", "energy"
    # Make sure that the direction vector is a unit vector
    # Hint: you can treat the 4-vector as a 3-vector + energy

    return # [SOMETHING]

def four_vector_mag2(vector: pd.DataFrame) -> pd.DataFrame:
    # [FILL HERE]
    # Return the magnitude squared of the four-vector
    # Hint: you can again trat the 4-vector as a 3-vector + energy

    return # [SOMETHING]

def four_vector_mag(vector: pd.DataFrame) -> pd.DataFrame:
    # [FILL HERE]
    # Return the magnitude of the four-vector
    # Hint: try to take the square root of the magnitude squared
    # Convention: as the magnitude squared can be negative, we will 
    #  by convention define the square root of a negative magnitudes squared
    #  to be the negative of the absolute value.

    return # [SOMETHING]

def four_vector_invert(vector: pd.DataFrame) -> pd.DataFrame:
    # [FILL HERE]
    # The code below should return a new vector (with the same standard format)
    # where all the coordinates are inverted (corresponding to the mathematical operation -1*vector)

    return # [SOMETHING]

In [7]:
# Perform the tests
from na62.tests.test_vectors import Test_FourVector
Test_FourVector().run_tests(sum_function=four_vector_sum, mag_function=four_vector_mag, mag2_function=four_vector_mag2, invert_function=four_vector_invert)

NameError: name 'four_vector_sum' is not defined

## Kinematic function

Finally let's come to some useful kinematic functions based on the above operations. For all the momenta (either 3-momenta or 4-momenta) we can make the same assumptions as previously regarding the name of the variables. In addition we are providing already some function to extract the required tracks and photons from the complete dataframe. These parts of the code are already written for you, only the actual mathematical operations are left out.  
Please complete the missing parts of the functions below and test them in the test suite.

In [8]:
from na62 import extract
def invariant_mass(momenta: List[pd.DataFrame]) -> pd.Series:
    # [FILL HERE]
    # The code received a list of 4-momenta (according to the usual format already used above).
    # Compute the invariant mass of the 4-momenta.
    # Hint: use the 4-vector functions we wrote above

    return # [SOMETHING]

def total_momentum(df: pd.DataFrame) -> pd.Series:
    # We receive the full data
    # First extract all the tracks and clusters. We fill all the "NA" values with 0.
    # In such a way, we do not need to care whether the track/cluster exist, 
    # it will only have a null contribution in the sum
    t1 = extract.track(df, 1).fillna(0)
    t2 = extract.track(df, 2).fillna(0)
    t3 = extract.track(df, 3).fillna(0)
    c1 = extract.photon_momentum(df, 1).fillna(0)
    c2 = extract.photon_momentum(df, 2).fillna(0)

    # [FILL HERE]
    # Compute the magnitude of the total momentum (including tracks and clusters)
    # Hint: use the 3-vector functions we wrote above
    
    return # [SOMETHING]


def total_track_momentum(df: pd.DataFrame) -> pd.Series:
    # We receive the full data
    # First extract all the tracks. We fill all the "NA" values with 0.
    # In such a way, we do not need to care whether the track exist, 
    # it will only have a null contribution in the sum
    t1 = extract.track(df, 1).fillna(0)
    t2 = extract.track(df, 2).fillna(0)
    t3 = extract.track(df, 3).fillna(0)

    # [FILL HERE]
    # Compute the magnitude of the total momentum of the tracks
    # Hint: use the 3-vector functions we wrote above

    return # [SOMETHING]

def missing_mass_sqr(beam: pd.DataFrame, momenta: List[pd.DataFrame]) -> pd.Series:
    # [FILL HERE]
    # Compute the missing mass squared defined as the squared magnitude of the 4-vector "beam- sum(momenta)"
    # Hint: use the 4-vector functions we wrote above

    return # [SOMETHING]

def missing_mass(beam: pd.DataFrame, momenta: List[pd.DataFrame]) -> pd.Series:
    # [FILL HERE]
    # Compute the missing mass defined as the magnitude of the 4-vector "beam- sum(momenta)"
    # Hint: use the 4-vector functions we wrote above

    return # [SOMETHING]

def propagate(track: pd.DataFrame, z_final: int, position_field_name: str = "position", direction_field_name: str = "direction") -> pd.DataFrame:
    # [FILL HERE]
    # Compute the propagated position of the track to the Z position 'z_final'.
    # The track provides initial positions as '[position_field_name]_{x,y,z}' variables,
    # and the momentum direction as '[direction_field_name]_{x,y,z}' variables (replace in both [position_field_name] by the name you receive above)
    # Return a DataFrame containing the three variables 'position_[x,y,z]'

    return # [SOMETHING]

In [9]:
# Perform the tests
from na62.tests.test_kinematics import TestKinematics
TestKinematics().run_tests(inv_mass_fuction=invariant_mass, total_momentum_function=total_momentum, 
                           total_track_momentum_function=total_track_momentum, missing_mass_sqr_function=missing_mass_sqr, 
                           missing_mass_function=missing_mass, propagation_function=propagate)

[ERROR] Invariant mass function does not return the expected data type (pandas.Series expected)
[ERROR] Invariant mass function does not return the expected values
[ERROR] Total momentum function does not return the expected data type (pandas.Series expected)
[ERROR] Total momentum function does not return the expected values
[ERROR] Total track momentum function does not return the expected data type (pandas.Series expected)
[ERROR] Total track momentum function does not return the expected values
[ERROR] Missing mass squared function does not return the expected data type (pandas.Series expected)
[ERROR] Missing mass squared function does not return the expected values
[ERROR] Missing mass function does not return the expected data type (pandas.Series expected)
[ERROR] Missing mass function does not return the expected values
[ERROR] Propagate function does not return the expected data type (pandas.DataFrame expected)
[ERROR] Propagate function does not return the expected values


### If you are really stuck, but don't give up to quickly

Uncomment and run the following cell to see the solution to the three-vector functions

In [10]:
# %load -s three_vectors_sum,three_vector_mag,three_vector_invert,three_vector_dot_product,three_vector_cross_product,three_vector_transverse na62/hlf.py

Uncomment and run the following cell to see the solution to the four-vector functions

In [11]:
# %load -s four_vectors_sum,four_vector_mag2,four_vector_mag,four_vector_invert na62/hlf.py

Uncomment and run the following cell to see the solution to the kinematic functions

In [12]:
# %load -s invariant_mass,total_momentum,total_track_momentum,missing_mass_sqr,missing_mass,propagate na62/hlf.py