<a href="https://colab.research.google.com/github/ianomunga/SiFT/blob/main/Embeddings_for_Track_Reconstruction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#importing necessary libraries,
#Numpy for matrix-based numerical processing
#Pandas for tabular dataset preprocessing
#OS for directory operations

import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
import os
from keras.preprocessing.image import ImageDataGenerator 

# keras
from keras.models import Sequential, Model
from keras.layers import Conv2D, Dense, Flatten, MaxPool2D, Dropout, Input
from keras.utils import plot_model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.optimizers import Adam

#metrics & model selection from sklearn
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# plotting
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

In [3]:
!git clone https://github.com/LAL/trackml-library

Cloning into 'trackml-library'...
remote: Enumerating objects: 222, done.[K
remote: Total 222 (delta 0), reused 0 (delta 0), pack-reused 222[K
Receiving objects: 100% (222/222), 43.80 KiB | 533.00 KiB/s, done.
Resolving deltas: 100% (133/133), done.


In [4]:
%cd trackml-library

/content/trackml-library


In [5]:
!pip install .

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/trackml-library
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: trackml
  Building wheel for trackml (setup.py) ... [?25l[?25hdone
  Created wheel for trackml: filename=trackml-3-py2.py3-none-any.whl size=13525 sha256=847a10fa4d31b11451ec6c63e0f1da948d363a9020f4f47f91c492a51b619aec
  Stored in directory: /root/.cache/pip/wheels/e6/ed/27/1d396a8f852fff410fafe7a696bec06bbfb517c37f60138673
Successfully built trackml
Installing collected packages: trackml
Successfully installed trackml-3


In [10]:
!pip show trackml

Name: trackml
Version: 3
Summary: TrackML utility library
Home-page: https://github.com/LAL/trackml-library
Author: Moritz Kiehn
Author-email: msmk@cern.ch
License: UNKNOWN
Location: /usr/local/lib/python3.8/dist-packages
Requires: numpy, pandas
Required-by: 


#Baseline Embeddings with Word2Vec

The logic here is to attempt using embedding approaches like word2vec that are most commonly utilized for encoding text tokens into vectors. Even though the event data being iteratively read from the TrackML library is numerical, I'd like to experiment with the resultant embeddings gotten in this way as dense vectors so that I can have a baseline to evaluate the energetic encodings against.

In [None]:
import trackml
import gensim

hits, cells, particles, truth = trackml.dataset.load_event('/content/trackml-library/build/lib/trackml/dataset.py')
#having trouble with the exact filepaths in colab, will fix later>>>


# Preprocess the event data as you would with text tokens, to get lists of sentences instead
sentences = []
for i in range(1000):
    event = hits[i:i+1000]
    event_features = [event['x'], event['y'], event['z'], event['volume_id'], event['layer_id']]
    sentences.append(event_features)

# Training word2vec on the lists of sentences
model = gensim.models.Word2Vec(sentences, size=5, window=5, min_count=1, workers=4)

# Iterating over the first 1000 events from the library and converting each one into a dense vector representation
embeddings = []
for i in range(1000):
    event = sentences[i]
    event_embedding = [model[feature] for feature in event]
    embeddings.append(event_embedding)

#Numerical Embeddings using Normalized Vectors' Cosine Similarity Scores

This would definitely perform better than ust word2vec, since it caters to the numerical aspects of the event data that would typically be recorded from HEP experiments at CERN, and is exactly what event data from TrackML would need. Here, approaches for embedding energetic tensors from numerical event data through parsing them into vectors of normalized dimensionalities then computing their cosine similarity are defined. This will serve as a preliminary analogue for performing energetic embeddings.

In [11]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import math

%matplotlib inline

In [12]:
def __init__(self, d=2, min_bound=0, max_bound=100, norm="l2"):
        self.d = d 
        self.min_bound = min_bound
        self.max_bound = max_bound
        self.norm = norm  #x and y is constrained to unit length
        self.M = np.random.normal(0, 1, (self.d, self.d))
        self.Q, self.R = np.linalg.qr(self.M, mode="complete")  # Use QR decomposition for the orthonormal basis, Q
    
def __linear_mapping(self, num):
        norm_diff = num / abs(self.min_bound - self.max_bound)
        theta = norm_diff * math.pi
        return theta
    
def make_embedding(self, num):
        r = 1
        theta = self.__linear_mapping(num)
        if self.d == 2:
            polar_coord = np.array([r*math.cos(theta), r*math.sin(theta)])
        elif self.d > 2:
            polar_coord = np.array([math.sin(theta)**(dim-1) * math.cos(theta) if dim < self.d else math.sin(theta)**(self.d) for dim in range(1, self.d+1)])
        else:
            raise ValueError("Wrong value for `d`. `d` should be greater than or equal to 2.")
            
        embedding = np.dot(self.Q, polar_coord)  # Numerical embedding for `num`
        
        return embedding