### Processing and analyzing images
This notebook highlights some basic essentials with processing and analyzing simulated images. We'll use the output data file included with degrad tools to demonstrate this

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm #progressbar
import matplotlib.pyplot as plt
'''Setting up some rcparameters for better size axis labels'''
plt.rc('legend', fontsize=12)
plt.rc('xtick', labelsize=14)
plt.rc('ytick', labelsize=14)
plt.rc('axes', labelsize=16)
plt.rc('axes', titlesize=16)

In [None]:
df = pd.read_feather("data/4.975keV_1000Events_all.feather")

In [None]:
# Look at columns
df.columns

### Descriptions of each column (for people who use ROOT you'll hear columns being called "branches") in our dataframe:

-**nHits**: Number of primary electron tracks in the simulation

-**x**, **y**, **z**: Primary track x, y, and z coordinates

-**t**: Relative time of ionization deposit. (x,y,z) are sorted in time-order

-**flag**: Not so important. Flag indicates the physical process with which the ionization electron was generated: 1=fluorescence; 2=pair production; 3=bremsstrahlung; 0=otherwise

-**truth_dir**: (x,y,z) unit vector of ER

-**truth_energy**: Energy of the primary track

-**ionization_energy**: ionization energy in CF4 of the primary track. This is computed as 'nHits' x W where W = 34.2 eV

-**truth_theta**: truth zenith angle (w.r.t z-axis) determined by 'truth_dir'

-**truth_phi**: truth aximuthal angle (in xy plane w.r.t +x) determined by 'truth_dir'

-**drift_length**: Amount of drift simulated. Currently using a random-uniform distribution between 1cm and 2.5cm. TODO for Jeff: Make this adjustable in configuration.yaml

-**xdiff**, **ydiff**, **zdiff**: x,y, and z coordinates of ionization after applying diffusion over 'drift_length'

-**xamp**, **yamp**, **zamp**: The electrons from (xdiff,ydiff,zdiff) that align with the openings of a GEM hole

-**xcam**, **ycam**, **qcam**: (x,y) after applying gain and diffusion through the transfer gap, binned to the 2048 x 1152 camera dimensions. 'qcam' is the number of amplified electrons falling into the bin

-**xITO**, **zITO**, **qITO**: (x,z) after applying gain and diffusion through the transfer gap, *and induction gap* binned to 120 strips along x. z is quantified assuming 0.26um per bin. 'qITO' is the number of amplified electrons falling into the ITO bin


### A couple of very useful pandas operations for this kind of data
1. Pandas supports lambda expressions which are vectorized and therefore much faster than running nested loops
over the dataframe

2. Pandas has lots of flexibility to query data for cuts

In [None]:
# Example 1: truth_dir is a 3-vector, let's find the magnitude of 'truth_dir' in all entries in df 
# (it should be 1 of course)
df['truth_dir']

In [None]:
# Use a lambda expression to compute the magnintude

df['truth_dir'].apply(lambda x: np.sqrt(x[0]**2 + x[1]**2 + x[2]**2))

In [None]:
# Lambda expressions apply a function elementwise. We can use numpy functions or user defined functions

df['truth_dir'].apply(lambda x: np.linalg.norm(x)) #same thing as the previous cell

In [None]:
df['truth_theta']

In [None]:
# Example 2: Querying data

df.query('cos(truth_theta)>0') #finds all entries with cos(theta) > 0

In [None]:
# boolean expressions are supported with & for 'and' and | for 'or'
# each clause needs to have parenthesis surrounding it when using boolean expressions
# pandas also allows inequalities like a < qty <= b

#Example: ionization energ y> 5.9 keV and phi in [-pi/2,pi/2]
df.query('(ionization_energy > 5.9) & (-%s/2 <= truth_phi <= %s/2)'%(np.pi,np.pi))

In [None]:
#Example 3: making cuts based on lambda expressions
'''query() expressions are nice (and I use them all the time) but they are somewhat limited.
Here"s an example where we select all events with summed charge greater than a certain threshold'''

print(f"Raw charges:\n{df['qcam']}\n") #charge

print(f"Summed charges:\n{df['qcam'].apply(lambda x: x.sum())}\n") #summed charge

In [None]:
# now let's select all events with the summed charge over a certain threshold
threshold = 1.5e6

df[df['qcam'].apply(lambda x: x.sum()) > threshold] #here's the expression

In [None]:
#We can also define new columns freely in our dataframe

df['qSum'] = df['qcam'].apply(lambda x: x.sum())
print(df['qSum'])

### Processing_images

Binned images are stored in "sparse" format as opposed to "dense" format. Sparse data includes (x,y,z,...) coordinate info for pixels > threshold value (we use 0 here). This means we ignore all 0's in our image.
This is computationally and storage efficient, as most of our image is empty (i.e. 0).

A 2048 x 1152 image of 16 bit pixels is 2048 x 1152 x 16 bits x 1 byte / 8 bits = 4.7 MB

On average, each image has around 7,000 non-zero pixels, so storing the image in sparse format gives:\
**2** & 7000 & 16 bits * 1byte / 8 bits = 28 kB **factor of 160 reduction over dense**

The bolded **2** comes from the fact that we're looking at x-y images, so we have 7000 16-bit integers for x and an additional 7000 16-bit integers for y

### Making images dense requires us binning them.
**It's actually computationally faster to bin sparse coordinates so sparsity helps us in many ways**

We'll be using [numpy's histogram2d function for this](https://numpy.org/doc/stable/reference/generated/numpy.histogram2d.html)

In [None]:
# converting sparse coordinates to dense coordinates. Use np.histogram2d

eventnum = 0 #let's look at event number 0

event = df.iloc[eventnum] #grab the event

#Declaring each argument for clarity. Native bin size is (2048,1152). xcam, and ycam
#are already binned to 0-2047 and 0-1151, respectively
#The function reference shows that argument 0 of the function is the 2D histogram (ndarray)

im = np.histogram2d(x=event['xcam'],y=event['ycam'],weights=event['qcam'],
                   bins=(2048,1152),range=((0,2048),(0,1152)))[0].T #transpose makes x and y as they should be

In [None]:
np.shape(im) #image shape is what we want

In [None]:
# Colormap reference https://matplotlib.org/stable/users/explain/colors/colormaps.html

#Our group uses 'jet' for historical reasons. We really should use a perceptually uniform sequential
#colormap like 'viridis' or 'plasma'
plt.imshow(im,cmap='jet')
plt.xlim(event['xcam'].min()-10,event['xcam'].max()+10)
plt.ylim(event['ycam'].min()-10,event['ycam'].max()+10)

In [None]:
'''Lets make a function to efficiently bin images'''

#bin_factor is the factor we downsample the image by. For example
#bin_factor = 4 is 4x4 binning
def create_image(x,y,q,bin_factor):
    im = np.histogram2d(x=x,y=y,weights=q,
                   bins=(2048//bin_factor,1152//bin_factor),
                        range=((0,2048),(0,1152)))[0].T #transpose makes x and y as they should be
    return im

### Slick way to use lambda expressions to create images

In [None]:
%%time
'''Create 4x4-binned images for all tracks'''
tqdm.pandas() #for progressbar

#axis = 1 for row-wise operations over the dataframe
ims = df.progress_apply(lambda row: create_image(row['xcam'],row['ycam'],row['qcam'],bin_factor = 4), axis=1)

### More syntactically clear way to create images
**Choose either this way or the way above to make your images but not both. Memory can fill up very quickly storing too many images**

As a note: I use the "htop" command in my terminal to check my memory and cpu usage. I do this very regularly to check if there's any computation I'm performing that uses too much memory (sometimes recasting array's as 32-bit or 16-bit datatypes can really help mitigate memory at the expense of precision). Mac users can install htop with homebrew (*brew install htop*), linux users can use the package manager for their distro, for windows...I dunno, I'm pretty sure there's a system performance monitor you can pull up with ctrl + alt + del

In [None]:
%%time
'''Create 4x4-binned images for all tracks but easier to read'''
ims2 = [] #will be the same as ims above but we'll do it with a for loop
for i in tqdm(range(0,len(df))):
    tmp = df.iloc[i] #grab ith entry
    ims2.append(create_image(tmp['xcam'],tmp['ycam'],tmp['qcam'],bin_factor = 4))

In [None]:
'''Sanity check that ims and ims2 are the same'''
np.abs((np.array(ims.to_list())-np.array(ims2))).max()

### One other note: This is *not* recommended but you could always add the dense images to your pandas dataframe with df['ims'] = ims or something similar. Generally speaking, it's better to just work with the sparse images and create dense images when needed

### Now let's plot our images with their corresponding keypoints on top of them

In [None]:
def plot_with_truth(event_num,bin_fact,zoom = True):
    tmp = df.iloc[event_num]
    '''Plot image'''
    im = create_image(tmp['xcam'],tmp['ycam'],tmp['qcam'],bin_factor=bin_fact) #use our create_image function
    plt.imshow(im,cmap='jet')
    #plt.imshow(ims[event_num],cmap='jet')
    '''make colorbar for image'''
    cbar = plt.colorbar()
    cbar.set_label('Intensity')
    '''plot truth'''
    scale_factor = (2048//bin_fact)/8 #conversion factor between cm and pixels. Camera pixels are squares
    '''Primary tracks are centered so we need to shift them in binned coordinates to the center'''
    shiftx = 2048//bin_fact/2
    shifty = 1152//bin_fact/2 #ylength is 1152 pixels
    plt.plot(tmp['x']*scale_factor+shiftx,tmp['y']*scale_factor+shifty,'o',color='k',markersize=2)
    '''Labels'''
    plt.xlabel('x [pixels]')
    plt.ylabel('y [pixels]')
    if zoom:
        plt.xlim((tmp['xcam']//bin_fact).min()-5,(tmp['xcam']//bin_fact).max()+5)
        plt.ylim((tmp['ycam']//bin_fact).min()-5,(tmp['ycam']//bin_fact).max()+5)

### Plot several images with keypoints on top of them

In [None]:
plt.figure(figsize = (16,24))
for i in range(1,9):
    plt.subplot(4,2,i)
    plot_with_truth(event_num = 0, bin_fact = i, zoom = True)
    plt.title('%s x %s binning'%(i,i))
plt.tight_layout()
plt.show()

### Now we have a sense of how to bin images. Next steps are seeing if we can create images with corresponding text files for keypoints. 

We're using [Ultralytics' YOLOv8 package](https://github.com/ultralytics/ultralytics) to train our models. For keypoint detection we use the YOLOv8-pose family of models. The train/validation/test dataset format guide can be found here: https://docs.ultralytics.com/datasets/pose/

The relevant part for us is that our datalabel format will be a textfile formatted like this:

[**class-index**] [**x**] [**y**] [**width**] [**height**] [**px1**] [**py1**] [**px2**] [**py2**] ... [**pxn**] [**pyn**]
    
where

**class_index**: Integer index for the class of track in the data. If we had a mixed dataset with ERs, NRs, protons, etc. we would use different values for each class. Here we have just ERs, so we can use 0 as the value for class index

**x**: The x-coordinate center of the bounding box of the track-image

**y**: The y-coordinate center of the bounding box of the track-image

**width**: Width of the bounding box

**height**: Height of the bounding box

**pxj**: x-coordinate of the jth truth keypoint (truth ionization electron)

**pyj**: y-coordinate of the jth truth keypoint (truth ionization electron)

### Some additional notes

1. In this sample all images have one track. This the label text file corresponding to each will include only a single line with these quantities

2. We want the quantities in our label file to be **scale invariant**. What I usually do is define each coordinate as a fraction of the pixel dimension of the image. For example pixel (842,921) on the image would be stored as (842/2048 , 921 / 1152) [because the image dimensions are (2048,1152)]

3. Since we know the truth boundary of the simulated tracks, it shouldn't be too difficult to generate accurate bounding boxes

4. **What we ultimately want is each of these 1000 images saved as a png in a folder called images/ and then an associated labels text file for each image in a folder called labels/**