## Table of Contents
<a id='TOC'></a>

0. [Imports](#imports)
1. [Run range and interesting keys](#runs)
2. [Create Dataframe](#df)

## Imports
<a id = 'imports'></a>

Go to [TOC](#TOC)

In [1]:
import numpy as np
import cygno as cy
import matplotlib.pyplot as plt
import pandas as pd
import os
import uproot
from tqdm.notebook import tqdm
from time import process_time 

import itertools
import pickle

# custom params for plots
plt.rcParams['figure.titlesize'] = 18
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = .5
plt.rcParams['grid.linestyle'] = '--'
plt.rcParams['figure.figsize'] = 12, 7
plt.rcParams['figure.subplot.wspace'] = 0.2
plt.rcParams['figure.subplot.hspace'] = 0.4

pd.set_option('display.max_columns', None) 

## Run range and interesting keys
<a id = 'runs'></a>

Go to [TOC](#TOC)

- First we need to set the range of runs that we want to enclose in the final dataframe, it can be also a single run, but it needs to be iterable in order for the following code to properly work.
- Moreover, we need to specify the feature we need as columns in the dataframe, this is done by modifying the `keys_to_save` list.
- Lastly, we specify the path where the reco files are stored, together with the path where we want to save the final dataframe.

In [2]:
##TEST
runs = np.arange(34877, 34880, 1)

print(f"Prepairing to load {len(runs)} runs...")

Prepairing to load 3 runs...


In [3]:
keys_to_save = ['run', 'event', 'pedestal_run', 'cmos_integral', 'cmos_mean',
                'cmos_rms', 't_DBSCAN', 't_variables', 'lp_len', 't_pedsub',
                't_saturation', 't_zerosup', 't_xycut', 't_rebin', 't_medianfilter',
                't_noisered', 'nSc', 'sc_size', 'sc_nhits', 'sc_integral',
                'sc_corrintegral', 'sc_rms', 'sc_energy', 'sc_pathlength',
                'sc_redpixIdx', 'nRedpix', 'sc_theta', 'sc_length', 'sc_width',
                'sc_longrms', 'sc_latrms', 'sc_lfullrms', 'sc_tfullrms',
                'sc_lp0amplitude', 'sc_lp0prominence', 'sc_lp0fwhm', 'sc_lp0mean',
                'sc_tp0fwhm', 'sc_xmean', 'sc_ymean', 'sc_xmax', 'sc_xmin', 'sc_ymax',
                'sc_ymin', 'sc_pearson', 'sc_tgaussamp', 'sc_tgaussmean',
                'sc_tgausssigma', 'sc_tchi2', 'sc_tstatus', 'sc_lgaussamp',
                'sc_lgaussmean', 'sc_lgausssigma', 'sc_lchi2', 'sc_lstatus'#,
                #'Lime_pressure', 'Atm_pressure', 'Lime_temperature', 'Atm_temperature',
                #'Humidity'
               ]

# keys_to_save = ['run', 'event', 'nSc', 'sc_nhits', 'sc_integral', 'sc_energy', 'sc_length', 
#                 'sc_xmean', 'sc_ymean', 'sc_xmax', 'sc_xmin', 'sc_ymax', 'sc_ymin',
#                ]


In [4]:
recopath = '/jupyter-workspace/cloud-storage/cygno-analysis/RECO/Winter23/'
savepath = '/jupyter-workspace/cloud-storage/zappater/CMOSEvents/'

## Create Dataframe
<a id = 'df'></a>

Go to [TOC](#TOC)

First of all we need to download the most recent logbook (e.g. from Grafana) and put it in the same folder of the code, such that we can retrieve important information about the considered run, useful for example to esclude pedestal runs from the dataframe (if we want to).

In [5]:
## load logbook, 'low_memory=False' option is to avoid an annoying warning
## the one in this folder is update up to 2024-01-05 14:27:18
dfinfo_tot = pd.read_csv('logbook.csv', low_memory=False)

FileNotFoundError: [Errno 2] No such file or directory: 'logbook.csv'

In [6]:
# data in the following range was taken with HV off but not marked as pedestal and turning HV on from time to time as explained later, in the code I escluded data with HV on in this
# range of runs, since it only creates noise due to the Iron source spots.

runs_ped = np.arange(20426, 20816, 1) # line 22 of the RUN3-summary in the datalogbook sheet

In [7]:
## OPEN RECO FILES AND SAVE TO CSV

## the code is written with exceptions required to prevent crashing when encountering zombies/corrupted or missing files,
## since it is interesting to know which files are like that, we store them in two lists.
not_found = []
corrupted = []

df_RUN3 = pd.DataFrame()

## loop over the run range
for r in tqdm(runs):
    # create the filename to be read
    filename = f"{recopath}reco_run{int(r)}_3D.root"
    #dfinfo = cy.run_info_logbook(r, sql=True, verbose=False) # this was here when Stefano sent me this code, but I never tried to implement this line of code
    
    # extract the part of the logbook relative to the current run
    dfinfo = dfinfo_tot[dfinfo_tot['run_number']==r].copy()
    
    ## there is a particular set of pedestal runs where the HV was turned on from time to time to check the GEM response (line 22 of the RUN3-summary in the datalogbook sheet), 
    ## in that case the interesting files are the pedestals with HV-off, and the HV-on runs should be excluded using the two following lines:
    if (r==runs_ped).any()&(dfinfo['HV_STATE'].values[0]==1):
        continue
    
    # only when we encounter actual data runs (pedestal_run=0) we proceed in opening and saving the files.
    if dfinfo['pedestal_run'].values[0]==0:#and os.path.exists(filename) and# and dfinfo['source_type'].values[0]==0: 
        print('Opening reco file of run {:05d}...'.format(r), end = '\r')
        try:
            file = uproot.open(filename+":Events")
        except:
            print("\nAAAAA", r)
            not_found.append(r)
            continue

        try:
            datadf = file.arrays(keys_to_save, library="pd")
        except:
            print("\nBBBBB", r)
            corrupted.append(r)
            continue
        
        # at the end of each iteration we concatenate the current dataframe to the big one and we delete the first from the memory.
        df_RUN3 = pd.concat([df_RUN3, datadf], ignore_index=True)
        del datadf
    #print(len(df_RUN3))
        

# at this point we compress the file and then we save it to a zip in the selected location.
compression_opts = dict(method='zip', archive_name='./dfRUN3.csv')  
df_RUN3.to_csv(f'{savepath}dfRUN3_{runs[0]}_{runs[-1]}_test.zip', index=False, compression=compression_opts)
print("\nEverything was saved!")

  0%|          | 0/3 [00:00<?, ?it/s]

NameError: name 'dfinfo_tot' is not defined