# HADES Data Tutorial

George Marshall, UCL.  Presented at [LEGEND Software Tutorial, Nov. 2021](https://indico.legend-exp.org/event/561/)

The most up to date data is located at MPIK in the ggmarsh-test-v03 user production. Hopefully in the next few months a new version will be produced probably at Gran Sasso and copied elsewhere which will the first reference production.

## The LEGEND Production Environment

A full set of tutorials on the production environment can be found here: https://github.com/mmatteo/legend-analysis-tutorials . 
The github page which provides more detail on all the commands is here: https://github.com/legend-exp/legend-prodenv .

Once you have navigated to the production environment at `/lfs/l1/legend/legend-prodenv` , the setup file can be sourced using:

`source setup.sh`

This gives access to all the prodenv commands a full description of which are found above.

Also found here are the 2 types of production : reference productions and user productions. Currently only user productions are being made but hopefully soon we will have a stable full production pipeline and will then produce the first reference production. The idea is that reference productions will be run using stable versions of pygama whereas the user productions are more experimental.

This tutorial will focus on the data found at `/lfs/l1/legend/legend-prodenv/prod-usr/ggmarsh-test-v03`.

To load the software for this production cycle we can use:
`prodenv-load.sh config.json`

This will put us in the container for this environment with the version of pygama the data was made using. As mentioned earlier for the user productions these may not be stable versions of pygama.

## A Production Cycle

By default there are 8 directories in each production cycle. 

dataflow contains all the snakmake code which controls the data production

gen is where all the data produced is stored, each detector has a separate directory in which is each tier of production. At the moment data is produced up to and including tier2.

genpar contains parameters produced in the data production such as the pole zero constant, calibration constants etc.

log contains the all the logs for the data production

meta contains all the config files for data productions

src contains all the other software such as the pygama version.

venv is the virtual environment

## Data Fields

Tier1 data contains the following fields:

```console
$ raw.lh5
├── stat
└── raw
     ├── baseline    # FPGA-estimated baseline
     ├── channel     # right now, index of the trigger (trace)
     ├── energy      # FPGA-estimated energy
     ├── ievt        # index of event
     ├── numtraces   # number of triggered FADC channels
     ├── packet_id   # packet index in file
     ├── timestamp   # time since beginning of file
     ├── tracelist   # list of triggered FADC channels
     ├── waveform    # digitizer data
     │   ├── dt      # sampling period (ns) - 16
     │   ├── t0
     │   └── values  # array holding the waveform samples FADC units
     ├── wf_max      # ultra-simple np.max energy estimation
     └── wf_std      # ultra-simple np.std noise estimation

```

Whereas tier 2 contains a few more:

```console
$ dps.lh5
├── dsp_info
└── raw
     ├── A_max
     ├── QDrift
     ├── bl_intercept
     ├── bl_mean
     ├── bl_slope
     ├── bl_std
     ├── cuspEftp
     ├── cuspEftp_ctc
     ├── cuspEmax
     ├── cuspEmax_ctc
     ├── dt_eff
     ├── pz_mean
     ├── pz_slope
     ├── pz_std
     ├── tp_01
     ├── tp_0_est
     ├── tp_0_trap
     ├── tp_10
     ├── tp_100
     ├── tp_20
     ├── tp_50
     ├── tp_80
     ├── tp_90
     ├── tp_95
     ├── tp_99
     ├── tp_max
     ├── tp_min
     ├── trapEftp
     ├── trapEftp_ctc
     ├── trapEmax
     ├── trapEmax_ctc
     ├── trapTmax
     ├── wf_max
     ├── wf_min
     ├── zacEftp
     ├── zacEftp_ctc
     ├── zacEmax
     └── zacEmax_ctc
```

A full description of how all this data was produced can be found here: https://indico.legend-exp.org/event/698/contributions/3409/attachments/1852/2844/Processing_Chain_lnote_v1.pdf

bl_mean, bl_slope, bl_std, bl_intercept are parameters related to the baseline namely the mean, RMS, slope and intercept.

pz_mean, pz_slope, pz_std are the same but for the flat top

tp_01, tp_0_est, tp_0_trap, tp_10, tp_100, tp_20, tp_50, tp_80, tp_90, tp_95, tp_99, tp_max, tp_min are all timepoint parameters. tp_0_est and tp_0_trap are two different estimates of the start point of the signal. All others are times to reach the relevant percentage of the energy. Finally tp_max and tp_min are the timepoints of the max and min of the waveform respectively.

cuspEftp, cuspEftp_ctc, cuspEmax, cuspEmax_ctc,
trapEftp, trapEftp_ctc, trapEmax, trapEmax_ctc, trapTmax, 
zacEftp, zacEftp_ctc, zacEmax, zacEmax_ctc
are all energy estimates. There are three different filter trap, zac and cusp. All filters have been optimised.
For each we have 2 different extraction methods the max and fixed time pickoff (ftp). 
For the moment I would recommend using the max as it performs better for this data.
Lastly there are two types ctc and no ctc which is whether the charge trapping correction is included or not.
trapTmax is a fixed length trap filter used for timing estimates.

Other parameters:
A_max  is the max current, 
QDrift is a parameter related to the uncollected charge. When divided by the energy (trapTmax) we get an effective drift time, dt_eff.
Finally wf_max and wf_min are the max and min of the waveform.

## Loading in Data

Load in packages

In [None]:
import pygama.lh5 as lh5
import os,json
import matplotlib.pyplot as plt
import numpy as np

In [None]:
%matplotlib notebook

Find files in this case for a Th lateral scan

In [None]:
# TODO: update these with paths appropriate to NERSC, ideally this directory:
# /global/cfs/cdirs/m2676/data/hades/V07646A/tier2/th_HS2_lat_psa
# George will edit this soon & remove this message.  --Clint

det = 'V07646A'
datatype = 'th_HS2_lat_psa'
run = '001'
datapath = '/lfs/l1/legend/legend-prodenv/prod-usr/ggmarsh-test-v03/gen/'+det+'/tier2/'+datatype

In [None]:
files      = os.listdir(datapath)
files.sort()
for i,file in enumerate(files):
    files[i] = os.path.join(datapath,file)
files = files[0:5]

In [None]:
sto=lh5.Store()

We can use this command to check what is in the file

In [None]:
sto.ls(files[0], 'raw/')

Load data in

In [None]:
uncal = lh5.load_nda(files, ['cuspEmax','cuspEmax_ctc'], 'raw')

In [None]:
print(uncal)

Or we can use dataframes instead:

In [None]:
df = lh5.load_dfs(files, ['cuspEmax','cuspEmax_ctc'], 'raw')

In [None]:
df.head()

In [None]:
plt.figure()
counts,bins,bars = plt.hist(uncal['cuspEmax'], bins=10000, histtype='step', label='No Charge Trapping Correction')
plt.hist(uncal['cuspEmax_ctc'], bins=bins, histtype='step', label='Charge Trapping Corrected')
plt.yscale('log')
plt.legend()
plt.show()

## Applying Quality Cuts

In [None]:
import pygama.genpar_tmp.cuts as cts

In [None]:
uncal_pass, uncal_cut = cts.load_nda_with_cuts(files,'raw',['cuspEmax_ctc'],verbose=False)

By default the cuts are generated as a 4 sigma double sided cut on bl_mean, bl_std and pz_std. We can customize by specifying a cut dictionary.

In [None]:
other_cuts_pass, other_cuts_fail  = cts.load_nda_with_cuts(files,'raw',['cuspEmax_ctc'], 
                                               cut_parameters= {'bl_mean':5,'bl_std':{'left':10,'right':4}, 'pz_std':4},
                                                          verbose=False)

In [None]:
plt.figure()
counts,bins,bars = plt.hist(uncal_pass['cuspEmax_ctc'], bins=1000, histtype='step', label='Passed Cuts')
plt.hist(uncal_cut['cuspEmax_ctc'], bins=bins, histtype='step', label='Failed Cuts')
plt.yscale('log')
plt.legend()
plt.show()

Note: Some detectors have problems with the default cuts as they could not use the DAQ baseline for subtraction. These detectors are V05266A and V04549A, for these just remove bl_mean from the cut dictionary. Also V07647B has a change in baseline value so the first file of run one shouldn't be used.

## Applying Energy Calibration 

In [None]:
import pygama.analysis.calibration as cal
import pygama.analysis.peak_fitting as pgp

In [None]:
glines    = [583.191, 727.330, 860.564,1592.53,1620.50,2103.53,2614.50] # gamma lines used for calibration
range_keV = [(25,40),(25,40), (25,40),(25,20),(25,40),(25,40),(90,90)] # side bands width
funcs = [pgp.radford_peak,pgp.radford_peak,pgp.radford_peak,
             pgp.radford_peak,pgp.radford_peak,pgp.radford_peak,pgp.radford_peak]

For the calibration we have to specify the peaks to find in keV. The widths around each peak to fit again in keV and finally the function to try and fit to each peak. Here we are just using the radford_peak function which is a gaussian with low energy tail and a step function for the background.

The other argument we have to supply is a guess on the adc to kev conversion. For Th assuming the 99th percentile is around the 2615 peak works well.

In [None]:
pars, cov, results = cal.hpge_E_calibration(
    uncal_pass['cuspEmax_ctc'],
    glines,
    (2620/np.nanpercentile(uncal_pass['cuspEmax_ctc'],99)),
    deg=1,
    range_keV = range_keV,
    funcs = funcs,
    verbose=True
)

In [None]:
ecal_pass = pgp.poly(uncal_pass['cuspEmax_ctc'], pars)

In [None]:
plt.figure()
plt.hist(ecal_pass, bins=10000, histtype='step')
plt.xlabel("Energy (adc)",     ha='right', x=1)
plt.ylabel("Counts / keV",     ha='right', y=1)
plt.yscale('log')
plt.show()

Some tips: if the routine is struggling to find the peaks then the issue is probably the guess parameter. Here this is specified as `(2600/np.nanpercentile(uncal_pass['cuspEmax_ctc'],99))` where we are guessing that the 99th percentile is around the 2615 peak. This works well for Th but will need different values for other sources. If the peak fitting is struggling try changing the fit widths in the range_keV above.

The results is a dictionary  containing the location of the found peaks in adc and kev, the location of the fitted peaks in kev and the parameters of the fitted peaks. Finally there is the calibration parameters and the fwhms of the peaks.

In [None]:
print(results)

Alternatively for Th data we can load the energy calibration constants from the ecal files. Note: These won't apply to lower energy sources. 

In [None]:
calpath = '/lfs/l1/legend/legend-prodenv/prod-usr/ggmarsh-test-v03/genpar/dsp_ecal/'+det+'.json'
with open(calpath, 'r') as o:
    cal_dict = json.load(o)
cal_pars = cal_dict['cuspEmax_ctc']['Calibration_pars']
eres_pars = [cal_dict['cuspEmax_ctc']['m0'], cal_dict['cuspEmax_ctc']['m1']]

In [None]:
plt.figure()
plt.hist(pgp.poly(uncal_pass['cuspEmax_ctc'], cal_pars), bins=10000, histtype='step')
plt.xlabel("Energy (adc)",     ha='right', x=1)
plt.ylabel("Counts / keV",     ha='right', y=1)
plt.yscale('log')
plt.show()

## Peak Fitting

In pygama there are a number of convenience functions for histogramming and peak fitting. These use by default a least squares fit for binned data. 

In [None]:
import pygama.analysis.histograms as pgh

First we can use the get_hist function to generate the histogram. There are a number of ways to do this you can either specify the number of bins, the edges of the bins or the range and width of the bins.

In [None]:
hist,bins,var = pgh.get_hist(ecal_pass[(ecal_pass>2554)&(ecal_pass<2664)], range=(2554,2664), dx=0.1)

The fitting function is fit_hist. It takes in the function to fit, the data and then a parameter guess to start the fitting. It can also take bounds on these parameters. For energy fitting guessing functions have been implemented so we will use these.

We will use again the radford peak function to fit which is a gaussian with a tail and a step. A number of other peak shapes are implemented in pygama and can be found [here](https://github.com/legend-exp/pygama/blob/master/pygama/analysis/peak_fitting.py). For this fit function, initial guesses have been implemented so we will use these.

In [None]:
pars_guess = cal.get_hpge_E_peak_par_guess(hist, bins, var, pgp.radford_peak)
bounds = cal.get_hpge_E_peak_bounds(hist, bins, var, pgp.radford_peak, pars_guess)

In [None]:
pars,covs = pgp.fit_hist(pgp.radford_peak, hist,bins, guess=pars_guess, bounds=bounds)

In [None]:
bin_cs = (bins[1:]+bins[:-1])/2
plt.figure()
ax1 = plt.subplot(211)
plt.xlim([2600,2620])
ax1.tick_params('x', labelbottom=False)
plt.plot(bin_cs,hist)
plt.plot(bin_cs,pgp.radford_peak(bin_cs, *pars))
plt.ylabel("Counts")
ax2 = plt.subplot(212, sharex=ax1)
plt.plot(bin_cs,pgp.radford_peak(bin_cs, *pars)-hist)

plt.xlabel("Energy (keV)")
plt.ylabel("Residual")
plt.tight_layout()
plt.show()

If using other fit functions there are a number of other helpful functions in pygama to guess parameter values. If fitting a gaussian we can use gauss_mode_width_max to guess the mu, sigma and max. 

In [None]:
g_pars, g_covs = pgp.gauss_mode_width_max(hist,bins)

In [None]:
print(f'Mu guess is {g_pars[0]}, fitted is {pars[0]}')
print(f'Sigma guess is {g_pars[1]}, fitted is {pars[1]}')
print(f'Max guess is {g_pars[2]}, fitted is {pars[-1]}')