<font size="1">Tutorial by Danielle Schaper (based on original tutorial by Ian Guinn, UNC., which was presented at [LEGEND Software Tutorial, Nov. 2021](https://indico.legend-exp.org/event/561/) )</font>

</font><font size="1"> NOTE: This tutorial is made to use the newer (refactored) version of pygama, version 1.0. The tutorials presented at the Nov. 2021 workshop used the older version of pygama, version 0.9. At the time of writing this tutorial, the version of pygama used was version 1.0.2 (specifically, 1.1.1.dev5+geb536311)</font>

In [None]:
import pygama
print(pygama.__path__)
print(pygama.__version__)

***

# <font color ='green'>Data Processing and Access 101: Overview of this Tutorial</font>


There are three main stages of data files in the LEGEND analysis process: ORCA files, "raw" files, and "DSP" files. Briefly:
1) ORCA files: These are the files that come straight off of the FlashCam cards. They consist of an ORCA header, and then raw binary data.

2) Raw files: After the ORCA files have undergone some basic processing in order to structure the binary data into HDF5-formatted files. These files contain run information and the raw (un-processed) waveforms.

3) DSP files: These files have undergone more rigorous DSP (digital signal processing) routines and contain the processed waveforms as well as lists of input parameters that were used to process the raw waveforms (e.g. trap filter rise time) and lists of parameters which were extracted from these analysis routines (e.g. trapEmax).

The primary goal of this tutorial is to walk the user through the stages of the ORCA ---> Raw ---> DSP conversions. In addition, we show how to inspect the contents of the files along the way by 1) Using the pygama WaveformBrowser functionality as well as 2) Structuring data into PANDAS dataframes and using the native PANDAS utilities to look at the data.

The data files that will be used in this tutorial are contained in the legend-testdata repository. 

In [None]:
# Set up python environment. This tutorial assumes that the user is running the refactored version of pygama.
import os,json, sys
import matplotlib.pyplot as plt
import numpy as np
from legend_testdata import LegendTestData

# Pygama modules needed for this tutorial
from pygama.raw  import build_raw
from pygama.dsp  import build_dsp
from pygama.lgdo import LH5Store as lh5_st
from pygama.lgdo import ls # Needed to do this to access the ls function since it is not a method of the LH5Store class
from pygama.lgdo import load_dfs
from pygama.vis.waveform_browser import WaveformBrowser

plt.rcParams["figure.figsize"] = (20,10)

In [None]:
# Load in data from legend-testdata and check to make sure the path can be found.
ldata=LegendTestData()
ldata.get_path("orca/fc/L200-comm-20220519-phy-geds.orca")

# Building a 'raw' file from a .orca DAQ file

Our first step in processing is to run build_raw (found in the <font color ='orange'>pygama.raw module</font>), which will decode the binary file produced by our DAQ system and output an HDF5 file following LEGEND's lh5 file specification. This requires us to provide an input file, an output file name, and a dictionary of settings.

In [None]:
# You will need to change the path 'orca_file' to point to the .orca file that you will want to analyze. Here, we will use one of the L-200 commissioning datafiles.

# Load in the desired ORCA file from legend-testdata
ldata=LegendTestData()
ldata.get_path("orca/fc/L200-comm-20220519-phy-geds.orca")
orca_file = ldata.get_path("orca/fc/L200-comm-20220519-phy-geds.orca")

# These will be the output filenames for this tutorial
raw_file = 'pygamaTutorialRAWfile.lh5'
dsp_file = 'pygamaTutorialDSPfile.lh5'

In [None]:
# This is a generic configuration file that can be used in the build_raw() function. It is generic because it relies on wildcards to parse the input files.
# The only specialized argument here is the out_stream argument so that it returns a file specifically named 'pygamaTutorialRAW.lh5'
# This config file template can be found in the legend-exp/legend-dataflow/scripts/build_raw.py GitHub repository

out_spec = { 
"ORFlashCamADCWaveformDecoder" : {
  "ch{key:03d}/raw" : {
  "key_list" : ["*"],
  "out_stream" : "pygamaTutorialRAWfile.lh5"
  }
 }
}

In [None]:
# Change overwrite = TRUE to allow the system to overwrite the existing file. This is useful if you are making small tweaks in this tutorial and then re-running the cell multiple times.
overwrite_raw = True

# Build the raw file
build_raw(orca_file, in_stream_type='ORCA',
         out_spec=out_spec, overwrite = overwrite_raw)

## Inspecting the raw file

Next, we'll look at the file output from daq_to_raw. The file is output using the LH5 specification, and can be accessed using the lh5_st() function in the <font color ='orange'>pygama.lgdo.LH5Store</font> module.

First, we'll create a Store object, and call ls to list the contents of the hdf5 group containing our data. Then, we'll call load_dfs to create a pandas dataframe. Note that the pandas dataframe will not contain the waveforms.

In [None]:
store_object = lh5_st()
print("List of raw file elements:")
print(ls(raw_file))

## Using the Waveform Browser to peek at a raw waveform

This is done using the WaveformBrowser function in the <font color ='orange'>pygama.vis.waveform_browser</font> module. Here we 

In [None]:
# For this tutorial, we will look at channel ch000

channel = 'ch000'
browser1 = WaveformBrowser(raw_file, lh5_group = channel +"/raw", lines = "waveform")
browser2 = WaveformBrowser(raw_file, lh5_group = channel +"/raw", lines = "waveform")

In [None]:
# Draw the first waveform in the file
browser1.draw_entry(1)

In [None]:
# To browse through the waveforms in the file, you can use the draw_next() function. Re-running this cell will update the plot.
fig = browser1.new_figure()
fig = browser1.draw_next()

## Structuring Data into PANDAS DataFrame

Let's pick a few parameters/values that we might be interested in and put them into a PANDAS dataframe so that we can look through them more easily. We can do this using the load_dfs() function found in the <font color ='orange'>pygama.lgdo</font> module. In order to do this (so that you know what inputs to use for the par_list and the lh5_group arguments), one must first know the HDF5 structure of the raw file. If you need help with this, please see the "UnderstandingHDF5Files" tutorial.

Here we are choosing to import a subset of the data contained in the raw file into a dataframe. We are interested in the *card, channel, crate, daqenergy, packet_id*, and *timestamp* datasets, which can all be found in the ORFlashCamADCWaveform HDF5 group.

In [None]:
raw_df = load_dfs(raw_file, par_list = ['card', 'channel', 'crate', 'daqenergy',
                              'packet_id', 'timestamp'], lh5_group = channel + '/raw')

In [None]:
# Now we can inspect the contents of the dataframe and see if they make sense. You can play with changing the channel number and seeing how the dataframe updates.
print(raw_df)

## Building the DSP file from the Raw file

In order to convert the raw data file into one that has been processed using DSP routines, we will need two input files:

1) A **global** configuration file which contains the information about the DSP processes that we wish to run on the data. Ideally, this configuration file remains static for all of the analyses that will be done across different datasets. This configuration file is defined in this tutorial as **dsp_config**.

2) A **local** configuration file which contains channel-specific (i.e. detector-specific) information and parameters. This configuration file will change between datasets, as its purpose is to perform minor corrections/adjustments to the data processes which originate from detector-specific fluctuations. This configuration file is defined in this tutorial as **db_dict**.

Example dsp_config and db_dict files have been included explicitly in this tutorial directory so as to minimize needing to import external files and to ensure the same files are used. These files can be found in the ./metadata directory. For more information on how these files affect the data and how to modify/manipulate them, please see the WriteProcessors and the IntroToDSP tutorial.

In [None]:
# We will use the dsp-config file in the ./metadata directory. If you'd like to see the file contents of the db_dict JSON file, you can uncomment and run this cell to print them.

## Uncomment this block to see the dsp-config file contents
#dsp_config_file_path = "./metadata/dsp-config-dataprocessing.json"
#with open(dsp_config_file_path, 'r') as j:
#     contents = json.loads(j.read())
#print(json.dumps(contents, indent=2))

## Uncomment this block to see the db_dict file contents
#db_dict_file_path = "./metadata/db_dict.json"
#with open(db_dict_file_path, 'r') as j:
#     contents = json.loads(j.read())
#print(json.dumps(contents, indent=2))

In [None]:
# Define the necessary dsp_config and db_dict JSON files.
dsp_config = "./metadata/dsp-config-dataprocessing.json"
db_dict = "./metadata/db_dict.json"

# Build the DSP file
build_dsp(raw_file, dsp_file, dsp_config = dsp_config, database = db_dict, write_mode = 'r')

In [None]:
dsp_df=load_dfs(dsp_file, par_list=['trapEmax'], lh5_group=channel +'/dsp')

In [None]:
print(dsp_df)

In [None]:
# Plotting the results of the trapEmax filter; there are only 12 events in this test data file, so the spectrum is of course not that impressive. It just matters that it works.
dsp_df.hist('trapEmax',bins=50)
plt.xlabel("Energy [keV]")
plt.ylabel("Counts")
plt.ylim(0,10)
plt.xlim(0,20900)
plt.show()

# Inspecting the DSP File Using the Waveform Browser

In [None]:
# Create an LH5 Store object
DSP_Store =lh5_st()

In [None]:
# List the contents of the DSP file for a given channel
channel = 'ch000'
ls('pygamaTutorialDSPfile.lh5', channel + '/dsp/')

In [None]:
# To see that we processed the waveforms, we shall plot the baseline-subtracted waveform, 'wf_blsub.'
browser_dsp = WaveformBrowser(dsp_file, lines='wf_blsub', lh5_group= channel + "/dsp")


In [None]:
# Draw the 1st waveform in the file -- it should look like the waveform that we drew earlier from the raw file, just adjusted to a zero baseline.
browser_dsp.draw_entry(1)

In [None]:
# To browse through the waveforms in the file, you can use the draw_next() function. Re-running this cell will update the plot.
fig = browser_dsp.new_figure()
fig = browser_dsp.draw_next()