<h1 style="text-align: center;">
    <img src="img/logo_servicex.png" width="70" height="70"  style="float:left" alt="ServiceX">
    <img src="img/logo_ut.png" width="150" height="100"  style="float:right" alt="UT Austin">
    ServiceX, the novel data delivery system
</h1>

<h4 style="text-align: center;">KyungEon Choi (UT Austin) for ServiceX team (IRIS-HEP)</h4>

<h4 style="text-align: center;">IRIS-HEp Analysis Software Training Event (July 19, 2024)</h4>

<br>

</br>

<h2>Data delivery? Data Access?</h2>

<font size="3">

<br>

<p style="text-align:center;"> <img src="img/remote_data.png"  width ="60%" alt="ServiceX"></p>

- Data we want to process is often stored at remote storages; sometimes too large to store directly accessible storage or production chain made it available only at remote storage
- There are couple of solutions
    1. Transfer or download to a directly accessible storage (e.g. <font size="2">`rucio get X`</font>)
    2. Run ntuplizer on the grid to filter and select what user need (and more), and then download (e.g. TopCPToolkit)
    3. Go to the machine which has access to the data (e.g. lxplus for eos storage access)



<h2>What is ServiceX?</h2>

<font size="3">
    
- A component of IRIS-HEP DOMA (Data Organization, Management And Access)
- A scalable data extraction and delivery service
- Deployed in a Kubernetes cluster

<h3>ServiceX under the hood</h3>
<p style="text-align:center;"> <img src="img/ServiceXDiagram2.png" width="100%" alt="ServiceX"></p>

<font size="3">

- <b><span style="color:#FF6E33;">Event data</span></b>
    - ServiceX delivers from grid or remote XRootD storage to the user. Or more precisely ServiceX writes into an object store (ServiceX internal storage) and users download files or URLs from the object store as soon as available.
    - Thickness of arrows reflect the amount of data over a wire. ServiceX is NOT designed to download full data from grids. Transformers effectively reduce data that will be delivered to user based on a query for selection and filtering.
    - ServiceX is often co-located with a grid site to maximize network bandwith. XCache is preferable to allow much faster read for frequently accessed datasets.
- <b><span style="color:red;">Transformer</span></b>
    - Extracts what user wants
    - ServiceX consists of multiple microservices that are deployed as static K8s pod (always "running" state) but transformers are dynamically created via HPA (Horizontal Pod Scaling)
    - A transformer pod runs on a file at a time and number of transformer pods are scaled up and down depending on the number of input files in the dataset and other criteria
- <b>ServiceX Request</b>
    - ServiceX request(s) is(are) made from the <span style="color:blue;">SerivceX client libary</span> to ServiceX Web API via HTTP request
    - A ServiceX request takes one input dataset (or list of files) and ServiceX is happily scale transformer pods automatically. A dataset with a single file should work but it's much more desirable to utilize HPA.
    - Users can make ServiceX request anywhere only with Python ServiceX client library and <font size="2"><code>servicex.yaml</code></font> includes an access token. Thus it's perfectly fine to deliver data to a university cluster or a laptop for small tests.

<br>

<h3>ServiceX Webpage</h3>
<font size="3">
    
- The "production" ServiceX for ATLAS users: <font size="2">[<code>https://servicex.af.uchicago.edu/</code>](https://servicex.af.uchicago.edu/)</font> - limited only to ATLAS users as it provides an access to the ATLAS event data
- Download a ServiceX configuration file (<font size="2"><code>servicex.yaml</code></font>) from the ServiceX website and copy to your home or working  directory 

<p style="text-align:center;"><img src="img/servicex_web.png" width="80%" alt="ServiceX Web"></p>

<br>


</br>

<h2>ServiceX Client library</h2>

<font size="3">

ServiceX Client library is a python library for users to communicate with ServiceX backend (or server) to make delivery requests and handling of outputs

<font size="3">

<b>Installation</b><br />
- <font size="2"><code>pip install servicex==3.0.0.alpha.19</code></font>

In [None]:
# !pip install servicex==3.0.0.alpha.19
!pip list | grep servicex

<br>
<br>

<h3>First ServiceX request</h3>

<!-- <font size="3">
Let's begin with the basic: <br>
<span style="margin-left:30px">Deliver a branch (or column) from a dataset in the grid</span> -->

<font size="3">

<b>The most fundamental compenents of a ServiceX request</b>
1. Dataset
1. Query - describe what a user wants to run in transformers

In [None]:
import servicex

In [None]:
spec = {
    "Sample":[{
        "Name": "UprootRaw",
        "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
        "Query": servicex.query.UprootRaw({"treename": "nominal", "filter_name": "el_pt"})
    }]
}

<font size="3">
    
- One sample named "UprootRaw" is defined in the <font size="2"><code>spec</code></font> object.
- A Rucio dataset is specified
- Defined a <font size="2">`Query`</font>, sent to transformers and run on all files in the given Rucio dataset
- <font size="2">`UprootRaw`</font> query takes <font size="2">`"treename"`</font> to set <font size="2">`TTree`</font> in flat ROOT ntuples and <font size="2">`"filter_name"`</font> to select branches in a given tree

<font size="3">
Let's deliver my ServiceX request

In [None]:
o = servicex.deliver(spec)

In [None]:
len(o['UprootRaw'])

<font size="3">
Returns a dictionary

In [None]:
print(f"Sample.Name: {o.keys()}\n")
print(f"Fileset: {type(o['UprootRaw'])}\n")
print(f"First file: {(o['UprootRaw'][0])}\n")

In [None]:
import uproot

with uproot.open(o['UprootRaw'][0]) as f:
    column = f['nominal']['el_pt']
column.array()

<font size="3">
Only few lines of a python script brings the data you want from the grid!

<br></br>

Let me go through what kinds of `Dataset` and `Query` are supported by ServiceX

<h3>Dataset</h3>

<font size="3">
ServiceX supports Rucio, XRootD, and CERN OpenDataset

In [None]:
servicex.dataset.Rucio.__init__

In [None]:
servicex.dataset.FileList.__init__

In [None]:
servicex.dataset.CERNOpenData.__init__

<br></br>

<h3>Query</h3>

<font size="3">
<ul>
    <li>Query is a representation of what user wants from input dataset. e.g.</li>
    <ul>
        <li><font size="2"><code>UprootRaw({"treename": "nominal", "filter_name": "el_pt"})</code></font></li>
    </ul>
    <li>User provided query is translated into a code that runs on transformers</li>
    <li>Query is input data format dependent as a code for flat ROOT ntuple differs from the one for Apache parquet</li>
    <!-- <li>ServiceX supports ROOT ntuples, ATLAS xAOD, CMS Run-1 AOD as an input format</li> -->
    <!-- <li>Current version of client library supports query languages   (though other query classes are registered)</li> -->
    <!-- <li>Current version of client library supports query classes for ROOT ntuples at the moment</li> -->
</ul>
</font>

In [None]:
servicex.query.plugins

<font size="3">

<br>
<b>Query classes for ROOT ntuples (via Uproot)</b>

<font size="3">

<code>UprootRaw</code> Query
- This is a new query language, essentially calling <font size="2">`uproot.tree.arrays()`</font> function
- A UprootRaw query can be a dictionary or a list of dictionaries
- There are two types of operations a user can put in a dictionary
    - query: contains a  <font size="2">`treename`</font> key
    - copy: contains a  <font size="2">`copy_histograms`</font> key

<font size="2">    
    <pre>
        <code class="python">
query = [
         {
          'treename': 'reco', 
          'filter_name': ['/mu.*/', 'runNumber', 'lbn', 'jet_pt_*'], 
          'cut':'(count_nonzero(jet_pt_NOSYS>40e3, axis=1)>=4)'
         },
         {
          'copy_histograms': ['CutBookkeeper*', '/cflow.*/', 'metadata', 'listOfSystematics']
         }
        ]
        </code>
    </pre>
</font>


<font size="3">

- More details on the grammar can be found [here](https://servicex-frontend.readthedocs.io/en/latest/transformer_matrix.html)

In [None]:
query_UprootRaw = servicex.query.UprootRaw({"treename": "nominal", "filter_name": "el_pt"})

<font size="3">

<br>

<code>FuncADL_Uproot</code> Query
- Functional Analysis Description Language is a powerful query language that has been supported by ServiceX
- In addition to the basic operations like <font size="2">`Select()`</font> for column selection or <font size="2">`Where()`</font> for filtering, more sophisticated query can be built
- One new addition <font size="2">`FromTree()`</font> method to set a tree name in a query
- More details can be found at the [talk](https://indico.cern.ch/event/1019958/timetable/#31-funcadl-functional-analysis) by M. Proffitt at PyHEP 2021

In [None]:
query_FuncADL = servicex.query.FuncADL_Uproot().FromTree('nominal').Select(lambda e: {'el_pt': e['el_eta']})

<font size="3">

<br>

<code>PythonFunction</code> Query
- Python function can be passed as a query
- <font size="2">`uproot`</font>, <font size="2">`awkward`</font>, <font size="2">`vector`</font> can be imported (limited by the transformer image)
- Primarily experimental purpose and likely to be discontinued

In [None]:
def run_query(input_filenames=None):
    import uproot
    with uproot.open({input_filenames: "nominal"}) as o:
        br = o.arrays("el_pt")
    return br

query_PythonFunction = servicex.query.PythonFunction().with_uproot_function(run_query)

<font size="3">
All three queries return the same output, ROOT files with selected branch <font size="2"><code>el_pt</code></font>!

<br></br>

<h3>Multiple samples</h3>

<font size="3">

- HEP analysis often needs more than one sample

In [None]:
spec_multiple = {    
    "Sample":[
        {
            "Name": "UprootRaw",
            "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
            "Query": query_UprootRaw,
        },
        {
            "Name": "FuncADL_Uproot",
            "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
            "Query": query_FuncADL,
        },
        {
            "Name": "PythonFunction",
            "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
            "Query": query_PythonFunction,
        }
    ]
}

<font size="3">

- <font size="2">`Sample`</font> block is a list of dictionaries, each with a <font size="2">`Dataset`</font> - <font size="2">`Query`</font> pair
- Client library makes one ServiceX request per <font size="2">`Dataset`</font> - <font size="2">`Query`</font> pair
- Again, it's preferred to have more files in a request to utilize K8s HPA than having multiple requests for the same query

In [None]:
o_multiple = servicex.deliver(spec_multiple)

<br></br>

<h3>YAML interface</h3>

<font size="3">

- It's cool to deliver only interested columns from grid storages in a Jupyter notebook, but real analysis often becomes quite messy
- A YAML file represents all of your data in your analysis and easily share with your colleague
- The new client library brings <font size="2">`servicex-databinder`</font> and significantly improve user interface to allow a seamless experience with YAML

In [None]:
%%writefile -a config_UprootRaw.yaml

Sample:
  - Name: Uproot_UprootRaw_YAML
    Dataset: !Rucio user.kchoi.pyhep2024.test_dataset
    Query: !UprootRaw |
        {"treename":"nominal", "filter_name": "el_pt"}

<font size="3">
Compare with the one in this notebook

```python
"Sample":[{
    "Name": "UprootRaw_PyHEP",
    "Dataset": Rucio("user.kchoi.pyhep2024.test_dataset"),
    "Query": UprootRaw({"treename": "nominal", "filter_name": "el_pt"})
}]
```

In [None]:
from servicex import deliver

In [None]:
o_yaml = deliver("config_UprootRaw.yaml")

<font size="3">

YAML syntax
- The exclamation mark(!), yaml tag, to declare dataset type and query type (see detail on the [PyYAML constructor](https://matthewpburruss.com/post/yaml/))
    - Dataset tags: <font size="2">`!Rucio`</font>, <font size="2">`!Rucio`</font>, <font size="2">`!FileList`</font>, <font size="2">`!CERNOpenData`</font>
    - Query tags: <font size="2">`!UprootRaw`</font>, <font size="2">`!FuncADL_Uproot`</font>, <font size="2">`!PythonFunction`</font>
- The pipe (`|`) after query tag represents the literal operator and allows to properly interpret multi-line string

<br></br>

<h3>Optional configurations</h3>

<font size="3">

- `General` block
    - Optional block
    - By default <font size="2">`OutputFormat: root-file`</font>
    - <font size="2">`parquet`</font> is supported as <font size="2">`OutputFormat`</font> for uproot queries except <font size="2">`UprootRaw`</font>
    - By default <font size="2">`Delivery: LocalCache`</font> &rarr; files are downloaded to your local cache directory<sup>1</sup>
    - Or <font size="2">`Delivery: SignedURLs`</font> only returns ServiceX object-store URLs &rarr; user can consume data directly from the ServiceX object-store
- `Sample` block
    - <font size="2">`NFiles`</font> to set number of files you want to run in the given Rucio dataset
- `Definition` block
    - Repeated long values can be replaced by setting YAML anchors, e.g. the same query for multiple samples
    - One constraint is the anchor (<font size="2">`&`</font>) needs to be defined prior to the alias (<font size="2">`*`</font>)

<font size="2"><sup>1</sup>The local cache path can be set in the `servicex.yaml` file: `cache_path: /X/Y`</font>

<br>

Example YAML:
</font>

```yaml
Definition:
  - &DEF_ggH_input "root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\
                  /2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root"

  - &DEF_query1 !PythonFunction |
    def run_query(input_filenames=None):
        import uproot

        with uproot.open({input_filenames:"nominal"}) as o:
            br = o.arrays("mu_pt")
        return br

  - &DEF_query2 !FuncADL_Uproot  |
    FromTree('mini').Select(lambda e: {'lep_pt': e['lep_pt']}).Where(lambda e: e['lep_pt'] > 1000)

General:
  OutputFormat: parquet
  Delivery: SignedURLs

Sample:
  - Name: ttH
    Dataset: !Rucio user.kchoi.fcnc_tHq_ML.ttH.v11
    Query: *DEF_query1
    NFiles: 5

  - Name: ttZ
    Dataset: !Rucio user.kchoi.fcnc_tHq_ML.ttZ.v11    
    Query: *DEF_query1
    NFiles: 3

  - Name: ggH
    Dataset: !FileList *DEF_ggH_input
    Query: *DEF_query2
```

<br>

<h3>Failed transformation</h3>

In [None]:
spec_typo = {
    "Sample":[{
        "Name": "UprootRaw_failed",
        "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
        "Query": servicex.query.UprootRaw({"treename": "nominal", "filter_name": "el_pta"})
    }]
}

In [None]:
o = deliver(spec_typo)

<br></br>

<h2>Example use case</h2>

<p style="text-align:center;"> <img src="img/ServiceXDiagram2.png" width="100%" alt="ServiceX"></p>

<font size="3">

<b>Case 1</b>
- My analysis team's AnalysisTop (or TopCPToolkits) ntuples on two datasets (Rucio DIDs) are ready on the grid
- I want all electron branches with electron pT > 25 GeV cut
- I gonna do my analysis in the UC AF coffea-casa so I don't want to download to my local cache space than simply consume from the object store

In [None]:
spec_case1 = {
    "General":
    {
        "Delivery": "SignedURLs"
    },
    "Sample":[
        {
            "Name": "ttH",
            "Dataset": servicex.dataset.Rucio("user.kchoi.fcnc_tHq_ML.ttH.v11"),
            "Query": servicex.query.UprootRaw({
                "treename":"nominal", 
                "filter_name": ["el_*", "mu_*","jet_*"], 
                "cut": "num(el_pt, axis=1)==3"
            })
        },
        {
            "Name": "ttW",
            "Dataset": servicex.dataset.Rucio("user.kchoi.fcnc_tHq_ML.ttW.v11"),
            "Query": servicex.query.UprootRaw({
                "treename":"nominal", 
                "filter_name": ["el_*", "mu_*","jet_*"], 
                "cut": "num(el_pt, axis=1)==3"
            })
        }
    ]
}

In [None]:
o_case1 = servicex.deliver(spec_case1)

In [None]:
o_case1['ttH']

<br>

<font size="3">

<b>Case 2</b>
- My analysis team stores all ntuples at EOS ATLAS space
- I just want a few branches from all files in parquet format for my machine learning study
- I want to deliver branches to my university cluster as it has a good GPU card

In [None]:
eos_file = servicex.dataset.FileList(["root://eosuser.cern.ch//eos/atlas/atlascerngroupdisk/phys-higgs/HSG1/HZG/Run2/ProcessedSample/H2Zy-FullRun2-v3/data/data15_p3876_all.root",
                                     "root://eosuser.cern.ch//eos/atlas/atlascerngroupdisk/phys-higgs/HSG1/HZG/Run2/ProcessedSample/H2Zy-FullRun2-v3/data/data17_p3876_all.root",
                                     "root://eosuser.cern.ch//eos/atlas/atlascerngroupdisk/phys-higgs/HSG1/HZG/Run2/ProcessedSample/H2Zy-FullRun2-v3/data/data18_p3876_all.root"])

spec_case2 = {
    "Sample":[
        {
            "Name": "UprootRaw_eos",
            "Dataset": eos_file,
            "Query": servicex.query.UprootRaw({"treename": "HZG_Tree", "filter_name": "ph_*"})
        }    
    ]
}

In [None]:
o_case1 = servicex.deliver(spec_case2)

In [None]:
o_case1["UprootRaw_eos"]

<br>

<font size="3">

<b>Case 3</b>
- Sample game uses 6 processes and 1 file per each, but in practice we need to process all files :)
    - <font size="2">`HWW`: `mc20_13TeV.345324.PowhegPythia8EvtGen_NNLOPS_NN30_ggH125_WWlvlv_EF_15_5.deriv.DAOD_PHYSLITE.e5769_s3681_r13167_r13146_p6026_tid37865929_00` (20 files / 21GB)</font>
    - <font size="2">`HZZ`: `mc20_13TeV.345060.PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l.deriv.DAOD_PHYSLITE.e7735_s3681_r13167_r13146_p6026_tid38191712_00` (17 files / 19GB)</font>
    - <font size="2">`tcha`:`mc20_13TeV.410658.PhPy8EG_A14_tchan_BW50_lept_top.deriv.DAOD_PHYSLITE.e6671_s3681_r13167_r13146_p6026_tid37621204_00` (103 files / 230GB)</font>
    - <font size="2">`ttbar`: `mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYSLITE.e6337_s3681_r13167_r13146_p6026_tid37620644_00` (547 files / 836GB)</font>
    - <font size="2">`tZq`: `mc20_13TeV.410560.MadGraphPythia8EvtGen_A14_tZ_4fl_tchan_noAllHad.deriv.DAOD_PHYSLITE.e5803_s3681_r13167_r13146_p6026_tid38191575_00` (15 files / 13GB)</font>
    - <font size="2">`Zee`: `mc20_13TeV.700322.Sh_2211_Zee_maxHTpTV2_CVetoBVeto.deriv.DAOD_PHYSLITE.e8351_s3681_r13167_r13146_p6026_tid37621317_00` (49 files / 71GB)</font>
- The same list of branches:
  <font size="2">
  
  ```
    "EventInfoAuxDyn.mcEventWeights",
        
    "AnalysisElectronsAuxDyn.pt",
    "AnalysisElectronsAuxDyn.eta",
    "AnalysisElectronsAuxDyn.phi",
    "AnalysisElectronsAuxDyn.m",
    
    "AnalysisMuonsAuxDyn.pt",
    "AnalysisMuonsAuxDyn.eta",
    "AnalysisMuonsAuxDyn.phi",
    
    "AnalysisJetsAuxDyn.pt",
    "AnalysisJetsAuxDyn.eta",
    "AnalysisJetsAuxDyn.phi",
    "AnalysisJetsAuxDyn.m",
    
    "BTagging_AntiKt4EMPFlowAuxDyn.DL1dv01_pb",
    "BTagging_AntiKt4EMPFlowAuxDyn.DL1dv01_pc",
    "BTagging_AntiKt4EMPFlowAuxDyn.DL1dv01_pu",
  ```
  
  </font>

In [None]:
%%writefile -a config_sample_game.yaml

Definition:
  - &DEF_query !UprootRaw |
        {
            "treename": "CollectionTree",
            "filter_name": [
                "EventInfoAuxDyn.mcEventWeights",  
                "AnalysisElectronsAuxDyn.pt",
                "AnalysisElectronsAuxDyn.eta",
                "AnalysisElectronsAuxDyn.phi",
                "AnalysisElectronsAuxDyn.m",

                "AnalysisMuonsAuxDyn.pt",
                "AnalysisMuonsAuxDyn.eta",
                "AnalysisMuonsAuxDyn.phi",

                "AnalysisJetsAuxDyn.pt",
                "AnalysisJetsAuxDyn.eta",
                "AnalysisJetsAuxDyn.phi",
                "AnalysisJetsAuxDyn.m",

                "BTagging_AntiKt4EMPFlowAuxDyn.DL1dv01_pb",
                "BTagging_AntiKt4EMPFlowAuxDyn.DL1dv01_pc",
                "BTagging_AntiKt4EMPFlowAuxDyn.DL1dv01_pu"]
        }

General:
    Delivery: SignedURLs
    
Sample:
  - Name: HWW
    Dataset: !Rucio mc20_13TeV.345324.PowhegPythia8EvtGen_NNLOPS_NN30_ggH125_WWlvlv_EF_15_5.deriv.DAOD_PHYSLITE.e5769_s3681_r13167_r13146_p6026_tid37865929_00
    Query: *DEF_query
  - Name: HZZ
    Dataset: !Rucio mc20_13TeV.345060.PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l.deriv.DAOD_PHYSLITE.e7735_s3681_r13167_r13146_p6026_tid38191712_00
    Query: *DEF_query
  - Name: tcha
    Dataset: !Rucio mc20_13TeV.410658.PhPy8EG_A14_tchan_BW50_lept_top.deriv.DAOD_PHYSLITE.e6671_s3681_r13167_r13146_p6026_tid37621204_00
    Query: *DEF_query
  - Name: ttbar
    Dataset: !Rucio mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYSLITE.e6337_s3681_r13167_r13146_p6026_tid37620644_00
    Query: *DEF_query
  - Name: tZq
    Dataset: !Rucio mc20_13TeV.410560.MadGraphPythia8EvtGen_A14_tZ_4fl_tchan_noAllHad.deriv.DAOD_PHYSLITE.e5803_s3681_r13167_r13146_p6026_tid38191575_00
    Query: *DEF_query
  - Name: Zee
    Dataset: !Rucio mc20_13TeV.700322.Sh_2211_Zee_maxHTpTV2_CVetoBVeto.deriv.DAOD_PHYSLITE.e8351_s3681_r13167_r13146_p6026_tid37621317_00
    Query: *DEF_query

In [None]:
o_case3 = servicex.deliver("config_sample_game.yaml")

<br>
<br>

<h2>Future plans</h2>

<font size="3">

<br>

<b>Client library</b>
- Migrate ATLAS FuncADL queries
- Improve robustness: progress bar (transform status/object store access) and local caching
- Readthedoc of the new ServiceX cilent library is under construction! https://servicex-frontend.readthedocs.io/en/latest/index.html
- ServiceX as a node of dask task graph

<b>ServiceX backend</b>
- Improve stability and robustness of ServiceX especially from what learned during 200Gbps challenge
- Server-side caching
- Add new ServiceX transformers: ATLAS TopCPToolkit transformer (WIP), column-join transformer, ATLAS columnar CP transformer?