# PyONCat (ONCat API from Python)

## Introduction

### About

ONCat is a metadata catalog built to store information about neutron experiment data at HFIR / SNS.  The contents of the catalog can be viewed at https://oncat.ornl.gov.

An API is available to allow programmatic access to the metadata stored in the catalog.  Documentation for the API is at https://oncat.ornl.gov/build.

This notebook outlines the usage of "PyONCat", a Python module built to make communicating with the API a little easier.

<p><font color='green'>**(Questions / requests / feedback?  Please contact ONCat Support: oncat-support@ornl.gov.)**</font></p>

### Installation

The latest version of PyONCat should already be installed on https://jupyter.sns.gov as well as instrument / analysis machines, but if you are using a machine without it then it can be installed using `pip` as follows:

```
pip install https://oncat.ornl.gov/packages/pyoncat-1.0-py3-none-any.whl
```

### Notebook Prerequisite (Run This First!)

We'd like to be able to time some of the things we do later on in this notebook, so let's define a "stopwatch" to help us with that.

In [None]:
# Don't worry -- you don't need to understand exactly *how* this works right now.  Just make sure you
# indent things properly when using it.

import contextlib
import datetime

@contextlib.contextmanager
def stopwatch():
    """Wrap things in this context manager to time how long they take."""
    start = datetime.datetime.now()
    print("Started at %s..." % start)
    yield
    end = datetime.datetime.now()
    print("Finished at %s!" % end)
    print("Total time = %s " % (end - start))

Example stopwatch usage.  Note that everything indented will be timed by the stopwatch.

In [None]:
import time

with stopwatch():
    # These will be timed...
    time.sleep(1)
    time.sleep(1)

# ... and this will not.
time.sleep(1)

## Usage

### 1 - Initial Setup

#### Main `ONCat` Object Creation

In [None]:
import pyoncat

# This is a temporary "client ID" intended for use in this tutorial **only**.
CLIENT_ID = "c0686270-e983-4c71-bd0e-bfa47243a47f"

oncat = pyoncat.ONCat(
    "https://oncat.ornl.gov",
    client_id=CLIENT_ID,
    flow=pyoncat.RESOURCE_OWNER_CREDENTIALS_FLOW,
)

#### Prompting Users for their XCAMS / UCAMS Password

<p><font color='grey'>*(Here we are assuming you want to use the username you used to log in to jupyter.sns.gov.)*</font></p>

In [None]:
import getpass

username = getpass.getuser()
password = getpass.getpass()

#### Logging in With the User's Credentials

In [None]:
oncat.login(username, password)

<font color='grey'>*(Please contact ONCat Support if you would like to be issued permanent client credentials for your own work.  Note that is it possible to have clients that use passwordless "machine-to-machine" authentication.)*</font>

### 2 - Basic Facility / Instrument Information

#### Printing the Names of the Facilities Supported by ONCat

In [None]:
facilities = oncat.Facility.list()

[facility.name for facility in facilities]

#### Printing the Names of the Instruments Support by ONCat for a Single Facility

In [None]:
instruments = oncat.Instrument.list(facility="SNS")

[instrument.name for instrument in instruments]

### 3 - Experiment Information

#### Retrieving All Experiments for an Instrument

In [None]:
experiments = oncat.Experiment.list(facility="SNS", instrument="NOM")

len(experiments)

Most people will not be able to see the vast majority of the experiments that have been run on any given instrument, and only the experiments for which you are a team member (or experiments marked as "calibration") are experiments you should be able to see.  Instrument staff should obviously be able to see all experiments for their instrument.

In general, experiment directories you have access to on the file system should also be available to you in ONCat.

#### Getting All the Information We Have for a given Experiment

In [None]:
# Let's use a calibration experiment that everyone has access to.
nom_cal_exp = oncat.Experiment.retrieve(
    "IPTS-19564",
    facility="SNS",
    instrument="NOM"
)

nom_cal_exp

Note that the object we got back was an `ONCatRepresentation`.  This is just a slightly more convenient wrapper around the information we got back from the API, which has a nested, "dictionary of dictionaries" structure.

#### Accessing Fields Using Standard Python Syntax

In [None]:
nom_cal_exp.title

This is the most convenient syntax but only top-level fields can be retireved this way.

#### Accessing Fields Using "Square-Bracket" Syntax

In [None]:
nom_cal_exp["title"]

Square-bracket syntax is more powerful since it also works for deeply-nested fields.  Use dot-delimited paths to "drill down" into the structure.

In [None]:
nom_cal_exp["indexed.run_number.ranges"]

We can only drill down into dictionaries -- not arrays.

In [None]:
try:
    nom_cal_exp["members.name"]
except KeyError:
    print("Could not drill down!")

To get the team member names, we should access the array of `members`, and then access the `name` of each:

In [None]:
[member["name"] for member in nom_cal_exp["members"]]

### 4 - Datafile Information

<p><font color="red">NOTE: From now on we will be dealing with datafile entries, for which we often store *large* amounts of information in the catalog.  Therefore, to make sure that your scripts always run as quickly as they can, it is important to be mindful about accidentally asking for more than you need.
    
Also, bear in mind that ONCat is a shared resource.  If a large number of expensive calls are made simultaneously, this may negatively impact the performance of other clients using the API.

Strategies to keep things as quick as possible are discussed in the following section.
</font></p>

#### Retrieving All Datafiles for an Experiment

Let's get all the datafiles for calibration experiment we looked at previously, and time how long it takes:

In [None]:
with stopwatch():
    datafiles = oncat.Datafile.list(
        facility="SNS",
        instrument="NOM",
        experiment="IPTS-19564",
    )

With any luck that should have been quite quick.  Let's see how many datafiles were returned.

In [None]:
len(datafiles)

So, not a lot of datafiles.  Let's up the ante a bit and ask for all the datafiles in a slightly larger calibration experiment.

In [None]:
with stopwatch():
    datafiles = oncat.Datafile.list(
        facility="SNS",
        instrument="NOM",
        experiment="IPTS-21285"
    )

len(datafiles)

That probably took quite a bit a little longer.  But why?  It's not like that's a *huge* number of files...

Well, let's see what a single datafile contains.

In [None]:
datafiles[0]

So, quite a lot of stuff...  Returning all of that for thousands of datafiles means the database has to read a lot from disk and a lot of bytes have to be sent across the network.  Those are obviously bottlenecks.

<p><font color="grey">*(Note that there is so much stuff per file because our cataloging strategy when parsing raw datafiles is to ingest as much as we possibly can, within reason.  A rough rule of thumb is, "if it's an array then we ignore it, else let's just go ahead and shove it in the catalog".)*</font></p>

We will explore how to speed things up a little later, but for now let's take a look at what we have.

#### Accessing Information on Datafiles 

The datafile objects we get from the API can be accessed in much the same way as the experiment objects we looked at before, except different information is stored.

Every datafile has a location:

In [None]:
datafiles[0].location

If the instrument works in terms of "runs", then raw datafiles will have a corresponding run number:

In [None]:
datafiles[0]["indexed.run_number"]

We store when the file was created:

In [None]:
datafiles[0].created

We also keep track of when we cataloged the file:

In [None]:
datafiles[0].ingested

But the vast majority of the remaining info is nested inside the metadata field:

In [None]:
datafiles[0].metadata

#### Easily Seeing All Fields at a Glace

With all that metadata it can be hard to find what you're looking for.

Luckily, there is an easier way to see all the dot-delimited paths in a given datafile:

In [None]:
datafiles[0].nodes()

... and a way to print out only the paths *within* a given path, for example all paths under the "sample" node:

In [None]:
datafiles[0].nodes("metadata.entry.sample")

Any of the dot-delimited paths can then be fed back in to the square-bracket syntax.  For example, the `speedrequest1` value from the DAS logs can be retrieved like this:

In [None]:
datafiles[0]["metadata.entry.daslogs.speedrequest1.average_value"]

### 5 - Improving Performance

Now let's try to speed things up a bit by being more specific about what we ask for and using a few more of the options avaiable to us in the API.

#### Filtering by Run Number

If we happen to know the exact run(s) we're looking for ahead of time, then that would mean we could ask for less datafiles to be retrieved from the database.

In [None]:
# Comma-seperated ranges are allowed.
run_numbers = "75400-75449,75500-75999"

with stopwatch():
    datafiles = oncat.Datafile.list(
        facility="SNS",
        instrument="NOM",
        ranges_q="indexed.run_number:" + run_numbers,
    )

len(datafiles)

Hopefully retrieving those 550 files was a lot quicker than retrieving the 1,926 we asked for earlier.

#### Filtering by Fields Using "Projections"

It is also possible to ask for a much smaller sub-set of information for each datafile, using something called a projection.

A projection is just a list of the same kind of dot-delimeted paths we were working with previously.

In [None]:
projection=[
    "indexed.run_number",
    "metadata.entry.sample.identifier",
    "metadata.entry.sample.name",
    "metadata.entry.sample.chemical_formula",
    "metadata.entry.sample.mass",
    "metadata.entry.sample.container_name",
    "metadata.entry.title",
    "metadata.entry.proton_charge",
    "location",
]

with stopwatch():
    datafiles = oncat.Datafile.list(
        facility="SNS",
        instrument="NOM",
        experiment="IPTS-21285",
        projection=projection,
    )

In [None]:
len(datafiles)

Even though we asked for all datafiles in the larger calibration experiment, that should have been *much* quicker to run.

You can see how the resulting datafile objects we got back are much smaller:

In [None]:
datafiles[0]

Now let's use `pandas` to print out everything that was returned in a table:

In [None]:
import pandas

pandas.DataFrame(
    data=[[datafile[item] for item in projection] for datafile in datafiles],
    columns=projection,
)


#### Filtering Raw/Processed Using "Tags"

As of Feb 2019 we only catalog raw files, but soon we will be cataloging reduced/processed files.  At that point, queries like the ones above will start to return a mixture of both.

To "future-proof" your queries, you might want to consider filtering by the `type/raw` tag:

In [None]:
datafiles = oncat.Datafile.list(
    facility="SNS",
    instrument="NOM",
    experiment="IPTS-21285",
    projection=projection,
    tags=["type/raw"],
)

len(datafiles)

#### Filtering By File Extension

Furthermore, you may also want to filter by file extension.  This is best shown with examples from CG3, which is "SPICE" instrument that writes out both `.xml` and `.dat` files:

In [None]:
xml_datafiles = oncat.Datafile.list(
    facility="HFIR",
    instrument="CG3",
    experiment="IPTS-17241",
    projection=["location"],
    tags=["type/raw"],
    exts=[".xml"]
)

len(xml_datafiles)

In [None]:
dat_datafiles = oncat.Datafile.list(
    facility="HFIR",
    instrument="CG3",
    experiment="IPTS-17241",
    projection=["location"],
    tags=["type/raw"],
    exts=[".dat"]
)

len(dat_datafiles)

In [None]:
all_datafiles = oncat.Datafile.list(
    facility="HFIR",
    instrument="CG3",
    experiment="IPTS-17241",
    projection=["location"],
    tags=["type/raw"],
)

len(all_datafiles)

In [None]:
assert len(all_datafiles) == len(xml_datafiles) + len(dat_datafiles)