# Workflow: processing, searching and retrieving observations

This tutorial demonstrates how OpenGHG can be used to process new measurement data, search the data present and to retrieve this for analysis and visualisation.

### Check installation

This tutorial assumes that you have installed `openghg`. To ensure install has been successful you can open an `ipython` console and try to import this module.

In a terminal type:

```bash
$ ipython
```

Then import `openghg` and print the version string associated with the version you have installed. If you get something like the below `openghg` is installed correctly.

```ipython
In [1]: import openghg
In [2]: openghg.__version__
Out[2]: '0.0.1'
```

If you get an ``ImportError`` please go back to the [install section of the documentation](https://docs.openghg.org/install.html).

### Jupyter notebooks

If you haven't used Jupyter notebooks before please see [this introduction](https://realpython.com/jupyter-notebook-introduction/).

## 1. Setting up an object store

The OpenGHG platform uses what's called an *object store* to save data. Any saved data has been processed into a standardised format, assigned universally unique identifiers (UUIDs) and stored alongside associated metadata (such as site and species details). Storing data in this way allows for fast retrieval and efficient searching.

When using OpenGHG on a local machine the location of the object store is set using an `OPENGHG_PATH` environment variable (explained below) and this can be any directory on your local system.

For this tutorial, we will create a temporary object store which we can add data to. This path is fine for this purpose but as it is a temporary directory it may not survive a reboot of the computer. 

The `OPENGHG_PATH` environment variable can be set up in the following way.

In [1]:
import os
import tempfile

tmp_dir = tempfile.TemporaryDirectory()
os.environ["OPENGHG_PATH"] = tmp_dir.name   # temporary directory

When creating your own longer term object store we recommend a path such as ``~/openghg_store`` which will create the object store in your home directory in a directory called ``openghg_store``. If you want this to be a permanent location this can be added to your "~/.bashrc" or "~/.bash_profile" file depending on the system being used. e.g. as

```bash
 export OPENGHG_PATH="$HOME/openghg_store"
```

## 2. Adding and standardising data

### Data types

Within OpenGHG there are several data types which can be processed and stored within the object store. This includes data from the AGAGE, DECC, NOAA, LondonGHG, BEAC2ON networks.

When uploading a new data file, the data type must be specified alongside some additional details so OpenGHG can recognise the format and the correct standardisation can occur. The details needed will vary by the type of data being uploaded but will often include the measurement reference (e.g. a site code) and the name of any network.

For the full list of accepted observation inputs and data types, there is a summary function which can be called:

In [2]:
from openghg.standardise import summary_data_types

summary = summary_data_types()

## UNCOMMENT THIS CODE TO SHOW ALL ENTRIES
# import pandas as pd; pd.set_option('display.max_rows', None)

summary

Unnamed: 0,Site code,Long name,Data type,Platform
0,BTT,BT Tower,BTT,surface site
1,,,CRDS,surface site
2,ASP,aspendale,GCWERKS,surface site
3,BRI,bristol,GCWERKS,surface site
4,BSD,bilsdale,GCWERKS,surface site
...,...,...,...,...
151,WILSHIRECRESTELEMENTARYSCHOOL,Wilshire Crest Elementary School,BEACO2N,surface site
152,NPL,National Physical Laboratory,NPL,surface site
153,,,AQMESH,surface site
154,,,GLASGOW_PICARRO,surface site


Note: there may be multiple data types applicable for a give site. This is can be dependent on various factors including the instrument type used to measure the data e.g. for Bilsdale ("BSD"):

In [3]:
summary[summary["Site code"] == "BSD"]

Unnamed: 0,Site code,Long name,Data type,Platform
4,BSD,bilsdale,GCWERKS,surface site


### DECC network

We will start by adding data to the object store from a surface site within the DECC network. Here we have accessed a subset of data from the Bilsdale site (site code "BSD") in the UK.

In [4]:
bsd_filepaths = ["../data/DECC/bsd.picarro.1minute.42m.min.dat", 
                 "../data/DECC/bsd.picarro.1minute.108m.min.dat", 
                 "../data/DECC/bsd.picarro.1minute.248m.min.dat"]

As this data is measured in-situ, this is classed as a surface site and we need to use the `ObsSurface` class to interpret this data. We can pass our list of files to the `read_file` method associated within the `ObsSurface` class, also providing details on:
 - site code - `"BSD"` for Billsdale
 - type of data we want to process, known as the data type - `"CRDS"`
 - network - `"DECC"`

This is shown below:

In [5]:
from openghg.store import ObsSurface

decc_results = ObsSurface.read_file(filepath=bsd_filepaths, data_type="CRDS", site="bsd", network="DECC")

Processing: bsd.picarro.1minute.42m.min.dat:   0%|                                                                                                 | 0/3 [00:00<?, ?it/s]

  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")


Processing: bsd.picarro.1minute.108m.min.dat:  33%|█████████████████████████████▎                                                          | 1/3 [00:00<00:00,  3.88it/s]

  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")


Processing: bsd.picarro.1minute.248m.min.dat:  67%|██████████████████████████████████████████████████████████▋                             | 2/3 [00:00<00:00,  3.93it/s]

  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")


Processing: bsd.picarro.1minute.248m.min.dat: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.91it/s]


In [6]:
print(decc_results)

defaultdict(<class 'dict'>, {'processed': {'bsd.picarro.1minute.42m.min.dat': {'ch4': '9cbc7510-c5c0-482c-9d9b-eadb2bbb513d', 'co2': 'e9c5d3fc-1a3b-4501-8cd5-1f81fbd40058', 'co': 'b9cdb56e-1e7d-4f3c-9a52-f75fa5271585'}, 'bsd.picarro.1minute.108m.min.dat': {'ch4': '89903022-1841-4316-aec7-acf09206edb0', 'co2': '10022a67-ac8b-42a9-8775-fef8b8cc92c6', 'co': 'eacb52fd-5d6f-423f-a311-cab669522e6a'}, 'bsd.picarro.1minute.248m.min.dat': {'ch4': '327ad8bc-1902-4d71-af07-fd3a3cb52f85', 'co2': 'ee83c1e4-b2f2-4456-bb65-81b65ca01ef2', 'co': '58c17188-87f8-444a-8a4b-de5091c4d21f'}}})


Here this extracts the data (and metadata) from the supplied files, standardises them and adds these to our created object store.

The returned `decc_results` will give us a dictionary of how the data has been stored. The data itself may have been split into different entries, each one stored with a unique ID (UUID). Each entry is known as a *Datasource* (see below for a note on Datasources). The `decc_results` output includes details of the processed data and tells us that the data has been stored correctly. This will also tell us if any errors have been encountered when trying to access and standardise this data.

### AGAGE data

Another data type which can be added is data from the AGAGE network. The functions that process the AGAGE data expect data to have an accompanying precisions file. For each data file we create a tuple with the data filename and the precisions filename. *Note: A simpler method of uploading these file types is planned.*

We must create a `tuple` associated with each data file to link this to a precision file:

```python
list_of_tuples = [(data1_filepath, precision1_filepath), (data2_filepath, precision2_filepath), ...]
```

The data being uploaded here is from the Cape Grim station in Australia, site code "CGO".

In [7]:
agage_tuples = [('../data/AGAGE/capegrim-medusa.18.C', '../data/AGAGE/capegrim-medusa.18.precisions.C')]

We can add these files to the object store in the same way as the DECC data by including the right keywords:
 - site code - `"CGO"` for Cape Grim
 - data type - `"GCWERKS"`
 - network - `"AGAGE"`

In [8]:
from openghg.store import ObsSurface

agage_results = ObsSurface.read_file(filepath=agage_tuples, data_type="GCWERKS", site="CGO", network="AGAGE")



  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].v

  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].v

  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].v

  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")
  start = Timestamp(dataset.time[0].v

Processing: capegrim-medusa.18.C: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.89s/it]


  start = Timestamp(dataset.time[0].values, tz="UTC")
  end = Timestamp(dataset.time[-1].values, tz="UTC")


When viewing `agage_results` there will be a large number of Datasource UUIDs shown due to the large number of gases in each data file

In [9]:
agage_results

defaultdict(dict,
            {'processed': {'capegrim-medusa.18.C': {'nf3_70m': '899cf4d4-024a-4618-a9a7-29f7ef0c0bb1',
               'cf4_70m': '07b0dfff-a568-44d7-a762-18b22fcc8c49',
               'c2f6_70m': '52ca8407-e280-4f96-b1c1-f4cd24ff1180',
               'c3f8_70m': '06321a37-514d-4dd4-9763-9dcf590bcddd',
               'c4f8_70m': 'c58a518d-df98-43da-a595-290a9bf16497',
               'c4f10_70m': '743aebc3-c603-49f1-bd1c-00e8df9fa3f1',
               'c6f14_70m': '143863f4-b780-44d2-a305-31b2a9f74bd5',
               'sf6_70m': '6c9e2486-dce7-424a-8ffd-c4f440413c46',
               'so2f2_70m': '005878a6-e4e3-471c-972e-f118af78f0e7',
               'sf5cf3_70m': '1fabe3c8-5d7a-4dc4-92e0-40a69ef10bad',
               'hfc23_70m': '8b03cf68-70df-42eb-ad4f-c4594770dc5c',
               'hfc32_70m': '5e4e376a-12ed-48fb-a513-8c3dff77a7ff',
               'hfc125_70m': '5a950636-f466-4afc-b728-0627597ac043',
               'hfc134a_70m': 'c316ef73-85cd-46b1-a827-fb35077ac91c'

#### A note on Datasources

Datasources are objects that are stored in the object store (++add link to object store notes++) that hold the data and metadata associated with each measurement we upload to the platform.

For example, if we upload a file that contains readings for three gas species from a single site at a specific inlet height OpenGHG    will assign this data to three different Datasources, one for each species. Metadata such as the site, inlet height, species, network etc are stored alongside the measurements for easy searching. 

Datasources can also handle multiple versions of data from a single site, so if scales or other factors change multiple versions may be stored for easy future comparison.

## 3. Searching for data

### Visualising the object store

Now that we have added data to our created object store, we can view the objects within it in a simple force graph model. To do this we use the `view_store` function from the `objectstore` submodule. Note that the cell may take a few moments to load as the force graph is created.

In the force graph the central blue node is the `ObsSurface` node. Associated with this node are all the data processed by it. The next node in the topology are networks, shown in green. In the graph you will see `DECC` and `AGAGE` nodes from the data files we have added. From these you'll see site nodes in red and then individual datasources in orange.

*Note: The object store visualisation created by this function is commented out here and won't be visible in the documentation but can be uncommented and run when you use the notebook version.*

In [10]:
from openghg.objectstore import visualise_store

# visualise_store()

Now we know we have this data in the object store we can search it and retrieve data from it.

### Searching the object store

We can search the object store by property using the `search(...)` function.

For example we can find all sites which have measurements for carbon tetrafluoride ("cf4") using the `species` keyword:

In [11]:
from openghg.retrieve import search
search(species="cf4")

Site: CGO
---------
cf4 at 70m


We could also look for details of all the data measured at the Billsdale ("BSD") site using the `site` keyword:

In [12]:
search(site="bsd")

Site: BSD
---------
ch4 at 42m, 108m, 248m
co2 at 42m, 108m, 248m
co at 42m, 108m, 248m


For this site you can see this contains details of each of the species as well as the inlet heights these were measured at.

## 4. Retrieving data

To retrieve the standardised data from the object store there are several functions we can use which depend on the type of data we want to access.

To access the surface data we have added so far we can use the `get_obs_surface` function and pass keywords for the site code, species and inlet height to retrieve our data.

In this case we want to extract the carbon monoxide ("co") data from the Bilsdale data ("BSD") site measured at the "248m" inlet:

In [13]:
from openghg.retrieve import get_obs_surface

obs_data = get_obs_surface(site="bsd", species="co", inlet="248m")

If we view our returned `obs_data` variable this will contain:

 - `data` - The standardised data (accessed using e.g. `obs_data.data`). This is returned as an [xarray Dataset](https://xarray.pydata.org/en/stable/generated/xarray.Dataset.html).
 - `metadata` - The associated metadata (accessed using e.g. `obs_data.metadata`).

In [14]:
obs_data

ObsData(data=<xarray.Dataset>
Dimensions:                    (time: 142)
Coordinates:
  * time                       (time) datetime64[ns] 2014-01-30T11:12:30 ... ...
Data variables:
    mf                         (time) float64 202.4 203.2 205.1 ... 114.5 114.2
    mf_variability             (time) float64 5.265 6.307 8.518 ... 7.339 5.405
    mf_number_of_observations  (time) float64 26.0 26.0 25.0 ... 23.0 24.0 23.0
Attributes: (12/23)
    data_owner:           Simon O'Doherty
    data_owner_email:     s.odoherty@bristol.ac.uk
    inlet_height_magl:    248m
    comment:              Cavity ring-down measurements. Output from GCWerks
    long_name:            bilsdale
    conditions_of_use:    Ensure that you contact the data owner at the outse...
    ...                   ...
    sampling_period:      60
    inlet:                248m
    port:                 9
    type:                 air
    network:              decc
    scale:                WMO-X2014A, metadata={'site': 'bsd'

First we tell `matplotlib` that we are plotting inside a Jupyter notebook, this ensures a plot with controls is created.

In [15]:
%matplotlib notebook

In [16]:
example_data = obs_data.data
mol_frac = example_data.mf
mol_frac.plot()

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f7fb5e6ef20>]

---

#### Cleanup

If you used the `tmp_dir` as a location for your object store at the start of the tutorial you can run the cell below to remove any files that were created to make sure any persistant data is refreshed when the notebook is re-run.

In [17]:
tmp_dir.cleanup()