# Beginner workflow

This tutorial is a beginner workflow for processing data, visualising the object store and retrieving and visualising data.

### Check installation

This tutorial assumes that you have installed `openghg`. To ensure install has been successful you can open an `ipython` console and import openghg

In a terminal type

```bash
ipython
```

Then import `openghg` and print the version string associated with the version you have installed. If you get something like the below `openghg` is installed correctly.

```ipython
In [1]: import openghg
In [2]: openghg.__version__
Out[2]: '0.0.1'
```

If you get an ``ImportError`` please go back to the install section of the documentation.




### Notebooks

If you haven't used Jupyter notebooks before please see [this introduction](https://realpython.com/jupyter-notebook-introduction/).

## 1. Setting up an object store

To create a new object store we can start adding data to, we need to set up an `OPENGHG_PATH` environment variable. 

We recommend a path such as ``~/openghg_store`` which will create the object store in your home directory in a directory called ``openghg_store``.

If you want this to be a permanent location this can be added to you "~/.bashrc" or "~/.bash_profile" file depending on the system being used.

For this tutorial we have set up a temporary object store at ``/tmp/openghg_store``. For the purposes of this tutorial this path is fine but as it is a temporary directory it may not survive a reboot of the computer. 

In [1]:
import os
tmp_dir = "/tmp/openghg_store"
os.environ["OPENGHG_PATH"] = tmp_dir

In [2]:
from openghg.localclient import get_obs_surface, RankSources
from openghg.retrieve import search

import glob
from pathlib import Path

import os



## 2. Adding and standardising data

### DECC network

We can now start adding data to our object store and standardising this into a common format. Here we have accessed a subset of data from the Bilsdale site in the UK which is part of the DECC network to demonstrate including a list of file paths:

In [3]:
bsd_filepaths = ["../data/DECC/bsd.picarro.1minute.42m.min.dat", "../data/DECC/bsd.picarro.1minute.108m.min.dat", "../data/DECC/bsd.picarro.1minute.248m.min.dat"]


We can pass this filepath to the `ObsSurface.read_file` function and must also provide details on:
 - site code - `"BSD"` for Billsdale
 - type of data we want to process, known as the data type - `"CRDS"`
 - network - `"DECC"`

In [4]:
from openghg.store import ObsSurface

decc_results = ObsSurface.read_file(filepath=bsd_filepaths, data_type="CRDS", site="bsd", network="DECC", overwrite=True)

Processing: bsd.picarro.1minute.248m.min.dat: 100%|██████████| 3/3 [00:01<00:00,  2.71it/s]


In [5]:
print(decc_results)

defaultdict(<class 'dict'>, {'processed': {'bsd.picarro.1minute.42m.min.dat': {'ch4': '47a7f63e-a74f-4334-abce-b6e1921c7832', 'co2': '439185e8-9074-44a2-8d82-74f423211fcc', 'co': '9d997a20-0d99-4a54-9919-d45b1c8c4eee'}, 'bsd.picarro.1minute.108m.min.dat': {'ch4': 'f2a68ed8-183f-4c13-b701-f3c1fbe9dcd8', 'co2': '3aceab22-6d65-451b-b54a-923f2ac3a093', 'co': '6015e48c-cded-47aa-8a43-cf2d7d133388'}, 'bsd.picarro.1minute.248m.min.dat': {'ch4': 'f6fb2484-f129-4274-a1a4-05124dcfdfe2', 'co2': '888f4fe7-d0b0-4b9e-ac58-99ed4e10697b', 'co': '2425ab94-ddb7-47fa-8918-f4b9e6f72189'}}})


Here this adds the data to the created object store and standardises this. The returned `decc_results` will give us a dictionary with the UUIDs (universally unique identifiers) for each of the Datasources the data has been assigned to. This tells us that the data has been processed and stored correctly.

## A note on Datasources

Datasources are objects that are stored in the object store (++add link to object store notes++) that hold the data and metadata associated with each measurement we upload to the platform.

For example, if we upload a file that contains readings for three gas species from a single site at a specific inlet height OpenGHG    will assign this data to three different Datasources, one for each species. Metadata such as the site, inlet height, species, network etc are stored alongside the measurements for easy searching. 

Datasources can also handle multiple versions of data from a single site, so if scales or other factors change multiple versions may be stored for easy future comparison.

### AGAGE data

Another data type which can be added to the platform and standardised is data from the AGAGE network. The functions that process the AGAGE data expect data to have an accompanying precisions file. For each data file we create a tuple with the data filename and the precisions filename. *Note: A simpler method of uploading these file types is planned.*

We must create a `tuple` associated with each data file to link this to a precision file:

```python
list_of_tuples = [(data1_filepath, precision1_filepath), (data2_filepath, precision2_filepath), ...]
```

The data being uploaded here is from the Cape Grim station in Australia, site code "CGO".

In [6]:
agage_tuples = [('../data/AGAGE/capegrim-medusa.18.C', '../data/AGAGE/capegrim-medusa.18.precisions.C')]

We can add these files to the object store in the same way as the DECC data but including the right keywords:
 - site code - `"CGO"` for Cape Grim
 - data type - `"GCWERKS"`
 - network - `"AGAGE"`

In [7]:
from openghg.store import ObsSurface

agage_results = ObsSurface.read_file(filepath=agage_tuples, data_type="GCWERKS", site="CGO", network="AGAGE", overwrite=True)

Processing: capegrim-medusa.18.C:   0%|          | 0/1 [00:00<?, ?it/s]

  data[gas_name + " integration_flag"] = (data[column].str[1] != "-").astype(int)
  data[gas_name + " status_flag"] = (data[column].str[0] != "-").astype(int)


Processing: capegrim-medusa.18.C: 100%|██████████| 1/1 [00:02<00:00,  2.41s/it]


When viewing `agage_results` there will be a large number of Datasource UUIDs shown due to the large number of gases in each data file

In [8]:
agage_results

defaultdict(dict,
            {'processed': {'capegrim-medusa.18.C': {'nf3_70m': '8d61807e-8e35-4bc1-92a8-c8d9adf5a80e',
               'cf4_70m': 'b1bfe502-346b-43b5-9ca2-a516ac0e738c',
               'c2f6_70m': 'efb986ae-c303-4923-95ba-68bca6ac305c',
               'c3f8_70m': 'b0bff7ae-5a5c-4ea6-8a6d-2d2ef454ebef',
               'c4f8_70m': 'f6b933b2-600e-4fdb-a7f0-85abe921de22',
               'c4f10_70m': '63652c62-c584-40ac-af3a-ae2759bc4cdd',
               'c6f14_70m': '2d22e29f-c440-4a26-8cb2-7b7b54dd7f22',
               'sf6_70m': 'b3539d8e-ce80-4db4-bd7a-63ae91f3810c',
               'so2f2_70m': 'a4208503-7e43-473c-9289-520048612c69',
               'sf5cf3_70m': '5eeef776-dbdc-40c7-879f-1d7544f5df3a',
               'hfc23_70m': '5b3769c5-9338-4131-8ac8-b8a627a67425',
               'hfc32_70m': '80e3988a-6b53-44c7-83ea-ab583dbcd831',
               'hfc125_70m': 'fa96a8c6-a5bb-4636-921d-bd2cf1567d82',
               'hfc134a_70m': '6476f5bc-839a-46aa-913d-7aea1e1a3a42'

## 3. Visualising the object store

Now that we have added data to our created object store, we can view the objects within it in a simple force graph model. To do this we use the `view_store` function from the `objectstore` submodule. Note that the cell may take a few moments to load as the force graph is created.

In the force graph the central blue node is the `ObsSurface` node. Associated with this node are all the data processed by it. The next node in the topology are networks, shown in green. In the graph you will see `DECC` and `AGAGE` nodes. From these you'll see site nodes in red and then individual datasources in orange.

#### Note

The object store visualisation created by this function is commented out here and won't be visible in the documentation but can be uncommented and run when you use the notebook version.


In [9]:
from openghg.objectstore import visualise_store

visualise_store()

Now we know we have this data in the object store we can search it and retrieve data from it.

## 4. Retrieving data 

To retrieve the standardised data from the object store we can use the `get_obs_surface` function. This allows us to retrieve and view the data stored.

In [10]:
from openghg.localclient import get_obs_surface

data = get_obs_surface(site="bsd", species="co", network="AGAGE", inlet="248m")

If we view data we expect an `ObsData` object to have been returned

In [11]:
data

ObsData(data=<xarray.Dataset>
Dimensions:                    (time: 142)
Coordinates:
  * time                       (time) datetime64[ns] 2014-01-30T11:12:30 ... ...
Data variables:
    mf                         (time) float64 202.4 203.2 205.1 ... 114.5 114.2
    mf_variability             (time) float64 5.265 6.307 8.518 ... 7.339 5.405
    mf_number_of_observations  (time) float64 26.0 26.0 25.0 ... 23.0 24.0 23.0
Attributes: (12/23)
    data_owner:           Simon O'Doherty
    data_owner_email:     s.odoherty@bristol.ac.uk
    inlet_height_magl:    248m
    comment:              Cavity ring-down measurements. Output from GCWerks
    long_name:            bilsdale
    Conditions of use:    Ensure that you contact the data owner at the outse...
    ...                   ...
    sampling_period:      60
    inlet:                248m
    port:                 9
    type:                 air
    network:              decc
    scale:                WMO-X2014A, metadata={'site': 'bsd'

First we tell `matplotlib` that we are plotting inside a Jupyter notebook, this ensures a plot with controls is created.

In [12]:
%matplotlib notebook

In [13]:
example_data = data.data
mol_frac = example_data.mf
mol_frac.plot()

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7fd408e53e20>]

## 5. Ranking data

The dates that the data from Heathfield retrieved above overlap. If we want to easily retrieve the highest quality data from Heathfield over a range of dates we don't want to have to repeatedly check which was the correct inlet/instrument for a given daterange. This problem is solved using ranking. 

A given inlet on a specific instrument at a site can be given a rank for a daterange. To do this we use the `RankSources` class from the `localclient` submodule.

## Get ranking data

In [14]:
r = RankSources()
r.get_sources(site="bsd", species="co")

{'co_248m_picarro': {'rank_data': {'2016-01-01-00:00:00+00:00_2018-01-01-00:00:00+00:00': 1},
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'},
 'co_42m_picarro': {'rank_data': 'NA',
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'},
 'co_108m_picarro': {'rank_data': 'NA',
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'}}

## Set ranking data

This tells us that there is no rank set for CO data at Bisldale. Say we want to prioritise data from different inlet heights between certain dates we can set the ranks like so

In [15]:
r.set_rank(key="co_248m_picarro", rank=1, start_date="2016-01-01", end_date="2018-01-01")
r.set_rank(key="co_42m_picarro", rank=1, start_date="2018-01-02", end_date="2019-05-30")
r.set_rank(key="co_108m_picarro", rank=1, start_date="2019-05-30", end_date="2021-11-30")

Now we can check everything was set correctlying using `get_sources` again as above

In [16]:
r.get_sources(site="bsd", species="co")

{'co_248m_picarro': {'rank_data': {'2016-01-01-00:00:00+00:00_2018-01-01-00:00:00+00:00': 1},
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'},
 'co_42m_picarro': {'rank_data': {'2018-01-02-00:00:00+00:00_2019-05-30-00:00:00+00:00': 1},
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'},
 'co_108m_picarro': {'rank_data': {'2019-05-30-00:00:00+00:00_2021-11-30-00:00:00+00:00': 1},
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'}}

## Retrieve ranked data

We can now retrive CO data from Bilsdale and we'll get the data from the highest ranked inlets.

In [17]:
co_data = get_obs_surface(site="bsd", species="co")

This retrieves the data and creates an `ObsData` dataclass. This holds the data in an xarray `Dataset`, the metadata associated with the site and each inlet and the ranking metadata. 

Let's have a look at the rank metadata

In [18]:
co_data.metadata["rank_metadata"]

{'2016-01-01-00:00:00+00:00_2018-01-01-00:00:00+00:00': '248m',
 '2018-01-02-00:00:00+00:00_2019-05-30-00:00:00+00:00': '42m',
 '2019-05-30-00:00:00+00:00_2021-11-30-00:00:00+00:00': '108m'}

In [19]:
co_data.data

In [20]:
co_data.metadata

{'248m': {'site': 'bsd',
  'instrument': 'picarro',
  'sampling_period': '60',
  'inlet': '248m',
  'port': '9',
  'type': 'air',
  'network': 'decc',
  'species': 'co',
  'scale': 'wmo-x2014a',
  'long_name': 'bilsdale',
  'data_type': 'timeseries'},
 '42m': {'site': 'bsd',
  'instrument': 'picarro',
  'sampling_period': '60',
  'inlet': '42m',
  'port': '9',
  'type': 'air',
  'network': 'decc',
  'species': 'co',
  'scale': 'wmo-x2014a',
  'long_name': 'bilsdale',
  'data_type': 'timeseries'},
 '108m': {'site': 'bsd',
  'instrument': 'picarro',
  'sampling_period': '60',
  'inlet': '108m',
  'port': '9',
  'type': 'air',
  'network': 'decc',
  'species': 'co',
  'scale': 'wmo-x2014a',
  'long_name': 'bilsdale',
  'data_type': 'timeseries'},
 'rank_metadata': {'2016-01-01-00:00:00+00:00_2018-01-01-00:00:00+00:00': '248m',
  '2018-01-02-00:00:00+00:00_2019-05-30-00:00:00+00:00': '42m',
  '2019-05-30-00:00:00+00:00_2021-11-30-00:00:00+00:00': '108m'},
 'data_owner': "Simon O'Doherty",


Now we have the highest ranked data for Bilsdale.

## 7. Cleanup

If you used the `tmp_dir` as a location for your object store at the start of the tutorial you can run the cell below to remove any files that were created.

In [21]:
tmp_dir.cleanup()

NameError: name 'tmp_dir' is not defined

## 8. What's next?

Further tutorials will be added soon. If you want to explore the internal workings of OpenGHG please checkout the Developer API documentation, if you would like contribute to the project we welcome pull requests to both the code and the documentation. For help and guidance on contributing check our contributing page.