# Beginner workflow

This tutorial is a beginner workflow for processing data, visualising the object store and retrieving and visualising data.

### Check installation

This tutorial assumes that you have installed `openghg`. To ensure install has been successful you can open an `ipython` console and import openghg

In a terminal type

```bash
ipython
```

Then import `openghg` and print the version string associated with the version you have installed. If you get something like the below `openghg` is installed correctly.

```ipython
In [1]: import openghg
In [2]: openghg.__version__
Out[2]: '0.0.1'
```

If you get an ``ImportError`` please go back to the install section of the documentation.




### Notebooks

If you haven't used Jupyter notebooks before please see [this introduction](https://realpython.com/jupyter-notebook-introduction/).

## 1. Setting up our environment

First the notebook sets up the environment needed to create the object store at our desired location. By default this location
is at ``/tmp/openghg_store``. For the purposes of this tutorial this path is fine but as it is a temporary directory it may not survive a
reboot of the computer. 

If you want to create an object store that survives a reboot you can change the path to anything you like. We
recommened a path such as ``~/openghg_store`` which will create the object store in your home directory in a directory called ``openghg_store``.

In [1]:
from openghg.modules import ObsSurface
from openghg.objectstore import visualise_store
from openghg.localclient import get_single_site, RankSources

import glob
from pathlib import Path
import os
import tempfile

### Set an environment variable for the OpenGHG object store

Here we create a temporary directory but you can use any folder you like by setting a path in place of `tmp_dir.name`. The object store created by this notebook will only have a lifetime as long as the notebook, if you want to create a longer lived object store set a path below.

In [2]:
tmp_dir = tempfile.TemporaryDirectory()
os.environ["OPENGHG_PATH"] = tmp_dir.name # "/tmp/openghg_store"


## 2. Processing data

First we want to create a list of files to process, we'll use some files from our local directory

In [3]:
decc_files = glob.glob("../data/DECC/*.dat")

In [4]:
agage_files = glob.glob("../data/AGAGE/*.C")

We can pass a list of files to `ObsSurface.read_file`. We must also tell it the type of data we want it to process, DECC data is CRDS. We also pass in the name of the network.

In [5]:
decc_results = ObsSurface.read_file(filepath=decc_files, data_type="CRDS", network="DECC")

Here `decc_results` will give us a dictionary with the UUIDs (universally unique identifiers) for each of the Datasources the data has been assigned to. This tells us that the data has been processed and stored correctly.

## A note on Datasources

Datasources are objects that are stored in the object store (add link to object store notes) that hold the data and metadata associated with each measurement we upload to the platform.

For example, if we upload a file that contains readings for three gas species from a single site at a specific inlet height OpenGHG    will assign this data to three different Datasources, one for each species. Metadata such as the site, inlet height, species, network etc are stored alongside the measurements for easy searching. 

Datasources can also handle multiple versions of data from a single site, so if scales or other factors change multiple versions may be stored for easy future comparison.

<div class="alert alert-info">
    When you run this notebook different UUIDs will be created for the Datasources. This is expected as 
    each time a Datasource is created from scratch it is assigned a unique UUID.
</div>

In [6]:
decc_results

{'tac.picarro.1minute.100m.min.dat': {'tac.picarro.1minute.100m.min_ch4': '1d46b603-d96c-4e81-b961-a75f5f4e0301',
  'tac.picarro.1minute.100m.min_co2': 'd6ec4760-aadb-45aa-aaa2-922b07b06e26'},
 'tac.picarro.1minute.100m.test.dat': {'tac.picarro.1minute.100m.test_ch4': 'a8acc1e4-dd60-495b-aee1-528b0d063997',
  'tac.picarro.1minute.100m.test_co2': '4ce4b0a3-301e-418d-9ac4-e49dd2a24a8b'},
 'hfd.picarro.1minute.100m.min.dat': {'hfd.picarro.1minute.100m.min_ch4': '892e0711-bde9-49d2-a6a2-9e176fa3f707',
  'hfd.picarro.1minute.100m.min_co2': 'de213b39-7e68-4041-9384-d26814a6d22f',
  'hfd.picarro.1minute.100m.min_co': 'e7985c05-c71c-4d75-a30b-6c1c282f315d'},
 'hfd.picarro.1minute.50m.min.dat': {'hfd.picarro.1minute.50m.min_ch4': 'fd1342ae-b326-45ea-a0f5-d306915a10e1',
  'hfd.picarro.1minute.50m.min_co2': '0fa0d783-81ad-47b0-89aa-aaa0f47f28d7',
  'hfd.picarro.1minute.50m.min_co': '168b93d3-46f3-482e-bc03-f51eaf46dec8'},
 'bsd.picarro.1minute.248m.dat': {'bsd.picarro.1minute.248m_ch4': '21ca3416

We can now process the AGAGE data. The functions that process the AGAGE data expect data to have an accompanying precisions file. For each data file we create a tuple with the data filename and the precisions filename. A simpler method of uploading these file types is planne.

We must create a `tuple` for each pair

```python
list_of_tuples = [(data_filepath, precision_filepath), (d1, p1), (d2, p2), ...]
```

In [7]:
agage_tuples = [('../data/AGAGE/capegrim-medusa.18.C', '../data/AGAGE/capegrim-medusa.18.precisions.C'), 
                ('../data/AGAGE/trinidadhead.01.C', '../data/AGAGE/trinidadhead.01.precisions.C')]

Then we process the files as we did before the with DECC data, but this time changing the data type to the `GCWERKS` type and the network to `AGAGE`.

In [8]:
agage_results = ObsSurface.read_file(filepath=agage_tuples, data_type="GCWERKS", network="AGAGE")

When viewing `agage_results` there will be a large number of Datasource UUIDs shown due to the large number of gases in each data file

In [9]:
agage_results

{'capegrim-medusa.18.C': {'capegrim-medusa.18_NF3': 'd6a41afc-373f-45c4-8fbe-957fc85cc18b',
  'capegrim-medusa.18_CF4': 'cf90c2dc-8f2d-49f4-b274-a1c6046a0f87',
  'capegrim-medusa.18_PFC-116': '29ddf34a-1704-4779-9d1e-4fcd13566105',
  'capegrim-medusa.18_PFC-218': '15657f37-8bae-4c78-aa17-dbfe5f1c2f98',
  'capegrim-medusa.18_PFC-318': '819953c9-fb41-43b1-86a9-36fc5460f911',
  'capegrim-medusa.18_C4F10': '9df16de3-5798-4e54-a83b-469de08f9b9d',
  'capegrim-medusa.18_C6F14': '575e63e5-8ca0-470e-ba8e-a76d379390f4',
  'capegrim-medusa.18_SF6': 'f0d55ba0-4325-40ed-a6a5-f841fc447a2b',
  'capegrim-medusa.18_SO2F2': '51d4707f-d7ad-4440-9290-e6d25e70a27b',
  'capegrim-medusa.18_SF5CF3': '10420d38-c88a-4135-85b9-9eb0a163e6d8',
  'capegrim-medusa.18_HFC-23': 'deba29cf-e8d0-4cef-b062-2f380d7d4c62',
  'capegrim-medusa.18_HFC-32': '3b203047-7c5e-4dba-9541-dbf6191eb034',
  'capegrim-medusa.18_HFC-125': 'e29f2cb4-e226-4356-a6ff-47a270a3e2f9',
  'capegrim-medusa.18_HFC-134a': 'cf74788e-be5e-4bc9-94dd-e07

## 3. Visualising the object store

Now that we have a simple object store created we can view the objects within it in a simple force graph model. To do this we use the `view_store` function from the `objectstore` submodule. Note that the cell may take a few moments to load as the force graph is created.

In the force graph the central blue node is the `ObsSurface` node. Associated with this node are all the data processed by it. The next node in the topology are networks, shown in green. In the graph you will see `DECC` and `AGAGE` nodes. From these you'll see site nodes in red and then individual datasources in orange.

<div class="alert alert-info">
    The object store visualisation created by this function is commented out here and won't be visible in the documentation but can be uncommented and run when you use the notebook version.
</div>

In [10]:
# visualise_store()

Now we know we have this data in the object store we can search it and retrieve data from it.

## 4. Retrieving data 

To retrieve data from the object store we can use the `get_single_site` function from the `localclient` submodule. This allows us to retrieve and view the data stored.

In [11]:
data = get_single_site(site="hfd", species="co", network="AGAGE")

If we view data we expect two `xarray.Dataset` objects to have been returned in in a `list`

In [12]:
data

[<xarray.Dataset>
 Dimensions:    (time: 274)
 Coordinates:
   * time       (time) datetime64[ns] 2013-12-04T14:02:30 ... 2019-05-21T15:46:30
 Data variables:
     mf         (time) float64 214.3 216.2 147.0 135.3 ... 123.7 133.7 118.5
     co_stdev   (time) float64 4.081 3.634 3.887 4.11 ... 3.276 4.28 3.603 4.442
     co_n_meas  (time) float64 19.0 19.0 19.0 19.0 19.0 ... 16.0 15.0 16.0 16.0
 Attributes:
     data_owner:           Simon O'Doherty
     data_owner_email:     s.odoherty@bristol.ac.uk
     inlet_height_magl:    100m
     comment:              Cavity ring-down measurements. Output from GCWerks
     Conditions of use:    Ensure that you contact the data owner at the outse...
     Source:               In situ measurements of air
     Conventions:          CF-1.6
     File created:         2020-10-22 12:55:45.980284+00:00
     Processed by:         auto@hugs-cloud.com
     species:              CO
     station_longitude:    0.23048
     station_latitude:     50.97675
     s

In [13]:
len(data)

2

We get two datasets for CO data from Tacolneston as there are two inlet measurement heights for this species at this site. We can quickly visualise the data we have stored using the plotting capabilities in `xarray`.

First we tell `matplotlib` that we are plotting inside a Jupyter notebook, this ensures a plot with controls is created.

In [14]:
%matplotlib notebook

INFO:matplotlib.font_manager:Generating new fontManager, this may take some time...


In [15]:
example_data = data[0]
mol_frac = example_data.mf
mol_frac.plot()

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f236b4ead10>]

## 5. Ranking data

The dates that the data from Heathfield retrieved above overlap. If we want to easily retrieve the highest quality data from Heathfield over a range of dates we don't want to have to repeatedly check which was the correct inlet/instrument for a given daterange. This problem is solved using ranking. 

A given inlet on a specific instrument at a site can be given a rank for a daterange. To do this we use the `RankSources` class from the `localclient` submodule.

In [16]:
r = RankSources()

r.get_sources(site="hfd", species="co")

{'co_hfd_100m_picarro': {'rank': 0,
  'data_range': '2013-12-04T14:02:30_2019-05-21T15:46:30',
  'uuid': 'e7985c05-c71c-4d75-a30b-6c1c282f315d',
  'metadata': {'site': 'hfd',
   'instrument': 'picarro',
   'time_resolution': '1_minute',
   'inlet': '100m',
   'port': '10',
   'type': 'air',
   'network': 'decc',
   'species': 'co',
   'scale': 'wmo-x2014a',
   'data_type': 'timeseries'}},
 'co_hfd_50m_picarro': {'rank': 0,
  'data_range': '2013-11-23T12:28:30_2020-06-24T09:41:30',
  'uuid': '168b93d3-46f3-482e-bc03-f51eaf46dec8',
  'metadata': {'site': 'hfd',
   'instrument': 'picarro',
   'time_resolution': '1_minute',
   'inlet': '50m',
   'port': '9',
   'type': 'air',
   'network': 'decc',
   'species': 'co',
   'scale': 'wmo-x2014a',
   'data_type': 'timeseries'}}}

The returned dictionary gives us two keys, one for each inlet height. To rank a source we use the `set_rank` method which expects two arguments: `rank_key` which is the key given to each source in the `dict` above and `rank_data` a dictionary of the form

```python
rank_data = {co2_hfd_50m_picarro: {1: [daterange_1], 2: [daterange_2]}}
```

We can create this dictionary using a helper method of `RankSources` called `create_daterange` as shown below.

In [17]:
daterange_100m = r.create_daterange(start="2013-11-01", end="2016-01-01")   

This creates a daterange string that will be understood by `openghg`. We can then place this in a list to create our `rank_data` dictionary.

In [18]:
rank_data = {"co_hfd_100m_picarro": {"1": [daterange_100m]}}

In [19]:
rank_data

{'co_hfd_100m_picarro': {'1': ['2013-11-01T00:00:00_2016-01-01T00:00:00']}}

Now we can set the rank of the network using `set_rank`

In [20]:
r.set_rank(rank_key="co_hfd_100m_picarro", rank_data=rank_data)

We can now check the rank for this inlet again to check it's been set correctly

In [21]:
r.get_sources(site="hfd", species="co")

{'co_hfd_100m_picarro': {'rank': defaultdict(list,
              {'1': ['2013-11-01T00:00:00_2016-01-01T00:00:00']}),
  'data_range': '2013-12-04T14:02:30_2019-05-21T15:46:30',
  'uuid': 'e7985c05-c71c-4d75-a30b-6c1c282f315d',
  'metadata': {'site': 'hfd',
   'instrument': 'picarro',
   'time_resolution': '1_minute',
   'inlet': '100m',
   'port': '10',
   'type': 'air',
   'network': 'decc',
   'species': 'co',
   'scale': 'wmo-x2014a',
   'data_type': 'timeseries'}},
 'co_hfd_50m_picarro': {'rank': 0,
  'data_range': '2013-11-23T12:28:30_2020-06-24T09:41:30',
  'uuid': '168b93d3-46f3-482e-bc03-f51eaf46dec8',
  'metadata': {'site': 'hfd',
   'instrument': 'picarro',
   'time_resolution': '1_minute',
   'inlet': '50m',
   'port': '9',
   'type': 'air',
   'network': 'decc',
   'species': 'co',
   'scale': 'wmo-x2014a',
   'data_type': 'timeseries'}}}

We can now see

```python
'co_hfd_100m_picarro': {'rank': defaultdict(list, {'1': ['2013-11-01T00:00:00_2016-01-01T00:00:00']})
```

Which tells us the rank was set correctly over the daterange that we specified. We can now search for data and we'll automatically get the highest ranked data.

Let's search for CO2 data at Heathfield between 2014 - 2015, dates covered by both inlets.

In [22]:
updated_data = get_single_site(site="hfd", species="co", network="AGAGE")

In [23]:
updated_data

[<xarray.Dataset>
 Dimensions:    (time: 89)
 Coordinates:
   * time       (time) datetime64[ns] 2013-12-04T14:02:30 ... 2015-12-30T14:55:30
 Data variables:
     mf         (time) float64 214.3 216.2 147.0 135.3 ... 123.8 136.9 142.0
     co_stdev   (time) float64 4.081 3.634 3.887 4.11 ... 3.815 4.502 4.545 3.533
     co_n_meas  (time) float64 19.0 19.0 19.0 19.0 19.0 ... 19.0 19.0 19.0 19.0
 Attributes:
     data_owner:           Simon O'Doherty
     data_owner_email:     s.odoherty@bristol.ac.uk
     inlet_height_magl:    100m
     comment:              Cavity ring-down measurements. Output from GCWerks
     Conditions of use:    Ensure that you contact the data owner at the outse...
     Source:               In situ measurements of air
     Conventions:          CF-1.6
     File created:         2020-10-22 12:55:45.980284+00:00
     Processed by:         auto@hugs-cloud.com
     species:              CO
     station_longitude:    0.23048
     station_latitude:     50.97675
     s

Now we get the highest ranked data returned to us without the need to specify an inlet height or instrument.

If we know that we want data from the 50m inlet we can still specify this in the search and get that data

In [24]:
fiftym_data = get_single_site(site="hfd", species="co", network="AGAGE", inlet="50m")

In [25]:
fiftym_data

[<xarray.Dataset>
 Dimensions:    (time: 636)
 Coordinates:
   * time       (time) datetime64[ns] 2013-11-23T12:28:30 ... 2020-06-24T09:41:30
 Data variables:
     mf         (time) float64 181.7 190.6 242.6 196.4 ... 93.59 86.76 129.7
     co_stdev   (time) float64 5.158 4.641 3.602 5.487 ... 3.672 4.666 4.111
     co_n_meas  (time) float64 19.0 19.0 19.0 19.0 19.0 ... 12.0 12.0 12.0 12.0
 Attributes:
     data_owner:           Simon O'Doherty
     data_owner_email:     s.odoherty@bristol.ac.uk
     inlet_height_magl:    50m
     comment:              Cavity ring-down measurements. Output from GCWerks
     Conditions of use:    Ensure that you contact the data owner at the outse...
     Source:               In situ measurements of air
     Conventions:          CF-1.6
     File created:         2020-10-22 12:55:46.530800+00:00
     Processed by:         auto@hugs-cloud.com
     species:              CO
     station_longitude:    0.23048
     station_latitude:     50.97675
     statio

## 6. Viewing ranked data

We can also view the ranks we have given to data with a similar layout to the object store visualisation we created earlier.

To do this we use the `visualise_rankings` method of of the `RankSources` class. In this figure we'll only see Datasources that contain ranked data. Hover over the nodes for further information.

<div class="alert alert-info">
    The rankings visualisation created by this function is commented out here and won't be visible in the documentation but can be uncommented and run when you use the notebook version.
</div>

In [26]:
# r.visualise_rankings()

## 7. What's next?

Further tutorials will be added soon. If you want to explore the internal workings of OpenGHG please checkout the Developer API documentation, if you would like contribute to the project we welcome pull requests to both the code and the documentation. For help and guidance on contributing check our contributing page.