# OpenGHG for data providers: uploading and classifying data

The OpenGHG platform has the ability to interpret and standardise data from multiple different sources. For measurement networks, this currently includes data from the following projects:

- AGAGE
- DECC
- LondonGHG

and can be expanded to include more as appropriate. At present, after being uploaded once this data will be available to access directly on the platform.

The standardised format aims to be CF and CEDA compliant (as long as the necessary metadata is provided).

## Manual upload

The current interface allows new measurement data to be uploaded directly to the platform by passing the data files along with a set of keywords so the data can be appropriately identified and categorised.

For instance to upload a data file or files from the Billsdale site (site code "BSD") within the DECC network this could be uploaded and stored within the OpenGHG cloud store using the key words:

- data_type of "CRDS"
- site code of "BSD"
- network of "DECC"

The data_type here indicates the expected format of the data files themselves. This can be specific to the type of instrument being used, a site or a particular network (more details below).

In [1]:
## MAY NOT NEED THIS??
import os
import tempfile

# I'll remove this - need to add a check to make sure OPENGHG_PATH is set

tmp_dir = tempfile.TemporaryDirectory()
os.environ["OPENGHG_PATH"] = tmp_dir.name # "/tmp/openghg_store"

In [2]:
from openghg.modules import ObsSurface 

decc_file = "../data/DECC/bsd.picarro.1minute.248m.min.dat"
data_type = "CRDS"
site = "bsd"
network = "DECC"

decc_results = ObsSurface.read_file(decc_file, data_type, site, network)

Processing: bsd.picarro.1minute.248m.min.dat: 100%|██████████| 1/1 [00:00<00:00,  2.30it/s]


#### Aside:

Accepted data types at the moment include:

- CRDS (data from CRDS instruments, typically within the DECC and AGAGE networks)
- GCWERKS (data from GC instruments, typically within the AGAGE network)
- NOAA
- THAMESBARRIER
- BEACO2N

## Automated upload
**++ADD DETAILS HERE OF HOW THIS WILL WORK++**

## Ranking data

When multiple sets of data are available for the same site and species, it is possible to set up a *ranking* to provide an order of preference for the data returned over a given time period. Once created, this ranking will then persist and will be used whenever this data is accessed.

For example at the Billsdale site this has inlets at different heights. For different time periods, depending on the status of the instruments and the data availability, data at different data may be preferred. This can be indicated and then stored to influence which data source is returned for each species at Billsdale.

In [3]:
## COULD REMOVE THIS IF DATA IS ALREADY IN THE OBJECT STORE

from openghg.util import bilsdale_datapaths
from openghg.modules import ObsSurface

# Load Billsdale data into object store
bsd_paths = bilsdale_datapaths()
uploaded = ObsSurface.read_file(filepath=bsd_paths, data_type="CRDS", site="bsd", network="DECC", overwrite=True)


Processing: bsd.picarro.1minute.42m.min.dat: 100%|██████████| 3/3 [00:01<00:00,  2.07it/s]


In [4]:
from openghg.localclient import RankSources

site = "BSD"  # Billsdale
species = "ch4" # methane, "ch4"

# Show all available sources which correspond to the same site and species
r = RankSources()
sources = r.get_sources(site=site, species=species)
sources

{'ch4_248m_picarro': {'rank_data': 'NA',
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'},
 'ch4_108m_picarro': {'rank_data': 'NA',
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'},
 'ch4_42m_picarro': {'rank_data': 'NA',
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'}}

Based on the data uploaded for Billsdale, each source is stored with a key related to the species being measured, the inlet height and the instrument type:
 - ch4_248m_picarro
 - ch4_108m_picarro
 - ch4_42m_picarro

So, for the methane data taken at the 108m inlet using the Picarro instrument the relevant key would be "ch4_108m_picarro".

We can use this to check and set the rank for the data taken at the 108m inlet to be used preferentially when extracting the data in 2015:

In [5]:
r.get_specific_source(key="ch4_108m_picarro")

'NA'

In [6]:
r.set_rank(key="ch4_108m_picarro", rank=1, start_date="2015-01-01", end_date="2016-01-01")

In [7]:
r.get_specific_source(key="ch4_108m_picarro")

{'2015-01-01-00:00:00+00:00_2016-01-01-00:00:00+00:00': 1}

We can also cover the rest of the date range for this data (01/01/2014 - 01/12/2021) by setting a ranking for the 248m inlet for the period before and after 2015 and check what values have been set:

In [8]:
dateranges_248m = ["2014-01-01_2015-01-01", "2016-01-01_2020-12-01"]
r.set_rank(key="ch4_248m_picarro", rank=1, dateranges=dateranges_248m)

In [9]:
# Checking what values have been set for this site
r.get_sources(site="bsd", species="ch4")

{'ch4_248m_picarro': {'rank_data': {'2014-01-01-00:00:00+00:00_2015-01-01-00:00:00+00:00': 1,
   '2016-01-01-00:00:00+00:00_2020-12-01-00:00:00+00:00': 1},
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'},
 'ch4_108m_picarro': {'rank_data': {'2015-01-01-00:00:00+00:00_2016-01-01-00:00:00+00:00': 1},
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'},
 'ch4_42m_picarro': {'rank_data': 'NA',
  'data_range': '2014-01-30T11:12:30_2020-12-01T22:31:30'}}

When retrieving the data from the object store, this now knows which data to extract for different time periods without needing to specify an inlet:

In [18]:
# For 2015, the data from 108m is returned

from openghg.processing import get_obs_surface

data = get_obs_surface(site="bsd", species="ch4", 
                       start_date="2015-01-01", end_date="2015-02-01")

print(f"Data from inlet at height: {data.metadata['inlet']}")

Data from inlet at height: 108m


In [17]:
# Outside 2015, the data from 248m is returned

from openghg.processing import get_obs_surface

data = get_obs_surface(site="bsd", species="ch4", 
                       start_date="2014-01-01", end_date="2014-02-01")

print(f"Data from inlet at height: {data.metadata['inlet']}")

Data from inlet at height: 248m


Even if the ranking is not used, the data can still be accessed but the inlet height must be if there is any ambiguity for the data source being retrieved.