# Remote Datasets

Data can live in a variety of places. `scmdata` provides some routines to make it easy to fetch data in an automatic method.

In [1]:
import scmdata

  import tqdm.autonotebook as tqdman


## Remote files

The simplest example would be reading CSV or excel data served via HTTP/HTTPS.

Rather than manually downloading the data and reading the local copy the data can be read directly.

In [2]:
remote_url = "https://rcmip-protocols-au.s3-ap-southeast-2.amazonaws.com/v5.1.0/rcmip-emissions-annual-means-v5-1-0.csv"

run = scmdata.ScmRun(remote_url, lowercase_cols=True)

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

`scmdata.ScmRun` supports a range of URL schemes include http, ftp, s3, gs, and file. Behind the scenes `pandas` is used to fetch the data. For more information about the remote formats that can be read, see the ``pd.read_csv`` documentation for the version of pandas which is installed.

## API-based Datasets

Some data sources may be served via an API to make it easy to consume in various ways. Rather than serving a single CSV file, an API allows users to query just the data that is required.

Below we use the NDC dataset developed by [Climate Resource](https://www.climate-resource.com/tools/ndcs). This dataset is in an early release and the API and the underlying data may change without warnings. It should also be noted that this dataset is provided with an [CC Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode) license which requires attribution and limits the data to only being used for non-commerical purposes.

In [None]:
NDCS_URL = "https://api.climateresource.com.au/ndcs/v1"

In [None]:
print(scmdata.RemoteDataset.__init__.__doc__)

In [None]:
ds = scmdata.RemoteDataset(NDCS_URL)

The `RemoteDataset` can be filtered in a similar way to `scmdata.ScmRun`. This includes the use of "*"'s to match multiple items.

Any subsequent operations will include data that matches the filter.

In [None]:
ghg_ds = ds.filter(variable="Emissions|Total GHG*")

But how do you find what data are available?

The `meta` function allows users to query what timeseries are available. The dataset is able to be filtered by any of the returned columns along with some additional helper filters ('year.min' and 'year.max').

In [None]:
ghg_ds.meta()

In [None]:
# A complete list of filters
ghg_ds.filter_options()

In [None]:
ghg_ds = ghg_ds.filter(**{"year.min": "2010", "year.max": "2030"})

The available timeseries can then be queried. This fetches the timeseries matching the requested filter from the server.
The resulting `scmdata.ScmRun` object is returned to perform additional operations.

The `scmdata.ScmRun` also includes an additional metadata property `source` that is set to the `RemoteDataset` that was used to fetch the data.

In [None]:
ghg_data = ghg_ds.query()
ghg_data

In [None]:
ghg_data.metadata["source"]

Alternatively, `scmdata.ScmRun` functions can be called directly. The underlying timeseries is queried automatically.

In [None]:
ghg_ds.process_over("region", "sum")

In [None]:
ghg_ds.lineplot(hue="variable")

For notebooks which are commonly run, it might be useful to cache the timeseries so it doesn't need to be downloaded on each run.

We recommend using `pooch` to cache the results of a query locally.

In [None]:
import pooch

pooch.retrieve(scmdata.RemoteDataset(NDC_URL).filter(version="14Feb2022b_CR", variable="Emissions|Total GHG*").url())