## Distributing Data with Intake

Intake philosophy contains a clear separation of concerns between the provider of data and the consumer of data. This tutorial concerns the former: someone who cares about where a particular dataset it stored and the right format and options for best retrieval. It is their task to make these choices, and then expose the data to end-users (such as data scientists), so that they have a clear path to finding and accessing their data. There is no need to train users in how to investigate or load a particular dataset, those details are encoded in the catalog.

Intake catalogs act as a single source of truth about the data in question. The principal job of a data scientist, while interacting with Intake, is to find the best representation of data-sets (as they would have to do in any case) and to author catalogs as a means of both codifying the data-sets in versionable files and exposing them to users with a clear contract.

In this tutorial we will show the work-flow for writing and distributing a catalog, and thereby providing data to your users.

In [None]:
import intake

## 1. Reading the data

Intake has a plugin architecture that makes it straightforward to distribute lots of different types of data. This project contains some csv files in the data directory. These files contain modelled precipitation under several different emissions scenarios.

In [None]:
import os
os.listdir('data')

<div class="alert alert-info" role="alert">
    <b>NOTE:</b> We are using local data files in this example, but you can use Intake to load data from many different storage options including: S3, GCS, azure, HDFS, HTTP servers, ftp, ssh...
</div>

Normally we'd read these files using pandas and doing a little data-munging on the read. 

In [None]:
import pandas as pd

df = pd.read_csv('data/SRLCC_b1_Precip_PCM-NCAR.csv', 
                 skiprows=3,
                 names=['time', 'precip'],
                 parse_dates=['time'])
df.head()

Intake lets you capture that read command in a human and machine readable way, so that the people consuming the data don't need to know the implementation details. First we'll write verify that our commands work using `intake.open_csv`.

In [None]:
data_source = intake.open_csv(
    'data/SRLCC_b1_Precip_PCM-NCAR.csv', 
    csv_kwargs=dict(skiprows=3,
                    names=['time', 'precip'],
                    parse_dates=['time']))
data_source

We just created our first data source. This can be used to read the data into a pandas dataframe.

In [None]:
df = data_source.read()
df.head()

Or we can choose to load the data lazily using dask.

In [None]:
data_source.to_dask()

## 2. Write the catalog

Now that we've got a data source, we write a catalog file to capture the arguments that we used to create it. We can use `.yaml()` to retrieve those arguments. And write them to a new file called `'my_catalog.yml`.

In [None]:
with open('my_catalog.yml', 'w') as f:
    f.write(data_source.yaml())

Now let's take a look at what's in that new catalog file.

In [None]:
with open('my_catalog.yml', 'r') as f:
    print(f.read())

<div class="alert alert-info" role="alert">
    <b>NOTE:</b> In this example we use csv files. In practice Intake supports a wide array of data formats including: databases, images, grids, and streaming data. You can find a list of all the various supported formats in the <a href=https://intake.readthedocs.io/en/latest/plugin-directory.html>Intake Docs</a>.
</div>

## 3. Read the catalog

Now you can use your new catalog to read in the data. Notice how you no longer need to know anything about the original file format or particularities of how the data should be read.

In [None]:
cat = intake.open_catalog('my_catalog.yml')
list(cat)

In the above cell, we open the catalog and see what data sources are available in it. In our catalog we'll only have one, but in practice a catalog can have many data soures and can even contain other catalogs. To learn more about reading from catalogs that other people have created, see [the ingesting data notebook](ingesting_data.ipynb) in this project. 

In [None]:
df = cat.csv.read()
df.head()

## Optional: Edit the catalog

This step is optional, but now that you have the basics of the catalog in place, you can open up the yaml file and do some editing. Feel free to taks a look at the catalog.yml that is included with this project to get ideas. Here are some things you might want to try out:

1. Change the name of the data source from csv to something more descriptive
2. Add a description. 
3. Add `{{ CATALOG_DIR }}` to the beginning of the `urlpath` to ensure that the data will be accessible from any location in the file hierarchy.
4. In the `urlpath` Replace `b1` with `{emissions}` and `PCM-NCAR` with `{model}`. This is a little tricky to explain, but essentially intake will look at all the files that match the pattern of the filename and add columns to the dataframe containing the information found within the filenames. To learn more [see this blog post](https://www.anaconda.com/intake-parsing-data-from-filenames-and-paths/).

After you are done editing, repeat step 3 to make sure that you are getting your desired output. 

## Distribute the catalog
Once you are happy with your catalog, you have several options of how to share it:

### 1. Remote catalog
The easiest option is often to upload the catalog to a central location where other people can access it. Often people upload to github or a shared network drive or institutional storage (gitlab, S3…). This is best suited for cases where the catalog points to remote data sources which might themselves be on different servers or on the cloud.

### 2. Intake server
If your data are local, or update frequently it might make sense to serve it using the `intake-server` command. Since this notebook is intended to run on Anaconda Enterprise there are a couple more steps that are needed to make sure that the arguments all make it through properly to the `intake-server` command. You'll find that there is a main.py script in this project which takes care of all these details.

To test it out, click the deploy button and choose the `my_server` command - make sure to make your deployment public. Once that is deployed, you can use the [ingesting data notebook](ingesting_data.ipynb) to read from your server. 

### 3. Conda package
Another way to distribute your data is by creating a conda package which contains the data and specifies any dependencies. The basics of this package are laid out in the conda.recipe directory you can read the [intake documentation](https://intake.readthedocs.io/en/latest/data-packages.html) to learn more about the process of creating data packages.

You can run the cell below to install `conda-build` in your environment.

In [None]:
!conda install conda-build --yes --quiet

### Build the package

Now you are ready to build the package:

In [None]:
!conda build conda.recipe --output-folder built --quiet

### Uploading the package

After that finishes, you should have a new directory in this project called `built`. In there you'll find your new archived package file. You can upload that anaconda.org using the suggested command (something like `anaconda upload built/noarch/data-model-precip-0.1.0-0.tar.bz2`), or to your platform package server.

To learn more about Intake, see [ingesting_data.ipynb](ingesting_data.ipynb) or visit the [Intake docs](https://intake.readthedocs.io).