Skip to content

Commit

Permalink
Start adding tutorial material
Browse files Browse the repository at this point in the history
(mostly unchanged)
  • Loading branch information
Martin Durant committed May 24, 2019
1 parent 437ef02 commit af6a6e7
Show file tree
Hide file tree
Showing 11 changed files with 7,197 additions and 1 deletion.
106 changes: 105 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,105 @@
*/.ipynb_checkpoints
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject
.idea/

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@ They should run locally if you download this repository. They are also available

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intake/intake-examples/master)

## Tutorial

The subdirectory `tutorial/` contains three notebooks describing work-flows for three different
[roles](https://intake.readthedocs.io/en/latest/index.html) by which you
might interact with Intake. The separation of concerns is important, so that each person can concentrate on the
job that they have in front of them, and there are clear communication paths between each. It may be that in a small
organisation, a person fulfills multiple of these roles, but it is still useful to consider problems from each
vantage in turn:

* [data scientist](tutorial/data_scientist.ipynb) - for the end-user who want to find, load and analyse their data
* [data engineer](tutorial/data_engineer.ipynb) - for the curator of data and catalogues, who decides how best to
store and expose data
* [developer](tutorial/dev.ipynb) - who can write new drivers and other extensions to Intake's capabilities. This
person does not necessarily develop any code for the Intake package itself.

## Data Packages
This directory contains examples of Intake catalogs and scripts:
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ dependencies:
- intake-xarray
## intake data packages
- intake::anaconda-package-data
- intake::us_crime
## pyviz dependencies
- nodejs
- scipy
Expand Down
158 changes: 158 additions & 0 deletions tutorial/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Announcing: Intake

#### *Taking the pain out of data access*

We are pleased to release Intake, a simple data access layer and cataloging system.

This article contains code executable with Binder:
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/martindurant/intake-release-blog/master)

## Introduction

Defining and loading data-sets costs time and effort.
The data scientist needs to know what data are available,
and the characteristics of each data-set, before going to the effort of loading
and beginning to analyse some specific data-set. Furthermore, they might need
to learn the API of some Python package specific to the target format. The code
to do such data loading often makes up the first block of every notebook or script,
propagated by copy&paste.

Intake has been designed as a simple layer over other Python libraries to
- provide a consistent API, so that you can investigate and load all of your data with
the same commands, without knowing the details of the backend library
- a simple cataloging system using YAML syntax, enabling, for every data-set,
a description of
- which plugin is to load it
- arguments to pass
- arbitrary metadata to associate with the data
- transparent access to remote catalogs and data, for most formats
- a minimalist plugin system, so that new loaders, remote containers, and many other
components can be contributed to Intake with a minimum of fuss.
- an optional server-client infrastructure, so that you can serve a set of catalogs and
stream data that the client doesn't have direct access to.

For a simple design and relatively small code-base, there are lots of features. We
will describe the main ones here from three points of view in the sections that follow.
However, Intake is *very new*, and its design and implementation are up for discussion
and change in the true spirit of open source. Please refer to the
[documentation](https://intake.readthedocs.io/en/latest/), and post problems, questions
and comments to the [issue tracker](https://github.com/ContinuumIO/intake/issues).

#### Installation

The easiest method is to use `conda`:

```bash
conda install -c intake intake
```

Several additional packages are available with plugins for various formats. See the
[informal directory](https://intake.readthedocs.io/en/latest/plugin-directory.html).

## Usage Scenarios

### Data Scientist

Intake provides an easy way to find your data, locally, in a cloud service, or
on an Intake server.

```Python
import intake
intake.cat # set of installed data-sets
cat = intake.open_catalog('directory/catalog.yaml') # some local catalog file
cat = intake.open_catalog('s3://bucket/data/cat*.yaml') # remote set of catalog files
cat = intake.open_catalog('intake://host:5000') # an Intake server
```

In each case, `list(cat)` will provide the set of available data-sets in a
catalog, and you can access the entries using attribute dot-notation to either
interrogate the entry (e.g., read the text description), or to directly load the data.
In general, a catalog can contain any number of entries, and can also reference
other catalogs, local or remote, in a hierarchical manner.

```Python
list(cat) # set of entries
cat.us_crime.describe() # basic description of an entry
cat.us_crime.plot() # quick-look plotting
df = cat.us_crime.read() # get all of the data in one shot
```

The output of `read()` will be one of a small number of python containers appropriate
to the data, in this case a Pandas data-frame. The two other built-in containers are
numpy n-dimensional arrays and list-of objects. Each of these also has an out-of-core,
parallelised version available with the `.to_dask()` method.

The linked notebook shows an example workflow for a data scientist.

[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/martindurant/intake-release-blog/master?filepath=data_scientist.ipynb)

### Data Engineer

Intake allows the separation of the definition of data sources from their use
and analysis. This allows the data engineer to get on with deciding where data
should be stored, in which format, and which package is best suited to load each
data-set, so long as the data scientist, above, gets the data-frame they need
(or other data object!).

Consider, for instance, maintaining a catalog file on a cloud service, and pointing
users to it. It may be that, originally, the data was in CSV format, with the
problems of data volume and parsing load this entails. The engineer could swap out
the catalog entry definition for an optimized binary version, without the end user
noticing the difference. In a project or institutional setting, the catalog can
be the "point of truth" about the current best data-sets that should be in use.
Data-sets and catalogs can also be packaged, for version control and dependency
tracking.

Catalogs also provide a natural place to *describe* each data-set, with text and
arbitrary metadata. Such information can be for the users' benefit to help them
choose the right data-set for their problem, for analysis of the set of entries
themselves, or as a way to pass parameters to downstream tools that work with Intake
(e.g., the [plotting](https://intake.readthedocs.io/en/latest/plotting.html#persisting-metadata)
system).

Finally, catalogs can also encode user parameters, giving either natural choices
to the end user (e.g., to filter a data set, or choose between version A and B),
or for getting information required for data access from the user's environment
(e.g., credentials defined in environment variables).

The linked notebook shows an example workflow for a data engineer.

[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/martindurant/intake-release-blog/master?filepath=data_engineer.ipynb)

### Developer

The Intake codebase is intended to be approachable and very pluggable. It should
be easy to to create new data loader plugins (see the [CSV](https://github.com/ContinuumIO/intake/blob/master/intake/source/csv.py#L7)
and [textfiles](https://github.com/ContinuumIO/intake/blob/master/intake/source/textfiles.py#L4)
prototypes classes). Even writing new auth mechanisms (beyond the [very simplistic](https://github.com/ContinuumIO/intake/blob/master/intake/auth/secret.py#L8)
one) or replacing the Intake server with a whole new implementation is very
doable. These do not need to exist within the Intake repo, and creating complementary
packages is very much encouraged. We hope to cover not just the main data formats
already loadable by the likes of Pandas, but to catalog and access *all* data
such as fitted machine learning models, REST APIs and video formats.

Such code can be distributed along the usual channels, and an entry in a
catalog can specify which module it requires to be installed, in order to be able to
load. Indeed, if distributing catalogs, then the appropriate dependency on the
plugin package can be declared.

Some suggestions of [plugin ideas](https://github.com/ContinuumIO/intake/issues/58) are
listed in a GitHub issue - please discuss and add your own!

The linked notebook shows an example workflow for a developer.

[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/martindurant/intake-release-blog/master?filepath=dev.ipynb)


## Summary

Intake provides a very simple yet useful division between the users of data, and
the maintainers of data source catalogs. Intake has approachable code and is extensible
in many places, and so hopefully can progress to become an all-inclusive data ecosystem
for numerical python.


--

This work was supported by Anaconda. Please note that the effort is still in the early
stages, so expect some rough edges!
Loading

0 comments on commit af6a6e7

Please sign in to comment.