# Intake

[Intake](https://github.com/ContinuumIO/intake) is a library from Anaconda Inc for sharing and discovering datasets in Python. 

Datasets that are packaged with intake consist of a manifest file describing where to find the dataset and a plugin which is used to load the dataset. On our Pangeo you will find some datasets which are already installed along with the standard set of plugins and some extras for loading environmental datasets.

## Listing available datasets

Out of the box there are a number of datasets already available to you here on our Pangeo. You can list them by importing intake and exploring the default intake catalog.

In [23]:
from intake import cat

In [24]:
for dataset in cat:
    print("{name}: {description}".format(name=dataset, description=cat[dataset].description))

car_park_tickets_sold: Data about the number of tickets sold in Exeter car parks from Exeter City Council (https://exeterdatamill.com/dataset/car-park-tickets-sold)
ncic_daily_land_obs: National Climate Information Centre (NCIC) daily land observations.


## Loading datasets

Datasets can be accessed from the intake catalog at a property of the same name. For example we have the `ncic_daily_land_obs` dataset which can be accessed at `cat.ncic_daily_land_obs()`. This dataset object has some properties for getting information about the dataset such as `description` used above. It also has two methods for reading the data from the dataset; `read()` and `read_chunked()`. 

### `read()`
When calling `read()` it will attempt to load the entire dataset into memory using an appropriate container. This may be desirible if you are working with a small dataset stored in a `csv` file. Calling `read()` on such a dataset will load the data into memory as a pandas dataframe.

In [26]:
car_park_tickets_sold = cat.car_park_tickets_sold().read()
car_park_tickets_sold.head()

Unnamed: 0,Year,Month,Date,Hour,Site,Tickets,SiteSub
0,2014,Mar,2014-03-05,08:00,Purchase Count - Bampfylde Street Car Park,10.0,Bampfylde Street Car Park
1,2014,Mar,2014-03-05,16:00,Purchase Count - Bampfylde Street Car Park,2.0,Bampfylde Street Car Park
2,2014,Mar,2014-03-06,08:00,Purchase Count - Bampfylde Street Car Park,,Bampfylde Street Car Park
3,2014,Mar,2014-03-06,09:00,Purchase Count - Bampfylde Street Car Park,,Bampfylde Street Car Park
4,2014,Mar,2014-03-06,10:00,Purchase Count - Bampfylde Street Car Park,,Bampfylde Street Car Park


### `read_chunked()`
If you are accessing large datasets which you do not wish to load into memory you can call `read_chunked()`, this will return you a `dask` object such as a `dataframe`, `array` or `bag` or an object based on one of these primitives such as an iris cube (which contains a `dask.array`). This allows you to work lazily in a dask workflow where computations and data loading happen 'just in time'.

In [28]:
# This can take a while
ncic = cat.ncic_daily_land_obs().read_chunked()
ncic

Exception ignored in: <bound method CFReader.__del__ of CFReader('/s3/ncic/gridded-land-obs-daily/grid/netcdf/mean-temperature/ukcp09_gridded-land-obs-daily_5km_mean-temperature_19790101_19791231.nc')>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/iris/fileformats/cf.py", line 1124, in __del__
    self._dataset.close()
AttributeError: 'CFReader' object has no attribute '_dataset'


[<iris 'Cube' of air_temperature / (degC) (time: 20454; projection_y_coordinate: 290; projection_x_coordinate: 180)>,
<iris 'Cube' of air_temperature / (degC) (time: 20454; projection_y_coordinate: 290; projection_x_coordinate: 180)>,
<iris 'Cube' of air_temperature / (degC) (time: 20454; projection_y_coordinate: 290; projection_x_coordinate: 180)>,
<iris 'Cube' of lwe_thickness_of_precipitation_amount / (mm) (time: 21184; projection_y_coordinate: 290; projection_x_coordinate: 180)>]

In [29]:
import iris
[rainfall] = ncic.extract(iris.Constraint(name='lwe_thickness_of_precipitation_amount'))
rainfall

Lwe Thickness Of Precipitation Amount (mm),time,projection_y_coordinate,projection_x_coordinate
Shape,21184,290,180
Dimension coordinates,,,
time,x,-,-
projection_y_coordinate,-,x,-
projection_x_coordinate,-,-,x
Auxiliary coordinates,,,
latitude,-,x,x
longitude,-,x,x
Attributes,,,
Conventions,CF-1.5,CF-1.5,CF-1.5


In [30]:
rainfall.has_lazy_data()

True