# Data Ingestion with Intake

Machine learning tasks are typically data heavy, requiring either labelled data for supervised learning or unlabelled data for unsupervised learning. In this user guide, the [`intake`](https://github.com/ContinuumIO/intake) library is used to fetch large datasets from remote data sources efficiently, including built in caching to avoid unncessary download when the files are available locally.

Once you have loaded your data, you will typically need to reshape it appropriately before it can be fed into a machine learning pipeline. These steps are detailed in the next user guide [Alignment_and_Preprocessing](03_Alignment_and_Preprocessing.ipynb).

## Inline loading

We'll start with the simple case of loading small local data. In this case intake isn't necessary, and `pandas` is often preferred - generally using ``pandas.read_csv`` as follows:

In [None]:
import pandas as pd

training_df = pd.read_csv('../landsat5_training.csv')

We can inspect the first several lines of the file using ``.head``, or a random set of rows using ``.sample(n)``

In [None]:
training_df.head()

To get a better sense of how this dataframe is setup, we can look at ``.info()``

In [None]:
training_df.info()

We can do the same types of things using intake. 

In [None]:
import intake

training = intake.open_csv('../landsat5_training.csv')

To get better insight into the data without loading it all in just yet, we can inspect the data using ``.to_dask()``

In [None]:
training_dd = training.to_dask()
training_dd.head()

In [None]:
training_dd.info()

To get a full pandas.DataFrame object, use ``.read()`` to load in all the data.

In [None]:
training_df = training.read()
training_df.info()

**NOTE:** There are different items in these two info views which reflect what is knowable before and after we read all the data. For instance, it is not possible to know the ``shape`` of the whole dataset before it is loaded.

## Loading multiple files

In addition to allowing partitioned reading of files, intake lets the user load and concatenate data across multiple files in one command

In [None]:
training = intake.open_csv(['../landsat5_training.csv', '../landsat8_training.csv'])

In [None]:
training_df = training.read()
training_df.info()

**NOTE:** The length of the dataframe has increased now that we are loading multiple sets of training data.

This can be more simply expressed as:

In [None]:
training = intake.open_csv('../landsat*_training.csv')

Sometimes, there is data encoded in a file name or path, that causes concatenated data to lose some important context. In this example, we lose the information about which version of landsat the training was done on. To keep track of that information, we use a python format string to specify our path and declare a new field on our data. That field will get populated based on its value in the path. 

In [None]:
training = intake.open_csv('../landsat{version:d}_training.csv')
training_df = training.read()
training_df.head()

## Using Catalogs

For more complicated setups, we use the catalog.yml to declare how the data should be loaded. This catalog file lays out how the data should be loaded, defines some metadata, and specifies any patterns in the file path that should be included in the data. Here is an example of a catalog entry:

In [None]:
with open('../catalog.yml') as f:
    for line in f.readlines()[:16]:
        print(line.rstrip())

The ``urlpath`` can be a path to a file, list of files, or a path with glob notation. Alternatively the path can be written as a python style [format_string](https://docs.python.org/3.6/library/string.html#format-string-syntax). In the case where the ``urlpath`` is a format string, the fields specified in that string will be parsed from the filenames and returned in the data. 

In [None]:
cat = intake.open_catalog('../catalog.yml')
list(cat)

In [None]:
l5 = cat.l5
l5.to_dask()

**NOTE**: The data has not yet been loaded so we don't have access to the actual data values yet, but we do have access to coordinates and metadata. 

In [None]:
l5_da = l5.read_chunked()

## Visualizing the data

To get a quick sense of the data, we can plot it using `hvplot`.

In [None]:
import hvplot.intake
intake.output_notebook()

In [None]:
l5.hvplot(kind='image', x='x', y='y', groupby='band', datashade=True, width=400)

This same plot can be declared in the catalog for ease of use and to point users to helpful ways to visualize data. Here is the relevant part of `catalog.yml`

In [None]:
with open('../catalog.yml') as f:
    for line in f.readlines()[16:25]:
        print(line.rstrip())

In [None]:
l5.plot.band_image()

We can achieve the same output using the dask array itself. When using the dask array, we can do some pre-processing such as filtering out missing values. 

In [None]:
l5_da_filtered = l5_da.where(l5_da > l5_da.nodatavals[0])

We can plot this filtered array to get rid of the background artifact seen above. 

In [None]:
import hvplot.xarray

In [None]:
l5_da_filtered.hvplot(kind='image', x='x', y='y', groupby='band', datashade=True, width=400)

## Accessing the data

Machine Learning pipelines such as scikit-learn accept numpy arrays as input. These arrays are accessible in xarray objects on the `values` attribute.

In [None]:
type(l5_da_filtered.values)

### Next:

Now that you have loaded your data, you will typically need to reshape it appropriately before it can be fed into a machine-learning pipeline. These steps are detailed in the next user guide [Alignment_and_Preprocessing](03_Alignment_and_Preprocessing.ipynb).