# Demo Part 1: Data Ingestion using `intake`
- author: hjsong
- date: 05/22/2019

Use python's `intake` library to simplify the data ingestion process.
It's an awesome way to maintain data catalogs both for local and remote data sources.
Great to hide data reading details (often the first boilerplate step before data analysis that clogs the code)
- Also supports GUI
- yaml configuration files keep the information on how to load each data source (driver type -- we can write our own as a plugin)
- also specify any metadata 
    - eg: plot types supported for the data source
    - supports easy integration with holoviews
    


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os, sys
from pathlib import Path
from pprint import pprint

UTILS_DIR = Path('../utils').absolute()
assert UTILS_DIR.exists()
if str(UTILS_DIR) not in sys.path:
    sys.path.insert(0, str(UTILS_DIR))
    print(f"Added {str(UTILS_DIR)} to sys.path")

pprint(sys.path)
    

In [None]:
# Don't want to see deprecation warnings
# https://docs.python.org/3/library/warnings.html#warning-filter
# jupyternotebooks are run as '__main__' module
# print(__name___) #at the console
      
import warnings
pprint(warnings.filters)

In [None]:
if not sys.warnoptions:
    warnings.simplefilter('ignore')

In [None]:
# don't write bytecode
sys.dont_write_bytecode = True

In [None]:
from utils import get_mro as mro, nprint, dict2json, display_dict2json, cols_with_null

## Inline loading

We'll start with the simple case of loading small local datasets, such as a .csv file for Pandas:

In [None]:
import intake
print(list(intake.cat))

In [None]:
cat_crime = intake.cat.us_crime
cat_info = cat_crime.discover()
varnames = list(cat_info['dtype'].keys())
meta = cat_info['metadata']
nprint('cat entry info') 
nprint('varnames',varnames)
nprint('meta', meta)

In [None]:
# meta = cat_info.get('metadata')
plots_info = meta['plots']
pprint(plots_info)

In [None]:
# plot names
plot_names = cat_crime.plots
for pname in plot_names:
    nprint(pname,plots_info[pname])

In [None]:
# Quick visualization

In [None]:
intake.output_notebook() #simply loads in hvplot library
cat_crime.plot(x='Year', y='Motor vehicle theft')

In [None]:
p = cat_crime.plot(x='Year', y=['Motor vehicle theft', 'Robbery'])

In [None]:
print(p)

In [None]:
%%opts Curve [tools=['hover'], width=800]
p

In [None]:
p = cat_crime.plot.bivariate('Burglary rate', 'Property crime rate', width=500, height=400)
# pprint(mro(p))
display(p)


In [None]:
var_x = 'Burglary rate'
var_y = 'Property crime rate'

# Add joint distribution and scatterplot
(cat_crime.plot.bivariate(var_x, var_y, width=800, height=500, legend=False) * \
cat_crime.plot.scatter(var_x, var_y, color='black', legend=False, size=15) + \
cat_crime.plot.table([var_x, var_y], width=350, height=350))

In [None]:
mro(cat_crime.plot)

In [None]:
# histograms of variables
vdims = ['Robbery', 'Burglary', 'Motor vehicle theft', 'Aggravated assault']
cat_crime.plot.kde(y=vdims, alpha=0.5, value_label='Count',width=800, height=500)

In [None]:
# Whisker plot
cat_crime.plot.box(y=vdims, invert=True, value_label='Count',width=800)

Good. These are basic plots though. Now, let's see what more we can do. We can support more engaging data exploration by linking different types of data sources and plot types. To do so, we are going to use lower level apis for holoviews and geoviews, rather than `hvplot` as above.

## Advanced plots using lower-level holoviews objects
- Modified: May 23, 2019


1. Link the table and the joint distribution/scatterplot 

In [None]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
# Load data
crime_df = cat_crime.read()
display(crime_df.head())

# check which columns have any null/nan values
null_cols = cols_with_null(crime_df)
pprint(null_cols)
temp = crime_df.drop(columns=null_cols)
display(temp.head())

# Equivalent to 
display(crime_df.dropna(axis=1, how='any').head())

In [None]:
import holoviews as hv
import xarray as xr

from holoviews import opts
# hv.notebook_extension('bokeh') # don't use it after `intake.output_notebook()`

In [None]:
def joint_dist(var1, var2, style={}):
#     return 
    pass

In [None]:
kdims = ['Year']
vdims = [varname for varname in varnames if varname not in kdims]
nprint('vdims', vdims)
rate_dims = [vdim for vdim in vdims if 'rate' in vdim]
nprint('rate dims', rate_dims)

In [None]:
ds = hv.Dataset(crime_df, kdims=kdims, vdims=vdims)
print(ds)

In [None]:
%%opts Curve [height=400, width=800, tools=['hover'] ]
vdims= ['Motor vehicle theft rate']
curve = ds.to(hv.Curve, 'Year', vdims)
curve

In [None]:
from holoviews import streams


In [None]:
sub = crime_df.set_index('Year').loc[1998:2000][vdims]
display(sub)

In [None]:
def make_from_boundsx(boundsx):
    sub = crime_df.set_index('Year').loc[boundsx[0]:boundsx[1]][vdims]
    return hv.Table(sub.describe().reset_index().values, 'stat', 'value')
def test_make_from_boundsx():
    display(make_from_boundsx([1960,2000]).opts(editable=True))
test_make_from_boundsx()

In [None]:
# Putting them tgt...
stream_boundsx = streams.BoundsX(source=curve, 
                                 boundsx=(1960,1970))

In [None]:
dmap = hv.DynamicMap(make_from_boundsx, streams=[stream_boundsx]
                    )

dmap

In [None]:
%%opts Curve [height=400, width=700, tools=['hover', 'xbox_select'] ]
curve+dmap

In [None]:
dmap

In [None]:
curve

---
### Quick holoviews's DynamicMap experiement
Modified: May 23, 2019


In [None]:
def get_2d_image(

---
### Spacenet data catalog
Modified: May 23, 2019

In [None]:
sp_catalog = intake.open_catalog("../train_spacenet_catalog.yml")
sp_catalog


In [None]:
aoi = "AOI_3_Paris"
img_id = 482
img_ds = sp_catalog.train_rgb8_img(aoi=aoi, img_id=img_id)
img = img_ds.read()
img.plot.imshow()

In [None]:
# Read in the corresponding geojson file
rbuff_ds = sp_catalog.train_rbuffer_vec(aoi=aoi, img_id=img_id)
rbuff = rbuff_ds.read()


In [None]:
%%opts Curve [height=400, width=700, tools=['hover', 'xbox_select'] ]
curve

In [None]:
def get_vline(x):
    return hv.VLine(x).opts(color='red',line_width=2)
get_vline(0.1)

In [None]:
from holoviews import streams

In [None]:
pointer = streams.PointerXY()
print(pointer.source)
print(pointer.contents)

In [None]:
# pointer.update(x=10)
print(pointer.contents)
print(id(pointer))

In [None]:
# import ipywidgets as iw
out = iw.Output(layout={'border': '1px solid black'})


In [None]:
out

In [None]:
# quick test
with out:
    print(pointer.contents)

In [None]:
@out.capture(clear_output)
def iw_output_subscriber(x, y):
    print(f'x: {x}, y: {y}')


In [None]:
# quick test
import time
for i in range(5):
    pointer.event(x=0.1*i,y=0.5)
    time.sleep(2)


Looking great. Let's add this function as a callback handler to pointer stream obj.

In [None]:
pointer.add_subscriber(iw_output_subscriber)
print(pointer.subscribers)

In [None]:
p_dmap = hv.DynamicMap(lambda x,y: hv.Points( [(x,y)] ),
                       streams=[pointer]).opts(

Note: this dmap is automatically registered as the **source** of the pointer LinkedStream obj

In [None]:
p_dmap.opts(
    opts.Points(size=7, tools=['hover'])
)

In [None]:
out

Now the following should trigger update on both output and p_dmap representation

In [None]:
# quick test
out.clear_output()
for i in range(5):
    pointer.event(x=0.1*i,y=0.5)
    time.sleep(1)


Good, it's working as expected:)

We can actually do this without mixing holoviews and ipywidgets (although I love the output widgets from ipywidget library. Let's replace the output ipywideget with a holoview's Element of type `hv.Table`. Another option would be `hv.Div`.
                                                                 

In [None]:
%%opts Table (editable=True)
table = hv.Table([(0,0)], kdims=['x','y'])
table

In [None]:
from holoviews.plotting.links import DataLink

In [None]:
DataLink(p_dmap, table)

In [None]:
(p_dmap + table).opts(
    opts.Points(size=7, tools=['hover']),
    opts.Table(editable=True, width=200)
)

In [None]:
todo: link the table and the scatterplot 

2. Larger dataset 

## todo:
- [ ] link the table and the scatterplot 
- [ ] holomaps (use the fields as key dim_
- [ ] holomaps - barchart of all variables, dropdown for time as keys
- [ ] use it for spacenet dataset -- local filesystems

# Nice GUI

In [None]:
intake.gui

In [None]:
# the gui frontend is linked to python kernel.
# so we can programatically access the catelog entry selected in the gui as `intake.gui.item` attribute
intake.gui.item

In [None]:
# Get quick information about this selected data source and dataset
intake.gui.item.discover()

We can inspect the first several lines of the file using ``.head``, or a random set of rows using ``.sample(n)``

In [None]:
training_df.head();
training_df.sample(5)

To get a better sense of how this dataframe is set up, we can look at ``.info()``

In [None]:
training_df.info()

To use methods like `pd.read_csv`, all data needs to be on the local filesystem (or on one of the limited remote specification formats supported by Pandas, such as S3). We could of course put in various commands here to fetch a file explicitly from a remote server, but the notebook would then very quickly get complex and unreadable.

Instead, for larger datasets, we can automate those steps using intake so that remote and local data can be treated similarly. 

In [None]:
import intake

training = intake.open_csv('../data/landsat5_training.csv')
mro(training)

To get better insight into the data without loading it all in just yet, we can inspect the data using ``.to_dask()``

In [None]:
training_dd = training.to_dask() # doesn't actually load the whole data
training_dd.head() 

In [None]:
training_dd.info() # since we didn't actually load in the entire dataset, we don't know the length of the dataframe

To get a full pandas.DataFrame object, use ``.read()`` to load in all the data.

In [None]:
training_df = training.read()
training_df.info()

**NOTE:** There are different items in these two info views which reflect what is knowable before and after we read all the data. For instance, it is not possible to know the ``shape`` of the whole dataset before it is loaded.

## Loading multiple files

In addition to allowing partitioned reading of files, intake lets the user load and concatenate data across multiple files in one command

In [None]:
training = intake.open_csv(['../data/landsat5_training.csv', '../data/landsat8_training.csv'])

In [None]:
training_df = training.read()
training_df.info()
training_df.head()

**NOTE:** The length of the dataframe has increased now that we are loading multiple sets of training data.

This can be more simply expressed as:

In [None]:
training = intake.open_csv('../data/landsat*_training.csv')

Sometimes, there is data encoded in a file name or path that causes concatenated data to lose some important context. In this example, we lose the information about which version of landsat the training was done on. To keep track of that information, we can use a python format string to specify our path and declare a new field on our data. That field will get populated based on its value in the path. 

In [None]:
training = intake.open_csv('../data/landsat{version:d}_training.csv')
training_df = training.read()
training_df.head()

In [None]:
# Exercise: Try looking at the tail of the data using training_df.tail(), or a random sample using training_df.sample(5)

---

## Read multiple netcdf files in one line
* hayley
* 05/21/2019

### 1. Using `intake.open_netcdf` with python string formatting

In [None]:
FLDAS_DATA_DIR = '/home/hayley/data/mint/FLDAS_NOAH01_A_EA_D.001/2017/01/'
name_fmt  =  'FLDAS_NOAH01_A_EA_D.A201701{day:d}.001.nc'
fpath = os.path.join(FLDAS_DATA_DIR, name_fmt)
print(fpath)

In [None]:
fldas_ds = intake.open_netcdf(fpath) #datasource

In [None]:
fldas_ddf = fldas.to_dask() #dask dataframe

In [None]:
# vs
# fldas_ds2 = intake.open_netcdf(os.path.join(FLDAS_DATA_DIR, 'FLDAS_NOAH01_A_EA_D.A201701*.001.nc'))
# dataset2 = fldas_ds2.read() # returns xarray.Dataset
# dataset2

In [None]:
nprint(*mro(fldas_ddf),header=False)

In [None]:
# inspect the dask df
fldas_ddf.info;

In [None]:
fldas_dataset = fldas_ds.read() # read all data to memory


In [None]:
mro = get_mro
pprint(mro(fldas_dataset))

In [None]:
dset = fldas_dataset
print(dset.dims)

In [None]:
dset

## visualize the dataset using geoviews

In [None]:
import holoviews as hv
from holoviews import opts
import geoviews as gv
hv.extension('matplotlib')


In [None]:
# hv_dataset = hv.Dataset(fldas_dataset, ['time', 'X', 'Y'], 'Qair_f_tavg')
data_t0 = fldas_dataset.sel(bnds=0)

In [None]:
hv.Dataset(temp).to(hv.Image, ['X','Y']).

In [None]:

import numpy as np
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
img = hv.Image(Z)

img

In [None]:
hv_dataset = hv.Dataset(dataset2, ['time', 'X', 'Y'], 'Qair_f_tavg')

In [None]:
hv_dataset

In [None]:
hv_dataset.to(hv.Image, ['X','Y']).hist()

### 2. Using `intake`'s catalog files


In [None]:
mro(

## Using Catalogs

For more complicated setups, we use the file catalog.yml to declare how the data should be loaded. The catalog lays out how the data should be loaded, defines some metadata, and specifies any patterns in the file path that should be included in the data. Here is an example of a catalog entry:

```
sources:
  landsat_5_small:
    description: Small version of Landsat 5 Surface Reflectance Level-2 Science Product.
    driver: rasterio
    cache:
      - argkey: urlpath
        regex: 'earth-data/landsat'
        type: file
    args:
      urlpath: 's3://earth-data/landsat/small/LT05_L1TP_042033_19881022_20161001_01_T1_sr_band{band:d}.tif'
      chunks:
        band: 1
        x: 50
        y: 50
      concat_dim: band
      storage_options: {'anon': True}
```

The ``urlpath`` can be a path to a file, list of files, or a path with glob notation. Alternatively the path can be written as a python style [format_string](https://docs.python.org/3.6/library/string.html#format-string-syntax). In the case where the ``urlpath`` is a format string, the fields specified in that string will be parsed from the filenames and returned in the data. 

In [None]:
cat = intake.open_catalog('../catalog.yml')
list(cat)

In [None]:
# Exercise: Read the description of the landsat_5_small data source using cat.landsat_5_small.description
cat.landsat_5_small.describe()

**NOTE:** If you don't have the data cached yet, then the next cell will take a few seconds.

In [None]:
landsat_5 = cat.landsat_5_small
landsat_5.to_dask()

The data has not yet been loaded so we don't have access to the actual data values yet, but we do have access to coordinates and metadata.

In [None]:
pprint(mro(landsat_5))

## Visualizing the data

To get a quick sense of the data, we can plot it using [hvPlot](https://hvplot.pyviz.org/), which provides interactive plotting commands for Intake, Pandas, XArray, Dask, and GeoPandas. We'll look more closely at hvPlot and its options in later tutorials.

In [None]:
import hvplot.intake
intake.output_notebook()

import holoviews as hv
hv.extension('bokeh')

We can quickly generate a plot of each of the landsat bands using the overview plot declared in the catalog. Here is the relevant part of `catalog.yml`:

```
metadata:
  plots:
    band_image:
      kind: 'image'
      x: 'x'
      y: 'y'
      groupby: 'band'
      rasterize: True
      width: 400
      dynamic: False
```

In [None]:
landsat_5.hvplot.band_image()

In [None]:
landsat_5.plots

## Accessing the data
So far we have been looking at the intake data entry object `landsat_5`. To access the data on this object we will read the data. If the data are big, we can use dask to do this using the `.to_dask()` method to create a `dask xarray.DataArray`. If the data are small, then we can use the `read()` method to read all the data straight into a regular `xarray.DataArray`. Once in an `xarray` object the data can be more easily manipulated and visualized.

In [None]:
type(landsat_5)

### Xarray DataArray
To get an `xarray` object, we'll use the `.read()` method.

In [None]:
landsat_5_xda = landsat_5.read()
type(landsat_5_xda)

We can use tab completion to explore what other information is stored on our xarray.DataArray object. We can use tab completion to explore attributes and methods available on our object.

In [None]:
# Exercise: Try typing landsat_5_xda. and press [tab] - don't forget the trailing dot!
landsat_5_xda.crs

### Numpy Array
Machine Learning pipelines such as scikit-learn accept Numpy arrays as input. These arrays are accessible in DataArray objects on the `values` attribute.

In [None]:
landsat_5_npa = landsat_5_xda.values
type(landsat_5_npa)

---

In [None]:
cat = 

### Next:

Now that you have loaded your data, you will typically need to reshape it appropriately before it can be fed into a machine-learning pipeline. These steps are detailed in the next tutorial: [Alignment and Preprocessing](03_Alignment_and_Preprocessing.ipynb).

---
---
## Exp 1: Use FLDAS dataset
* hayley
* 5/21/2019
* purpose: share how to use `intake` and `pyviz` to read data from catalogs and show quick visualization 
    - the workflow of reading a data using Catalog (python `intake`) and visualize the data using (`holoviews` and `geoviews`)


In [None]:
import requests, os
from pathlib import Path

In [None]:
base_url = 'https://workflow.isi.edu/MINT/FLDAS/FLDAS_NOAH01_A_EA_D.001/2001/01/'
fname = 'FLDAS_NOAH01_A_EA_D.A20010102.001.nc'

In [None]:
fpath = os.path.join(base_url, fname)
req = requests.get(fpath)

In [None]:
req.status_code

In [None]:
import xarray as xr

In [None]:
temp = req.content

In [None]:
data = req.raw

In [None]:
MINT_DATA_DIR = Path('/home/hayley/data/mint/FLDAS_NOAH01_A_EA_D.001/2017/')
month = '01'
day = '01'
date = f'2017{month}{day}'

print('date', date)

