# Intake Caching

This notebook shows a simple demonstration of how you would use and manage caching with Intake to avoid repeated downloads to large data files.

Let's start with a simple example. First, import intake as normal.

In [None]:
import intake
cat = intake.open_catalog('catalog.yml')
list(cat)

The cache specification from the catalog is shown in the source metadata.

In [None]:
stats = cat.demographic_stats()
stats.cache[0].clear_all() # <-- clearing cache to make sure we start from scratch.
stats.discover()['metadata']

In [None]:
stats._urlpath

Here the urlpath is a remote HTTP server. When the data source is read for the first time a download will be triggered.

In [None]:
time df = stats.read()

Now lets read the data again.  Notice, the read is fast this time thanks to local caching.

In [None]:
time df = stats.read()

See that we do indeed have the data.

In [None]:
df.head()

Looking under the hood at the default cache directory, notice the files now exist locally.

In [None]:
%ls -la ~/.intake/cache/f890ce4d538240e87ede9d31a6541443

Inspecting the metadata shows the created timestamp, original path, and cached path.

In [None]:
stats.cache[0].get_metadata(stats._urlpath)

The data source will provide the cache directory if you are not sure where it is located.

In [None]:
stats.cache_dirs

The cache can be cleared for an individual source.

In [None]:
stats.cache[0].clear_cache(stats._urlpath)
stats.cache[0].get_metadata(stats._urlpath)

After clearing the cache, the files are removed from the cache directory.

In [None]:
%ls -la ~/.intake/cache

If the data source is read again, the file is also downloaded again.

In [None]:
time df = stats.read()

In [None]:
%ls -la ~/.intake/cache/f890ce4d538240e87ede9d31a6541443

## Cache directory is configurable

The cache directory defaults to ``~/.intake/cache``, but can be set by the config file, environment variable, or at runtime.  Here it is set at runtime.

In [None]:
stats.cache[0].clear_cache(stats._urlpath)

import os.path

cat = intake.open_catalog('catalog.yml')
stats = cat.demographic_stats()
stats.set_cache_dir(os.path.join(os.getcwd(), 'test_cache_dir'))
stats.cache_dirs

In [None]:
time df = stats.read()

In [None]:
stats.cache[0].get_metadata(stats._urlpath)

In [None]:
stats.cache[0].clear_all()

## Disable Caching

Caching can be disabled globaly in the ``intake.config``.

In [None]:
from intake.config import conf
conf['cache_disabled'] = True

cat = intake.open_catalog('catalog.yml')
stats = cat.demographic_stats()

Notice, the read times are consistently longer.

In [None]:
time df = stats.read()

In [None]:
time df = stats.read()

Also, the cache directory and metadata are empty.

In [None]:
stats.cache_dirs

In [None]:
%ls -la ~/.intake/cache

In [None]:
stats.cache[0].get_metadata(stats._urlpath)