# Advanced Datadriver - Caching

_Execute the following cell in order to make the table of contents appear_

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

In this notebook, we will discover how you can optimize the memory taken by your workflow during the exploration

<h2 id="tocheading">Table of Contents</h2>
<div id="toc"></div>

# Memory Usage

Let's create a simple dataset first. We will need a context and some data to be imported :

In [None]:
from dd import DB
from dd.api.contexts import LocalContext
import pkg_resources
from sklearn.ensemble import RandomForestClassifier


# Context
db = DB(dbtype='sqlite', filename=':memory:')
context = LocalContext(db)
context.set_default_write_options(index=False, if_exists='replace')

# Loading data
titanic_datapath = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
dataset = context.load_file(titanic_datapath,
                            table_name="cache.dataset")

Since the load_operation is lazy, the dataset object is still very thin :

In [None]:
dataset.memory_usage

In [None]:
print("Size of dataset in bytes : {size}".format(size=dataset.memory_usage))

However, once you call an action on the dataset, you will notice the object gets significantly larger:

In [None]:
dataset.head()

In [None]:
print("Size of dataset in bytes : {size}".format(size=dataset.memory_usage))

What happened ? Well, the object stored the result of the computation so that it is available for later use much faster

In [None]:
dataset.dataframe

# Dataset Caching
You may run out of memory and realize you'd prefer recompute some of the data rather than keeping everything by default. In that case, the _cache()_ method is what you need.

In [None]:
# Loading data
titanic_datapath = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
dataset_no_cache = context.load_file(titanic_datapath,
                                     table_name="cache.dataset_test",
                                     write_options=dict(if_exists="replace", index=False))
dataset_no_cache.cache(False)

In [None]:
_ = dataset_no_cache.head()

In [None]:
dataset_no_cache.memory_usage

In [None]:
dataset_no_cache.dataframe is None

If you ever get more memory, or realize than you can afford to keep your data in cache, you can change the value later:

In [None]:
dataset.cache()  # True by default

In [None]:
_ = dataset.head()
dataset.memory_usage

# Context default caching

If you were find yourself in a situation where you wish the default value for the caching were `False`, use the *set\_auto\_persistence()* method of your context:

In [None]:
context.set_auto_persistence(False)

In [None]:
new_dataset = context.load_file(titanic_datapath,
                                table_name="cache.dataset_auto_persist")

In [None]:
new_dataset.head()

In [None]:
new_dataset.memory_usage