# Setting up your system to work with Hathi feature-counts quickly and reproducibly.

When working with large-scale data like the Hathi Feature counts, there can be a tradeoff between local speed and reproducible patterns of research.

Especially if you're working in a team, it may be that people have different strategies for storing HTRC data across their different systems. How 
do you write code that everyone can run?

This notebook lays out a strategy that makes this possible using the `resolvers.make_chain_resolver` function in htrc_features.

In [29]:
%load_ext autoreload
%autoreload 2

import htrc_features
from htrc_features import Volume, resolvers, FeatureReader
from pathlib import Path


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Method 1: Define your own system-wide resolver, and give users a fallback.

I have the following text at `~/.hathi_resolver.json`,
because I rsynced a full set of the feature counts at /drobo/hathi-ef and 
want to load and created a zstd feather version for each one when it's requested.


This won't work on your computer! But it doesn't need to--just define your *own* method there, 
and you'll get fast results as well.

```json
[
    {"method": "stubbytree", "format": "json", "compression": "bz2", "dir": "/drobo/hathi-ef"},
    {"method": "stubbytree", "format": "feather", "compression": "zstd", "dir": "/drobo/hathi-ef"},
]
```


**This code produces reproducible research without requiring saving local files**. Anyone can run it--even
in, say, a google colab notebook--and htrc-feature-reader will fetch the relevant files over the web.

But it will run *much* faster on your local machine than downloading the files again and again; and it lets use you re-use
a single setup with multiple different projects looking at Hathi files

The code below says--first, check for a file at `~/.hathi_resolver.json`, and use it as a plan to
load files if one exists. If not, just use the default settings for the module.


In [24]:
try:
    my_resolver = yaml.safe_load(Path("~/.hathi_resolver.json").expanduser().open())
    default_resolver = resolvers.make_resolver_chain(my_resolver)
except FileNotFoundError:
    import sys
    sys.stderr.write("Using default settings to resolve IDs. Consider adding a file at `~/.hathi_resolver.json`")
    default_resolver = None

In [30]:
Volume("mdp.39015078554949", id_resolver=default_resolver)

## Method 2: Use resolvers to copy files to a local directory before publication.

This method is reproducible, but it also relies on the HTRC endpoint existing indefinitely. If you want to publish supporting materials, that's not
good enough; you need to bundle the data as well.

The best way to do that is to copy all the files you're using into a local directory.
You can do that manually; or you can let htrc-feature-reader handle it for you, by defining a resolver
that includes a copy step.

The first time I run the following code, it will copy from /drobo/hathi-ef into the local directory 'hathi-features';
but all subsequent times, it will find the file immediately inside 'hathi-features'.

In [37]:
pre_publication_resolver_chain = [
    {"method": "http"},
    # I don't include this line because *you* don't have a folder at /drobo/hathi-ef; but you could drop
    # in one of your own to avoid hitting the htrc http servers.
#    {"method": "stubbytree", "format": "json", "compression": "bz2", "dir": "/drobo/hathi-ef"},
    {"method": "local", "format": "json", "compression": "bz2", "dir": "hathi-features"},
]
if not Path("hathi-features").exists():
    Path("hathi-features").mkdir()
publication_resolver = resolvers.make_resolver_chain(pre_publication_resolver_chain)

Volume("mdp.39015078554949", id_resolver=publication_resolver)

## Method 3: Keeping it simple for publication once you know the files exist.

You could leave this code in your
publication code, but it might make sense as a last step before publication to switch to using a simple, local resolver.

I suggest json.bz2 here, but you could use parquet, feather, or anything else.

This resolver would not have worked when you started creating this notebook, but the block above should have created the 
needed files in 'hathi-features'.


In [38]:
publication_resolver = resolvers.LocalResolver(format = "json", compression = "bz2", dir="hathi-features")
Volume("mdp.39015078554949", id_resolver=publication_resolver)