In [1]:
%load_ext autoreload
%autoreload 2

In [5]:
import htrc_features
import htrc_features.resolvers
from htrc_features import Volume, resolvers
import tempfile
import os
import json
import logging

In [7]:
logging.getLogger().setLevel(logging.INFO)


# Some explanations and tests for the new loading methods.

This is not a comprehensive set of tests, but should provide the basics.

## Loading from a path.

An unnamed initial arg to 'Volume' looks at the format to see if it's an ID or a path. This looks like an ID, so reads from disk.

In [8]:
Volume("../data/PZ-volumes/hvd.hwrqs8.json.bz2").tokenlist().head(5)

path


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
2,body,1C,CC,1
2,body,i,NN,1
7,body,.,$.,1
7,body,CHILDREN,NE,1
7,body,MR,NE,1


## Loading over the web

This one loads from the web. I think there are probably more gentle defaults than re-pulling from online every time.

In [12]:
Volume("hvd.hwrqs8")

http


## Using IDs with paths.

That's basically the entire old method.

But now we can grab things from local sources while still using id arguments.

In this example, we can use instead 'localResolver'. We say we're using json, bz2, and a folder named `../data/PZ-volumes'

In [15]:
fileholder = resolvers.LocalResolver(dir = "../data/PZ-volumes", format = "json", compression = "bz2")
fand = htrc_features.JsonFileHandler(id = "hvd.hwrqs8", id_resolver = fileholder)
fand

<htrc_features.parsers.JsonFileHandler at 0x10fa25950>

## Fetching over HTTP.

While HTTP fetching is currently silent, we should probably warn when that happens without an explicit request. Here's how you'd do that:

In [16]:
fand = htrc_features.JsonFileHandler(id = "hvd.hwrqs8", id_resolver = "http")
fand._make_tokencount_df().head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
2,body,1C,CC,1
2,body,i,NN,1
7,body,.,$.,1
7,body,CHILDREN,NE,1
7,body,MR,NE,1


# Fancy zip storage

When working with millions of files, some systems start to run out of inodes. Here, we build a storage using the 'ziptreeresolver' method, which assigns each file to one of 4096 zip files based on its name. Here, I'll just create one in a tmpdir and build a resolver.

In [17]:
zipdir = tempfile.gettempdir()
zipholder = resolvers.ZiptreeResolver(zipdir, format = "json", compression = "bz2")

Now we'll go through the PZ-volumes folder and, for every volume, 

1. Grab the ID.
2. Read the bzipped binary data into memory
3. Reinsert that binary data into the ziptree holder.

Note that we tell the zipholder to use 'json' storage' and 'bz2' compression. It **does not** actually compress the data; that's just to be an appropriate filename for the ID for later retrieval. This behavior is a little wonky, and may be changed or idiot-proofed.

In [18]:
sample_dir = "../data/PZ-volumes/"

ids = set()

for file in os.listdir(sample_dir):
    if file.endswith(".bz2"):
        id = htrc_features.utils.extract_htid(file)
        ids.add(id) # Store for use
        raw_file_buffer = open(os.path.join("../data/PZ-volumes/", file), "rb")
        zipholder.put(raw_file_buffer, id, format = "json", compression = "bz2")
        

This new tmpdir is filled with zipfiles. There are 4096 names, built from the first three characters of sha-1 hashes of the filenames.

In [19]:
[z for z in os.listdir(zipdir) if z.endswith(".zip")]

['b75.zip',
 '96c.zip',
 'e14.zip',
 '553.zip',
 'e99.zip',
 'e6b.zip',
 '7d2.zip',
 'e6f.zip',
 '915.zip',
 'd33.zip',
 'a97.zip',
 'c5f.zip',
 '173.zip',
 '613.zip',
 '940.zip']

## `get` calls return buffers

If we use the ZiptreeResolver's get method directly, we see it returns a BZ2File.

In [20]:
gotten_file = zipholder.get('njp.32101068970662')
gotten_file

<bz2.BZ2File at 0x111be7f90>

This is a buffered IO object. 

In [21]:
import io
isinstance(gotten_file, io.BufferedIOBase)

True

In [22]:
import bz2

fin = resolvers.ZiptreeResolver(zipdir, "json", "bz2").get(id = "hvd.hwrevu")

print(json.loads(fin.read())['metadata'])

{'schemaVersion': '1.3', 'dateCreated': '2016-06-18T20:43:39.6072562Z', 'volumeIdentifier': 'hvd.hwrevu', 'accessProfile': 'google', 'rightsAttributes': 'pd', 'hathitrustRecordNumber': '966012', 'enumerationChronology': ' ', 'sourceInstitution': 'HVD', 'sourceInstitutionRecordNumber': '001437508', 'oclc': ['2522081'], 'isbn': [], 'issn': [], 'lccn': ['17015285'], 'title': 'The lady with the dog, and other stories,', 'imprint': 'The Macmillan company, 1917.', 'lastUpdateDate': '2014-10-26 03:25:06', 'governmentDocument': False, 'pubDate': '1917', 'pubPlace': 'nyu', 'language': 'eng', 'bibliographicFormat': 'BK', 'genre': ['not fiction'], 'issuance': 'monographic', 'typeOfResource': 'text', 'classification': {}, 'names': ['Chekhov, Anton Pavlovich 1860-1904 ', 'Garnett, Constance Black 1862-1946 '], 'htBibUrl': 'http://catalog.hathitrust.org/api/volumes/full/htid/hvd.hwrevu.json', 'handleUrl': 'http://hdl.handle.net/2027/hvd.hwrevu'}


In [23]:
Volume("hvd.hwrevu", format = "json", compression = "bz2", id_resolver = "ziptree", dir = zipdir)

ziptree
