HTRC-Features
=============

Tools for working with HTRC Feature Extraction files


## Installation

To install,

    git clone https://github.com/organisciak/htrc-feature-reader.git
    cd htrc-feature-reader
    python setup.py install

That's it! This library is written for Python 2.7 and 3.0+.

Two optional modules improve the HTRC-Feature-Reader: `pysolr` allows fetching of metadata, and `ujson` speeds up loading by about 0.4s per file. To install:

    pip install pysolr ujson

## Usage

### Reading feature files

The easiest way to start using this library is to use the `FeatureReader`, which takes a list of paths.

In [None]:
import glob
from htrc_features import FeatureReader
paths = glob.glob('data/PZ-volumes/*basic.json.bz2')
# Here we're loading five paths, for brevity
feature_reader = FeatureReader(paths[:5])
i = 0
for vol in feature_reader.volumes():
    print("%s - %s" % (vol.id, vol.title))

Iterating on `FeatureReader.volumes()` returns `Volume` objects.
Wherever possible, this library tries not to hold things in memory, so most of the time you want to iterate rather than casting to a list.
In addition to memory issues, since each volume needs to be read from a file and initialized, it will be slow. 
_Woe to whomever tries `list(FeatureReader.volumes())`_.

The method for creating a path list with 'glob' is just one way to do so.
For large sets, it's better to just have a text file of your paths, and read it line by line.

The feature reader also has a useful method, `multiprocessing(map_func)`, for chunking a running functions across multiple processes.
This is an advanced feature, but extremely helpful for any large-scale processing.

### Volume

A volume contains information about the current work and access to the pages of the work.

All the metadata fields from the HTRC JSON file are accessible as properties of the volume object, including _title_, _language_, _imprint_, _oclc_, _pubDate_, and _genre_. The main identifier _id_ and _pageCount_ are also accessible.

In [None]:
"Volume %s has %s pages in %s" % (vol.id, vol.pageCount, vol.language)

As a convenience, Volume.year returns Volume.pubDate:

In [None]:
"%s == %s" % (vol.pubDate, vol.year)

Like with the feature_reader, it doubles as a generator for pages, and again, it's preferable for speed and memory to iterate over the pages than to read them into a list.

In [None]:
# Let's skip ahead some pages
i = 0
for page in vol:
    i += 1
    if i >= 16:
        break
        
print(page)

This is just a pleasant way to access `vol.pages()`.
If you want to pass arguments to page initialization, such as changing the pages default section from body to 'fullpage', it can be done with `for page in vol.pages(default_section='fullpage')`. 

Finally, if the minimal metadata included with the extracted feature files is insufficient, you can fetch the HTRC's metadata record with `vol.metadata`.
Remember that this calls the HTRC servers for each volume, so can add considerable overhead.

In [None]:
fr = FeatureReader(paths[0:5])
for vol in fr.volumes():
    print(vol.metadata['published'][0])

In [None]:
print("METADATA FIELDS: " + ", ".join(vol.metadata.keys()))

_At large-scales, using `vol.metadata` is an impolite and inefficient amount of server pinging; there are better ways to query the API than one-by-one._

## Pages and Sections

A page contains the meat of the HTRC's extracted features.
Since the HTRC provides information by header/body/footer, these are accessed as separate 'sections' with `Page.header`, `Page.body`, and `Page.footer`.


In [None]:
print("The body has %s lines and %s sentences" % (page.body.lineCount, page.body.sentenceCount))

There is also `Page.fullpage`, which is a section combining the header, footer, and body.
Remember that these need to be added together, which isn't done until the first time `fullpage` is accessed, and in large-scale processing those milliseconds can add up.

In [None]:
fullpage = page.fullpage
combined_token_count = page.body.tokenCount + page.header.tokenCount + page.footer.tokenCount
# check that full page is adding properly
assert(fullpage.tokenCount == combined_token_count)

For the most part, the properties of the page and section are identical to the HTRC Extracted Features schema, rather than following Python naming conventions (e.g. CamelCase when convention would expect underscore_separation).

A page has a default section, where some features -- such as accessing a token list -- can be accessed without specify the section each time. For example, with the default_section set to 'body', as it is by default, `Page.body.tokenlist` can be accessed with `Page.tokenlist`.

## The fun stuff: playing with token counts and character counts

Token lists are contained in Section.tokenlist.

In [None]:
tl = page.body.tokenlist

A `tokenlist` returns a [Pandas](http://pandas.pydata.org/) DataFrame through `tokenlist.token_counts()`, and provides syntactic access to the vocabulary (`tokenlist.tokens`) and a total token count (`tokenlist.count`).

In [None]:
df = tl.token_counts()
print(df.sort_values(by='count', ascending=False)[:5])

These can be manipulated in various ways. You can case-fold, for example:

In [None]:
df = tl.token_counts(case=False)
print(df.sort_values(by='count', ascending=False)[:5])

Or, you can combine part of speech counts into a single integer.

In [None]:
df = tl.token_counts(pos=False)
print(df.sort_values(by='count', ascending=False)[:3])

To get just the unique tokens, `TokenList.tokens` provides them, though it is just an easy way to get `TokenList.token_counts().keys()`

In [None]:
tl.tokens[:10]

In addition to token lists, you can also access `Section.beginLineChars` and `Section.endLineChars`, which are dictionaries of character counts that occur at the start or end of the line.

### Volume stats collecting

The Volume object has a number of methods for collecting information from all its pages.

In [None]:
tokens = vol.tokens_per_page()
# Show first 15 pages
tokens[:15]

In [None]:
a = vol.term_page_freqs()

In [None]:
print(a.iloc[1:3, 4:14])

In [None]:
print(vol.term_volume_freqs()[:4])

Volume.term_page_freqs provides a wide DataFrame resembling a matrix, where terms are listed as columns, pages are listed as rows, and the values correspond to the term frequency (or page page frequency with `page_freq=true`).
Volume.term_volume_freqs() simply sums these.
 
### Multiprocessing

For faster processing, you can write a mapping function for acting on volumes, then pass it to `FeatureReader.multiprocessing`.
This sends out the function to a different process per volume, spawning (CPU_CORES-1) processes at a time.
The map function receives the feature_reader and a volume path, and needs to initialize the volume.

Here's a simple example that returns the term counts for each volume (take note of the first two lines of the functions):

```python
def printTermCounts(args):
    fr, path = args
    vol = fr.create_volume(path)

    metadata = (vol.id, vol.year)
    return (metadata, results)

    results = feature_reader.multiprocessing(map_func)
    for vol, result in results:
		print("Results from %s (%d)" % vol)
		for term, count in result.items():
            print("%s: %d" % (term, count))
```

Some rules: results must be serializeable, and the map_func must be accessible from __main__ (basically: no dynamic functions: they should be written plainly in your script).

The results are collected and returned together, so you don't want a feature reader with all 250k files, because the results will be too much memory (depending on how big your result is).
Instead, it easier to initialize feature readers for smaller batches.


## Advanced Files

In the beta Extracted Features release, schema 2.0, a few features were separated out to an advanced files. If you try to access those features, like `endLineChars`, you'll get a error:

In [None]:
end_line_chars = vol.end_line_chars()

It is possible to load the advanced file alongside the basic files by passing in a `(basic, advanced)` tuple of filepaths where you would normally pass in a single path. For example,

In [None]:
newpaths = [(x,x.replace('basic', 'advanced')) for x in paths]
newpaths[:2]

In [None]:
fr = FeatureReader(newpaths)
vol = next(fr.volumes())
end_line_chars = vol.end_line_chars()
print(end_line_chars['!'][:15])

Note that the advanced files are not fully supported, because the basic/advanced split will not continue for future releases.

Loading and parsing the advanced feature files adds non negligible time (about `1.3` seconds on my computer), so only load them if you need them.