# Volume Parsing and Parquet Features

In [1]:
from htrc_features import FeatureReader, Volume

The Volume object used to handle JSON parsing and feature logic, while the FeatureReader handled reading and decompression.

This was recently updated, to disentangle reading and parsing of dataset files from working with them.  Volume now outsources to a set of parser functions - by default the 'jsonVolumeParser' - allowing for alternative versions of the Extracted Features Dataset to be stored. 

### 1. Volumes can now load files directly

In [2]:
Volume('../data/PZ-volumes/hvd.hwquxe.json.bz2')

Iteration through the FeatureReader is still possible:

In [4]:
import glob
paths = glob.glob('../data/PZ-volumes/*')

r = FeatureReader(paths[:5])
for vol in fr.volumes():
    print(vol)

<Volume: The ballet dancer, and On guar... (1901) by Serao, Matilde.>
<Volume: The man from Glengarry : a tal... (1901) by Connor, Ralph 1860-1937>
<Volume: The lady with the dog, and oth... (1917) by Chekhov, Anton Pavlovich 1860-1904>
<Volume: Mr. Rutherford's children. By... (1855) by Warner, Susan 1819-1885>
<Volume: Russian short stories, ed. for... (1919) by Schweikert, Harry Christian 1877- ed.>


### 2. Volumes hold non-json internal representations

The Volume is now made up of four DataFrame: tokencounts, line character counts, section-level features (i.e. the page level features that are provided for header/body/footer), and page-level features.

In [6]:
vol.tokenlist().head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
2,body,"""",``,1
2,body,.,.,1


In [11]:
vol.line_chars().head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,place,char,Unnamed: 4_level_1
2,body,begin,F,1
2,body,begin,a,1


In [8]:
vol.section_features(section='all').head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,capAlphaSeq,emptyLineCount,lineCount,sentenceCount,tokenCount
page,section,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,header,0,0,0,0,0
1,body,0,0,0,0,0


Metadata is imported from the parser as a Volume property:

In [15]:
vol.parser.meta

{'id': 'mdp.39015028036104',
 'schema_version': '1.3',
 'date_created': '2016-06-19T18:28:16.1649565Z',
 'title': 'Russian short stories, ed. for school use,',
 'pub_date': '1919',
 'language': 'eng',
 'ht_bib_url': 'http://catalog.hathitrust.org/api/volumes/full/htid/mdp.39015028036104.json',
 'handle_url': 'http://hdl.handle.net/2027/mdp.39015028036104',
 'oclc': ['1456817'],
 'imprint': 'Scott, Foresman and company [c1919]',
 'names': ['Schweikert, Harry Christian 1877- ed. '],
 'classification': {'lcc': ['PZ1.S413 Ru']},
 'type_of_resource': 'text',
 'issuance': 'monographic',
 'genre': ['not fiction'],
 'bibliographic_format': 'BK',
 'pub_place': 'ilu',
 'government_document': False,
 'source_institution': 'MIU',
 'enumeration_chronology': ' ',
 'hathitrust_record_number': '1059466',
 'rights_attributes': 'pd',
 'access_profile': 'google',
 'volume_identifier': 'mdp.39015028036104',
 'source_institution_record_number': '001059466',
 'isbn': [],
 'issn': [],
 'lccn': ['19006802'],


In [16]:
vol.page_count, vol.issn

(460, [])

### 3. Alternative data parsers are supported

The bzipped JSON files may not meet all use cases. Developers can now extend basicVolumeParser with their own parsers, which are given to FeatureReader or a Volume with the `parser=...` argument. This will also help scale to future changes in the HTRC's Extracted Features file format.

There are two volume parsers included: `jsonVolumeParser` (default), and `parquetVolumeParser`.

### 4. A feature file can hold incomplete data

The feature reader is now more robust toward loading data that may be missing parts of speech, or lowercases, or not have the page sections. This can be useful for saving more succinct versions of texts.

`Volume.tokenlist()` also now contains a `rop_section` arguments, to drop the 'section' index level. This is a common use case, because most users only keep the 'body' section.

### 5. Support for Parquet-based dataset files

The current parser enforces a filename convention, and you pass the extensionless file path. Here's what the files look like:

In [20]:
glob.glob('../data/parquet/mdp.39015028036104*')

['../data/parquet/mdp.39015028036104.meta.json',
 '../data/parquet/mdp.39015028036104.tokens.parquet',
 '../data/parquet/mdp.39015028036104.section.parquet',
 '../data/parquet/mdp.39015028036104.chars.parquet']

You don't need all four - perhaps you just want to sload tokencounts and metadata, or even just metadata. The files are lazy-loaded, so if you have all four files but only want to access the metadata, you don't need to hide the other files - just don't call information from them!

Loading is done like this:

In [22]:
pvol = Volume('../data/parquet/mdp.39015028036104', parser='parquet')
pvol

`parser=` can also take a parser class directly.

In [24]:
from htrc_features import parquetVolumeParser
Volume('../data/parquet/mdp.39015028036104', parser=parquetVolumeParser)

There is now a `Volume.save_parquet` method for saving to the parquet format.

In [18]:
?Volume.save_parquet

[0;31mSignature:[0m
[0mVolume[0m[0;34m.[0m[0msave_parquet[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpath[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmeta[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtokens[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchars[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msection_features[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcompression[0m[0;34m=[0m[0;34m'snappy'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtoken_kwargs[0m[0;34m=[0m[0;34m{[0m[0;34m'section'[0m[0;34m:[0m [0;34m'all'[0m[0;34m,[0m [0;34m'drop_section'[0m[0;34m:[0m [0;32mFalse[0m[0;34m}[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Save the internal representations of feature data to parquet, and the metadata to json,
using the naming 

By default, only the tokens and metadata are saved. You can also save a partial tokenlist if you like.

### 6. The Page was stupefied

The Page object was stupefied - it reaches up to the associated Volume for all of it's functionality now, and all the page-level Volume methods have a page_select argument for selecting only a single page.