# HTRC Features Feader 2.0 Changes

In [270]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [307]:
from htrc_features import FeatureReader, Volume

The Volume object used to handle JSON parsing and feature logic, while the FeatureReader handled reading and decompression.

This was recently updated, to disentangle reading and parsing of dataset files from working with them.  Volume now outsources to a set of parser functions - by default the 'jsonVolumeParser' - allowing for alternative versions of the Extracted Features Dataset to be stored. 

## Support for Extracted Features 2.0

The new release of the HTRC Extracted Features Dataset changes the JSON format slightly, to use JSON-LD and be mostly Schema.org compatible.

In [440]:
vol = Volume(path='../data/ef2-stubby/mdp/31181/mdp.39015014589116.json.bz2')
print("Successfully loaded {} ({}) from the new EF2.0 "
      "features schema {}".format(vol.id, vol.title, vol.parser._schema))

Successfully loaded mdp.39015014589116 (Labor unions and autocracy in Iran /) from the new EF2.0 features schema https://schemas.hathitrust.org/EF_Schema_FeaturesSubSchema_v_3.0


All the features, such as the tokencounts in `vol.tokenlist`, are available. Note that the HathiTrust Reserach Center improved the tokenization and part-of-speech tagging, so the results may vary slightly from prior versions.

In [442]:
token_counts = vol.tokenlist()
print("cross-section of {}: Page 40".format(vol.title))
token_counts.xs(40, level='page')

cross-section of Labor unions and autocracy in Iran /: Page 40


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
section,token,pos,Unnamed: 3_level_1
body,','',2
body,'','',5
body,'s,POS,3
body,",",",",35
body,--,:,4
body,...,...,...
body,with,IN,6
body,workers,NNS,3
body,would,MD,1
body,year,NN,1


One new change is that `sentenceCount` is not provided when there is no language model or if sentences cannot be parsed from the page - then the value is shown as `NaN`.

In [443]:
vol.section_features()

Unnamed: 0_level_0,tokenCount,lineCount,emptyLineCount,capAlphaSeq,sentenceCount
page,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,55,29,6,1,
5,12,2,0,1,1.0
6,238,52,10,3,
7,14,6,0,3,1.0
8,144,21,0,4,17.0
...,...,...,...,...,...
350,793,96,0,17,4.0
351,880,95,0,9,5.0
352,228,32,0,19,4.0
353,93,10,0,3,1.0


Note that the algorithmically inferred language on each page is now under `calculatedLanguage`, for clarity, and offers the top language rather the full list of language probabilities.

In [444]:
vol.page_features('calculatedLanguage')

page
1      None
2        gl
3      None
4      None
5        en
       ... 
356    None
357    None
358    None
359      in
360    None
Name: calculatedLanguage, Length: 360, dtype: object

## EF2.0 Metadata

Most of the Metadata is schema.org compatible, and is mapped to attributes of Volume.
![Screenshot of auto-fill attributes shown for a volume, showing many metadata fields](images/metadata.png)

Generally, field names match the schema's key, converted from CamelCase to a snake_case, which is the preferred convention for attributes. One exception is that `schemaVersion` is renamed to `metadata_schema_version`, for clarity next to `features_schema_version`. Fields use their name from the EF2.0 schema, rather than trying to map on to the field names of past versions.

All the metadata can be returned from the volume's parser, as so:

In [445]:
vol.parser.meta

{'id': 'mdp.39015014589116',
 'metadata_schema_version': 'https://schemas.hathitrust.org/EF_Schema_MetadataSubSchema_v_3.0',
 'enumeration_chronology': None,
 'type_of_resource': 'http://id.loc.gov/ontologies/bibframe/Text',
 'title': 'Labor unions and autocracy in Iran /',
 'date_created': 20200209,
 'pub_date': 1985,
 'language': 'eng',
 'access_profile': 'google',
 'isbn': '0815623437',
 'issn': None,
 'lccn': '85017300',
 'oclc': '12420719',
 'page_count': 360,
 'feature_schema_version': 'https://schemas.hathitrust.org/EF_Schema_FeaturesSubSchema_v_3.0',
 'access_rights': 'ic',
 'alternate_title': None,
 'category': 'Industries. Land use. Labor',
 'genre_ld': ['http://id.loc.gov/vocabulary/marcgt/doc',
  'http://id.loc.gov/vocabulary/marcgt/bib'],
 'genre': ['document (computer)', 'bibliography'],
 'contributor_ld': {'id': 'http://www.viaf.org/viaf/24638157',
  'type': 'http://id.loc.gov/ontologies/bibframe/Person',
  'name': 'Ladjevardi, Habib.'},
 'contributor': 'Ladjevardi, Habi

You can see in the above that when a value for a field is a higher level datatype, like a Schema.org Organization or VIAF Person, the library renames that field to `field_ld` and extracts the name of the entity into `field`. Let's look more closely at publisher:

In [446]:
vol.publisher

'Syracuse University Press'

In [447]:
vol.publisher_ld

{'id': 'http://catalogdata.library.illinois.edu/lod/entities/ProvisionActivityAgent/ht/Syracuse%20University%20Press',
 'type': 'http://id.loc.gov/ontologies/bibframe/Organization',
 'name': 'Syracuse University Press'}

The real linked data entity is the latter, while the plain text version that some users may look for is the former.

Note that the Feature Reader still maintains a few attributes remapping common metadata to friendlier names - `Volume.year` returns the metadata in `pub_date` and `Volume.author` returns the metadata for `contributor`.

In [448]:
vol.author, vol.year

(['Ladjevardi, Habib.'], 1985)

Two notable new metadata fields in Extracted Features 2.0 are 'category' and 'genre'. Genre provides more information about the type of volume, and can help in focusing on only certain holdings of the HathiTrust Digital Library. For example, you may want to exclude goverment works by looking for `government publication`. `vol.genre` provides the plain text description of the genre(s) as a list (as per the [schema](http://id.loc.gov/vocabulary/marcgt.html), while vol.genre_ld provided the persistent identifier for the authority record.

In [449]:
vol.genre, vol.genre_ld

(['document (computer)', 'bibliography'],
 ['http://id.loc.gov/vocabulary/marcgt/doc',
  'http://id.loc.gov/vocabulary/marcgt/bib'])

`category` provides the name of the Library of Congress subclass of the book, when a classification number is present. e.g.

In [452]:
vol.category, vol.lcc

('Industries. Land use. Labor', 'HD6805.2.L33 1985')

### Backwards compatibility

The Feature Reader still supports older EF files.

In [278]:
vol = Volume('../data/ef1.5-examples/hvd.hwquxe.json.bz2')
print(vol.metadata_schema_version)
vol

1.3


## Support for the ID-based retrieval (including the new Stubbytree format)

Part of the new HTRC Feature Reader changes allows you to return Extracted Features files by ID, as long as the library knows where to look for them - i.e. are your files in a single directory, or a zip file, or deeper file structure? 

### Stubbytree (new)

The new Extracted Features 2.0 release is contained in a file structure called Stubbytree, an adjustment on the previous Pairtree format which is much gentler on your operating system. The new Feature Reader supports this structure.

Consider the file `hvd.32044093320364`. EF files are contained in `../data/ef2-stubby/`, and the file itself is in that structure is at `hvd/34926/hvd.32044093320364.json.bz2`. To load the file with by ID, you tell the library the root directory, with `dir`, and the way of resolving the id (i.e. `id_resolver="stubbytree"`):

In [366]:
vol = Volume('hvd.32044093320364', dir='../data/ef2-stubby/', id_resolver='stubbytree')
vol

The older `pairtree` format is also supported, with `id_resolver='pairtree'`.

### Local Directory

If all your files are in the same directory, you can use `id_resolver="local"`. It is also the default assumption with id_resolver isn't provided.

In [301]:
Volume('hvd.hwquxe', dir='../data/ef1.5-examples')

## Updated Stubbytree utilities

There is a function for converting an HTID to a Stubbytree path in `utils`. Where an HTID is comprised of a libid and volid, the location is `libid/volid[::3]/` - that is, a subfolder named after every third character of volid.

In [373]:
from htrc_features import utils
utils.id_to_stubbytree('uc2.ark:/13960/t6c24s04', format='json', compression='bz2')

'uc2/a+30644/uc2.ark+=13960=t6c24s04.json.bz2'

This function by default is used in `utils.id_to_rsync`. You can specify a pairtree format if you need it, though.

In [375]:
utils.id_to_rsync('uc2.ark:/13960/t6c24s04')

'uc2/a+30644/uc2.ark+=13960=t6c24s04.json.bz2'

In [376]:
utils.id_to_rsync('uc2.ark:/13960/t6c24s04', format='pairtree')

'uc2/pairtree_root/ar/k+/=1/39/60/=t/6c/24/s0/4/ark+=13960=t6c24s04/uc2.ark+=13960=t6c24s04.json.bz2'

The command-line utility, `htid2rsync`, now uses stubbytree by default, and has an `--oldstyle` flag to use the pairtree. *(text preceded with a `!` in this documentation is run on the command line rather than python)*.

In [386]:
!head ../data/ef2-stubby/htids.txt | htid2rsync --from-file

coo/32086/coo.31924109784268.json.bz2
hvd/34926/hvd.32044093320364.json.bz2
ien/35375/ien.35556031376650.json.bz2
keio/1139/keio.10810734990.json.bz2
loc/a+30679/loc.ark+=13960=t6737fd9d.json.bz2
nyp/33759/nyp.33433070251792.json.bz2
nyp/33054/nyp.33433001051444.json.bz2
osu/33003/osu.32435005003835.json.bz2
osu/33022/osu.32435001924323.json.bz2
txu/01171/txu.059172143771152.json.bz2


In [385]:
!head ../data/ef2-stubby/htids.txt | htid2rsync --from-file --oldstyle

coo/pairtree_root/31/92/41/09/78/42/68/31924109784268/coo.31924109784268.json.bz2
hvd/pairtree_root/32/04/40/93/32/03/64/32044093320364/hvd.32044093320364.json.bz2
ien/pairtree_root/35/55/60/31/37/66/50/35556031376650/ien.35556031376650.json.bz2
keio/pairtree_root/10/81/07/34/99/0/10810734990/keio.10810734990.json.bz2
loc/pairtree_root/ar/k+/=1/39/60/=t/67/37/fd/9d/ark+=13960=t6737fd9d/loc.ark+=13960=t6737fd9d.json.bz2
nyp/pairtree_root/33/43/30/70/25/17/92/33433070251792/nyp.33433070251792.json.bz2
nyp/pairtree_root/33/43/30/01/05/14/44/33433001051444/nyp.33433001051444.json.bz2
osu/pairtree_root/32/43/50/05/00/38/35/32435005003835/osu.32435005003835.json.bz2
osu/pairtree_root/32/43/50/01/92/43/23/32435001924323/osu.32435001924323.json.bz2
txu/pairtree_root/05/91/72/14/37/71/15/2/059172143771152/txu.059172143771152.json.bz2


## Rewriting Pairtree to Stubbytree

[Benjamin Schmidt](https://benschmidt.org/) has developed a number of clever generalizations for Volume reading and *writing*. As a result, you can convert a file structure that is in Pairtree into a Stubbytree file format like below. The temporary directory is just for the example, of course - in your case you want a permanent stubbytree root.

```python
htid = 'osu.32435018220335'
invol = Volume(htid, dir='/data/pairtree_root', id_resolver='pairtree')
outvol = Volume(htid, dir='/data/stubby_root', id_resolver='stubbytree', mode='wb')
outvol.write(invol) 
```

`invol` can be any Volume that is readable. (By the way, if you're already rewriting files, add `format='parquet'` and write the new ones in the faster alternative Parquet format!)

If you're looking for a large batch rewrite of the pairtree files which just moves files around without reading them, the Massive Texts Lab has a scrip to do so: https://github.com/massivetexts/compare-tools/blob/master/scripts/pairtree_to_stubbytree.py. This was written back when Stubbytree was only a lab-specific creation - now that HTRC Extracted Features use Stubbytree, it may just be easier to download your files anew, in the EF2.0 format.

## Volumes can now load files directly

As you've seen up to this point, the Feature Reader supports initializing single files through `Volume()` now, while still supporting the approach for feature a larger collection with `FeatureReader()`:

In [305]:
Volume('../data/ef2-stubby/loc/a+30679/loc.ark+=13960=t6737fd9d.json.bz2')

Iteration through the FeatureReader is still possible:

In [350]:
import glob
paths = glob.glob('../data/ef2-stubby/pst/**/*bz2', recursive=True)
fr = FeatureReader(paths)
for vol in fr.volumes():
    print(vol)

None
['../data/ef2-stubby/pst/0055/pst.000029571581.json.bz2', '../data/ef2-stubby/pst/0053/pst.000068517380.json.bz2']
<Volume: A vision : A reissue with the... (1965) by Y>
<Volume: Special funds : status of appr... (1990) by P>


## Volumes hold non-json internal representations

The Volume is now made up of four DataFrame: tokencounts, line character counts, section-level features (i.e. the page level features that are provided for header/body/footer), and page-level features.

In [6]:
vol.tokenlist().head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
2,body,"""",``,1
2,body,.,.,1


In [11]:
vol.line_chars().head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,place,char,Unnamed: 4_level_1
2,body,begin,F,1
2,body,begin,a,1


In [8]:
vol.section_features(section='all').head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,capAlphaSeq,emptyLineCount,lineCount,sentenceCount,tokenCount
page,section,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,header,0,0,0,0,0
1,body,0,0,0,0,0


Metadata is imported from the parser as a Volume property:

In [15]:
vol.parser.meta

{'id': 'mdp.39015028036104',
 'schema_version': '1.3',
 'date_created': '2016-06-19T18:28:16.1649565Z',
 'title': 'Russian short stories, ed. for school use,',
 'pub_date': '1919',
 'language': 'eng',
 'ht_bib_url': 'http://catalog.hathitrust.org/api/volumes/full/htid/mdp.39015028036104.json',
 'handle_url': 'http://hdl.handle.net/2027/mdp.39015028036104',
 'oclc': ['1456817'],
 'imprint': 'Scott, Foresman and company [c1919]',
 'names': ['Schweikert, Harry Christian 1877- ed. '],
 'classification': {'lcc': ['PZ1.S413 Ru']},
 'type_of_resource': 'text',
 'issuance': 'monographic',
 'genre': ['not fiction'],
 'bibliographic_format': 'BK',
 'pub_place': 'ilu',
 'government_document': False,
 'source_institution': 'MIU',
 'enumeration_chronology': ' ',
 'hathitrust_record_number': '1059466',
 'rights_attributes': 'pd',
 'access_profile': 'google',
 'volume_identifier': 'mdp.39015028036104',
 'source_institution_record_number': '001059466',
 'isbn': [],
 'issn': [],
 'lccn': ['19006802'],


In [16]:
vol.page_count, vol.issn

(460, [])

## Alternative data parsers are supported

The bzipped JSON files may not meet all use cases. Developers can now extend basicVolumeParser with their own parsers, which are given to FeatureReader or a Volume with the `parser=...` argument. This will also help scale to future changes in the HTRC's Extracted Features file format.

There are two volume parsers included: `jsonVolumeParser` (default), and `parquetVolumeParser`.

## A feature file can hold incomplete data

The feature reader is now more robust toward loading data that may be missing parts of speech, or lowercases, or not have the page sections. This can be useful for saving more succinct versions of texts.

`Volume.tokenlist()` also now contains a `drop_section` arguments, to drop the 'section' index level. This is a common use case, because most users only keep the 'body' section.

## Support for Parquet-based dataset files

The current parser enforces a filename convention, and you pass the extensionless file path. Here's what the files look like:

In [20]:
glob.glob('../data/parquet/mdp.39015028036104*')

['../data/parquet/mdp.39015028036104.meta.json',
 '../data/parquet/mdp.39015028036104.tokens.parquet',
 '../data/parquet/mdp.39015028036104.section.parquet',
 '../data/parquet/mdp.39015028036104.chars.parquet']

You don't need all four - perhaps you just want to load tokencounts and metadata, or even just metadata. The files are lazy-loaded, so if you have all four files but only want to access the metadata, you don't need to hide the other files - just don't call information from them!

Loading is done like this:

In [355]:
pvol = Volume('mdp.39015028036104', format='parquet', dir='../data/parquet')
pvol

Specifying format did the trick, though you can also provide a fileHandler directly. Use this for alternative ways of storing/loading the data:

In [358]:
from htrc_features.parsers import ParquetFileHandler
Volume('mdp.39015028036104', dir='../data/parquet', file_handler=ParquetFileHandler)

There is now a `Volume.save` method for saving to the parquet format (or other formats).

In [360]:
?Volume.save

[0;31mSignature:[0m [0mVolume[0m[0;34m.[0m[0msave[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mdir[0m[0;34m,[0m [0mformat[0m[0;34m=[0m[0;34m'parquet'[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
A wrapper around the 'write' method of all IdResolvers, 
that allows you to quickly declare a div, a format, and other
kwargs.

The primary use is for converting the feature files to
a more efficient parquet format. By default, only metadata and
tokencounts are saved, using the naming convention used by
parquetVolumeParser.

Saving page features is currently unsupported, as it's an
ill-fit for parquet. This is currently just the
language-inferences for each page - everything else is in
section features (page by body/header/footer).

Since Volumes partially support incomplete dataframes, you can
pass Volume.tokenlist arguments as a dict with
token_kwargs. For example, if you want to save a
representation with only body info

For example, writing to Parquet internally calls `ParquetFileHandler.write`:

In [361]:
ParquetFileHandler.write?

[0;31mSignature:[0m
[0mParquetFileHandler[0m[0;34m.[0m[0mwrite[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvolume[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfiles[0m[0;34m=[0m[0;34m[[0m[0;34m'meta'[0m[0;34m,[0m [0;34m'tokens'[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmode[0m[0;34m=[0m[0;34m'wb'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcompression[0m[0;34m=[0m[0;34m'default'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindexed[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtoken_kwargs[0m[0;34m=[0m[0;34m'default'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Save the internal representations of feature data to parquet, and the metadata to json,
using the naming convention used by ParquetFileHandler.

The primary use is for converting the f

By default, only the tokens and metadata are saved. You can also save a partial tokenlist if you like.

## The Page was stupefied

The Page object was stupefied - it reaches up to the associated Volume for all of it's functionality now, and all the page-level Volume methods have a page_select argument for selecting only a single page.