## Using the HTID Class

The HTID class serves as an endpoint to the various sources of volume-level information in this project. Since we use various sources of metadata, data, and features, this is meant to simplify the code access. It lazy loads and caches content when needed.

In [1]:
%load_ext autoreload
%autoreload 2

In [8]:
from compare_tools.utils import HTID
from compare_tools.hathimeta import HathiMeta, get_json_meta
from compare_tools.configuration import config
from SRP import Vector_file
config.keys()

dict_keys(['rsync_root', 'parquet_chunked_root', 'parquet_root', 'glove_data_path', 'srp_data_path', 'meta_path', 'metadb_path'])

In [9]:
config

{'rsync_root': '/data/extracted-features/',
 'parquet_chunked_root': '/data/extracted-features-parquet-chunked/',
 'parquet_root': '/data/extracted-features-parquet/',
 'glove_data_path': '/data/vectorfiles/all_Glove_testset.bin',
 'srp_data_path': '/data/vectorfiles/all_SRP_testset.bin',
 'meta_path': '/projects/saddl-main/sampling/test_dataset.csv.gz',
 'metadb_path': '/data/saddl/meta.db'}

In [11]:
# Pre-initialized objects - these aren't specific to a volume, so the HTID
# class just calls stuff out of them
metastore = HathiMeta(config['metadb_path'])
glove = Vector_file(config['glove_data_path'], mode='r')
srp = Vector_file(config['srp_data_path'], mode='r')

In [13]:
htid_args = dict(ef_root=config['parquet_root'],
                 ef_chunk_root=config['parquet_chunked_root'], 
                 ef_parser='parquet',
                 hathimeta=metastore,
                 vecfiles=[('glove', glove),('srp', srp)])

In [14]:
htid = HTID('aeu.ark:/13960/t0000s333', **htid_args)

## Metadata
Metadata is loaded both from the HathiMeta database and from the Extracted Features volume. This can sometimes differ, so the class doesn't deduplicate. In other words, *don't assume* that there's only one field with a given name! The `meta` call will initialize the volume if it isn't already, so if you don't want to spare the processing cycles, don't ask for metadata or don't give ef_root on init!

In [15]:
htid.meta().head(10)

access                                          allow
access_profile                                   open
access_profile_code                              open
author                    Clark, Francis E. 1851-1927
bib_fmt                                            BK
bibliographic_format                               BK
classification                     {'ddc': ['910.4']}
collection_code                                   AEU
content_provider_code                        ualberta
date_created             2016-06-19T07:13:48.0676367Z
dtype: object

## Data - Extracted Features

In [16]:
htid.volume

The version of the Extracted Features that I'm pointing to on my machine has case and pos dropped, so token counts will only load if you don't call that information. Your system may be different.

In [17]:
htid.volume.tokenlist(case=False, pos=False).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
page,lowercase,Unnamed: 2_level_1
323,africa,4
132,ii,1
270,eupon,1
89,for,2
256,natives,1


For performance, this project allows you to save pre-parsed Extracted Features using the parquet format, as above. Simplified token count information can also be saved, including a version where only *n*-sized chunks are saved. HTID can point to an endpoint for chunk-only EF volumes:

In [18]:
chunks = htid.chunked_volume.chunked_tokenlist(case=False, pos=False, suppress_warning=True)
chunks.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
chunk,lowercase,Unnamed: 2_level_1
1,!,10
1,!^-^>,1
1,!tiiagfciiir-iiitiww<,1
1,"""",28
1,"""continued""",1


*The `suppress_warning` argument suppresses a reminder that pre-chunked tokenlists are returned as they exist on disk (i.e. if it was pre-saved at 5000 words per chunk, that's what you get, even if you ask chunked_tokenlist for 8000 words per chunk!). If you're worried you'll forget that, don't suppress the warning!*

These chunks are around 5000 tokens each:

In [19]:
chunks.groupby(level='chunk').sum()

Unnamed: 0_level_0,count
chunk,Unnamed: 1_level_1
1,4957
2,4973
3,5006
4,4928
5,4989
6,4959
7,4842
8,4890
9,4913
10,4979


# Pre-crunched Vectors

Pass one or more Vectorfile objects to allow MTID-formatted vectors to be returned. Recall, these are the args that were passed:

In [20]:
htid_args['vecfiles']

[('glove', <SRP.SRP_files.Vector_file at 0x7f85928f5320>),
 ('srp', <SRP.SRP_files.Vector_file at 0x7f85928f5390>)]

The vectors, with corresponding mtid reference, can be returned with `HTID.vectors()`. The are returned as a list, with each item a tuple of (name, mtids, numpy array of vectors).

An optional argument can specified, which makes the response just a tuple of the mtid reference and numpy array.

In [25]:
htid.vectors('srp')

(['aeu.ark:/13960/t0000s333-0001',
  'aeu.ark:/13960/t0000s333-0002',
  'aeu.ark:/13960/t0000s333-0003',
  'aeu.ark:/13960/t0000s333-0004',
  'aeu.ark:/13960/t0000s333-0005',
  'aeu.ark:/13960/t0000s333-0006',
  'aeu.ark:/13960/t0000s333-0007',
  'aeu.ark:/13960/t0000s333-0008',
  'aeu.ark:/13960/t0000s333-0009',
  'aeu.ark:/13960/t0000s333-0010',
  'aeu.ark:/13960/t0000s333-0011',
  'aeu.ark:/13960/t0000s333-0012',
  'aeu.ark:/13960/t0000s333-0013',
  'aeu.ark:/13960/t0000s333-0014',
  'aeu.ark:/13960/t0000s333-0015',
  'aeu.ark:/13960/t0000s333-0016'],
 array([[ -26.984823 ,  307.37292  ,   84.06194  , ..., -130.36887  ,
           95.95985  ,  193.36221  ],
        [ 178.0878   ,  173.55658  ,   32.863342 , ...,  -39.72213  ,
           29.23233  ,   81.24365  ],
        [   8.283472 ,  203.79395  ,  -61.260574 , ...,  -87.80734  ,
          -17.250793 ,  -30.96655  ],
        ...,
        [  32.598274 ,  231.31816  ,   42.3752   , ...,   55.059994 ,
         -125.438034 ,  232.3612

The vectors are cached after their first call. Above, where I only called vectors for 'srp', 'glove' wasn't cached or even loaded.