The code in this notebook illustrates a process for fetching the HathiTrust volume identifiers from a given OCLC or other standard number (using the HathiTrust Bibliographic API) and then retrieving the feature sets for those volumes using those identifiers.

It shows some preliminary work on 1) identifying points of divergence between different feature sets for the same volume (which might derive from OCR errors, etc.; and 2) supplementing token data with standardized word vectors (from the spaCy library).

In [None]:
BASE_URL_HTRC = "https://data.htrc.illinois.edu/ef-api"
BASE_URL_HT = "https://catalog.hathitrust.org/api/volumes/full/"

In [None]:
import pandas as pd
import requests
import numpy as np

In [None]:
def get_ef_data_by_volume_id(volume_id):
    """"Fetches the extract featureset data for a given volume"""
    url = f"{BASE_URL_HTRC}/volumes/{volume_id}"
    response = requests.get(url)
    return response.json()

In [None]:
def get_ht_bib_metadata(id_type, id_value):
  """Fetches the volume metadata for a given standard identifier.
  id_type should be one of oclc, issn, isbn, issn, htid, recordnumber"""
  url = f"{BASE_URL_HT}/{id_type}/{id_value}.json"
  response = requests.get(url)
  return response.json()

In [None]:
oclc_lawrence = "1083464"

In [None]:
lawrence_metadata = get_ht_bib_metadata("oclc", oclc_lawrence)

**TO DO**: The Bibliographic API returns a null result, not an error, if no match is found. We should account for that.

In [None]:
get_ht_bib_metadata("oclc", "44590156")

{'records': {}, 'items': []}

The code below extracts volume-level metadata from a given result from the Bibliographic API.

In [None]:
ef_ids = []
for item in lawrence_metadata["items"]:
  ef_item = {"orig": item["orig"],
             "htid": item["htid"],
             "enumcron": item["enumcron"]}
  ef_ids.append(ef_item)

Two possible edge cases:
- Multiple volumes in a multivolume work (with a single OCLC)
- Multiple HathiTrust volumes = single-volume work (single OCLC)

In the code below, we extract the features for a single volume.

In [None]:
lawrence_mich = get_ef_data_by_volume_id(ef_ids[0]["htid"])

In [None]:
import json
with open("lawrence_mich_ef.json", "w") as f:
  json.dump(lawrence_mich, f)

In [None]:
lawrence_mich_pages = lawrence_mich["data"]["features"]["pages"]

For now, let's focus on the tokens in the `body` part of the `pages` objects. We want to flatten this structure into three DataFrames:
1. Summary stats for the page
2. Individual token/POS counts for the page
3. Begin/end letter counts for lines.

In [None]:
def extract_data_from_pages(pages):
  """Iterates over the token-part-of-speech data at the page level, creating a flatter structure,
  where each token/POS pair is stored in a dictionary indicating the count and the page number."""
    extracted_data = []
    for i, page in enumerate(pages):
        body = page.get('body')
        if body:
            token_pos_count = body.get('tokenPosCount')
            if token_pos_count:
                for t, pos in token_pos_count.items():
                  token_data = {"page": i, "token": t}
                  pos_dict = dict(zip(["pos", "counts"], list(pos.items())[0]))
                  token_data.update(pos_dict)
                  extracted_data.append(token_data)
    return extracted_data

In [None]:
ef_lawrence_mich = extract_data_from_pages(lawrence_mich_pages)

For each page/token/pos, we get the count.

In [None]:
ef_lawrence_mich[0]

{'page': 4, 'token': 'II', 'pos': 'NNP', 'counts': 1}

This structure maps easily onto a pandas DataFrame.

In [None]:
df = pd.DataFrame.from_records(ef_lawrence_mich)

In the code below, we compare the feature sets for the same volume of D. H. Lawrence's _Poems_, as derived from two separate copies in HathiTrust.

In [None]:
lawrence_uc = get_ef_data_by_volume_id(ef_ids[1]["htid"])

In [None]:
lawrence_uc_pages = lawrence_uc["data"]["features"]["pages"]

In [None]:
ef_lawrence_uc = extract_data_from_pages(lawrence_uc_pages)

In [None]:
ef_lawrence_uc[:10]

[{'page': 0, 'token': 'E', 'pos': 'UNK', 'counts': 1},
 {'page': 0, 'token': '_|-', 'pos': 'UNK', 'counts': 1},
 {'page': 0, 'token': '·-', 'pos': 'UNK', 'counts': 1},
 {'page': 0, 'token': '.--', 'pos': 'UNK', 'counts': 1},
 {'page': 0, 'token': '..-', 'pos': 'UNK', 'counts': 1},
 {'page': 0, 'token': '-.', 'pos': 'UNK', 'counts': 3},
 {'page': 0, 'token': '，', 'pos': 'UNK', 'counts': 1},
 {'page': 0, 'token': '--|', 'pos': 'UNK', 'counts': 1},
 {'page': 0, 'token': '.', 'pos': 'UNK', 'counts': 21},
 {'page': 0, 'token': '--|--', 'pos': 'UNK', 'counts': 1}]

In [None]:
df2 = pd.DataFrame.from_records(ef_lawrence_uc)

What follows are possible points of comparison.

1. Identification of tokens unique to each volume.

In [None]:
df.loc[~df.token.isin(df2.token)]

Unnamed: 0,page,token,pos,counts
25,7,FIVE,NE,1
28,7,ADELPHI,NE,1
30,7,LTD,NN,1
31,7,NUMBER,NE,1
33,7,BECKER,NE,1
...,...,...,...,...
28569,310,9349,CD,1
28570,310,1605,CD,1
28571,310,115616,CD,1
28572,310,HF,NN,1


In [None]:
df2.loc[~df2.token.isin(df.token)]

Unnamed: 0,page,token,pos,counts
0,0,E,UNK,1
1,0,_|-,UNK,1
2,0,·-,UNK,1
3,0,.--,UNK,1
4,0,..-,UNK,1
...,...,...,...,...
29089,313,--.-|-.--.---,NN,1
29090,313,│,JJ,18
29091,313,...-│,NR,1
29092,313,│--,NN,1


2. Number of unique tokens appearing only once per feature set. A higher number of such tokens might logically correspond to a higher prevalance of OCR errors.

In [None]:
mich_singletons = df.loc[df.counts == 1].token.unique()
uc_singletons = df2.loc[df2.counts == 1].token.unique()

In [None]:
len(mich_singletons)

7844

In [None]:
len(uc_singletons)

8056

3. The number of non-alphabetic tokens. A higher number of such tokens might indicate more OCR errors.

In [None]:
len(df.loc[~df.token.str.isalpha()])

2718

In [None]:
len(df2.loc[~df2.token.str.isalpha()])

3000

3. **TO DO**: Calculate the volume-level totals per token/POS for each feature set, and find the differences. Where the difference is > 0, the two feature sets diverge on that token.

In [None]:
token_summary = df.groupby(["token", "pos"]).counts.sum()

In [None]:
token_summary

token  pos
!      .      398
!!     .        6
!!!    .        2
"!     UNK      1
%      NN       1
             ... 
•      SYM      4
       UNK      2
■      SYM      2
       UNK      2
✓      NNP      1
Name: counts, Length: 8875, dtype: int64

In [None]:
token_summary2 = df2.groupby(["token", "pos"]).counts.sum()

In the code below, we illustrate a process for adding pre-computed word embeddings (from the standard implementation packaged with the spaCy library).

In [None]:
import spacy

In [None]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
nlp = spacy.load("en_core_web_md")

For each unique token in the feature set, we parse it as a one-token document with spaCy (in order to retrieve the word vector), and then we associate this word vector with the token, using a separate dictionary to avoid redundancy.

In [None]:
token_vectors = {}
for token in df.token.unique():
  token_doc = nlp(token)
  vector = token_doc[0].vector
  # Exclude empty vectors, which correspond to tokens without embeddings
  if np.any(vector):
    token_vectors[token] = vector

In [None]:
import pickle
with open("vector_dict_lawrence_poems.pkl", "wb") as f:
  pickle.dump(token_vectors, f)