# htrc-vectorize

This notebook demonstrates how to use HTRC extracted features to build a word2vec model using Doc2Vec. 

Jed Dobson<br>
James.E.Dobson@Dartmouth.EDU<br>
http://www.dartmouth.edu/~jed<br>
October 2020<br>
 
This file is part of the htrc-vector-project, begun in June 2020 with Catherine Parnell. 

<b>Repository</b>:<br>
https://github.com/jeddobson/htrc-vector-project

### Why use HTRC Features?

The <a href="https://analytics.hathitrust.org/features">HTRC Extracted Features dataset</a> include over 17 million volumes. This includes works presently still protected by copyright. These features can be distributed because they are for non-consumptive use. They are designed to be computer rather than human readable. This means that you can model large number of texts including texts from the twenty and twenty-first century. Document and page-based features are available. The <a href="https://github.com/htrc/htrc-feature-reader">HTRC feature-reader package</a> for Python enables easy access to these features. These features include tokens and their repititions on a page-by-page basis. Word order is lost (thus the page is no longer human readable). The format used limits their use, however, and many popular applications of text mining and machine learning expect preserved word order. The popular Skipgram model used by <a href="https://github.com/tmikolov/word2vec">word2vec</a>, for example, learns by predicting words within a window surrounding a target word. 
    
### Doc2Vec
This notebook uses <a href="https://radimrehurek.com/gensim/models/doc2vec.html">Doc2Vec</a> to train a model for words appearing within a much larger window. Doc2Vec has been used to produce vectors from paragraph-length text sources. These vectors are then used for classification and other applications.

### Tunable and Limitations
This example notebook uses a set of Toni Morrison novels found in the HATHI Trust archive. This is a much smaller dataset than we've used in other experiments. Vector models like word2vec generally require larger numbers of text sources (we've trained on 30GB archives). 

<b>Text Sources</b>
- input size: Depending on what you want to model, you will most likely want to work with more than words written by a single author. We successfully used this approach with thousands of texts across multiple genres.
- preprocessing: This notebook does not provide any preprocessing and simply imports all extracted features. You may want to remove some words with high frequent use (i.e., "stopwords"). Results may vary.

<b>Execution</b>
In real execution, you'll want to separate the several phases of this workflow. You can predownload the features and load these locally (this script will download as needed). To manage run time, you may also want to process each text individually. We produced a CSV file with one line for each page of the text. These were then concatenated (and compressed with bzip) and converted into a TaggedDocument before running doc2vec.

<b>doc2vec</b>
- window: This is the primary tunable. The smaller this variable the more likely you are to find similar word vectors based on alphabetical ordering (the tokens are processed in alphabetical ordering, as received from the feature-reader). The window size should approximate the typical page length.
- min_count: Most errors introduced during digitization or creation of the extracted features should be removed with a small integer used here. 


In [1]:
from htrc_features import FeatureReader, utils  
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.models.keyedvectors as kv

In [2]:
# data to download
documents = [
    ["The Bluest Eye","uc1.32106018657251"],
    ["Song of Solomon","mdp.39015032749130"],
    ["Sula","uc1.32106019072633"],
    ["Tar Baby","uc1.32106005767956"],
    ["Jazz","ien.35556029664190"],
    ["Beloved","mdp.49015003142743"],
    ["Paradise","mdp.39015066087613"],
    ["A Mercy","mdp.39076002787351"]
]

In [3]:
# This function extracts individual pages and create string of words from tokens
# Word order is lost from HTRC features. This creates page length strings by
# multiplying tokens for each appearance. Thus, token the with count 2 will 
# appear as "the the" in the returned string.

def get_pages(document):
    fr = FeatureReader([document])
    vol = next(fr.volumes())
    ptc = vol.tokenlist(pos=False, case=False).reset_index().drop(['section'], axis=1)
    page_list = set(ptc['page'])
    
    rows=list()
    for page in page_list:
        page_data = str()
        
        # operate on each token
        for page_tokens in ptc.loc[ptc['page'] == page].iterrows():
            if page_tokens[1][1].isalpha():
                page_data += (' '.join([page_tokens[1][1]] * page_tokens[1][2])) + " "

        # Doc2Vec needs comma separated list of words
        rows.append(page_data.split())
    return rows

In [4]:
# Process downloaded features and store as TaggedDocument with a tag for page number
# This tage is required for Doc2Vec and would normally be based on paragraphs but we
# can only operate on pages of data from HTRC extracted features
#

pages = list()
for d in documents:
    for page in get_pages(d[1]):
        pages.append(page)

# convert to TaggedDocument
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(pages)]

In [5]:
print("creating model")
model = Doc2Vec(tagged_data, 
                dm=1, # operate on "paragraphs" (pages) with distributed memory model
                vector_size=300, # larger vector size might produce better results
                min_count=5, # drop words with very few repetitions
                window=150, # larger window size needed because of extracted features
                workers=2)

print("saving word2vec model")
model.save_word2vec_format("doc2vec-morrison-novels.w2v")

creating model
saving word2vec model


In [6]:
# load and verify
model =  kv.KeyedVectors.load_word2vec_format("doc2vec-morrison-novels.w2v")

In [7]:
model.most_similar(["memory"],topn=25)

[('painful', 0.9396734237670898),
 ('permanent', 0.9382855296134949),
 ('pieces', 0.9304176568984985),
 ('path', 0.9261106848716736),
 ('sauce', 0.9238497018814087),
 ('quite', 0.9236317873001099),
 ('months', 0.9215139746665955),
 ('rain', 0.9191372394561768),
 ('outside', 0.9182758331298828),
 ('ruby', 0.9170806407928467),
 ('order', 0.9139819741249084),
 ('secret', 0.9124130606651306),
 ('roses', 0.9104964137077332),
 ('single', 0.9092104434967041),
 ('remembered', 0.908808708190918),
 ('schoolhouse', 0.9077692031860352),
 ('rented', 0.9066518545150757),
 ('such', 0.9056478142738342),
 ('parts', 0.9044241905212402),
 ('thoroughly', 0.9029747247695923),
 ('once', 0.9015201330184937),
 ('thrill', 0.9012625813484192),
 ('perfect', 0.8979784250259399),
 ('sheets', 0.896855354309082),
 ('pressed', 0.8961389660835266)]