# <center>Critical AI</center>
<center>ENGL 54.41</center>
<center>Dartmouth College</center>
<center>Winter 2026</center>
<pre>Created: 01/15/2026; Updated: 01/23/2026</pre>

## Create our own embeddings using HathiTrust Data

We'll use a method called doc2vec that creates embeddings from documents to produce embeddings for individual words from our HathiTrust word frequency data. This allows us to continue to use the same data model for this approach that would normally require small windows of neighboring words for context. We'll use a much larger context window in order to produce these embeddings. This technique requires much more data than it would otherwise because we need to see many samples of similar large context to learn high-quality embeddings. But it does work.

In [None]:
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.models.keyedvectors as kv
from gensim import matutils

import numpy as np
import pandas as pd
import torch

from htrc_features import FeatureReader

from numpy import dot
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE

from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
# the following is a list of HathiTrust ids for books. These identify
# the HTRC extracted features dataset for each text. You can find the
# ID by visiting https://www.hathitrust.org/ and searching for a book.
# You will need to click on the link for a specific volume from a 
# specific library. If you want a book that is under copyright 
# protection, you can change "Item Availbility" from "Full View" to 
# "All Items" have you have searched for a book or author. Same process
# applies for finding the IDs (click on "Limited (search-only)" to find
# ID from the url.

texts = ['mdp.39015014296548',
 'uc1.$b100778',
 'uc1.$b285061',
 'uc1.b4194243',
 'uc1.$b106074',
 'uc1.b3117127',
 'uc1.$b245112',
 'uc1.$b434924',
 'mdp.39015002194143',
 'uc1.b3117208',
 'uc1.b2839083',
 'uc1.32106005763088',
 'uc1.$b87329',
 'inu.30000117261671',
 'wu.89095289229',
 'uc1.32106001535084',
 'mdp.39015049019139',
 'uc1.$b103178']

In [None]:
# build document-term matrix by page
fr = FeatureReader(ids = texts)
rows = []
for vol in fr:
    print(vol)
    tl = vol.tokenlist(section='body', case=False, pos=True, drop_section=True)
    tl = tl.reset_index().rename(columns={"token": "lowercase", 0: "count"})
    tl["volume"] = vol.id
    rows.append(tl[["volume", "page", "lowercase", "pos", "count"]])

df = pd.concat(rows, ignore_index=True)

# filter for only alphabetical tokens and longer than one character
df = df[df["lowercase"].str.isalpha() & (df["lowercase"].str.len() > 1)]

# create page_ids
df["page_id"] = df["volume"].astype(str) + ":" + df["page"].astype(str)

dtm_counts = (
    df.pivot_table(index="page_id",
                   columns="lowercase",
                   values="count",
                   aggfunc="sum",
                   fill_value=0)
    .sort_index()
)

In [None]:
# Process downloaded features and store as TaggedDocument with a tag for page number
# This tage is required for Doc2Vec and would normally be based on paragraphs but we
# can only operate on pages of data from HTRC extracted features

pages = list()
for document in dtm_counts.index.to_list():
  pages.append(np.repeat(dtm_counts.loc[document].loc[lambda x: x > 0].index, dtm_counts.loc[document].loc[lambda x: x > 0].values).tolist())

In [None]:
tagged_data = [TaggedDocument(words=tokens, tags=[f"p{i}"])
          for i, tokens in enumerate(pages)]

In [None]:
print("creating model")
wvmodel = Doc2Vec(tagged_data,
                dm = 1,              # operate on "paragraphs" (pages) with distributed memory model
                vector_size = 200,   # larger vector size might produce better results but requires more time and memory
                min_count = 2,       # drop words with very few repetitions
                window = 150,        # larger window size needed because of extracted features
                epochs = 10,         # default number of epochs (like did in our Perceptron networks, we'll run all data through multiple times)
                workers = 8)         # attempt some parallelism

In [None]:
wvmodel.wv.most_similar("crime")

In [None]:
def plot_neighbors(term):
    if term in wvmodel.wv:
        vocab = [v[0] for v in wvmodel.wv.most_similar(term,topn=50)]
        embs = np.array([wvmodel.wv[v] for v in vocab])
        tsne = TSNE(n_components=2, perplexity=2, max_iter=1000, random_state=42)
        embeddings_2d = tsne.fit_transform(torch.tensor(embs))
        xs, ys = embeddings_2d[:, 0], embeddings_2d[:, 1]
        plt.figure(figsize=(8, 6))
        scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.7)
        for i, w in enumerate(vocab):
          plt.annotate(w, xy = (xs[i], ys[i]), xytext = (3, 3),
                       textcoords = 'offset points', ha = 'left', va = 'top')
        plt.title(f't-sne plot of neighbors of {term}')
        plt.grid(True)
        plt.xticks(())
        plt.yticks(())
        plt.tight_layout()
        plt.show()
    else:
        print(f'{term} not found in model vocab')

In [None]:
plot_neighbors("crime")

Now go back up and read through the displayed volume information as we are building our dataset. How might you interpret these data in terms of that dataset? What other useful queries might you make?