<center>
<h1>Cultural Analytics</h1><br>
<h2>ENGL64.05 / QSS 30.16 Fall 2022</h2>
</center>

----

# Lab 6
## Neural Language Models and word2vec

 <center><pre>Created: 05/16/2021; Revised: 10/31/2022</pre></center>

In [None]:
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

from gensim import matutils
import numpy as np
from numpy import dot
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.decomposition import PCA
from sklearn.manifold import MDS

from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
# Load Google News 200 Model (smaller)
google_model = KeyedVectors.load_word2vec_format("shared/ENGL64.05-22F/models/google-vectors.w2v",binary=True)

In [None]:
# "Interview" the model
vocab_size, dim = google_model.vectors.shape
print("vocab:", vocab_size)
print("depth:", dim)

In [None]:
# Similar Terms
google_model.most_similar("dartmouth",topn=25)

In [None]:
# What is the distance (cosine similarity, so angle similarity) between dartmouth and hanover?
google_model.distance('dartmouth','hanover')

In [None]:
# Similar Terms: We can group terms together in a list to search for neighbors 
# of this ("concept?") group.
google_model.most_similar(["dartmouth","harvard","yale"],topn=25)

## Now Try Your Own Similarity Queries

## Analogical Reasoning Task(s)

In [None]:
# This was introduced with the model
google_model.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
# here's another version of this query
google_model.most_similar(positive=['oslo', 'portugal'], negative=['lisbon'])

### Now Try Your Own Analogical Queries

In [None]:
# The model came with a list of these for several categories. 
queries = open('shared/ENGL64.05-22F/models/questions-words.txt').readlines()

# Let's display the categories:
[e.strip() for e in queries if e.startswith(':')]

In [None]:
# make a list of just capitals and countries
capital_queries = [q.strip().split() for q in queries[1:queries.index(': capital-world\n')]]
print(capital_queries[:5])

In [None]:
# Can you determine the accuracy of this model using these supplied queries?

## Plotting Neighbors in Vector Space

In [None]:
def scatter_terms_mds(term):
    neighbor_vectors=list()
    neighbor_words=list()

    for word, j in google_model.most_similar(term,topn=15):
        neighbor_words.append(word)
        neighbor_vectors.append(google_model[word[0]])
    mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

    dist_matrix = 1 - cosine_similarity(neighbor_vectors)
    pos = mds.fit_transform(dist_matrix)
    xs, ys = pos[:, 0], pos[:, 1]

    fig = plt.figure(figsize=(20, 15))
    plt.clf()
    plt.title("MDS Neighboring Terms for: " + term)
    plt.style.use('ggplot')
    plt.scatter(xs, ys, marker = '^')
    for i, w in enumerate(neighbor_words):
         plt.annotate(w, xy = (xs[i], ys[i]), xytext = (3, 3),
            textcoords = 'offset points', ha = 'left', va = 'top')
    plt.show()   

In [None]:
# Now execute this function with a search term
scatter_terms_mds("")

In [None]:
def scatter_terms_pca(term):
    neighbor_vectors=list()
    neighbor_words=list()

    for word, j in google_model.most_similar(term,topn=15):
        neighbor_words.append(word)
        neighbor_vectors.append(google_model[word[0]])
   
    pca = PCA(n_components=2)
    plot_data = pca.fit_transform(neighbor_vectors)
    xs, ys = plot_data[:, 0], plot_data[:, 1]

    fig = plt.figure(figsize=(20, 15))
    plt.clf()
    plt.title("PCA Neighboring Terms for: " + term)
    plt.style.use('ggplot')
    plt.scatter(xs, ys, marker = '^')
    for i, w in enumerate(neighbor_words):
         plt.annotate(w, xy = (xs[i], ys[i]), xytext = (3, 3),
            textcoords = 'offset points', ha = 'left', va = 'top')
    plt.show()  

In [None]:
# Now execute this function with a search term
scatter_terms_pca("")

## Training Our Own Model

The cells below will train a word2vec model from HTRC texts using Doc2Vec. This isn't perfect,
as we are using a bag-of-words representation of the text on individual pages to train the model.
Word2vec typically would be trained on text with intact word order but if we want to create models 
from sources under copyright, this is our best option.

In [None]:
from htrc_features import FeatureReader, utils  
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.models.keyedvectors as kv

In [None]:
# Make this a list of HathiTrust IDS
documents = []

In [None]:
# This function extracts individual pages and create string of words from tokens
# Word order is lost from HTRC features. This creates page length strings by
# multiplying tokens for each appearance. Thus, token the with count 2 will 
# appear as "the the" in the returned string.

def get_pages(document):
    fr = FeatureReader([document])
    vol = next(fr.volumes())
    ptc = vol.tokenlist(pos=False, case=False).reset_index().drop(['section'], axis=1)
    page_list = set(ptc['page'])
    
    rows=list()
    for page in page_list:
        page_data = str()
        
        # operate on each token
        for page_tokens in ptc.loc[ptc['page'] == page].iterrows():
            if page_tokens[1][1].isalpha():
                page_data += (' '.join([page_tokens[1][1]] * page_tokens[1][2])) + " "

        # Doc2Vec needs comma separated list of words
        rows.append(page_data.split())
    return rows

In [None]:
# Process downloaded features and store as TaggedDocument with a tag for page number
# This tage is required for Doc2Vec and would normally be based on paragraphs but we
# can only operate on pages of data from HTRC extracted features
#

pages = list()
for d in documents:
    for page in get_pages(d):
        pages.append(page)

# convert to TaggedDocument
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(pages)]

In [None]:
print("creating model")
model = Doc2Vec(tagged_data, 
                dm=1, # operate on "paragraphs" (pages) with distributed memory model
                vector_size=300, # larger vector size might produce better results
                min_count=5, # drop words with very few repetitions
                window=150, # larger window size needed because of extracted features
                workers=2)

print("saving word2vec model")
model.save_word2vec_format("doc2vec-htrc-sample.w2v")

In [None]:
# load and verify
model =  kv.KeyedVectors.load_word2vec_format("doc2vec-htrc-sample.w2v")

In [None]:
# "Interview" the model
vocab_size, dim = model.vectors.shape
print("vocab:", vocab_size)
print("depth:", dim)

In [None]:
# Run some sample queries
model.most_similar("")