<center>
<h1>Cultural Analytics</h1><br>
<h2>ENGL64.05 / QSS 30.16 23F</h2>
</center>

----

# Lab 3
## Vectorization

 <center><pre>Created: 10/09/2019; Revised 09/20/2022</pre></center>

<h3><font color="Green">Part One: Vectorization and Vocabulary Reducation</font></h3>

Now we're going to convert a single text into a document-term matrix. We'll use Scikit-Learn to vectorize a single text file.

In [None]:
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input='filename',
                             strip_accents='unicode')

In [None]:
# Select a single text from the Novel450 dataset and assign the name of this file
# to a variable 'source_text'
#
# To get a list of file names (and the directory), execute these lines:
# import glob as glob
# glob.glob("shared/ENGL64.05-22F/data/Novel450/EN*")

In [None]:
# This does the actual vectorization
dtm = vectorizer.fit_transform([source_text])

# Return total number of documents and the number of items in the vocabulary
dc, vc = dtm.shape
print("document count:",dc,"vocabulary count:",vc)

In [None]:
# what are our top terms?
vocab_sums = dtm.sum(axis=0)
sorted_vocab = [(v, vocab_sums[0, i]) for v, i in vectorizer.vocabulary_.items()]
sorted_vocab = sorted(sorted_vocab, key = lambda x: x[1], reverse=True)

# display top one hundred words
for i in range(1,100):
    print(sorted_vocab[i][0],"->",sorted_vocab[i][1])

In [None]:
# We're now to going to limit the vocabulary.
# Review the documentation for the vectorizer by executing this cell and modify the above line in 
# which we initialize the vectorizer from CountVectorizer. 
#
# FIRST:
# Remove the English language "stopwords" and check the top terms. What was removed? What remains?
#
# THEN:
# 1) Add ten new frequently repeated words to the stopword list and re-run the vectorizer, 
#    removing these terms.
# 2) Limit the vocabulary to a maximum of 1000 features and display the top 100 most infrequently used terms.

help(vectorizer)

<h3><font color="Green">Part Two: Vectorization of a Collection</font></h3>

Now we're going to read many texts into a document-term matrix. 

In [None]:
# Use glob to match all novels published in the twentieth century 
# in the Novel450 dataset. Assign to the variable source_texts.

In [None]:
# now that this is done, let's simplify our filenames 
import os
short_names = [os.path.basename(f) for f in source_texts]

In [None]:
# A short detour: 
# While we are playing with creating lists of metadata,
# let's use list comprehension (https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)
# to process this list a little more. The variable source_texts should be assigned to a list of filenames.
# Make a variable "labels" from this list that contains the first part of the filename (EN + Year of Publication).
# You might want to use split to do this or find another way to process the filenames.
labels = 

In [None]:
dtm = vectorizer.fit_transform(source_texts)

# Return total number of documents and the number of items in the vocabulary. This should 
# be 37 documents.

dc, vc = dtm.shape
print("document count:",dc,"vocabulary count:",vc)

In [None]:
# create dictionary mapping vocabulary index to word
idx2voc = dict([(v, k) for k, v in vectorizer.vocabulary_.items()])

In [None]:
# what are our top terms, by frequency?
vocab_sums = dtm.sum(axis=0)
sorted_vocab = [(v, vocab_sums[0, i]) for v, i in vectorizer.vocabulary_.items()]
sorted_vocab = sorted(sorted_vocab, key = lambda x: x[1], reverse=True)

# display top one hundred words
for i in range(1,100):
    print(sorted_vocab[i][0],"->",sorted_vocab[i][1])

In [None]:
# calculate absolute differences between our first and second text and display
# top twenty-five terms with the greatest difference. 
abs_diff = np.abs(dtm[0] - dtm[1])
for i in np.argsort(abs_diff.toarray()[0])[::-1][:25]:
    print(idx2voc[i],abs_diff.toarray()[0][i])

<h3><font color="Green">Part Three: Vector Similarity</font></h3>

In this section, we'll calculate the distance between our vectors representing the modeled texts.

In [None]:
# Let's find the distance in our mapped semantic space between 
# the vector for the first and for the second file. We'll use
# Euclidean distance to measure these differences.

from sklearn.metrics import euclidean_distances
dist = euclidean_distances(dtm[0],dtm[1])[0][0]
print(short_names[0],"->",short_names[1],"=",dist)

In [None]:
# We can easily create a distance matrix for all our files easily enough
dist_matrix = euclidean_distances(dtm)

In [None]:
# NumPy's argsort can sort distances from a selected (our first text, in this example) row in the matrix:
for i in np.argsort(dist_matrix[0]):
    print(short_names[i],dist_matrix[0][i])

In [None]:
# Now vectorize all the ninteenth-century texts, dropping English-language
# stop words, and find the text that is furthest away in our mapped space from
# Jane Austen's Pride and Prejudice.