# Vector Space Models and Similarity Measures

## Introduction 

### Terminology: Vectorization, vectors, vector space

**vectorization**: the processes by which text data become represented as vectors based on particular features (e.g. word frequencies, distinctive words, etc)

> “_Vectorization_ refers to the process by which objects become represented as vectors; vectors stand in for the source objects. In the case of texts, vectorization involves the quantification of linguistic units and the values may include word frequencies, probability values, or co- occurrence relations, among other possible options.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1, 2022, p. 82)


**vectors**: sequences of numbers used to identify a point or provide coordinates of language features in semantic space. When represented as vectors, semantic relationships can be manipulated and investigated in new ways.

> “_Vectors_ should be understood as stored computable values that enable comparisons across a high-dimensional space of some number of texts.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1, 2022, p. 82)


**vector space**: the high-dimensional geometric space defined by the set of vectors abstracted from the text data

> “_Vector space_ is the large dimensional geometric space, the boundaries of which are defined by the included vectors, in which these objects exist.”
(Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1, 2022, p. 82)  


**Vector space models**

> “Vector models are produced through one-way transformations of text into another space.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1, 2022, p. 82)

> “Vector space models are representations of text in that they provide an alternate mode of access to some aspects of the information contained within text.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1, 2022, p. 82)

> “[Vector models involve the] numerical representation of information extracted from textual sources.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1, 2022, p. 84)

Creating word vectors from text data involves a process in which distributions of vocabulary across bodies of text become represented in mathematical space. This allows us to imagine semantic relations as spatial relations and to use linear algebra to investigate and manipulate semantic relations. This can help us reconstruct and explore conceptual relationships in cultural conceptual analysis in new ways.

What does this mean? Let’s look at some examples below.

### Creating word vectors

Imagine we have a fragment of text: “It was the best of times, it was the worst of times.”  

Let’s map out how these words relate to one another by creating a spreadsheet for which each row is one (unique) word in the sentence and each column is the context of that word (in this case, context means the word immediately before and after it). The values in each cell correspond with how many times the word occurs in the given context. The numbers in the rows constitute that word's vector. Because there are ten possible contexts, this is a ten dimensional space.  

![dtm-best-worst](dtm-best-worst.png)  

The rows are the vectors of each word. The vector for `of = [0, 0, 0, 0, 1, 0, 0, 0, 1, 0]`  

You can see that `best` and `worst` have the same vector, the same set of coordinates, because they appear in exactly the same contexts (between `the` and `of`). 

This means that `best` and `worst` have similar semantic functions. The model is capturing something about how language behaves in this sentence — even though `best` and `worst` have opposite meanings, Dickens is precisely playing on paradox by setting up parallels between oppositions in this sentence. Words with opposite meanings are given the same place and function in the sentences in order to convey the sense of crisis and confusion in late 19th century Britain.

**Document-Term-Matrix**  
The example above involves a single sentence. Typically, you are working with large numbers of text, but the process of vectorization is similar.  In creating a Document-Term-Matrix (one way of creating word vectors), you create a matrix (table) where each row stands for a document, and each column is a term in the vocabulary of the corpus. Each row represents a vector of the distribution of vocabulary in document 

There are many different ways of creating word vectors involving a vast array of different parameters.


**Neural Language Models**  
word2vec implementations learn from a given training data the repetitions of words in similar context windows and identify semantic relationships between words by observing them in context. Each vector the record of repeated language features.

> “the word2vec embedding model, is also trained on texts but its vector space registers multiple aspects of language as found in the modeled texts. We might best think of this class of vectors as inscribed traces through their textual sources. The shape of these traces record repeated patterns of language use. Each trace, then, itself is a record, a line of inquiry and connection running through, constructing, and calling into being a semantic universe.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1, 2022, p. 91)

### Exploring semantic relations with word vectors: a playful example with color words

Let’s look at an example by playing with vectors for color words.

Instead of vectorizing by training on text data, imagine we have a system of color where each color is associated with a particular code point. Similar colors have similar code points. We'll be working with this color data from the [xkcd color survey](https://github.com/dariusk/corpora/blob/master/data/colors/xkcd.json). The data links a color code with a name of a color (determined by groups of participants).  [Here’s a page](https://xkcd.com/color/rgb/) that shows the colors and their codes. We'll turn this color code into a system of coordinates (i.e. vectors) and then manipulate the coordinates to explore relations between these colors.

Note: the code here is for illustrative purposes only, it will be too slow for larger vector space models. Don't get too bogged down in the details of the code, this is for illustrating what creating and manipulating word vectors can do.

In [None]:
import json

In [None]:
color_data = json.loads(open("xkcd.json").read())

In [None]:
#The following function converts colors 
#from hex format (#1a2b3c) to a tuple of integers:
def hex_to_int(s):
    s = s.lstrip("#")
    return int(s[:2], 16), int(s[2:4], 16), int(s[4:6], 16)

In [None]:
#Create a dictionary and populate it with mappings from 
#color names to RGB vectors for each color in the data:
colors = dict()
for item in color_data['colors']:
    colors[item['color']] = hex_to_int(item['hex'])

In [None]:
colors['olive']

In [None]:
colors['black']

The cell below defines a function to find the closest item to an arbitrary vector by finding the distance between the target vector and each item in the space, in turn, then sorting the list from closest to farthest. The closest() function below does just that. By default, it returns a list of the ten closest items to the given vector.

In [None]:
import math

def distance(coord1, coord2): # returns the Euclidean distance between two points
    return math.sqrt(sum([(i - j)**2 for i, j in zip(coord1, coord2)]))

def closest(space, coord, n=10):
    closest = []
    for key in sorted(space.keys(),
                        key=lambda x: distance(coord, space[x]))[:n]:
        closest.append(key)
    return closest

In [None]:
closest(colors, colors['red'])

In [None]:
#Defining a function to subtract two vectors from one another
import math
def subtractv(coord1, coord2):
    return [c1 - c2 for c1, c2 in zip(coord1, coord2)]

In [None]:
#Defining a function to add two vectors to one another
def addv(coord1, coord2):
    return [c1 + c2 for c1, c2 in zip(coord1, coord2)]

In [None]:
#Test: 
#the distance from "red" to "green" is greater than the distance from "red" to "pink"
distance(colors['red'], colors['green']) > distance(colors['red'], colors['pink'])

In [None]:
#Finding closest vectors to vector "purple"
closest(colors, colors['purple'])

In [None]:
#Finding the word vectors closest to the vector resulting from 
#subtracting "red" from "purple"
closest(colors, subtractv(colors['purple'], colors['red']))

In [None]:
#Finding the word vectors closest to the vector resulting from 
#adding "black" and ""
closest(colors, addv(colors['white'], colors['black']))

#### Vectors and Comparison — Computing relations between vectors

Comparison involves investigating the relations between different elements — the uniqueness, sameness, distinctions, overlaps between elements  relative to other elements.

Word vectors allow for a number of different methods for calculating relations similarity and difference (based on properties abstracted from text into vector representation). When distributions of words have been rendered as points in space, we can ask questions such as: How far apart or close together are these two points? 


> “Vectors are defined more by the operations performed on them, the vector-wise numerical computations and comparisons, than their particular data structures and formats. While a single indexed object comprised some indeterminate sequence of values might be considered as a vector, it is the computational manipulation of multiple objects grasped in their entirety that defines a vector.”  (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1m 2022, p. 82)

> “all words or text fragments are placed in relation to each other. It is an instrument through which we can render language comparable but this comparison always takes place through the frame inscribed by the operator.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1m 2022, p. 88)

> “Measuring pair-wise distance, the distance in vector space between two individual word vectors, even at two hundred dimensions, is a trivial task but we must choose the appropriate method by which we determine the distance, our distance metric. Do we use cosine similarity and determine the angle corresponding to our two selected vectors? Or do we use Euclidean distance and measure the shortest path through vector space? When comparing multiple vectors for visualization in two or three dimensions we might choose principal components analysis or t-distributed Stochastic Neighbor Embedding to reduce the dimensionality of our data. These methods all produce alternate representations of vector space that introduce distortions and give greater significance to some aspects of data. They do not alter nor do they rewrite the vector space model but they do create a new representation of this space. The initial model and its geometry remain in- tact, but the space has been selectively extracted, extended, or warped to form a new representational object.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1m 2022, p. 88)



#### Cosine similarity 

We can measure closeness and distance between vectors in vector space using distance metrics. Distance metrics allow us to investigate computationally about how alike or different texts or concepts may be. With distance metrics we can take word-based features of a set of documents and measures the similarity among documents based on their distance from one another in vector space. 

According to the distributional hypothesis: the closer the words in space > the more similar text contexts of use > the more similar their meanings and their semantic functions.

> “Similarity in vector space, where similarity means closeness as measured with one of the above-mentioned distance metrics within the model-defined geometrical space, has been assumed to suggest something about the similarity of the meaning of the words, or at least their usage patterns in supplied training data.” (Dobson, James. “Vector Hermeneutics” in _Digital Scholarship in the Humanities_, Vol. 37, No. 1m 2022, p. 89)


Cosine similarity is measured by the cosine of the angle between two points. This returns some number between 0 and 1 — the closer to 1, the closer the angle, the more similar the elements are.

Because cosine similarity describes the angle between two points rather than their exact placement in space. This means that cosine distance is much less effected by magnitude, or how large your numbers are. No matter what the lengths of the vectors are, the angle still tells us how close these two vectors look. This is really useful because it means that we can directly compare short and long bodies of text.

We can therefore use cosine similarity to identify concepts that share similar semantic spaces or functions. 


### Vector arithmetic

We can also apply arithmetic to vectors to manipulate the conceptual space these vectors represent. 

Adding vectors together will bring to light the shared semantic spaces these words occupy. 

Subtracting vectors from one another will leave behind what is unique to the semantic space of a vectors as distinct from the subtracted vector. For example, word vectors may encode the polysemy, the many different meanings of a word that are simultaneously present in a word. 
For example, bank may refer to a bank or to a river bank. We can subtract river from bank which will remove the shared semantic space between these two vectors and leave us with the semantic space that bank occupies as distinct from river. 

These manipulation of vectors allow for radically new ways of investigating semantic relations between concepts.

## Document-term-matrix and Cosine similarity between documents using Scikit-learn

In [None]:
import glob
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#The files we want to vectorize
input_files = glob.glob('kafka-corpus/*')
print(input_files)

In [None]:
# Converting our texts into a document-term matrix.
"""Parameters: 
1-Set input to "filename" to tell CountVectorizer to accept a list of filenames to open and read.
3-Set max_df to 0.7. DF stands for document frequency.
This parameter tells CountVectorizer that you’d like to eliminate words that appear in more than 70% 
of the documents in the corpus. This setting will eliminate the most common words 
(articles, pronouns, prepositions, etc.) without the need for a stop words list.
"""
#Try modifying different parameters.
#cf. help(vectorizer) for more information

vectorizer = CountVectorizer(input='filename',
                             lowercase=True,
                             strip_accents='unicode',
                             max_df=0.7)

In [None]:
# This does the actual vectorization and creates a document term matrix
dtm = vectorizer.fit_transform(input_files)

In [None]:
# Return total number of documents and the number of items in the vocabulary
dc, vc = dtm.shape
print('document count:',dc,'vocabulary count:',vc)

In [None]:
#Inspect vocabulary items
vectorizer.vocabulary_.items()

In [None]:
# What are our top words across all documents?
vocab_sums = dtm.sum(axis=0)
sorted_vocab = [(v, vocab_sums[0, i]) for v, i in vectorizer.vocabulary_.items()]
sorted_vocab = sorted(sorted_vocab, key = lambda x: x[1], reverse=True)

# display top twenty words
for i in range(1,20):
    print(sorted_vocab[i][0],"->",sorted_vocab[i][1])

In [None]:
# Creating a distance matrix using Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine_dist_matrix = 1 - cosine_similarity(dtm)

In [None]:
#Assessing similarties between texts based on term frequencies across documents
import pandas as pd
from scipy.spatial.distance import pdist, squareform

#Creates a dataframe with cosine distances between the texts
#calculated from vectors of word counts for each text
cosine_distances = pd.DataFrame(squareform(pdist(dtm.toarray(), metric='cosine')), 
                                columns=input_files, index=input_files)
cosine_distances

## Word vectors using word2vec and gensim

In [None]:
import gensim
from pathlib import Path

In [None]:
#Merging all texts into a single list
from pathlib import Path
directory_path = 'kafka-corpus'
all_docs = []

for filepath in Path(directory_path).glob("*.txt"):
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
        all_docs.append(text)

len(all_docs)

Gensim takes sentences as its input. So let's define a function that a takes a list of texts (e.g. our all_docs list) and converts it into sentences for gensim word2vec to use. The function will lower-case text and tokenize by sentence and word.

In [None]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
        counter += 1
    return all_txt

sentences = make_sentences(all_docs)

In [None]:
# Traning the models
# Try playing around with vector_size and min_count to see how that affects the models
kafka_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; ignores all words with total frequency lower than 5 
    vector_size=50) # size of NN layers; default is 100.

In [None]:
#Inspect the vocabulary
kafka_model.wv.key_to_index

In [None]:
# Find nearest word vectors by cosine distance
kafka_model.wv.most_similar('gregor', topn=10)

In [None]:
# Find cosine distance between two given word vectors, the closer to 1 the more similar
print(kafka_model.wv.similarity(w1='disgust',w2='food'))

In [None]:
# Subtract two vectors
#Remove the sense of "life" from "hopeless" and see 
kafka_model.wv.most_similar(positive=['hopeless'], negative=['life'])

_Acknowledgements_: This notebook is inspired by Allison Parrish’s [“Understanding word vectors” notebook](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469), Dan Sinykin's ["Word Embedding Models" notebook](https://github.com/sinykin/QTM-340/blob/master/notebooks/class13-word-vectors-complete-ds.ipynb), Teddy Roland's ["Word2Vec" notebook](https://github.com/teddyroland/BBB-Word2Vec/blob/master/Word2Vec.ipynb), and John Ladd's ["Understanding and Using Common Similarity Measures for Text Analysis"](https://programminghistorian.org/en/lessons/common-similarity-measures).