# Vector space model

A vector space model represents documents as vectors in a shared vector space. First, a vocabulary is created as a set of all words in the corpus. Then the vocabulary is alphabetically sorted into a list. Each document vector is represented as a Python list where each element is the count of the corresponding word in the vocab list. Once vectors are created for each document, cosine similarity can be calculated for pairs of documents. The python library numpy is used for the vector operations.

This notebook uses 4 texts on 4 subjects. Each text is divided into halves so that the corpus consists of 8 documents. The first part of the notebook shows how to create a vector space model from scratch in Python. The second part shows how to use vector operations in sklearn.

In [1]:
# imports
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
import re

In [2]:
# read each document and break each into 2 pieces
# create a list of docs
docs = []

with open('../school_texts/anat.txt', 'r') as f:
    doc_anat = f.read().lower()
    doc_anat = doc_anat.replace('\n', ' ')
    i = len(doc_anat)//2
    doc_anat1 = doc_anat[:i]
    doc_anat2 = doc_anat[i:]
    docs.append(word_tokenize(doc_anat1))
    docs.append(word_tokenize(doc_anat2))
    
with open('../school_texts/buslaw.txt', 'r') as f:
    doc_buslaw = f.read().lower()
    doc_buslaw = doc_buslaw.replace('\n', ' ')
    i = len(doc_buslaw)//2
    doc_buslaw1 = doc_buslaw[:i]
    doc_buslaw2 = doc_buslaw[i:]
    docs.append(word_tokenize(doc_buslaw1))
    docs.append(word_tokenize(doc_buslaw2))
    
with open('../school_texts/econ.txt', 'r') as f:
    doc_econ = f.read().lower()
    doc_econ = doc_econ.replace('\n', ' ')
    i = len(doc_econ)//2
    doc_econ1 = doc_econ[:i]
    doc_econ2 = doc_econ[i:]
    docs.append(word_tokenize(doc_econ1))
    docs.append(word_tokenize(doc_econ2))
    
with open('../school_texts/geog.txt', 'r') as f:
    doc_geog = f.read().lower()
    doc_geog = doc_geog.replace('\n', ' ')
    i = len(doc_geog)//2
    doc_geog1 = doc_geog[:i]
    doc_geog2 = doc_geog[i:]
    docs.append(word_tokenize(doc_geog1))
    docs.append(word_tokenize(doc_geog2))
    

In [3]:
# preprocess: remove non-alpha, remove stopwords, lemmatize
docs_preprocessed = [[wnl.lemmatize(w) for w in doc if w not in stopwords and w.isalpha()] for doc in docs]

In [4]:
vocab = set()
for doc in docs_preprocessed:
    doc_set = set(doc)
    vocab = vocab.union(doc_set)

vocab = sorted(list(vocab))
print('vocab length:', len(vocab)) 
vocab[:5]

vocab length: 3601


['abandoned', 'abdominal', 'abide', 'ability', 'able']

In [5]:
vectors = []
for doc in docs_preprocessed:
    vec = [doc.count(t) for t in vocab]
    vectors.append(vec)
print(vectors[0][:10])

[0, 4, 0, 0, 0, 0, 0, 0, 0, 1]


In [6]:
from numpy import dot
from numpy.linalg import norm

In [7]:
# function to compute cosine similarity
def cos_sim(v1, v2):
    return float(dot(v1, v2)) / (norm(v1) * norm(v2))

In [12]:
# compute cosine similarity for the first doc, paired with all docs
for i, vec in enumerate(vectors):
    print('cosine similarity anat1 and vector', i+1, '=', format(cos_sim(vectors[0], vec), '.2f'))

cosine similarity anat1 and vector 1 = 1.00
cosine similarity anat1 and vector 2 = 0.72
cosine similarity anat1 and vector 3 = 0.05
cosine similarity anat1 and vector 4 = 0.05
cosine similarity anat1 and vector 5 = 0.06
cosine similarity anat1 and vector 6 = 0.06
cosine similarity anat1 and vector 7 = 0.06
cosine similarity anat1 and vector 8 = 0.10


### Results

In the results above, the cosine similarity of anat1 with itself is 1, the highest similarity. The cosine similarity of anat1 with anat2 is high: 0.72. The cosine similarity with all other docs is low. This approach gave very good results. The preprocessing is key. Without stopword removal, all the docs would seem more similar to each other than they really are. 

## Using sklearn

The sklearn package has vectorizer functions for converting docs to vectors. Sincs the 'docs' above were already tokenized, the code below creates 'docs2' which puts the tokens back together into plain text.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

docs2 = [' '.join(docs_preprocessed[i]) for i in range(len(docs))]

tfidf_docs = tfidf_vectorizer.fit_transform(docs2)
print('docs shape:', tfidf_docs.shape)

docs shape: (8, 3593)



In [19]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(tfidf_docs[0], tfidf_docs)

array([[1.        , 0.70338879, 0.02614208, 0.0192754 , 0.02797854,
        0.02504257, 0.0273924 , 0.05087871]])

## Results

We see here that anat1 and anat2 have a high similarity whereas all others are low. The cosine similarity scores are similar to the 'from scratch' code above. The difference is that the from-scratch vectors used term frequency and the sklearn vectors used tf-idf to create the vectors. 