#### Objectives
- Generate word context matrix for a given corpus on a specified window size
- Find similarity between two words using word context matrix

In [1]:
%pip install -q pandas

Note: you may need to restart the kernel to use updated packages.


In [6]:
corpus = ''
with open('corpus.txt', 'r') as f:
    corpus = f.read()
corpus = corpus.replace('\n', ' ')

In [7]:
corpus

'It is hard to imagine a world without Shakespeare. Since their composition four hundred years ago, Shakespeare’s plays and poems have traveled the globe, inviting those who see and read his works to make them their own. Readers of the New Folger Editions are part of this ongoing process of “taking up Shakespeare,” finding our own thoughts and feelings in language that strikes us as old or unusual and, for that very reason, new. We still struggle to keep up with a writer who could think a mile a minute, whose words paint pictures that shift like clouds. These expertly edited texts are presented to the public as a resource for study, artistic adaptation, and enjoyment. By making the classic texts of the New Folger Editions available in electronic form as The Folger Shakespeare (formerly Folger Digital Texts), we place a trusted resource in the hands of anyone who wants them. The New Folger Editions of Shakespeare’s plays, which are the basis for the texts realized here in digital form, 

In [8]:
words = corpus.split(' ')
words = [w for w in words if w != '' and w.isalpha() and w != ',' and w != '.']
unique_words = list(set(words))

In [9]:
unique_words

['writer',
 'scholars',
 'early',
 'hope',
 'poems',
 'indispensable',
 'process',
 'four',
 'regular',
 'Barbara',
 'up',
 'expertise',
 'basis',
 'scholarship',
 'finding',
 'inspired',
 'these',
 'electronic',
 'texts',
 'where',
 'gained',
 'artwork',
 'struggle',
 'commend',
 'Werstine',
 'inviting',
 'whose',
 'holdings',
 'best',
 'think',
 'Paul',
 'visiting',
 'to',
 'shift',
 'modern',
 'Editions',
 'textual',
 'edited',
 'with',
 'collection',
 'greatest',
 'read',
 'anyone',
 'their',
 'unparalleled',
 'I',
 'here',
 'us',
 'trusted',
 'been',
 'pictures',
 'realized',
 'form',
 'words',
 'his',
 'in',
 'feelings',
 'documentary',
 'exists',
 'works',
 'strikes',
 'place',
 'world',
 'performance',
 'unusual',
 'ongoing',
 'see',
 'hands',
 'Shakespeare',
 'richness',
 'years',
 'Since',
 'old',
 'An',
 'those',
 'It',
 'can',
 'source',
 'Mowat',
 'presented',
 'paths',
 'very',
 'physical',
 'for',
 'about',
 'consulted',
 'available',
 'are',
 'Readers',
 'more',
 'Libra

In [10]:
import pandas as pd

window_size = 5
word_context_matrix = pd.DataFrame(0, index=unique_words, columns=unique_words)

for i in range(len(words)):
  word = words[i]
  for j in range(i-window_size, i+window_size+1):
    if j < 0 or j >= len(words) or i == j:
      continue
    context_word = words[j]
    word_context_matrix.at[word, context_word] += 1

In [12]:
word_context_matrix.head()

Unnamed: 0,writer,scholars,early,hope,poems,indispensable,process,four,regular,Barbara,...,digital,editions,Folger,keep,still,paint,reflect,thoughts,which,classic
writer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
scholars,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
early,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hope,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
poems,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
import math

def cosine_similarity(item1, item2):
    dot_product = sum([item1[i] * item2[i] for i in range(len(item1))])
    magnitude1 = math.sqrt(sum([item1[i] ** 2 for i in range(len(item1))]))
    magnitude2 = math.sqrt(sum([item2[i] ** 2 for i in range(len(item2))]))
    return float(dot_product / (magnitude1 * magnitude2))


def get_vector(word, word_context_matrix):
    return word_context_matrix.loc[word].values

In [14]:
cosine_similarity_matrix = pd.DataFrame(
    0, index=unique_words, columns=unique_words)

In [16]:
for word1 in unique_words:
    for word2 in unique_words:
        item1 = get_vector(word1, word_context_matrix)
        item2 = get_vector(word2, word_context_matrix)
        cosine_similarity_matrix.at[word1,
                                    word2] = cosine_similarity(item1, item2)

  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
  cosine_similarity_matrix.at[word1,
 

In [17]:
cosine_similarity_matrix.head()

Unnamed: 0,writer,scholars,early,hope,poems,indispensable,process,four,regular,Barbara,...,digital,editions,Folger,keep,still,paint,reflect,thoughts,which,classic
writer,1.0,0.0,0.083333,0.109109,0.0,0.0,0.083333,0.0,0.0,0.091287,...,0.176777,0.0,0.110702,0.547723,0.365148,0.416667,0.0,0.091287,0.0,0.0
scholars,0.0,1.0,0.0,0.109109,0.273861,0.273861,0.0,0.0,0.433013,0.0,...,0.235702,0.273861,0.309965,0.0,0.0,0.0,0.416667,0.0,0.357217,0.333333
early,0.083333,0.0,1.0,0.218218,0.091287,0.182574,0.333333,0.091287,0.144338,0.182574,...,0.353553,0.182574,0.221404,0.091287,0.091287,0.0,0.166667,0.273861,0.306186,0.25
hope,0.109109,0.109109,0.218218,1.0,0.119523,0.119523,0.0,0.119523,0.0,0.239046,...,0.154303,0.119523,0.0,0.119523,0.239046,0.109109,0.109109,0.239046,0.066815,0.109109
poems,0.0,0.273861,0.091287,0.119523,1.0,0.1,0.0,0.4,0.237171,0.1,...,0.193649,0.1,0.145521,0.0,0.0,0.0,0.182574,0.1,0.167705,0.273861


Finding similarities between words

In [18]:
first_word = input("Enter word one: ")
second_word = input("Enter word two: ")

try:
  similarity = cosine_similarity_matrix.at[first_word, second_word]
  print(f"Similarity between {first_word} and {second_word} is {similarity}")
except KeyError:
  print("One or both of the words are not in the corpus")

One or both of the words are not in the corpus
