# LELA60331 Computational Linguistics 1 Week 4

This week we are going to look at vector-based models of word meaning. I am first of all going to have to introduce  a Python library called Numpy (https://numpy.org/devdocs/user/absolute_beginners.html).

### Numpy

Numpy is widely used for representing and processing arrays, including multidimensional arrays (known to us as Vectors/Matrices/Tensors). It is fast, intuitive and has lots of helpful built-in functions (we will make use of some of these later in the semester).

To use numpy we need to import it as follows. The naming of numpy as np is a widely-used convention.

In [1]:
import numpy as np

We can create empty numpy arrays as follows:

In [3]:
# For a 1 dimensional colarray
p =   np.zeros(4)

In [4]:
p

array([0., 0., 0., 0.])

In [None]:
# For a 2 dimensional array
np.zeros((4, 5))

We can also create them from Python lists as follows:

In [None]:
# Example vector
np.array([9,2,3,5])

In [None]:
# Example rank 2 tensor (specificaly a 2x4 matrix)
np.array(([9,2,3,5],[4,6,7,3]))

In [None]:
# Example rank 3 tensor 3x2x4
np.array([[[0, 1, 2, 3],[4, 5, 6, 7]],[[0, 1, 2, 3],[4, 5, 6, 7]],[[0 ,1 ,2, 3],[4, 5, 6, 7]]])

The arrays must be rectangular, not ragged, or you will see the following error

In [None]:
# Example 3-dimensional array
np.array(([9,2,3,5],[4,6,7,3],[5,7,1,2,7]))

Just as with Python lists we can use indices to find individual values:

In [None]:
a=np.array([9,2,3,5])
a[1]

And ranges:

In [None]:
a[1:3]

We can do the same for multidimensional arrays. Indexes should be in the order of nesting. So for a rank 2 tensor the row index comes first and the column second:

In [None]:
a=np.array(([9, 2, 3, 5],
       [4, 6, 7, 3],
       [5, 7, 1, 2]))
a[1,0]

We can assign values to particular positions in our tensor using indices:

In [None]:
a[0,0] = 1000
a[2,1] = 2000
print(a)

For vectors we can perform the operations that we learned about in our lecture as follows:

In [None]:
# Vector addition
a = np.array(([9,2,3,5]))
b = np.array(([1,2,3,4]))
c=a+b
print(a)
print(b)
print(c)

In [None]:
# Vector subtraction
a = np.array(([9,2,3,5]))
b = np.array(([1,2,3,4]))
c=a-b
print(a)
print(b)
print(c)

In [None]:
# Dot product
a = np.array(([9,2,3,5]))
b = np.array(([1,2,3,4]))
c=a*b
dp=sum(c)
print(a)
print(b)
print(c)
print(dp)

Problem 1: Write the code to calculate the cosine of the angle between vector a and vector b. You might need to refer to your lecture notes

In [None]:
a = np.array(([9,2,3,5]))
b = np.array(([1,2,3,4]))
cosine = ??????

### Building Word Vectors

In [None]:
import re
# download from from the internt
!wget https://www.gutenberg.org/files/2554/2554-0.txt
# Or in Jupyter
#from urllib.request import urlretrieve
#url = "https://www.gutenberg.org/files/2554/2554-0.txt"
#filename="2554-0.txt"
#urlretrieve(url, filename)
# read in the file
f = open('2554-0.txt')
c_and_p = f.read()
# select the first chapter - possible because I determined range
c_and_p = c_and_p[5464:]
# convert text to lower case
c_and_p=c_and_p.lower()
c_and_p=re.sub('\n',' ', c_and_p)
c_and_p=re.sub('[^a-z ]','', c_and_p)
c_and_p=re.split(" ", c_and_p)

In [None]:
c_and_p[1:10]

Problem 2 (to be tackled collectively): Write code to create a word-by-word matrix

In [None]:
token_count = len(c_and_p)
type_list = list(set(c_and_p))
# The type count is the number of unique words. The token count is the total number of words including repetitions.
type_count = len(type_list)
# We create a matrix in which to store the counts for each word-by-word co-occurence
M = np.zeros((type_count, type_count))
window_size = 2

# COMPLETE CODE

Problem 3: Calculate the cosine between "walk" and "run", and between "walk" and "shine". What does the outcome tell us?

In [None]:
w1 = "walk"
w2 = "run"
w3 = "shine"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)

### Pretrained embeddings

Vectors are best when learned from very large text collections. However learning such vectors, particular using neural network methods rather than simple counting, is very computationally intensive. As a result most people make use of pretrained embeddings such as those found at

https://code.google.com/archive/p/word2vec/

or

https://nlp.stanford.edu/projects/glove/

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
#from urllib.request import urlretrieve
#url = "http://nlp.stanford.edu/data/glove.6B.zip"
#filename="glove.6B.zip"
#urlretrieve(url, filename)
!unzip -q glove.6B.zip

In [None]:
import numpy as np
embedding_file = 'glove.6B.100d.txt'
#embedding_file = f.read()
embeddings=[]
type_list=[]
with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])
                type_list.append(word)
                embeddings.append(vec)
M=np.array((embeddings))

In [None]:
w1 = "football"
w2 = "rugby"
w3 = "cricket"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)
w1_vec=M[w1_index,]
w2_vec=M[w2_index,]
w3_vec=M[w3_index,]

Problem 3. Calculate the cosine between the words above. What do the cosine values tell us?

# Finding the most similar words

One thing we often want to do is to find the most similar words to a given word/vector. An exhaustive N x N comparison is very time consuming, and so we can make use of an efficient "nearest neighbours" finding algorithm. We are just using this algorithm here so we won't go into it in any detail. We use the implementation in the Scikitlearn toolkit, which we will learn more about in the Python programming sessions in RM in CCL 2.

In [None]:
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(M)

In [None]:
w="football"
w_index = type_list.index(w)
w_vec = M[w_index,]
for i in nbrs.kneighbors([w_vec])[1][0]:
  print(type_list[i])

Problem 4. Find some examples where the system fails and explain why you think it has done so.

### Analogical reasoning

Another semantic property of embeddings is their ability to capture relational meanings. In an important early vector space model of cognition, Rumelhart and Abrahamson (1973) proposed the parallelogram model for solving simple analogy problems of the form a is to b as a* is to what?. In such problems, a system given a problem like apple:tree::grape:?, i.e., apple is to tree as  grape is to , and must fill in the word vine.

In the parallelogram model, the vector from the word apple to the word tree (= tree âˆ’ apple) is added to the vector for grape (grape); the nearest word to that point is returned.





Problem 4: Complete the code below so that it solves the analogical reasoning problem. Come up with a analogical reasoning problem of your own and use the code to solve it.

In [None]:
w1 = "apple"
w2 = "tree"
w3 = "grape"
w1_index = type_list.index(w1)
w2_index = type_list.index(w2)
w3_index = type_list.index(w3)
w1_vec = M[w1_index,]
w2_vec = M[w2_index,]
w3_vec = M[w3_index,]

spatial_relationship = ???
w4_vec = ???
nbrs.kneighbors([w4_vec])
for i in nbrs.kneighbors([w4_vec])[1][0]:
  print(type_list[i])