<a href="https://colab.research.google.com/github/pmadhyastha/INM434/blob/main/distributional_semantics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Distributional models, vector space representations and word embeddings

In [1]:
__author__ = "Pranava Madhyastha"
__version__ = "INM434/IN3045 City, University of London, Spring 2024"

# Setup: english wiki corpus!

We will begin by downloading a large wikipedia corpus and perform some cleaning and text normalisation.

In [2]:
import urllib.request
import re

# Download the corpus
url = "http://mattmahoney.net/dc/enwik8.zip"
urllib.request.urlretrieve(url, "enwik8.zip")

# Extract the corpus and clean it
import zipfile

with zipfile.ZipFile('enwik8.zip', 'r') as zip_ref:
    zip_ref.extractall()

with open('enwik8', 'r', encoding='utf-8') as f_in, open('enwik8_clean', 'w', encoding='utf-8') as f_out:
    for line in f_in:
        # Strip off HTML tags
        line = re.sub(r'<.*?>', '', line)
        # Normalize the text
        line = line.lower()
        # Write the cleaned line to the output file
        f_out.write(line)

# Tokenize the cleaned corpus
with open('enwik8_clean', 'r', encoding='utf-8') as f:
    corpus = f.read()

The above code downloads an old version of English wikipedia corpus.

### TODO:

1.   How are we normalising text? Can you print out a part of the corpus and see what is being done?

## Vector space represenation of words using co-occurrence information.

We will now write code for getting out first vector space model by building very simple word co-occurrence dictionary (matrix and dictionaries and interchangeably used in this lab session).


The program reads in a corpus of text stored in the file called 'enwik8_clean' (which we obtained using the process above). The program then tokenizes it using NLTK library. It then creates a vocabulary of *unique words* in the corpus and counts the number of occurrences of each word using a defaultdict object.

Next, it counts the co-occurrences of words within a fixed window size of three words and stores the results in a nested defaultdict.

The program then creates a vector space representation for each word by iterating through the vocabulary and creating a vector of co-occurrence counts with all other words in the vocabulary.

The resulting vectors are stored in a dictionary object named 'vectors'. The code also filters out stop words from the corpus before counting co-occurrences.

In [3]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import defaultdict
import os
from nltk.tokenize import word_tokenize

nltk.download('punkt')


# Constants
WINDOW_SIZE = 3
STOPWORDS = set(stopwords.words('english'))

# Read in corpus
with open('enwik8_clean', 'r', encoding='utf-8') as f:
    filesize = os.path.getsize('enwik8_clean')
    lines_limit = 1000
    lines = []
    line = 0
    while line < lines_limit:
        l = f.readline()
        lines.append(l)
        line += 1
    corpus = ''.join(lines)

# Create vocabulary
corpus = word_tokenize(corpus)
vocab = set(corpus)

print(len(vocab))

# Count occurrences of each word
word_counts = defaultdict(int)
for word in corpus:
    word_counts[word] += 1

# Count co-occurrences of words within a window
context_counts = defaultdict(lambda: defaultdict(int))
for i in range(len(corpus)):
    if corpus[i] in STOPWORDS:
        continue
    for j in range(i - WINDOW_SIZE, i + WINDOW_SIZE + 1):
        if j < 0 or j >= len(corpus) or i == j or corpus[j] in STOPWORDS:
            continue
        context_counts[corpus[i]][corpus[j]] += 1

# Create vector space representation for each word
vectors = {}
for word in vocab:
    vector = [context_counts[word][w] for w in vocab]
    vectors[word] = vector

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


4063


### TODO:

1. How many lines of data are we considering the code?
2. "vectors" is a dictionary, can you print the vocabulary?
3. What is the dimensionality of the word vector?
4. What are we storing in each one of the arrays corresponding to each word?
5. Consider the word "peaceful", what is the colsum of this word? What does colsum signify here?
6. Feel free to increase the number of lines, there is a point at which it is going to occupy a very large amount of memory and "colab" will stop functioning. What is causing this?
7. What is a way to mitigate the memory explosion?  

## Word similarity using metrics

We will now play with vector similarity. We want to essentially understand how we can obtain the most similar and the least similar words. We will use distance metrics to operationalise this.

In [4]:
import numpy as np

def top_n_words(vectors, n, distance_function):
    distances = {}
    for word, vector in vectors.items():
        distance = distance_function(vector)
        distances[word] = distance
    sorted_distances = sorted(distances.items(), key=lambda x: x[1], reverse=True)
    return sorted_distances[:n]

# Example dictionary of vectors
vectors = {
    'apple': np.array([1, 2, 3]),
    'banana': np.array([4, 5, 6]),
    'orange': np.array([7, 8, 9]),
    'grape': np.array([10, 11, 12])
}

# Euclidean distance function
def euclidean_distance(vector):
    return np.linalg.norm(vector)

# Manhattan distance function
def manhattan_distance(vector):
    return np.sum(np.abs(vector))

# Top-n words using Euclidean distance
top_n_euclidean = top_n_words(vectors, 2, euclidean_distance)
print('Top 2 words using Euclidean distance:', top_n_euclidean)

# Top-n words using Manhattan distance
top_n_manhattan = top_n_words(vectors, 2, manhattan_distance)
print('Top 2 words using Manhattan distance:', top_n_manhattan)


Top 2 words using Euclidean distance: [('grape', 19.1049731745428), ('orange', 13.92838827718412)]
Top 2 words using Manhattan distance: [('grape', 33), ('orange', 24)]


## TODO:

1. Can you extend the code for cosine similarity and cosine distance?
2. Have a look at https://docs.scipy.org/doc/scipy/reference/spatial.distance.html for additional distance metrics. Try different distance functions and see how top-2 words change!
3. We are playing with toy data for this code, can you instead use the vectors from the example above and obtain the top-2 words for a few words in the vocab? (say start with the word "peaceful").
4. Can you try playing with a few of reweighting techniques? See how distances change?