<a href="https://colab.research.google.com/github/pmadhyastha/INM434/blob/main/distributional_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Distributional models, vector space representations and word embeddings

In [None]:
__author__ = "Pranava Madhyastha" 
__version__ = "INM434/IN3045 City, University of London, Spring 2023"

# Setup: english wiki corpus! 

We will begin by downloading a large wikipedia corpus and perform some cleaning and text normalisation. 

In [None]:
import urllib.request
import re

# Download the corpus
url = "http://mattmahoney.net/dc/enwik8.zip"
urllib.request.urlretrieve(url, "enwik8.zip")

# Extract the corpus and clean it
import zipfile

with zipfile.ZipFile('enwik8.zip', 'r') as zip_ref:
    zip_ref.extractall()

with open('enwik8', 'r', encoding='utf-8') as f_in, open('enwik8_clean', 'w', encoding='utf-8') as f_out:
    for line in f_in:
        # Strip off HTML tags
        line = re.sub(r'<.*?>', '', line)
        # Normalize the text
        line = line.lower()
        # Write the cleaned line to the output file
        f_out.write(line)

# Tokenize the cleaned corpus
with open('enwik8_clean', 'r', encoding='utf-8') as f:
    corpus = f.read()

The above code downloads an old version of English wikipedia corpus. 

### TODO: 

1.   How are we normalising text? Can you print out a part of the corpus and see what is being done? 



## Vector space represenation of words using co-occurrence information. 

We will now write code for getting out first vector space model by building very simple word co-occurrence dictionary (matrix and dictionaries and interchangeably used in this lab session). 


The program reads in a corpus of text stored in the file called 'enwik8_clean' (which we obtained using the process above). The program then tokenizes it using NLTK library. It then creates a vocabulary of *unique words* in the corpus and counts the number of occurrences of each word using a defaultdict object. 

Next, it counts the co-occurrences of words within a fixed window size of three words and stores the results in a nested defaultdict. 

The program then creates a vector space representation for each word by iterating through the vocabulary and creating a vector of co-occurrence counts with all other words in the vocabulary. 

The resulting vectors are stored in a dictionary object named 'vectors'. The code also filters out stop words from the corpus before counting co-occurrences.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import defaultdict
import os 
from nltk.tokenize import word_tokenize

nltk.download('punkt')


# Constants
WINDOW_SIZE = 3
STOPWORDS = set(stopwords.words('english'))

# Read in corpus
with open('enwik8_clean', 'r', encoding='utf-8') as f:
    filesize = os.path.getsize('enwik8_clean')
    lines_limit = 1000
    lines = []
    line = 0
    while line < lines_limit:
        l = f.readline()
        lines.append(l)
        line += 1
    corpus = ''.join(lines)

# Create vocabulary
corpus = word_tokenize(corpus)
vocab = set(corpus)

print(len(vocab))

# Count occurrences of each word
word_counts = defaultdict(int)
for word in corpus:
    word_counts[word] += 1

# Count co-occurrences of words within a window
context_counts = defaultdict(lambda: defaultdict(int))
for i in range(len(corpus)):
    if corpus[i] in STOPWORDS:
        continue
    for j in range(i - WINDOW_SIZE, i + WINDOW_SIZE + 1):
        if j < 0 or j >= len(corpus) or i == j or corpus[j] in STOPWORDS:
            continue
        context_counts[corpus[i]][corpus[j]] += 1

# Create vector space representation for each word
vectors = {}
for word in vocab:
    vector = [context_counts[word][w] for w in vocab]
    vectors[word] = vector

### TODO: 

1. How many lines of data are we considering the code? 
2. "vectors" is a dictionary, can you print the vocabulary? 
3. What is the dimensionality of the word vector? 
4. What are we storing in each one of the arrays corresponding to each word? 
5. Consider the word "peaceful", what is the colsum of this word? What does colsum signify here? 
6. Feel free to increase the number of lines, there is a point at which it is going to occupy a very large amount of memory and "colab" will stop functioning. What is causing this? 
7. What is a way to mitigate the memory explosion?  

## Word similarity using metrics 

We will now play with vector similarity. We want to essentially understand how we can obtain the most similar and the least similar words. We will use distance metrics to operationalise this. 

In [None]:
import numpy as np

def top_n_words(vectors, n, distance_function):
    distances = {}
    for word, vector in vectors.items():
        distance = distance_function(vector)
        distances[word] = distance
    sorted_distances = sorted(distances.items(), key=lambda x: x[1], reverse=True)
    return sorted_distances[:n]

# Example dictionary of vectors
vectors = {
    'apple': np.array([1, 2, 3]),
    'banana': np.array([4, 5, 6]),
    'orange': np.array([7, 8, 9]),
    'grape': np.array([10, 11, 12])
}

# Euclidean distance function
def euclidean_distance(vector):
    return np.linalg.norm(vector)

# Manhattan distance function
def manhattan_distance(vector):
    return np.sum(np.abs(vector))

# Top-n words using Euclidean distance
top_n_euclidean = top_n_words(vectors, 2, euclidean_distance)
print('Top 2 words using Euclidean distance:', top_n_euclidean)

# Top-n words using Manhattan distance
top_n_manhattan = top_n_words(vectors, 2, manhattan_distance)
print('Top 2 words using Manhattan distance:', top_n_manhattan)


## TODO: 

1. Can you extend the code for cosine similarity and cosine distance? 
2. Have a look at https://docs.scipy.org/doc/scipy/reference/spatial.distance.html for additional distance metrics. Try different distance functions and see how top-2 words change! 
3. We are playing with toy data for this code, can you instead use the vectors from the example above and obtain the top-2 words for a few words in the vocab? (say start with the word "peaceful"). 
4. Can you try playing with a few of reweighting techniques? See how distances change? 

## Latent semantic analysis (truncated SVD) 

We will now see the basic workings of latent semantic analysis. For this, we will use a readily available function from scikit-learn: CountVectorizer. We have used this in the previous sets of labs, we should be fairly familiar with the basic functionality of this method. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

# Define some example text documents
docs = [
    "The quick brown fox jumps over the lazy dog",
    "The quick brown fox jumps over the quick dog",
    "Dog is man's best friend",
    "Cat is not a man's best friend",
    "The brown fox is quick"
]

# Convert the documents to a co-occurrence matrix
vectorizer = CountVectorizer()
cooc = vectorizer.fit_transform(docs)

# print the vocabulary
print(vectorizer.get_feature_names())

# Perform singular value decomposition (SVD)
svd = TruncatedSVD(n_components=5)
svd.fit(cooc)


# Extract left singular vectors and singular values as word embeddings
word_embeddings = svd.components_.T * svd.singular_values_

print(word_embeddings)


### Todo
1. Can you associate the words to their corresponding vectors? 
2. What was the initial dimensionality? 
3. Can you change the CountVectorizer such that it now considers "document information" (here a document is a sentence) -- try looking at TfidfVectorizer. 
4. Can you print the top-2 words for each word in the vocabulary?

## Word2Vev

In the following code, we will implement a simplified version of word2vec model that trains a continuous bag-of-words model using PyTorch to learn word embeddings from a small corpus. The CBOW model predicts a target word given its surrounding context words, and the word embeddings are the weights of the model's embedding layer.


The train_cbow function trains the CBOW model using the specified hyperparameters. It initializes the model, criterion, and optimizer, and then trains the model for the specified number of epochs. For each sentence in the corpus, it iterates over the positions in the sentence and creates a context tensor and target tensor for each position. It calculates the output of the model for the context tensor, computes the loss using the cross-entropy loss function, and backpropagates the loss to update the model parameters. At the end of each epoch, it prints the total loss for that epoch. 

This code looks very similar to the linear and non-linear models that we have explored before! 

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import numpy as np

# Toy data for training
sentences = [["apple", "orange", "banana", "pear"], ["apple", "pear"], ["banana", "orange"], ["pear", "banana", "orange"]]

# Building vocabulary
word_counts = Counter([word for sentence in sentences for word in sentence])
vocabulary = list(word_counts.keys())
word_to_index = {word: i for i, word in enumerate(vocabulary)}

# Hyperparameters
embedding_size = 100
window_size = 2
learning_rate = 0.001
num_epochs = 100

class CBOW(nn.Module):
    def __init__(self, vocabulary_size, embedding_size):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(vocabulary_size, embedding_size)
        self.linear = nn.Linear(embedding_size, vocabulary_size)

    def forward(self, context):
        embedded_context = self.embedding(context)
        sum_embedded_context = torch.sum(embedded_context, dim=1)
        output = self.linear(sum_embedded_context)
        return output

def train_cbow(sentences, word_to_index, vocabulary, embedding_size, window_size, learning_rate, num_epochs):
    model = CBOW(len(vocabulary), embedding_size)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    for epoch in range(num_epochs):
        total_loss = 0
        for sentence in sentences:
            for i in range(window_size, len(sentence) - window_size):
                context = [word_to_index[word] for word in sentence[i-window_size:i] + sentence[i+1:i+window_size+1]]
                target = torch.tensor(word_to_index[sentence[i]], dtype=torch.long)
                optimizer.zero_grad()
                output = model(torch.tensor(context, dtype=torch.long))
                loss = criterion(output.unsqueeze(0), target)
                total_loss += loss.item()
                loss.backward()
                optimizer.step()
        print("Epoch %s, loss=%.4f" % (epoch+1, total_loss))

    return model

# Train the CBOW model
model = train_cbow(sentences, word_to_index, vocabulary, embedding_size, window_size, learning_rate, num_epochs)

# Get the learned embeddings
word_embeddings = model.embedding.weight.detach().numpy()

# Test the embeddings
print("Embedding for 'apple': ", word_embeddings[word_to_index['apple']])
print("Embedding for 'orange': ", word_embeddings[word_to_index['orange']])
print("Embedding for 'banana': ", word_embeddings[word_to_index['banana']])
print("Embedding for 'pear': ", word_embeddings[word_to_index['pear']])


### Todo

1. Is this a linear or a non-linear model? 
2. What does the window-size signify here? 

## Typical use case: Pre-trained word representations. 

We will now use pre-trained word representations using GloVe (an alternative way of learning word representations similar to word2vec). GloVe uses a similar formalisation of the problem similar to skipgram version of word2vec but additionally adds information relating to "ratios" of co-occurrences with other words. For the purposes of this module, it is sufficient to know that GloVe is very similar to word2vec and shows similar behaviour. (if you are further interested, for more details, please refer: https://nlp.stanford.edu/projects/glove/). 

In the code below, we are simply going to load the pretrained word vectors (vectors that are previously trained over a large corpus). To do this, we are going to download GloVe based word vectors. Then we are going to load it up into a dictionary and obtain 10 most similar words for a given word. 



In [None]:
import numpy as np
import urllib.request
import zipfile
from pathlib import Path




# Download and extract the GloVe embeddings
url = 'http://nlp.stanford.edu/data/glove.6B.zip'
filename = 'glove.6B.zip'
infile = Path(filename)
if not infile:
    urllib.request.urlretrieve(url, filename)
    with zipfile.ZipFile(filename, 'r') as zip_ref:
        zip_ref.extractall('.')

# Load the embeddings into a dictionary
embeddings = {}
with open('glove.6B.50d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings[word] = vector

# Function to obtain most similar words to a given word
def get_most_similar_words(word, embeddings, n=10):
    word_vector = embeddings.get(word)
    if word_vector is None:
        return []
    similarities = {}
    for w, v in embeddings.items():
        sim = np.dot(word_vector, v) / (np.linalg.norm(word_vector) * np.linalg.norm(v))
        similarities[w] = sim
    sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
    return [w for w, s in sorted_similarities[:n]]

print("most similar words for  'apple': ", get_most_similar_words("apple", embeddings))


### TODO: 

1. Can you try for a list of other words? 
2. **Advanced** A sentence is made of words, we can consider "mean pooling" word representations to form the sentence representation. Can you use this clue to construct a sentiment classifier for Lab 3 and 4 with mean pooled word representations as features?  is this better? What is different now? 