## Dependencies

In [1]:
import os
import re
import requests
from gensim.models import Word2Vec
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /home/hc4293/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/hc4293/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Text pre-processing

In [2]:
def preprocess_corpus(filepath):
    # stopwords and lemmatizer
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read().lower()
        sentences = re.split(r'[.!?]', text)
    processed_sentences = []
    for sentence in sentences:
        # remove punctuation and tokenize
        tokens = re.findall(r'\b\w+\b',sentence)
        # remove stopwords and lemmatize
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
        if tokens:  # Avoid empty lists
            processed_sentences.append(tokens)
    
    return processed_sentences

In [3]:
file_path = "./data/shakespeare.txt"
data = preprocess_corpus(filepath=file_path)

# Training

In [4]:
def train_word2vec(sentences, vector_size=100, window=5, min_count=2, workers=4):
    model = Word2Vec(sentences, vector_size=vector_size, window=window, min_count=min_count, workers=workers)
    return model

In [5]:
word2vec_model = train_word2vec(data)
word2vec_model.save("./data/shakespeare_word2vec.model")

## Evaluate

In [6]:
examples = [
    ["thou", "thee"],
    ["love", "honour"],
    ["dagger", "sword"],
    ["villain", "knave"],
    ["fair", "foul"]  
]

In [7]:
def demonstrate_similarity(model, examples):
    for word1, word2 in examples:
        similarity = model.wv.similarity(word1, word2)
        print(f"Similarity between '{word1}' and '{word2}': {similarity*100:.2f}%")

model_path = "./data/shakespeare_word2vec.model"
word2vec_model = Word2Vec.load(model_path)
demonstrate_similarity(word2vec_model,examples)


Similarity between 'thou' and 'thee': 66.96%
Similarity between 'love' and 'honour': 90.81%
Similarity between 'dagger' and 'sword': 87.09%
Similarity between 'villain' and 'knave': 92.42%
Similarity between 'fair' and 'foul': 78.51%


## Bonus

In [9]:
import numpy as np

# load Glove embeddings
def load_glove_embeddings(glove_file_path):
    embeddings = {}
    with open(glove_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings


def demonstrate_similarity_glove(embeddings, examples):
    for word1, word2 in examples:
        if word1 in embeddings and word2 in embeddings:
            vec1, vec2 = embeddings[word1], embeddings[word2]
            similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
            print(f"Similarity between '{word1}' and '{word2}': {similarity*100:.2f}%")
        else:
            print(f"One or both words not found in embeddings: '{word1}', '{word2}'")


In [11]:
glove_file_path = "./models/glove.6B.100d.txt"
glove_embeddings = load_glove_embeddings(glove_file_path)

examples = [
    ["thou", "thee"],
    ["love", "honour"],
    ["dagger", "sword"],
    ["villain", "knave"],
    ["fair", "foul"]
]

demonstrate_similarity_glove(glove_embeddings, examples)

Similarity between 'thou' and 'thee': 61.34%
Similarity between 'love' and 'honour': 40.07%
Similarity between 'dagger' and 'sword': 64.94%
Similarity between 'villain' and 'knave': 22.70%
Similarity between 'fair' and 'foul': 21.49%


### **Write-Up**

---

#### **Text Preprocessing**
The preprocessing function processes the Shakespearean text by:
- Converting text to lowercase.
- Tokenizing sentences using regular expressions.
- Removing punctuation and stopwords.
- Lemmatizing tokens to ensure that words like "dagger" and "daggers" are treated the same.

This prepares the data for training meaningful embeddings.

---

#### **Training the Word2Vec Model**
The Word2Vec model was trained on the preprocessed Shakespearean corpus with the following parameters:
- **Vector Size**: 100
- **Window Size**: 5 (context words to consider on each side)
- **Minimum Count**: 2 (ignores words with frequency < 2)

The model was saved as `shakespeare_word2vec.model` for later use.

---

#### **Discussion**
The Word2Vec model successfully captured relationships between Shakespearean terms, showcasing its ability to learn domain-specific contextual embeddings. High similarity scores between pairs like "villain-knave" (94.36%) and "love-honour" (89.54%) indicate that the model understands both semantic and cultural relevance in Shakespeare’s works.

Interestingly, the similarity score for "fair-foul" (78.76%) highlights how the model perceives contrasts as inherently related, reflecting Shakespeare’s use of opposites.

---

#### **Comparison with Pre-Trained Embeddings (Bonus)**
To assess the performance, I compared the Word2Vec embeddings with pre-trained GloVe embeddings:

| Pair           | Word2Vec Model | GloVe |
|----------------|-------------------|---------------------------|
| "thou-thee"    | 64.89%           | 61.34%                    |
| "love-honour"  | 89.54%           | 40.07%                    |
| "dagger-sword" | 88.06%           | 64.94%                    |
| "villain-knave"| 94.36%           | 22.70%                    |
| "fair-foul"    | 78.76%           | 21.49%                    |

##### Discussion:
The Word2Vec model significantly outperformed the GloVe embeddings for most Shakespearean word pairs. For instance, "love-honour" achieved a much higher similarity (89.54%) with Word2Vec compared to GloVe (40.07%), demonstrating Word2Vec’s ability to better understand nuanced and poetic relationships. 

GloVe’s lower performance, especially for "villain-knave" (22.70%) and "fair-foul" (21.49%), suggests that pre-trained general embeddings lack the domain-specific knowledge required to capture the richness of Shakespearean language. However, for more universal terms like "dagger-sword" (64.94%), GloVe embeddings performed reasonably well, reflecting their broad vocabulary coverage and training on a much larger corpus.

---

#### **References**
1. Shakespeare, W. *The Complete Works of William Shakespeare*. Project Gutenberg, [https://www.gutenberg.org/](https://www.gutenberg.org/).

--- 
