# Lab 6: (Sub)Word embeddings
### COSC 426: Fall 2025, Colgate University

Use this notebook to answer the questions in `Lab6.md`. Make sure to include in this notebook all the tests and experiments you run. Make sure to also cite any external resources you use. 

## Part 1: Computing similarity between `GLoVe` embeddings

### Part 1.1

In [None]:
import pickle
import numpy as np

with open('glove_dolma_300_10k.pkl', 'rb') as f:
    glove_embeddings = pickle.load(f)

In [None]:
print(len(glove_embeddings.keys()))

In [None]:
words = ['hello', 'cat', 'chomsky', 'supercalifragilisticexpialidocious']

for word in words:
    if word in glove_embeddings:
        vec = glove_embeddings[word]
        print(f'"{word}": Dimensions', vec.shape, "Mean:", np.mean(vec))
    else:
        print(f'"{word}": Missing embedding')

### Part 1.2

Implement and test the following functions

In [None]:
def get_word_vector_glove(word:str,embeddings:dict):
    """
    Return embedding of word if it exists, if not the mean embedding of all words
    """
    pass

In [None]:
## Tests
print(str(get_word_vector_glove('hello',glove_embeddings).shape) == '(300,)')
print(str(np.mean(get_word_vector_glove('hello',glove_embeddings))) == '-5.767882e-05')
print(str(get_word_vector_glove('hello',glove_embeddings)[0]) == '0.200562')

print(str(get_word_vector_glove('supercalifragilisticexpialidocious',glove_embeddings).shape) == '(300,)')
print(str(np.mean(get_word_vector_glove('supercalifragilisticexpialidocious',glove_embeddings))) == '-0.0065580728')
print(str(get_word_vector_glove('supercalifragilisticexpialidocious',glove_embeddings)[0]) == '-0.08470377')

print(str(get_word_vector_glove('chomsky',glove_embeddings).shape) == '(300,)')
print(str(np.mean(get_word_vector_glove('chomsky',glove_embeddings))) == '-0.0065580728')
print(str(get_word_vector_glove('chomsky',glove_embeddings)[0]) == '-0.08470377')

In [None]:
def cosine_similarity(vec1:np.array, vec2:np.array):
    """
    Returns cosine similarity between two vectors
    """
    pass


In [None]:
print(str(cosine_similarity(np.array([1,2,3]), np.array([1,2, -3]))) == '-0.2857142857142857')
print(str(cosine_similarity(np.array([1,2,3]), np.array([1,-2, 3]))) == '0.42857142857142855')
print(str(cosine_similarity(np.array([1,2,3]), np.array([1,2, 3]))) == '1.0')

sims = {'hello hello': '1.0000001',
        'hello hey': '0.93034637',
        'hello hi': '0.9133941',
        'hello supercalifragilisticexpialidociou': '0.73929405',
        'hello chomsky': '0.73929405',
        'hello cat': '0.6564517'
       }

for key,val in sims.items():
    word1, word2 = key.split()
    res = str(cosine_similarity(get_word_vector_glove(word1,glove_embeddings),
                  get_word_vector_glove(word2,glove_embeddings)))

    print(res == val)

In [1]:
def find_similar(word_vec:np.array, n:int, embeddings:dict, exclude:list):
    """
    Params:
        word_vec: a word embedding
        n: number of similar words to return
        embeddings: key word, value embedding
        exclude: words to be excluded from the output
        
    Returns:
        n words most similar to word; This does not include words in the exclude list
    """
    pass
    

In [None]:

find_similar(get_word_vector_glove('hello',glove_embeddings), 10, glove_embeddings, exclude=['hello'])

In [None]:
find_similar(get_word_vector_glove('hello',glove_embeddings), 10, glove_embeddings, exclude=[])

### Part 1.3

Answer the following questions: 

1. What is the time complexity of `find_similar` if your vocab has `v` words, your embedding size is `m`, and you want to find `n` most similar words to the inputted word? 

2. Consider a scenario (e.g., web application that displays similar words) where you might have to repeatedly run `find_similar`, say for `x` times. What are the benefits and challenges of pre-computing the similarity between all words? How might you overcome the challenges? 

## Part 2: Computing similarity between `distilgpt` embeddings

### Part 2.1

In [2]:
from transformers import AutoModel, AutoTokenizer
import torch

model_name = "distilgpt2"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

embedding_matrix = model.get_input_embeddings().weight

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
words = ['hello', 'cat', 'chomsky', 'supercalifragilisticexpialidocious']

for word in words:
    tokens = tokenizer.tokenize(word, add_special_tokens=False, return_tensors="pt")
    print(tokens)
    ids = tokenizer.convert_tokens_to_ids(tokens)
    print(ids)
    embeds = embedding_matrix[ids]
    print(word, embeds.shape)

#### Answer the following question
How does `distilgpt` handle words that are not in its vocab? 


### Part 2.2


In [None]:
def get_hf_wordvec(word, tokenizer, embedding_matrix):
    """
    Returns an embedding that is the average over all the sub-word tokens
    """
    pass

In [None]:
print(str(get_hf_wordvec('hello', tokenizer, embedding_matrix).shape) =='(768,)')
print(str(np.mean(get_hf_wordvec('hello', tokenizer, embedding_matrix))) =='-0.0013018699')
print(str(get_hf_wordvec('hello', tokenizer, embedding_matrix)[0]) =='-0.029814368')

print(str(get_hf_wordvec('supercalifragilisticexpialidocious', tokenizer, embedding_matrix).shape)=='(768,)')
print(str(np.mean(get_hf_wordvec('supercalifragilisticexpialidocious', tokenizer, embedding_matrix)))=='-0.0007369095')
print(str(get_hf_wordvec('supercalifragilisticexpialidocious', tokenizer, embedding_matrix)[0]) =='-0.023058223')

print(str(get_hf_wordvec('chomsky', tokenizer, embedding_matrix).shape) =='(768,)')
print(str(np.mean(get_hf_wordvec('chomsky', tokenizer, embedding_matrix)))=='0.0015589814')
print(str(get_hf_wordvec('chomsky', tokenizer, embedding_matrix)[0]) =='0.026135625')

### Part 2.3

What are the limitations of using `embedding_matrix` in `find_similar`? 

### Part 2.4

In [None]:
def create_hf_embeddings(hf_model_name: str, vocab:set):
    """
    Returns dictionary. Key: words in the vocab; Value: word embeddings for the hf_model for the word
    """

In [None]:
## Add sanity case checks

## Part 3: Using analogies to study encoding of gender

### Part 3.1

In [3]:
def compute_analogy(analogy:str, embeddings:dict, n:int):
    """
    Params:
        analogy: String of the format: a - b + c
        embeddings:  key word, value embedding
        n: number of cloest words to return
    Returns: 
        n cloest words to the resulting analogy embedding
    """
    pass


In [4]:
## Add Tests

### Part 3.2
Come up with at least 10 analogies that you think are important for testing how robustly gender is encoded in some embeddings. Justify why you picked the examples you did. 

### Part 3.3
Using analogies from the previous part, test how robustly gender is encoded in `GLoVe` and `distilgpt` embeddings. Add as many code and markdown chunks as you would like. 

## Part 4 (optional): Using analogies to study encoding of other features