Homework 4: Neural Language Models (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 2
----

### Names
----
Names: __Katherine Aristizabal, Jose Meza Llamosas__ (Write these in every notebook you submit.)

Task 2: Training your own word embeddings (15 points)
--------------------------------

For this task, you'll use the `gensim` package to train your own embeddings for both words and characters. These will eventually act as inputs to your neural language model.

In [1]:
# here are several dependencies to install
# !python --version
# !python -m pip install --upgrade pip setuptools wheel
# !pip install nltk
# !pip install gensim
# !pip install torch torchvision torchinfo

In [2]:
# import your libraries here

# Remember to restart your kernel if you change the contents of this file!
import neurallm_utils as nutils

# for word embeddings
# if not installed, run the following command:
# !pip install gensim
from gensim.models import Word2Vec

import torch
import torch.nn as nn

[nltk_data] Downloading package punkt to /Users/0wner/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/0wner/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [3]:
# If running on google colab, you'll need to mount your drive to access data files

# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# constants you may find helpful. Edit as you would like.

# The dimensions of word embedding. 
# This variable will be used throughout the program
# DO NOT WRITE "50" WHEN YOU ARE REFERRING TO THE EMBEDDING SIZE
EMBEDDINGS_SIZE = 50

EMBEDDING_SAVE_FILE_WORD = f"spooky_embedding_word_{EMBEDDINGS_SIZE}.model" # The file to save your word embeddings to
EMBEDDING_SAVE_FILE_CHAR = f"spooky_embedding_char_{EMBEDDINGS_SIZE}.model" # The file to save your char embeddings to
TRAIN_FILE = 'spooky_author_train.csv' # The file to train your language model on


Train embeddings on provided dataset
---

In [5]:
# your code here
# use the provided utility functions to read in the data


data = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]


# read the spooky data in both by character and by word using the read_file_spooky function in the 
# provided utils
answer1 = nutils.read_file_spooky(TRAIN_FILE, ngram=1, by_character=True)
answer2 = nutils.read_file_spooky(TRAIN_FILE, ngram=1, by_character=False)


#wrapped_text1 = textwrap.fill(answer1, width=40)
#wrapped_text2 = textwrap.fill(answer2, width=40)

# print out the first two sentences in each format
print("char_list")
print(' '.join(answer1[0]))
print(' '.join(answer1[1]))


print("word_list")
print(' '.join(answer2[0]))
print(' '.join(answer2[1]))

# make sure we can read the output easily without scrolling to the side too much


char_list
<s> t h i s _ p r o c e s s , _ h o w e v e r , _ a f f o r d e d _ m e _ n o _ m e a n s _ o f _ a s c e r t a i n i n g _ t h e _ d i m e n s i o n s _ o f _ m y _ d u n g e o n ; _ a s _ i _ m i g h t _ m a k e _ i t s _ c i r c u i t , _ a n d _ r e t u r n _ t o _ t h e _ p o i n t _ w h e n c e _ i _ s e t _ o u t , _ w i t h o u t _ b e i n g _ a w a r e _ o f _ t h e _ f a c t ; _ s o _ p e r f e c t l y _ u n i f o r m _ s e e m e d _ t h e _ w a l l . </s>
<s> i t _ n e v e r _ o n c e _ o c c u r r e d _ t o _ m e _ t h a t _ t h e _ f u m b l i n g _ m i g h t _ b e _ a _ m e r e _ m i s t a k e . </s>
word_list
<s> this process , however , afforded me no means of ascertaining the dimensions of my dungeon ; as i might make its circuit , and return to the point whence i set out , without being aware of the fact ; so perfectly uniform seemed the wall . </s>
<s> it never once occurred to me that the fumbling might be a mere mistake . </s>


8. What character represents spaces when we tokenize by character? __The underscore character__
9. Read the word2vec documentation. What do the following parameters signify?
    - embeddings_size: __The total dimensions used to define a word vector__
    - window: __ The number of words around the target word we are considering to be context words__
    - min_count: __The number of times a word has to appear to count as a word in the model__
    - sg: __Whether we are using the skip gram algorithm or not__

In [None]:
# 10 points
# create your word embeddings
# use the skip gram algorithm and a window size of 5
# min_count should be 1
# takes ~3.3 sec on Felix's computer for character embeddings using skip-gram with window size 5
# takes ~3.3 sec on Felix's computer for word embeddings using skip-gram with window size 5 


def train_word2vec(data: list[list[str]], embeddings_size: int,
                    window: int = 5, min_count: int = 1, sg: int = 1) -> Word2Vec:
    """
    Create new word embeddings based on our data.

    Params:
        data: The corpus
        embeddings_size: The dimensions in each embedding

    Returns:
        A gensim Word2Vec model
        https://radimrehurek.com/gensim/models/word2vec.html

    """

    model = Word2Vec(data, vector_size=embeddings_size, window=window, min_count=min_count, sg=sg)
    return model


# After you are happy with this function, copy + paste it into the bottom of 
# your neurallm_utils.py file
# You'll need it for the next task!
def create_embedder(raw_embeddings: Word2Vec) -> torch.nn.Embedding:
    """
    Create a PyTorch embedding layer based on our data.

    We will *first* train a Word2Vec model on our data.
    Then, we'll use these weights to create a PyTorch embedding layer.
        `nn.Embedding.from_pretrained(weights)`


    PyTorch docs: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding.from_pretrained
    Gensim Word2Vec docs: https://radimrehurek.com/gensim/models/word2vec.html

    Pay particular attention to the *types* of the weights and the types required by PyTorch.

    Params:
        data: The corpus
        embeddings_size: The dimensions in each embedding

    Returns:
        A PyTorch embedding layer
    """

    # Hint:
    # For later tasks, we'll need two mappings: One from token to index, and one from index to tokens.
    # It might be a good idea to store these as properties of your embedder.
    # e.g. `embedder.token_to_index = ...`

    # Create mappings
    
    #get word vectors
    word_vectors = raw_embeddings.wv.vectors  
    print("rawembeddings ")

    print(word_vectors)
    #convert to tensor 
    wv_tensor = torch.tensor(word_vectors, dtype=torch.float32)
    #pass in new weights  
    embedding = torch.nn.Embedding.from_pretrained(wv_tensor)

    token_to_index = dict()
    index_to_token = dict()
    for token in raw_embeddings.wv.index_to_key:
        print("token")
        print(token)
        token_to_index[token] = raw_embeddings.wv.key_to_index[token]
        index_to_token[raw_embeddings.wv.key_to_index[token]] = token

    embedding.token_to_index = token_to_index
    embedding.index_to_token = index_to_token
    #return embedding
    return embedding

In [7]:

# Create and save both sets (word and character based) of Word2Vec embeddings. 
# Use the provided utility functions in nutils.
# These will be (re)loaded in the next notebook.

trained_word =train_word2vec(answer2, EMBEDDINGS_SIZE)
nutils.save_word2vec(trained_word, EMBEDDING_SAVE_FILE_WORD)

trained_char = train_word2vec(answer1, EMBEDDINGS_SIZE)
nutils.save_word2vec(trained_char, EMBEDDING_SAVE_FILE_CHAR)

In [8]:
# load them in again to make sure that this works and is still fast
word2Vec_word = nutils.load_word2vec(EMBEDDING_SAVE_FILE_WORD)
word2Vec_char = nutils.load_word2vec(EMBEDDING_SAVE_FILE_CHAR)

In [9]:
# now create the embedders
e1 = create_embedder(word2Vec_word)

e2= create_embedder(word2Vec_char)

rawembeddings 
[[ 0.0103383   0.07689469  0.07484731 ...  0.23047149  0.07107486
   0.09929769]
 [-0.2284146   0.11586719 -0.08179347 ... -0.12775023  0.30214933
   0.21500851]
 [-0.30937526  0.13488829  0.05041312 ... -0.1863659   0.05114806
   0.38927856]
 ...
 [ 0.02430332 -0.00233489 -0.05374633 ... -0.07021384  0.02596888
   0.05717675]
 [ 0.04168672  0.03614822 -0.09234766 ... -0.07654018  0.05800839
   0.06390785]
 [ 0.01302757 -0.04204199 -0.07840345 ... -0.03239243  0.0216551
   0.05331475]]
rawembeddings 
[[ 0.07127578 -0.07838666  0.08876693 ...  0.03249672 -0.22411497
  -0.20206049]
 [ 0.1579031   0.05950294  0.14582403 ...  0.10637886  0.03768904
   0.02200111]
 [ 0.12050118  0.04857738  0.15570727 ...  0.00399799 -0.00566519
   0.04093249]
 ...
 [-0.04403063  0.05896317 -0.03024507 ...  0.0322051   0.06146831
   0.04918491]
 [-0.03361075  0.01475725 -0.01489315 ...  0.06451239  0.04597827
   0.02145318]
 [-0.02219469  0.04958931 -0.01531201 ... -0.00037175  0.03097666
   

In [10]:
# take a look at your saved token to index and index to token mappings in your embedders to make sure they make sense
# AND that they are both dictionaries mapping from int to str or vice versa!
# don't leave a ton of output in your notebook when you turn it in, but you need to understand this,
# and it's an easy place to make a mistake that's hard to debug later.
# do leave whatever code you use here, comment it out if it produces a lot of output

print(e1.index_to_token)
print(e1.token_to_index)
print(e2.index_to_token)
print(e2.token_to_index)


{0: '_', 1: 'e', 2: 't', 3: 'a', 4: 'o', 5: 'n', 6: 'i', 7: 's', 8: 'h', 9: 'r', 10: 'd', 11: 'l', 12: 'u', 13: 'm', 14: 'c', 15: 'f', 16: 'w', 17: 'y', 18: 'g', 19: 'p', 20: ',', 21: 'b', 22: 'v', 23: '.', 24: '</s>', 25: '<s>', 26: 'k', 27: ';', 28: '"', 29: 'x', 30: "'", 31: 'q', 32: 'j', 33: 'z', 34: '?', 35: ':', 36: 'é', 37: 'æ', 38: 'ê', 39: 'ö', 40: 'è', 41: 'ë', 42: 'à', 43: 'ô', 44: 'ñ', 45: 'ä', 46: 'ï', 47: 'â', 48: 'ü', 49: 'ο', 50: 'ἶ', 51: 'δ', 52: 'α', 53: 'π', 54: 'ν', 55: 'υ', 56: 'ς', 57: 'å', 58: 'ç', 59: 'î'}
{'_': 0, 'e': 1, 't': 2, 'a': 3, 'o': 4, 'n': 5, 'i': 6, 's': 7, 'h': 8, 'r': 9, 'd': 10, 'l': 11, 'u': 12, 'm': 13, 'c': 14, 'f': 15, 'w': 16, 'y': 17, 'g': 18, 'p': 19, ',': 20, 'b': 21, 'v': 22, '.': 23, '</s>': 24, '<s>': 25, 'k': 26, ';': 27, '"': 28, 'x': 29, "'": 30, 'q': 31, 'j': 32, 'z': 33, '?': 34, ':': 35, 'é': 36, 'æ': 37, 'ê': 38, 'ö': 39, 'è': 40, 'ë': 41, 'à': 42, 'ô': 43, 'ñ': 44, 'ä': 45, 'ï': 46, 'â': 47, 'ü': 48, 'ο': 49, 'ἶ': 50, 'δ': 51, 

In [11]:
# 4 points
# print out the vocabulary size for your embeddings for both your word
# embeddings and your character embeddings
# label which is which when you print them out
print("vocab size for word")
print(len(e1.token_to_index))
print("vocab size for chars")
print(len(e2.token_to_index))

vocab size for word
25374
vocab size for chars
60
