<a href="https://colab.research.google.com/github/muskanrath30/muskanrath30/blob/main/WordEmbeddingPlayground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class Exercise: Let’s play with Word Embeddings!

A Word Embedding is a representation of a word. Embeddings are vectors of some fixed number of values, where each value is typically a float between 0 and 1. We use `representation`, `embedding` or `vector` pretty much interchangeably. They all mean the same thing: a list of numbers that are mean to represent a word. 

In this class exercise, we will use common libraries to load, view and interact with some word embeddings. 


Thanks to [here](https://colab.research.google.com/github/ecs-vlc/COMP6248/blob/master/docs/labs/lab7/7_2_WordEmbeddings.ipynb#scrollTo=vWwACkpZx2Eu) for inspiration.

# Load GloVe vectors
First, we will load some word embeddings that have already been trained. The original GloVe paper introduced vectors of different sizes (no. of float values used in each representation) for each word. We will use the one of size 100. This size is also referred to as the `dimensions`. 



In [None]:
import torchtext.vocab, torch

glove = torchtext.vocab.GloVe(name='6B', dim=100)

print(f'There are {len(glove.itos)} words in the vocabulary')

Each unique word is assigned an `index`. For example, the word `the` is assigned an index of 0. 

In [None]:
# Fetch the integer index associated with the word.
glove.stoi['the']

In [None]:
# Fetch the word associated with an integer index.
glove.itos[0]

Embeddings are oraganized as a 2D matrix. 

In [None]:
glove.vectors.shape

If you pay close attention, you will recognize that the second item in the output above, `100` is the number of dimensions used to represent each word. 
The first element is the count of number of unique words for which we have embeddings. This is also referred to as the `vocabulary`.

# Methods to be implemented


In [None]:
def get_embedding(glove, word):
  # First, identify the index associated with the chosen word.
  # Then, fetch the corresponding embedding from glove.vectors 
  return glove.vectors[glove.stoi[word]]

In [None]:
def nearest_neighbor(glove, query_embedding, n=10):
  # Find the nearest neighbors for a given embedding. Here's the information you will need.
  # torch.dist(v1, v2): this method returns the distance between two vectors.

  # First, iterate through all words using the itos method introduced above.
  # Compute the distance between each word in the vocabulary and the query_embedding
  # Sort the words by their distance wrt query_embedding. Return n nearest neighbors.
  distances = [(w, torch.dist(query_embedding, get_embedding(glove, w)).item()) for w in glove.itos]
  return sorted(distances, key = lambda w: w[1])[:n]
  

In [None]:
nearest_neighbor(glove, get_embedding(glove, "india"))

Once the above methods are implemented, play around with it. Specifically, get the nearest neighbors for the following words:

- bank
- seattle
- jaguar
- neural
- brazil
- france
- india
- [other words of your choice]

In [None]:
nearest_neighbor(glove, get_embedding(glove, "bank"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "seattle"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "jaguar"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "neural"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "brazil"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "malayalam"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "odia"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "maharashtra"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "tamil"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "france"))

In [None]:
nearest_neighbor(glove, get_embedding(glove, "apple"))

In the above examples, we may face issues when it comes to Polysemy words. Polysemy words are those which have multiple related meanings.

# Analogies

One of the first extremely surprising findings of word embeddings was their ability to solve simple analogies. Let's first implement the method.


In [None]:
def analogy(glove, word1, word2, word3, n=5):
  candidate_words = nearest_neighbor(glove, get_embedding(glove, word2) - get_embedding(glove, word1) + get_embedding(glove, word3), n=5)
  filtered_candidate_words = [x for x in candidate_words if x[0] not in [word1, word2, word3]]
  return filtered_candidate_words[:n]


In [None]:
analogy(glove, "france", "paris", "india")  # this means france: paris = india: <which word?>
# Here we are using nearest neighbor method as well

Try the above method to solve the following analogical problems:

1. man:king :: woman : ___ ?
2. france:paris :: brazil: ___ ?
3. good:best :: funny: ___ ?
4. baseball:diamond :: tennis: ___ ?


In [None]:
analogy(glove, "man", "king", "woman")

In [None]:
analogy(glove, "france", "paris", "brazil")

In [None]:
analogy(glove, "good", "best", "funny")

In [None]:
analogy(glove, "baseball", "diamond", "tennis")

In [None]:
analogy(glove, "india", "maharashtra", "united states of america")

In [None]:
analogy(glove, "india", "maharashtra", "usa")

We observed that the words are case sensitive. Capitalised words are entities, the people who developed this model included only lower case words in the word corpus.                          
Each entity is one token. Multiple tokens weren't considered

united states of america is same as usa but the same names do not have similar embeddings