<h1>Embedding Words and Types<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Why-Learn-Embeddings?" data-toc-modified-id="Why-Learn-Embeddings?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Why Learn Embeddings?</a></span><ul class="toc-item"><li><span><a href="#Efficiency-of-Embeddings" data-toc-modified-id="Efficiency-of-Embeddings-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Efficiency of Embeddings</a></span></li><li><span><a href="#Approaches-to-Learning-Word-Embeddings" data-toc-modified-id="Approaches-to-Learning-Word-Embeddings-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Approaches to Learning Word Embeddings</a></span></li><li><span><a href="#The-Practical-Use-of-Pretrained-Word-Embeddings" data-toc-modified-id="The-Practical-Use-of-Pretrained-Word-Embeddings-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>The Practical Use of Pretrained Word Embeddings</a></span></li></ul></li><li><span><a href="#Example:-Learning-the-Continous-Bag-of-Words-Embeddings" data-toc-modified-id="Example:-Learning-the-Continous-Bag-of-Words-Embeddings-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Example: Learning the Continous Bag of Words Embeddings</a></span></li></ul></div>

## Introduction

*Representataion Learning or Embedding* refer to learning the mapping from one discrete type to a point in the vector space. When the discrete types are words, the dense vector representation is called a _word embedding_. TF-IDF(Term Frequency-Inverse Document Frequency) is an example of _count based embedding_ method.

## Why Learn Embeddings?

- The count-based representations are also called _distributional representations_ because their significant content or meaning is represented by multiple dimensions in the vector. These representations are not learned from the data but heuristically constructed.

**Benefits of Low Dimensional Learned Representations:**
- Reducing the dimensionality is computationally efficient.
- The count based representations result in high dimensional vectors that encode similar information along many dimensions and do not share statistical strength.
- Very high dimensions in the input can result in real problems in machine learning and optimisation which is often called _curse of dimensionality_.
- Representations learned from task specific data are optimal for the task at hand.

### Efficiency of Embeddings

When we perform the matrix multiplication of one hot vector with weight matrix, the resulting vector is just selecting the row indicated by the non zero entry.

![Figure 5.1](../images/figure_5_1.png)

### Approaches to Learning Word Embeddings

Auxiliary Tasks used to train Word Embeddings:
- Given a sequence of words, predict the next word. This is also called the _language modeling task_.
- Given a sequence of words before and after, predict the missing word.
- Given a word, predict words that occur within a window, independent of the position.

### The Practical Use of Pretrained Word Embeddings

In [4]:
# Loading Embeddings
# Download Embeddings file from https://www.kaggle.com/danielwillgeorge/glove6b100dtxt?select=glove.6B.100d.txt
%load_ext nb_black

import numpy as np
from annoy import AnnoyIndex


class PreTrainedEmbeddings(object):
    def __init__(self, word_to_index, word_vectors):
        """
        Args:
            word_to_index: mapping from word to integers.
            word_vectors: list of numpy array.
        """
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = {v: k for k, v in self.word_to_index.items()}
        self.index = AnnoyIndex(len(word_vectors[0]), metric="euclidean")
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)

    @classmethod
    def from_embeddings_file(cls, embedding_file):
        """
        Init from pretrained vector file.

        Vector filw should be of the format:
            word0 x0_0 x0_1, x0_2 ... x0_N
            word1 x1_0 x1_1, x1_2 ... x1_N

        Args:
            embedding_file: location of the file
        Returns:
            instance of PretrainedEmbeddings
        """
        word_to_index, word_vectors = {}, []
        with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])

                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)
        return cls(word_to_index=word_to_index, word_vectors=word_vectors)

    def get_embedding(self, word):
        """
        Args:
            word: Input word to get embedding for.
        Returns:
            an embedding for given word
        """
        return self.word_vectors[self.word_to_index[word]]

    def get_closed_to_vector(self, vector, n=1):
        """
        Given a vector, return its n nearest neighbors.

        Args:
            vector: should match the size of the vectors in the Annoy Index.
            n: the number of neighbors to return
        Returns:
            Unsorted list of words nearest to the given vector.
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor] for neighbor in nn_indices]

    def compute_and_print_analogy(self, word1, word2, word3):
        """
        Prints the solutions to analogies using word embeddings.

        Analogies are word1 to word2 as word3 is to __
        This methid will print: word1 : word2 :: word3 : word4

        Args:
            word1, word2, word3
        """
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        closed_words = self.get_closed_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closed_words = [word for word in closed_words if word not in existing_words]
        if len(closed_words) == 0:
            print("Could not find nearest neighbors for the vector!")
            return
        for word4 in closed_words:
            print(f"{word1}:{word2} :: {word3}:{word4}")


embeddings = PreTrainedEmbeddings.from_embeddings_file("../data/glove.6B.100d.txt")

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

In [15]:
# Relationships between word embeddings

# Relationship 1: the relationship between gendered nouns and pronouns
print("the relationship between gendered nouns and pronouns")
embeddings.compute_and_print_analogy("man", "he", "woman")
print()

# Relationship 2: Verb-noun relationships
print("Verb-noun relationships")
embeddings.compute_and_print_analogy("fly", "plane", "sail")
print()

#  Relationship 3: Noun-noun relationships
print("Noun-noun relationships")
embeddings.compute_and_print_analogy("cat", "kitten", "dog")
print()

# Relationship 4: Hypernymy (broader category)
print("Hypernymy (broader category)")
embeddings.compute_and_print_analogy("blue", "color", "dog")
print()

# Relationship 5: Meronymy (part-to-whole)
print("Meronymy (part-to-whole)")
embeddings.compute_and_print_analogy("toe", "foot", "finger")
print()

# Relationship 6: Troponymy (difference in manner)
print("Troponymy (difference in manner)")
embeddings.compute_and_print_analogy("talk", "communicate", "read")
print()

# Relationship 7: Metonymy (convention / figures of speech)
print("Metonymy (convention / figures of speech)")
embeddings.compute_and_print_analogy("blue", "democrat", "red")
print()

# Relationship 8: Adjectival scales
print("Adjectival scales")
embeddings.compute_and_print_analogy("fast", "fastest", "young")
print()

the relationship between gendered nouns and pronouns
man:he :: woman:she
man:he :: woman:never

Verb-noun relationships
fly:plane :: sail:ship
fly:plane :: sail:vessel

Noun-noun relationships
cat:kitten :: dog:puppy
cat:kitten :: dog:puppies
cat:kitten :: dog:toddler

Hypernymy (broader category)
blue:color :: dog:behavior
blue:color :: dog:touch
blue:color :: dog:viewer

Meronymy (part-to-whole)
toe:foot :: finger:ground
toe:foot :: finger:pointing

Troponymy (difference in manner)
talk:communicate :: read:interpret
talk:communicate :: read:typed
talk:communicate :: read:correctly
talk:communicate :: read:instructions

Metonymy (convention / figures of speech)
blue:democrat :: red:republican
blue:democrat :: red:congressman
blue:democrat :: red:senator

Adjectival scales
fast:fastest :: young:female
fast:fastest :: young:fellow
fast:fastest :: young:younger



<IPython.core.display.Javascript object>

In [17]:
embeddings.compute_and_print_analogy("fast", "fastest", "small")
embeddings.compute_and_print_analogy("man", "king", "woman")
embeddings.compute_and_print_analogy("man", "doctor", "woman")

fast:fastest :: small:smallest
fast:fastest :: small:large
man:king :: woman:queen
man:king :: woman:monarch
man:king :: woman:throne
man:doctor :: woman:nurse
man:doctor :: woman:physician


<IPython.core.display.Javascript object>

In [18]:
embeddings.compute_and_print_analogy("sachin", "cricket", "messi")

sachin:cricket :: messi:rugby
sachin:cricket :: messi:soccer
sachin:cricket :: messi:football
sachin:cricket :: messi:club


<IPython.core.display.Javascript object>

In [29]:
embeddings.compute_and_print_analogy("nifty", "sensex", "nasdaq")

nifty:sensex :: nasdaq:index
nifty:sensex :: nasdaq:composite


<IPython.core.display.Javascript object>

## Example: Learning the Continous Bag of Words Embeddings