This notebook heavily refers to the following resources:
1. https://www.youtube.com/watch?v=viZrOnJclY0&ab_channel=StatQuestwithJoshStarmer
2. https://www.youtube.com/watch?v=Qf06XDYXCXI&t=1070s&ab_channel=StatQuestwithJoshStarmer

Word embedding is the process of converting a word into a vector that can be processed by a neural network. Unlike a tokenizer, where the input is simply being converted to a random token, the embedding process aims to create an embedded dictionary where words of similar meaning will be ``closer" when embedded into a vector. Words may also carry different meaning, therefore we hope to assign more than 1 number to describe different words. In this notebook, we are going to train our own word embedder using neural network. 

Consider the following training data
1. Fruits are tasty.
2. Cakes are tasty.

In our training data, there are a total of 4 unique words: fruits, are, tasty, cakes. Using the one hot encoding method, we can assign each unique words with a one vector, and the output of the word with another vector. 

Input:
1. fruits - [1, 0, 0, 0]
2. are    - [0, 1, 0, 0]
3. tasty  - [0, 0, 1, 0]
4. cakes  - [0, 0, 0, 1]

Label:
[0, 1, 0, 0]
[0, 0, 1, 0]
[0, 0, 0, 1]
[0, 1, 0, 0]

Consider the following 3 training data:
1. Fruits are tasty.
2. Fruits are delicious.
3. Fruits are tasty and delicious

The word 'are' has 2 possible outputs, tasty and delicious. One hot encoding is unable to encode this relation properly. This becomes a multi-label encoding scheme, where each sample can be associated with multiple binary labels, with each label indicating the presence or absence of a category. 

Label :
[0, 1, 0, 0, 0]
[0, 0, 1, 1, 0]
[0, 0, 0, 0, 1]
[0, 0, 0, 0, 1]
[0, 0, 1, 1, 0]




In pytorch, this is done through the torch.tensor() function


In [1]:
import os
import torch
from torch import nn
from torch.optim import Adam
from torch.distributions.uniform import Uniform
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl
import pandas as pd
import matplotlib.pyplot as plt

inputs = torch.tensor([[1., 0., 0., 0.],
                       [0., 1., 0., 0.],
                       [0., 0., 1., 0.],
                       [0., 0., 0., 1.]])

labels = torch.tensor([[0., 1., 0., 0.],
                       [0., 0., 1., 0.],
                       [0., 0., 0., 1.],
                       [0., 1., 0., 0.]])

dataset = torch.utils.data.TensorDataset(inputs, labels)
dataloader = torch.utils.data.DataLoader(dataset)

In [2]:
class WordEmbedding(pl.LightningModule):
    def __init__(self):
        super().__init__() #this is for lightning module
        min_value = -0.5
        max_value = 0.5

        #initialize the weight for network with 2 nodes
        self.input1_w1 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.input1_w2 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.input2_w1 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.input2_w2 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.input3_w1 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.input3_w2 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.input4_w1 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.input4_w2 = nn.Parameter(Uniform(min_value, max_value).sample())

        self.output1_w1 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.output1_w2 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.output2_w1 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.output2_w2 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.output3_w1 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.output3_w2 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.output4_w1 = nn.Parameter(Uniform(min_value, max_value).sample())
        self.output4_w2 = nn.Parameter(Uniform(min_value, max_value).sample())

        #define loss function
        self.loss = nn.CrossEntropyLoss()

    def forward(self, input):
        #one-hot encodding
        #print(input)
        input = input[0] #make it into a vector instead of a tensor
        inputs_to_1st_hidden = ((input[0] * self.input1_w1) + 
                                (input[1] * self.input2_w1) +
                                (input[2] * self.input3_w1) +
                                (input[3] * self.input4_w1))
        inputs_to_2nd_hidden = ((input[0] * self.input1_w1) +
                                (input[1] * self.input2_w2) +
                                (input[2] * self.input3_w2) +
                                (input[3] * self.input4_w2))
        output1 = ((inputs_to_1st_hidden * self.output1_w1) +
                   (inputs_to_2nd_hidden *self.output1_w2))
        output2 = ((inputs_to_1st_hidden * self.output2_w1) +
                   (inputs_to_2nd_hidden *self.output2_w2))
        output3 = ((inputs_to_1st_hidden * self.output3_w1) +
                   (inputs_to_2nd_hidden *self.output3_w2))
        output4 = ((inputs_to_1st_hidden * self.output4_w1) +
                   (inputs_to_2nd_hidden *self.output4_w2))

        output_presoftmax = torch.stack([output1, output2, output3, output4])
        return(output_presoftmax)

    def configure_optimizers(self):
        return Adam(self.parameters(), lr = 0.1)

    def training_step(self,batch, batch_idx):
        input_i, label_i = batch
        output_i = self.forward(input_i)
        loss = self.loss(output_i, label_i[0])
        return loss
        

In [3]:
smallmodel = WordEmbedding()
print('Before parameterization, the parameters are :')
for name, param in smallmodel.named_parameters():
    print(name, param.data)      

Before parameterization, the parameters are :
input1_w1 tensor(-0.0614)
input1_w2 tensor(-0.1802)
input2_w1 tensor(0.0847)
input2_w2 tensor(-0.4365)
input3_w1 tensor(0.3641)
input3_w2 tensor(0.4052)
input4_w1 tensor(-0.4422)
input4_w2 tensor(0.1040)
output1_w1 tensor(0.1997)
output1_w2 tensor(-0.1207)
output2_w1 tensor(0.4642)
output2_w2 tensor(0.4567)
output3_w1 tensor(-0.2053)
output3_w2 tensor(-0.3855)
output4_w1 tensor(0.1262)
output4_w2 tensor(-0.4925)


In [4]:
#graphing the data

#put the weight into dictionary
data = {'w1': [smallmodel.input1_w1.item(),
               smallmodel.input2_w1.item(),
               smallmodel.input3_w1.item(),
               smallmodel.input4_w1.item()],
        'w2': [smallmodel.input1_w2.item(),
               smallmodel.input2_w2.item(),
               smallmodel.input3_w2.item(),
               smallmodel.input4_w2.item(),],
        'token' : ["Fruits", "are", "tasty", "Cakes"],
        'input' : ['input1', 'input2', 'input3', 'input4']
       }

df = pd.DataFrame(data)
#sns.scatterplot(data = df, x = 'w1', y = 'w2')
print(df)

         w1        w2   token   input
0 -0.061420 -0.180230  Fruits  input1
1  0.084740 -0.436456     are  input2
2  0.364129  0.405175   tasty  input3
3 -0.442168  0.104045   Cakes  input4


In [5]:
# training the neural network
trainer = pl.Trainer(max_epochs = 100)
trainer.fit(smallmodel, train_dataloaders = dataloader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
C:\Users\honlin\Anaconda3\lib\site-packages\pytorch_lightning\trainer\connectors\logger_connector\logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default

  | Name         | Type             | Params
--------------------------------------------------
0 | loss         | CrossEntropyLoss | 0     
  | other params | n/a              | 16    
--------------------------------------------------
16        Trainable params
0         Non-trainable params
16        Total params
0.000 

Training: |                                                                                      | 0/? [00:00<…

`Trainer.fit` stopped: `max_epochs=100` reached.


In [None]:
import matplotlib.pyplot as plt
#put the weight into dictionary
data = {'w1': [smallmodel.input1_w1.item(),
               smallmodel.input2_w1.item(),
               smallmodel.input3_w1.item(),
               smallmodel.input4_w1.item()],
        'w2': [smallmodel.input1_w2.item(),
               smallmodel.input2_w2.item(),
               smallmodel.input3_w2.item(),
               smallmodel.input4_w2.item(),],
        'token' : ["Fruits", "are", "tasty", "Cakes"],
        'input' : ['input1', 'input2', 'input3', 'input4']
       }

df = pd.DataFrame(data)
#sns.scatterplot(data = df, x = 'w1', y = 'w2')
print(df)
plt.figure()
df.plot()


         w1        w2   token   input
0  2.254447 -0.180230  Fruits  input1
1 -1.877416 -2.382924     are  input2
2  2.704787 -1.146925   tasty  input3
3 -1.164742  2.272458   Cakes  input4
