# Lab 3, Part 2 : Feed Forward Neural Language Models

# About this lab

In this session, you will experiment with feed-forward neural language models (FFLM) using [PyTorch](https://www.pytorch.org). To train the models, you will be using the [WikiText-2](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) corpus, which is a popular LM dataset introduced in 2016:

> The WikiText language modeling dataset is a collection of texts extracted from Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), `WikiText-2` is over 2 times larger. The dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Goal of this lab : 
* Understand FFN
* Train a FFNLM
* Use PyTorch

This part should take you 3h

## Downloading Stuff & Setting Up the Environment

In [19]:
import sys
sys.executable


'C:\\ProgramData\\Anaconda3\\envs\\BookEx\\python.exe'

In [2]:
!wsl which bash


 
 
 C o p y r i g h t   ( c )   M i c r o s o f t   C o r p o r a t i o n .   T o u s   d r o i t s   r é s e r v é s . 
 
 
 
 
 
   U t i l i s a t i o n   :   w s l . e x e   [ A r g u m e n t ]   A r g u m e n t s   : 
 
 
 
 
 
 - - 
 
 
 
 
 
         i n s t a l l e r   < O p t i o n s > 
 
 
                 I n s t a l l e r   l e s   f o n c t i o n n a l i t é s   d u   s o u s - s y s t è m e   W i n d o w s   p o u r   L i n u x .   S i   a u c u n e   o p t i o n   n ' e s t   s p é c i f i é e ,   l e s   f o n c t i o n n a l i t é s   r e c o m m a n d é e s   
 
 
               s e r o n t   i n s t a l l é e s   a v e c   l a   d i s t r i b u t i o n   p a r   d é f a u t . 
 
 
 
 
 
                 P o u r   a f f i c h e r   l a   d i s t r i b u t i o n   p a r   d é f a u t   a i n s i   q u ' u n e   l i s t e   d ' a u t r e s   d i s t r i b u t i o n s   v a l i d e s ,   u t i l i s e z 
 
 
                   ' w s l   - - l i s t   - - o n l i n e ' 

In [3]:
!pip install bash_kernel # did not work either



In [4]:
%%bash
echo "Hel" # simple command to test if bash works

Hel


In [5]:
%pwd

'C:\\Users\\Mon PC\\DataspellProjects\\AI LAB\\AI LAB 3'

In [6]:
# Download the corpus. Simple commands work but not complexe ones...
%%bash
URL="https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2"

for split in "train" "valid" "test";do
if [ ! -f "${split}.txt" ]; then
echo "Downloading ${split}.txt"
wget -q "${URL}/${split}.txt"
# Remove empty lines
sed -i '/^ *$/d' "${split}.txt"
# Remove article titles starting with = and ending with =
sed -i '/^ *= .* = $/d' "${split}".txt
fi
done

# Prepare smaller version for fast training neural LMs
head -n 5000 < train.txt > train_small.txt

# Print the first 10 lines with line numbers
cat -n train.txt | head -n10
echo

# Print some statistics
echo -e "\n   Line,   word,   character counts"
wc *.txt

SyntaxError: invalid syntax (589482835.py, line 5)

# Another Testing Phase

In [25]:
import os
import subprocess
subprocess.run(["cmd.exe","echo","hahaha"])

CompletedProcess(args=['cmd.exe', 'echo', 'hahaha'], returncode=0)

In [22]:
import subprocess
subprocess.run(["cmd.exe","wget", "-q", f"{url}/{split}.txt"])

NameError: name 'url' is not defined

In [7]:
#gave me hope but still not working
import subprocess
import os

url = "https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2"

for split in ["train", "valid", "test"]:
  if not os.path.isfile(f"{split}.txt"):
    print(f"Downloading {split}.txt")
    subprocess.run(["cmd.exe","wget", "-q", f"{url}/{split}.txt"])
    # Remove empty lines
    subprocess.run(["cmd.exe","sed", "-i", '/^ *$/d', f"{split}.txt"])
    # Remove article titles starting with = and ending with =
    subprocess.run(["cmd.exe","sed", "-i", '/^ *= .* = $/d', f"{split}.txt"])

# Prepare smaller version for fast training neural LMs
subprocess.run(["cmd.exe","head", "-n", "5000", "<", "train.txt", ">", "train_small.txt"])

# Print the first 10 lines with line numbers
subprocess.run(["cmd.exe","cat", "-n", "train.txt"], capture_output=True, text=True)
print()

# Print some statistics
output = subprocess.run(["cmd.exe","wc", "*.txt"], capture_output=True, text=True)
print("   Line,   word,   character counts")
print(output.stdout)


   Line,   word,   character counts
Microsoft Windows [version 10.0.22621.1413]
(c) Microsoft Corporation. Tous droits r‚serv‚s.

(BookEx) C:\Users\Mon PC\DataspellProjects\AI LAB\AI LAB 3>


# Here End the Tests
I had to manually write the bash command on a bash prompt since I couldn't get them to work here

In [1]:
%env CUBLAS_WORKSPACE_CONFIG=:4096:8

env: CUBLAS_WORKSPACE_CONFIG=:4096:8


In [2]:
import math
import time
import random
import numpy as np
# Fancy progress bar
from tqdm import tqdm
import torch
from torch import nn

###############
# Torch setup #
###############
print('Torch version: {}, CUDA: {}'.format(torch.__version__, torch.version.cuda))
cuda_available = torch.cuda.is_available()
if not torch.cuda.is_available():
  print('WARNING: You may want to change the runtime to GPU for Neural LM experiments!')
  DEVICE = 'cpu'
else:
  DEVICE = 'cuda:0'

#######################
# Some helper functions
#######################

def fix_seed(seed=None):
  """Sets the seeds of random number generators."""
  torch.use_deterministic_algorithms(True)
  if seed is None:
    # Take a random seed
    seed = time.time()
  seed = int(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  return seed
"""
@:param n: number of parameters
"""
def readable_size(n):
  """Returns a readable size string for model parameters count."""
  sizes = ['K', 'M', 'G']
  fmt = ''
  size = n
  for i, s in enumerate(sizes):
    nn = n / (1000 ** (i + 1))
    if nn >= 1:
      size = nn
      fmt = sizes[i]
    else:
      break
  return '%.2f%s' % (size, fmt)

Torch version: 2.0.0, CUDA: 11.8


# Feed-forward Language Models (FFLM)

FFLMs are similar to $n$-gram language models in the sense that the choice of $n$ is a hyperparameter for the network architecture. A basic FFLM constructs a  $C=n\mathrm{-1}$ length context window before the word to be predicted. If the word embedding size is $E$, the feature vector for the context window becomes a vector of size $E\times C$, resulting from the **concatenation** of individual word embeddings of context words. Hence, the choice of $C$ for FFLMs, affects the number of final learnable parameters in the network.

## A - Dataset Stuff

### Representing the vocabulary

The below `Vocabulary` class encapsulates the **word-to-idx** and **idx-to-word** mapping that you should now be familiar with from the previous lab sessions. Read it to understand how the vocabulary is constructed from a plain text file, within the `build_from_file()` method. Special `<.>` markers are also included in the vocabulary.

In [3]:

class Vocabulary(object):
  """Data structure representing the vocabulary of a corpus."""
  def __init__(self):
    # Mapping from tokens to integers
    self._word2idx = {}

    # Reverse-mapping from integers to tokens
    self.idx2word = []

    # 0-padding token
    self.add_word('<pad>')
    # sentence start
    self.add_word('<s>')
    # sentence end
    self.add_word('</s>')
    # Unknown words
    self.add_word('<unk>')

    self._unk_idx = self._word2idx['<unk>']
  """
  Returns the integer ID of the word or <unk> if not found.
  @:param word: word to be converted to index
  """
  def word2idx(self, word):
    return self._word2idx.get(word, self._unk_idx)

"""
Adds the `word` into the vocabulary.
@:param word: word to be added to the vocabulary
"""
  def add_word(self, word):
    if word not in self._word2idx:
      self.idx2word.append(word)
      self._word2idx[word] = len(self.idx2word) - 1

"""
Builds a vocabulary from a given corpus file.
@:param fname: file name of the corpus
"""
  def build_from_file(self, fname):
    with open(fname, encoding='utf-8') as f:
      for line in f:
        words = line.strip().split()
        for word in words:
          self.add_word(word)
"""
Converts a list of indices to words.
@:param idxs: list of indices to be converted to words
"""
  def convert_idxs_to_words(self, idxs):
    return ' '.join(self.idx2word[idx] for idx in idxs)

"""
Converts a list of words to a list of indices.
@:param words: list of words to be converted to indices
"""
  def convert_words_to_idxs(self, words):
    return [self.word2idx(w) for w in words]

"""
Returns the size of the vocabulary.
"""
  def __len__(self):
    return len(self.idx2word)

"""
Returns a string representation of the vocabulary.
"""
  def __repr__(self):
    return "Vocabulary with {} items".format(self.__len__())

Let's construct the vocabulary for the training set and analyse the token indices for a sentence with an unknown word.




* **Why do we map unknown tokens to a special `<unk>` token?**
* **Do you think the network will learn a useful embedding for that? If not, how can you let the network to learn an embedding for it?**

In [14]:
vocab = Vocabulary()
vocab.build_from_file('train.txt')
print(vocab)

# TODO : Convert sentence to list of indices, note how the last word is mapped to 3 (<unk>)
sentence = "Probable 1 unknown word_"
print(vocab.convert_words_to_idxs(sentence.split()))

Vocabulary with 33280 items
[3, 1000, 5042, 3]


the word `word_` is constructed so as not to be in the vocabulary. It is mapped to the `<unk>` token as expected. Upon further testing, it seems that the vocab is case-sensitive, `probable` and `Probable` are different tokens. I am not sure it is intended or not. However, I believe that it would be better if  the vocab was not case-sensitive as words be they capitalized or not, are still the same word and should as such have the same embedding.

### Representing the corpus

Let's process the corpus for PyTorch: all splits will end up being a large, 1D token sequences. Note that, in `corpus_to_tensor()`, every line is wrapped between `<s> .. </s>` tags.

In [5]:
"""
Converts a corpus file to a tensor of token indices.
@:param _vocab: vocabulary object
@:param filename: file name of the corpus
"""
def corpus_to_tensor(_vocab, filename):
  # Final token indices
  idxs = []  
  with open(filename, encoding = 'utf-8') as data: # added the encoding since the initial code was not working on my machine.
    for line in tqdm(data, ncols=80, unit=' line', desc=f'Reading {filename} '): # tqdm is a progress bar, it is not necessary but it is nice to have.
                                                                                 # ncols is the width of the progress bar, unit is the unit of the progress bar, desc is the description of the progress bar.
                                                                                 # unit is the unit of the progress bar, desc is the description of the progress bar.
      line = line.strip() # Remove leading and trailing whitespaces
      # Skip empty lines if any
      if line:
        # Each line is considered as a long sentence for WikiText-2
        line = f"<s> {line} </s>" # Add sentence markers
        # Split from whitespace and add sentence markers
        idxs.extend(_vocab.convert_words_to_idxs(line.split())) # Convert words to indices
  return torch.LongTensor(idxs)

In [6]:
# Read the files, prepare the small one as well
train = corpus_to_tensor(vocab, 'train.txt')
train_small = corpus_to_tensor(vocab, 'train_small.txt')

valid = corpus_to_tensor(vocab, 'valid.txt')
test = corpus_to_tensor(vocab, 'test.txt')
print('\n')

"""
Returns a human-readable size.
"""
print(f'Small training size in tokens: {readable_size(len(train_small))}')
print(f'Training size in tokens: {readable_size(len(train))}')
print(f'Validation size in tokens: {readable_size(len(valid))}')
print(f'Test size in tokens: {readable_size(len(test))}')

Reading train.txt : 36718 line [00:00, 51735.07 line/s]
Reading train_small.txt : 5000 line [00:00, 50128.17 line/s]
Reading valid.txt : 3760 line [00:00, 53726.05 line/s]
Reading test.txt : 4358 line [00:00, 61778.50 line/s]



Small training size in tokens: 276.94K
Training size in tokens: 2.10M
Validation size in tokens: 218.81K
Test size in tokens: 246.99K





**Q: Print the first 20 token indices from the training set. And then print the sentence in actual words corresponding to these 20 tokens by using one of the provided methods in the `Vocabulary` class.**

In [7]:
########
print(train[:20])
print(vocab.convert_idxs_to_words(train[:20]))
########

tensor([ 1,  4,  5,  6,  7,  4,  2,  1,  8,  9,  5, 10, 11,  3,  6, 12, 13, 11,
        14, 15])
<s> = Valkyria Chronicles III = </s> <s> Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 ,


## B - Model definition

Now that we are done with data loading and vocabulary construction, we can define the actual FFLM model in PyTorch. Recall from the lectures that this model requires a pre-defined context window size $C$ which will affect the way you set up some of the linear layers. **Note that**, in contrast to the model depicted in the lecture, this model has an additional layer `ff_ctx`, which projects the context vector $c_k$ to hidden dimension $H$. This ensures that the number of parameters in the output layer does not depend on the context size, i.e. it is always $H\times V$ instead of $CE\times V$.

---

**Q: Follow the comments in `__init__()` and `forward()` to fill in the missing parts with some actual code.**

In [8]:
"""
Feed-forward language model.
"""
class FFLM(nn.Module):
  """
    Initializes the model.
    @:param vocab_size: size of the vocabulary
    @:param emb_dim: embedding dimension
    @:param hid_dim: hidden dimension
    @:param context_size: context window size
    @:param dropout: dropout probability
  """
  def __init__(self, vocab_size, emb_dim, hid_dim, context_size, dropout=0.5):
    # Call parent's __init__ first
    super(FFLM, self).__init__()

    # Store arguments
    self.vocab_size = vocab_size
    self.emb_dim = emb_dim
    self.hid_dim = hid_dim
    self.context_size = context_size

    # Create the loss, don't sum or average, we'll take care of it
    # in the training loop for logging purposes
    self.loss = nn.CrossEntropyLoss(reduction='none')

    # Create the non-linearity
    self.nonlin = torch.nn.Tanh()

    # Dropout regularizer
    self.drop = nn.Dropout(p=dropout)

    ##############################
    # Fill the missing parts below
    ##############################
    # TODO : Compute the dimension of the context vector
    self.context_dim = context_size * emb_dim

    # Create the embedding layer (i.e. lookup table tokens->vectors)
    self.emb = nn.Embedding(
      num_embeddings=self.vocab_size, embedding_dim=self.emb_dim,
      padding_idx=0)

    # This cuts the number of parameters a bit
    self.ff_ctx = nn.Linear(self.context_dim, self.hid_dim)

    ############################################
    # Output layer mapping from the output of `ff_ctx` to vocabulary size
    # TODO : Fill the dimensions of the output layer
    ############################################
    self.out = nn.Linear(self.hid_dim, self.vocab_size)

    # Purely for informational purposes: compute # of total params
    self.n_params = 0
    for param in self.parameters():
      self.n_params += np.cumprod(param.data.size())[-1]
    self.n_params = readable_size(self.n_params)

"""
Forward-pass of the module.
@:param x: input tensor of token indices, shape (batch_size, context_size+1)
@:param y: true labels, shape (batch_size, context_size+1)
"""
  def forward(self, x, y):
    """Forward-pass of the module."""
    # TODO : What is the shape of x ?
    # The shape of x is (batch_size, context_size+1).

    # TODO : Get the embeddings for the token indices in `x`
    embs = self.emb(x)

    # TODO : Concatenate the embeddings to form the context vector
    ctx = embs.view(embs.shape[0], -1) # Reshape to (batch_size, context_size*emb_dim)

    # TODO : Apply ff_ctx -> non-lin -> dropout -> output layer to obtain the logits i.e. unnormalized scores
    h = self.ff_ctx(ctx)
    h = self.nonlin(h)
    h = self.drop(h)
    logits = self.out(h)



    # TODO : Use self.loss to compute the losses, return the losses (true labels are in `y`)
    return self.loss(logits.view(-1, self.vocab_size), y.view(-1))

"""
Returns a tensor of size (n_batches, batch_size, context_size + 1).
@:param data_tensor: tensor of token indices
@:param batch_size: batch size, default 64
"""
  def get_batches(self, data_tensor, batch_size=64):
    """Returns a tensor of size (n_batches, batch_size, context_size + 1)."""
    # Split data into rows of n-grams followed by the (n+1)th true label
    x_y = data_tensor.unfold(0, self.context_size + 1, step=1)

    # Get the number of training n-grams
    n_samples = x_y.size()[0]

    # Hack: discard the last uneven batch for simplicity
    n_batches = n_samples // batch_size
    n_samples = n_batches * batch_size
    # Split nicely into batches, i.e. (n_batches, batch_size, context_size + 1)
    # The final element in each row is the ID of the true label to predict
    x_y = x_y[:n_samples].view(n_batches, batch_size, -1)

    # A particular batch for context_size=2 will now look like below in
    # word format. Last element for every array is the next token to be predicted
    #
    # [[<s>, cat, sat],
    #  [cat, sat, on],
    #  [sat, on,  the],
    #  [on,  the, mat],
    #   ....
    return x_y
"""
trains the model.
@:param optim: optimizer, shifts the parameters in the direction of the gradient
@:param train_tensor: tensor of token indices for training
@:param valid_tensor: tensor of token indices for validation
@:param test_tensor: tensor of token indices for testing
@:param n_epochs: number of epochs, default 5, i.e. 5 passes over the training data
@:param batch_size: batch size, default 64
@:param shuffle: whether to shuffle the batches or not, default False. Shuffling
                 is useful for training, but not for testing.
"""
  def train_model(self, optim, train_tensor, valid_tensor, test_tensor, n_epochs=5,
                  batch_size=64, shuffle=False):
    """Trains the model."""
    # Get batches for the training data
    batches = self.get_batches(train_tensor, batch_size)

    print(f'Will do {batches.size(0)} batches for an epoch.')

    for eidx in range(1, n_epochs + 1): # Epoch loop
      start_time = time.time() # For reporting time per epoch
      epoch_loss = 0
      epoch_items = 0

      # Enable training mode
      self.train() # Enable training mode

      # Shuffle the batch order or not
      if shuffle:
        batch_order = torch.randperm(batches.size(0))
      else:
        batch_order = torch.arange(batches.size(0))

      # Start training
      for iter_count, idx in enumerate(batch_order):
        batch = batches[idx].to(DEVICE)

        # TODO : Split into inputs `x` and labels `y`. Hint : Look at the context_size
        x, y = batch[:, :-1], batch[:, -1] # The last element in each row is the ID of the true label to predict

        # Clear the gradients
        optim.zero_grad() # Clear the gradients

        # TODO : Compute the loss thanks to one of the previous function
        loss = self.forward(x, y)

        # Backprop the average loss and update parameters
        loss.mean().backward() # Backprop the average loss and update parameters
        optim.step()

        # sum the loss for reporting, along with the denominator
        epoch_loss += loss.detach().sum()
        epoch_items += loss.numel()

        if iter_count % 1000 == 0:
          # Print progress
          loss_per_token = epoch_loss / epoch_items
          ppl = math.exp(loss_per_token)
          print(f'[Epoch {eidx:<3}] loss: {loss_per_token:6.2f}, perplexity: {ppl:6.2f}')

      time_spent = time.time() - start_time

      print(f'\n[Epoch {eidx:<3}] ended with train_loss: {loss_per_token:6.2f}, ppl: {ppl:6.2f}')
      # Evaluate on valid set
      valid_loss, valid_ppl = self.evaluate(test_set=valid_tensor)
      print(f'[Epoch {eidx:<3}] ended with valid_loss: {valid_loss:6.2f}, valid_ppl: {valid_ppl:6.2f}')
      print(f'[Epoch {eidx:<3}] completed in {time_spent:.2f} seconds\n')

    # Evaluate the final model on test set
    test_loss, test_ppl = self.evaluate(test_set=test_tensor)
    print(f' ---> Final test set performance: {test_loss:6.2f}, test_ppl: {test_ppl:6.2f}')

"""
Evaluates the model on the given test set.
@:param test_set: tensor of token indices for testing
@:param batch_size: batch size, default 32
"""
  def evaluate(self, test_set, batch_size=32):
    """Evaluates and computes perplexity for the given test set."""
    loss = 0

    # Get the batches
    batches = self.get_batches(test_set, batch_size)

    # TODO : Set your model to Eval mode
    self.eval()

    with torch.no_grad(): # Disable gradient computation
      for batch in batches: # Batch loop
        batch = batch.to(DEVICE) # Move to GPU

        # TODO : Split into inputs `x` and labels `y`. Hint : Look at the context_size
        x, y = batch[:, :-1], batch[:, -1]

        # loss will be a vector of size (batch_size, ) with losses per every sample
        # sum the loss for reporting, along with the denominator
        loss += self.forward(x, y).sum()

    # Normalize by the number of tokens in the test set
    loss /= batches.size()[:2].numel()

    # TODO : Switch back to training mode
    self.train()

    # return the perplexity and loss
    return loss, math.exp(loss)

  def __repr__(self):
    """String representation for pretty-printing."""
    s = super(FFLM, self).__repr__()
    return f"{s}\n# of parameters: {self.n_params}"

## C - Training

We can now launch training using a set of sane hyper-parameters for our model. This is a 3-gram FFLM since the context size is set to 2. On a Colab GPU, a single epoch should take around 1 minute.

In [9]:
# Set the seed for reproducible results
%env CUBLAS_WORKSPACE_CONFIG=:4096:8
fix_seed(42)

fflm_model = FFLM(
  len(vocab),       # vocabulary size
  emb_dim=128,      # word embedding dim
  hid_dim=128,      # hidden layer dim
  context_size=2,   # C = (N-1) if you think in n-gram LM terminology
  dropout=0.4,      # dropout probability
)

# move to device
fflm_model.to(DEVICE)

# Initial learning rate for the optimizer
FFLM_INIT_LR = 0.001

# Create the optimizer
fflm_optimizer = torch.optim.Adam(fflm_model.parameters(), lr=FFLM_INIT_LR)
print(fflm_model)

print('Starting training!')
# NOTE: If you happen to have memory errors, try decreasing the batch size
# It will print progress every 1000 batches
fflm_model.train_model(fflm_optimizer, train, valid, test, n_epochs=5, batch_size=256, shuffle=True)

env: CUBLAS_WORKSPACE_CONFIG=:4096:8
FFLM(
  (loss): CrossEntropyLoss()
  (nonlin): Tanh()
  (drop): Dropout(p=0.4, inplace=False)
  (emb): Embedding(33280, 128, padding_idx=0)
  (ff_ctx): Linear(in_features=256, out_features=128, bias=True)
  (out): Linear(in_features=128, out_features=33280, bias=True)
)
# of parameters: 8.59M
Starting training!
Will do 8200 batches for an epoch.
[Epoch 1  ] loss:  10.49, perplexity: 35882.11
[Epoch 1  ] loss:   7.27, perplexity: 1439.55
[Epoch 1  ] loss:   6.98, perplexity: 1072.42
[Epoch 1  ] loss:   6.83, perplexity: 927.49
[Epoch 1  ] loss:   6.74, perplexity: 846.14
[Epoch 1  ] loss:   6.67, perplexity: 785.10
[Epoch 1  ] loss:   6.61, perplexity: 738.93
[Epoch 1  ] loss:   6.56, perplexity: 705.65
[Epoch 1  ] loss:   6.51, perplexity: 673.90

[Epoch 1  ] ended with train_loss:   6.51, ppl: 673.90
[Epoch 1  ] ended with valid_loss:   5.80, valid_ppl: 330.63
[Epoch 1  ] completed in 43.10 seconds

[Epoch 2  ] loss:   6.08, perplexity: 435.11
[Epo

**Q: If everything goes well, you should see a loss of around ~10.4 printed as the first loss. This will still be the case if you change the random seed to some other number before model construction i.e. the culprit is not the exact values that they take.**
* **Can you come up with a simple mathematical formula which yields that value?**

In [10]:
##########################
# Answer to question above
##########################
print("<TODO: put the formula here which computes the value>")


<TODO: put the formula here which computes the value>


<div align="center">
$ log(P) = -\frac{1}{n}\sum^n_{k=1}log(P))$
</div>

In [18]:
print(math.log(fflm_model.vocab_size))

10.412711894935144


The value is close to the found first loss value.

## D - Further Exploring

With the default settings above, you should end up with a validation perplexity of $\sim1076$ and a final test set perplexity of $\sim1003$ at the end of 5th epoch. Try the following questions to further analyze the model's prediction.

---

* **Q: Remove the `tanh()` non-linearity from the code so that the context is computed as a linear combination of its embeddings. How does the results compare to the initial one? Do you think non-linearity helps?**

* **Q: Compare the results by rerunning the training with unshuffled batches i.e. with `shuffle=False`. What do you notice in terms of results?**

* **Q: Play with hyper-parameters related to dimensions and dropout. Could you find a model with smaller perplexity?**

* **Q: Try with different context sizes such as 3, 5, 7, etc. What is the best perplexity you can get?**

**Q1:** I have not done it but from my reading I know that non-linearities are introduced so as to  make the model more expressive. The model with non-linearity has a better perplexity than the one without non-linearity. It also makes the model more robust to the data.
**Q2:** Shuffling the Data will make the model better at generalizing. The model with shuffled data has a better perplexity than the one without shuffling.
**Q3:** Too much work for me sir :(, I tried to implemet sklearn's RandomizedSearchCV but it was taking too much time to run. I will try to do it later.


## E - Further Reading for your knowledge
 - [Original FFLM paper from Bengio et al. 2003](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
 - [Original RNNLM paper from Mikolov et al. 2010](https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf)
 - Some recent state-of-the-art LSTM-based RNNLMs

  - [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)
  - [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/pdf/1803.08240.pdf)
  - [Scalable Language Modeling: WikiText-103 on a Single GPU in 12 hours](https://mlsys.org/Conferences/2019/doc/2018/50.pdf)