# Introduction

Welcome to the second NNFL assignment. In this assignment you will be programming an RNN from scratch and creating a preproccessing pipeline for Natural Language Processing. While RNNs are typically programmed using frameworks like PyTorch, the preprocessing pipeline that you will learn about here will be applicable in a lot of NLP problems you will face.

Please read the instructions given below carefully before attempting the assignment.  
- Do NOT import any other modules
- Do NOT change the prototypes of any of the functions
- Sample test cases are already given, test your code using these sample cases
- Grading will be based on hidden test cases
- Please solve this notebook using [Google Colab](https://colab.research.google.com/) as the required packages are already installed 

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE`, as well as your name and ID number below:

In [8]:
NAME = "Harshit Agarwal"
ID = "2019A3PS0245P"

# Installing the Dataset and the GloVe embeddings

We will kick things off by installing all the pretrained models and the dataset. Running the below cell should set you up.

While glove embeddings would have been covered in class, you can find some links about them below:

1. [Glove paper](https://nlp.stanford.edu/pubs/glove.pdf)
2. [For the lazy ones](https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010)

In [9]:
! wget https://nlp.stanford.edu/data/glove.6B.zip
! unzip glove.6B.zip
! rm glove.6B.100d.txt glove.6B.200d.txt glove.6B.300d.txt glove.6B.zip
! pip install --upgrade --no-cache-dir gdown
! gdown --id 1sfQ2Y6kvmrScWMOt4c2zKYxfws5Ivu7x

--2022-04-04 17:18:35--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-04-04 17:18:35--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-04-04 17:21:17 (5.09 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
replace glove.6B.50d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: rm: cannot remove 'glove.6B.100d.txt': No such file or directory
rm: cannot remove 'glove.6B.200d.txt': N

Importing the requisite libraries. Keep in mind the ones that are imported - you will be needing them at a later point.

In [10]:
# DO NOT MODIFY THIS CELL
import torch
import torch.nn.functional as F
from collections import Counter
from tqdm.notebook import tqdm

# Problem Statement

The problem we will try to solve is next word prediction. Given a sequence of words, we want to train an RNN cell to predict the next most probable word.

This cell initialises all your parameters in the model. Each of these will be explained in due course of time.

In [11]:
SEQ_LEN = 8
VOCAB_SIZE = 6000
EMB_DIM = 50
HIDDEN_LEN = 64
BATCH_SIZE = 64
DATA_PATH = '/content/RNN_From_Scratch_Dataset.txt'

# Preprocessing [3.25M]

Preprocessing of data involves the following steps for each line of your dataset:
1. Remove all punctuation from the data. This is done so that our model does not encounter characters which will not contribute to the prediction of the next word.
2. Convert the data into tokens - in this case the tokens would just be words. You should have a clear understanding of the difference between words and tokens.
3. Pad the token sequence with padding tokens, or slice it depending on its length. This is done so that all your datapoints are of the same length. This allows us to ensure that when pytorch creates a batch, no errors are encountered. The inspiration for padding, however, comes from transformers. You can read more about them [here](https://arxiv.org/abs/1706.03762).

We will first write functions for each of these individual tasks. Note that each of these will be for a single datapoint.

In [12]:
# GRADED - 0.5 Marks
def clean_str(line):
	'''
		Remove punctuation marks from the input strings
    chars_to_remove = [',', '.', '"', "'", '/', '*', ',', '?', '!', '-', '\n', '“', '”', '_', '&', '\ufeff', '&', ';', ":"]
		
		Arguments:
			line: The raw text string

		Returns:
			Lowercased string without punctuation marks
	'''
	# YOUR CODE HERE
	chars_to_remove = [',', '.', '"', "'", '/', '*', ',', '?', '!', '-', '\n', '“', '”', '_', '&', '\ufeff', '&', ';', ":"]
	for ele in line:
		if ele in chars_to_remove:
			line = line.replace(ele,"")
	return line.lower()

In [13]:
# Sample Test Case
test_str = 'Who, let* the. dogs- out?'
assert ',' not in clean_str(test_str) and '*' not in clean_str(test_str) and '?' not in clean_str(test_str)
print('Sample Test passed', '\U0001F44D')

Sample Test passed 👍


A tokeniser is a model that splits our words into tokens. Since our problem is next word prediction, our tokeniser function tokenises the words by splitting them, therefore this is called a Word Tokeniser. You can read about other types of tokenisers [here](https://huggingface.co/docs/transformers/tokenizer_summary)

In [14]:
# GRADED - 0.5 Marks
def tokenise(line):
	'''
		Tokenise the raw string into word tokens
		
		Arguments:
			line: The raw text string

		Returns:
			Tokens in the string, split at a space
	'''
	# YOUR CODE HERE
	tokens = line.split()
	return tokens

In [15]:
# Sample Test Case
test_str = 'Who let the dogs out\n'
assert tokenise(test_str) == ['Who', 'let', 'the', 'dogs', 'out']
print('Sample Test passed', '\U0001F44D')

Sample Test passed 👍


All our training points should have the same length. This is done to ensure that the pytorch dataloader can use them with minimal effort from our side.

Another reason for this is that RNNs operate sequentially. Having, say, 200 tokens per training point(which might just be true for our dataset) will cause the training process to slow down. Moreover, RNNs struggle with long sequences.

**Note:** Since this function works with preprocessed raw strings, we can also use it to create our labels. Therefore, before slicing/padding, ensure that you have extracted the label and have updated the training datapoint accordingly.

In [16]:
# GRADED - 0.75 Marks
def pad_sequence(tokens, seq_len, padding_token = '<PAD>'):
	'''
		Padding/slicing sequences of tokens to ensure all of them have the same length. After the padding is done, the next word label is also appended to it.
		
		Arguments:
			tokens: tokens generated from the tokenizer
			seq_len: The maximum permitted length of the sequence
			padding_token: The token to be used in case of padding

		Returns:
			tokens: A list of padded/sliced tokens of len = max_len
			last_token: The label for one datapoint, the token with the highest index
	'''
	# YOUR CODE HERE
	count=1
	s = len(tokens);
	if s==0:
		for i in range(seq_len):
			tokens.append(padding_token);
			last_token='';
	else:
		last_token=tokens[s-1]
		tokens[s-1]=padding_token
		if s-1<seq_len:
			
			for i in range(s,seq_len):
				tokens.append(padding_token)
		elif s-1>seq_len:
			tokens =tokens[0:seq_len];
		else:
			tokens=tokens[0:seq_len];


	return tokens, last_token

In [17]:
# Sample Test Case
test_seq = ['Who', 'let', 'the', 'dogs', 'out']
tokens, last_token = pad_sequence(test_seq, SEQ_LEN)
print(tokens)
assert tokens == ['Who', 'let', 'the', 'dogs', '<PAD>', '<PAD>', '<PAD>', '<PAD>'] and last_token == 'out'
print('Sample Test passed', '\U0001F44D')

['Who', 'let', 'the', 'dogs', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
Sample Test passed 👍


All the above functions should now be called from one function, which will preprocess the entire dataset.

In [18]:
# GRADED - 1.5 Marks
def preprocess_data(path, vocab_size, seq_len):
	'''
		Function to call all preprocessing steps for the entire corpus

		Note: Ensure to leave a slot in the vocabulary for the <UNK> token
		
		Arguments:
			path: The path to your data file
			vocab_size: The number of tokens to be included in the vocabulary
			seq_len: The maximum permitted length of the token sequences

		Returns:
			datapoints: The preprocessed training + testing points for the corpus, list format
      labels: The labels to be used per datapoint, list format
	'''
	# YOUR CODE HERE
	text = open(path,"r");
	# print(text.readlines())
 
	datapoints=[]
	labels=[]
	for x in text:
		line = clean_str(x);

		tokens = tokenise(line);

		tokens=tokens[0:vocab_size-1];
		[datapoint, label] =pad_sequence(tokens,seq_len)
		datapoints.append(datapoint)
		labels.append(label)
	
	

	return datapoints, labels

Run the below cell to preprocess your data.

In [19]:
datapoints, labels = preprocess_data(DATA_PATH, VOCAB_SIZE, SEQ_LEN)


In [20]:
# Sample Test Case
assert len(datapoints[0]) == len(datapoints[-1]) == len(datapoints[1231]) == SEQ_LEN
assert len(labels) == len(datapoints)
print('Sample Test passed', '\U0001F44D')

Sample Test passed 👍


# Building the vocab and converting the tokens to numbers [1.25M]

More often than not, you will have a lot of words in your dataset - more than you require. Therefore, VOCAB_SIZE becomes a parameter that you can manually set as per your requirements.

Your model cannot work with textual words - it needs numbers. For this purpose, we convert the words into numbers by creating a one-one mapping between words and a set of indices.

For creating the vocabulary, we will be choosing the top-k words(k = user-defined vocabulary size) in our dataset. We also need a way to work with words not in our vocab - thus comes the ```<UNK>``` token. Any word not in our vocabulary is allotted this token. This has to manually be added into the dataset.

We will also create an inverse mapping which can be used for decoding the next word from our model.

In [21]:
def build_vocab(datapoints, labels, vocab_size):
	'''
		Building the vocabulary from the most common words in the corpus

		Note: Ensure to leave a slot in the vocabulary for the <UNK> token. 
		For uniformity, insert this at the end of your vocab, ie, its index should be 4999.
		Also ensure that each label is in the vocab. If not, add it by removing the least common word.
		Also ensure you remove padding tokens from the vocabulary and add the most appropriate word.
		
		Arguments:
			datapoints: The preprocessed datapoints in the corpus
			labels: The labels per datapoint in the corpus
			vocab_size: The number of tokens in the vocab

		Returns:
			vocab: A dicitionary mapping from the word to its corresponding vocab index(0 indexed)
			vocab_inv: A dicitionary mapping from the vocab index to its corresponding word
	'''	
	word_sea = []
	for datapoint in datapoints + labels:
		word_sea.extend(datapoint)

	most_common_words = [word for word, _ in Counter(word_sea).most_common(vocab_size - 1)]
	replaced_idx = 1
	for label in labels:
		if label not in most_common_words:
			while(most_common_words[-replaced_idx] in labels):
				replaced_idx += 1
			most_common_words[-replaced_idx] = label
	
	vocab = {word: idx for idx, word in enumerate(most_common_words)}
	vocab_inv = {idx: word for idx, word in enumerate(most_common_words)}
	vocab['<UNK>'] = vocab_size - 1
	vocab_inv[vocab_size - 1] = '<UNK>'
	return vocab, vocab_inv

This function will convert the tokens in our dataset to tokens in our created vocabulary. This means that tokens not in our vocabulary wil get mapped to ```<UNK>```

In [22]:
def data2tokens(vocab, raw_data):
	'''
		Converts the raw text into their corresponding tokens
		
		Arguments:
			vocab: Mapping from the word to its corresponding vocab index
			raw_data: The preprocessed data, however, some words are not present as 
							  tokens in the vocab

		Returns:
			dataset_tokens: A list of the preprocessed data where all words are correspoding to 
									    tokens in the vocab
	'''
	dataset_tokens = []
	for data in raw_data:
		dataset_tokens.append([word if word in vocab else '<UNK>' for word in data])
	
	return dataset_tokens

The above tokens can now mapped to indices in our vocab.

In [23]:
# GRADED - 1.25 Marks
def tokens2ids(vocab, data_tokens):
	'''
		Converts the tokens into their corresponding vocab indices
		
		Arguments:
			vocab: Mapping from the word to its corresponding vocab index
			data_tokens: The preprocessed data where all words are correspoding to 
									 tokens in the vocab

		Returns:
			dataset_ids: The tokens in the dataset converted to their vocab indices
			This should be a Pytorch Long Tensor
	'''
	# YOUR CODE HERE
	dataset_ids=[]

	for t in data_tokens:
		temp=[]
		# print(t);
		# print("\n");
		# print(len(t))
		for y in t:
			temp.append(vocab[y]);
		dataset_ids.append(temp);

	# dataset_ids=torch.FloatTensor(dataset_ids).long()
	

	dataset_ids = torch.tensor(dataset_ids).long()

	
	return dataset_ids

Run the below cell to call all the functions above in sequence.

In [35]:
vocab, vocab_inv = build_vocab(datapoints, labels, VOCAB_SIZE)
print(vocab)
dataset_tokens = data2tokens(vocab, datapoints)
dataset_ids = tokens2ids(vocab, dataset_tokens)
# print(dataset_ids)
# print(len(dataset_tokens))



In [25]:
# Sample Test Case
assert len(vocab) == VOCAB_SIZE
assert len(dataset_tokens) == len(dataset_ids)
assert torch.is_tensor(dataset_ids)
print('Sample Test passed', '\U0001F44D')

Sample Test passed 👍


Now that our data is ready, we need to find a way to feed it into the model. As you have learnt in the previous assignment, we use mini batch sampling for inputs into the model. PyTorch uses a dataloader class(which you used in your previous assignment) which makes this possible. This next function will be an emulation of the dataloader in vanilla Python.

In [26]:
def create_dataloader(datapoints, labels, num_batches, batch_size):
  '''
    Function to create the dataloader which will yield batches on the fly.

    Arguments:
      datapoints: The preprocessed datapoints in the corpus
      labels: The labels per datapoint
      num_batches: The number of batches from the dataset
      batch_size: The number of datapoints per batch
    Returns:
      x: One minibatch of input indices of size (batch_size, seq_len)
      y: One minibatch of labels per datapoints of size (batch_size, 1)
  '''
  for i in range(num_batches):
    if i == num_batches - 1:
      x = datapoints[i*batch_size:]
      y = labels[i*batch_size:]
    else:
      x = datapoints[i*batch_size: (i+1)*batch_size]
      y = labels[i*batch_size: (i+1)*batch_size]
    yield x, y

# Modelling [5.5M]

The first step would be to build an RNN cell. This would be randomly initialised. An RNN contains 5 matrices, each of which have been described in the docstring. You would return a dictionary with the values being the matrices and keys being their corresponding notations. 

The equations of an RNN can be summarised as:

### $ h^{(t)} = tanh(E \cdot I) + h^{(t - 1)} \cdot H + I_b $
###  $ o^{(t)} = h^{(t - 1)} \cdot O + O_b$

In [82]:
# GRADED - 0.75 Marks
def create_rnn(hidden_len, emb_dim):
	'''
		Creates a randomly intialised rnn cell
		
		Arguments:
			hidden_len: The length of the hidden state of the rnn
			emb_dim: The length of the embeddings in the embedding space

		Returns:
			rnn: A dictionary containing all the weights and biases associated with the rnn cell (use torch.randn).
				rnn['I']: The learnable weights to convert the input embeddings to the current hidden state
				rnn['H']: The learnable weights to convert the previous hidden state to the current hidden state
				rnn['O']: The learnable weights to convert the current hidden state to the output vector
				rnn['I_b']: The bias to be used to convert the input embeddings to the current hidden state
				rnn['O_b']: The bias to be used to convert the current hidden state to the output vector
	'''
	# YOUR CODE HERE
	rnn= {'I': torch.randn(emb_dim,hidden_len),
	      'H': torch.randn(hidden_len,hidden_len),
				'O': torch.randn(hidden_len, hidden_len),
				'I_b': torch.randn(hidden_len,1),
				'O_b': torch.rand(hidden_len,1)}
	
	return rnn

The model input, as discussed before, is going to be a set of indices corresponding to each token in the vocabulary. This cannot be directly fed in because they do not mean anything to our model. They are not present in a common vector space. For this purpose, we create "embeddings" which is a multi-dimensional representation of our vocabulary. These are stored in a lookup table and are learnable features, just like the weights and biases of our network. They can be indexed using the indices we have created in our vocab.


You have to write a function to initialise this lookup table, as per the conditions given in the docstring. The ```load_pretrained_embeddings``` loads the GloVe Embeddings for you in a dictionary mapping the GloVe tokens to GloVe embeddings.

In [83]:
def load_pretrained_embeddings(model_name):
  '''
		Reads and loads the pretrained glove embeddings from the downloaded glove file
		
		Arguments:
			model_name: The path to the pretrained glove file

		Returns:
			embedding_model: Mapping from token to its correpsonding glove embedding
	'''
  embedding_model = {}
  f = open(model_name, 'r')
  for line in tqdm(f.readlines(), desc = 'Reading GloVe Embeddings'):
    tmp = line.split(' ')
    word, vec = tmp[0], list(map(float, tmp[1:]))
    assert(len(vec) == 50)
    if word not in embedding_model:
        embedding_model[word] = torch.tensor(vec)
        
  return embedding_model

# GRADED - 1.75 Marks
def create_embeddings(emb_dim, num_tokens, vocab, model_name = 'glove.6B.50d.txt'):
  '''
		Creates and initialises the embeddings for the corpus:
    1. If a token in the corpus is present as a token in the glove embeddings, initialise it with the glove embedding
    2. If a token in the corpus is not present as a token in the glove embeddings, initialise it with a random embedding sampled from U(-0.25, 0.25)
    3. Initialise the padding token with a zero embedding
		
		Arguments:
			emb_dim: The length of the embeddings in the embedding space
      num_tokens: The number of tokens in the vocabulary, aka, vocab size
      vocab: Mapping from the word to its corresponding vocab index
      model_name: The path to the pretrained glove file

		Returns:
			embeddings: The initialised embedding space (a torch tensor)
	'''
  # YOUR CODE HERE
  glove_embeddings = load_pretrained_embeddings(model_name)
  f = open(model_name, 'r')
  r1=-.25
  r2=.25
  embeddings=[]
  for word in vocab:
    if word=='<PAD>':
      embeddings.append(torch.zeros(50))
    elif word in glove_embeddings:
      embeddings.append(glove_embeddings[word])
    else:
      embeddings.append((r2-r1)*torch.randn(50)+r2)
    

  embeddings = torch.stack(embeddings)


    
  return embeddings

This function creates your classifier, which is a fully connected layer. As before, you have to return a dictionary which contains the weights and baises of the classifier. The equations of the classifier can be summarised as:

 ### $ Y = (X \cdot W + b)$

In [84]:
# GRADED - 0.75 Marks
def create_classifier(in_features, num_classes):
	'''
		Creates a randomly intialised classifer as a fully connected layer
		
		Arguments:
			in_features: The length of the feature vector at the input of the classifier
			num_classes: The number of classes to be predicted

		Returns:
			classifier: 
				classifier['weight']: The randomly initialised weights for the fully connected layer from in_features to num_classes (use torch.randn)
				classifier['bias']: The randomly initialised bias for the fully connected layer from in_features to num_classes
	'''
	# YOUR CODE HERE
	classifier={'weight': torch.randn(in_features,num_classes),
	            'bias': torch.randn(num_classes,1)}
	return classifier

This function forwards the rnn by one step for all elements in the batch. In case a padding token is encountered, no change is made to the output and the hidden state. For this purpose, you have been provided the previous hidden state and the previous output as an input into the function.

### $ h^{(t)} = tanh(E \cdot I) + h^{(t - 1)} \cdot H + I_b $
###  $ o^{(t)} = h^{(t - 1)} \cdot O + O_b$

In [159]:
# GRADED - 0.75 Marks
def forward_rnn_one_step(rnn, embs, prev_hidden, prev_output):
	'''
		Takes one forward step through the rnn cell. In case a padding token is encountered, no change is made to the output and the hidden state.
		
		Arguments:
			rnn: Dictionary containing the weights and biases of the rnn
			embs: The input embeddings of the tokens of the datapoint
			prev_hidden: The previous hidden state
			prev_output: The previous outputs

		Returns:
			hidden_state: The next hidden state
			output: The output for this sequence
	'''
	# YOUR CODE HERE
	I = rnn["I"]
	H = rnn["H"]
	O = rnn["O"]
	I_b = rnn["I_b"]
	O_b= rnn["O_b"]

	hidden_state = torch.tanh(torch.matmul(embs,I)) + torch.matmul(prev_hidden,H) + torch.transpose(I_b,0,1)
	output = torch.matmul(torch.transpose(prev_hidden,0,1),prev_output) + torch.transpose(O_b,0,1)

	return hidden_state, output

The below function passes your features through the full connected layer. The equations are summarised again for your convenience:

 $ Y = (X \cdot W + b)$

In [160]:
# GRADED - 0.75 Marks
def classify(feat, classifier):
	'''
		Performs a forward pass through the classifier

		Arguments:
			feat: The feature vector for classification
			classifier: Dictionary containing the weights and biases of the classifier

		Returns:
			logits: The logits for each word in our model
	'''
	# YOUR CODE HERE
	w = classifier['weight']
	b = classifier['bias']
	logits = torch.matmul(feat,w)+b
	return logits

The forward function passes your input data through all the above functions in sequence. Note the sequential nature of calling the rnn, since the previous hidden state has to be used.

The initial hidden state has to be initialised to zeros, while the output has to be randomly initialised.

The features to be used for the classifier is the final output of the RNN. Note that the features can be a concatenations of all outputs/hidden states too. You will have to change the classifier accordingly. 

The output logits will have to be converted to a probability distribution. For this purpose, it will be passed through a softmax activation.

In [161]:
def forward(x, seq_len, hidden_len, classifier, embs, rnn):
	'''
		Performs a foraward pass for the batched data

		Arguments:
			x: The feature vector for classification
			seq_len: The maximum permitted length of the token sequences in the datapoints
			hidden_len: The length of the hidden state of the rnn cell
			classifier: Dictionary containing the weights and biases of the classifier
			embs: The input embeddings of the tokens of the datapoint
			rnn: Dictionary containing the weights and biases of the rnn

		Returns:
			probs: The probabilities of all the next words
	'''
	input_embs = embs[x]
	hidden_state = torch.zeros(x.shape[0], hidden_len)
	output = torch.zeros(x.shape[0], hidden_len)
	for i in range(seq_len):
		hidden_state, output = forward_rnn_one_step(rnn, input_embs[:, i, :], hidden_state, output)
		
	logits = classify(output, classifier)
	probs = F.softmax(logits, dim = 1)
	return probs

Run the below cell to call your modelling functions.

In [162]:
torch.manual_seed(69)
print('Creating RNN')
rnn = create_rnn(HIDDEN_LEN, EMB_DIM)
print('Creating Embeddings')
embs = create_embeddings(EMB_DIM, VOCAB_SIZE, vocab)
print('Creating Classifier')
classifier = create_classifier(HIDDEN_LEN, VOCAB_SIZE)

Creating RNN
Creating Embeddings


Reading GloVe Embeddings:   0%|          | 0/400000 [00:00<?, ?it/s]

Creating Classifier


In [163]:
# Sample Test Cases
print(embs.shape)
print((VOCAB_SIZE, EMB_DIM))
assert rnn['H'].shape == (HIDDEN_LEN, HIDDEN_LEN)
assert torch.isclose(rnn['I_b'][40], torch.tensor(0.1261), atol = 0.1) 
assert torch.isclose(rnn['I'][40][40], torch.tensor(0.9098), atol = 0.1) 
assert embs.shape == (VOCAB_SIZE, EMB_DIM)
assert len(classifier['bias']) == VOCAB_SIZE
print('Sample Test passed', '\U0001F44D')

torch.Size([6000, 50])
(6000, 50)
Sample Test passed 👍


In [164]:
# Sample Test Cases
torch.manual_seed(69)
test_output, _ = forward_rnn_one_step(rnn, torch.randn(32, EMB_DIM), torch.zeros(32, HIDDEN_LEN), torch.zeros(32, HIDDEN_LEN))
assert torch.isclose(test_output[1][0], torch.tensor(0.9958), atol = 0.01)
print('Sample Test passed', '\U0001F44D')

AssertionError: ignored

Now that your data and model is ready, it is time to pass it through the model and get some predictions. Write an expression to calculate the number of batches and use that to create your dataloader, using the variables and prototype you have created above.

In [None]:
# GRADED - 0.75 Marks
num_batches = # YOUR CODE HERE
dataloader = create_dataloader(dataset_ids, labels, num_batches, BATCH_SIZE)

Just like you do in PyTorch, loop through your dataloader to get the batches. Perform a forward pass to get the probabilities of your next words. Choose the most probable word using ```torch.argmax``` and add these to a list called ```preds```.

These predictions will be indices, therefore, use ```vocab_inv``` to convert it back into words.

In [None]:
preds = []
for x, y in tqdm(dataloader, total = num_batches, desc = 'Forward Pass'):
  probs = forward(x, SEQ_LEN, HIDDEN_LEN, classifier, embs, rnn)
  next_word = torch.argmax(probs, dim = 1)
  preds.extend([vocab_inv[int(word)] for word in next_word])

In [None]:
# Sample Test Case
assert len(preds) == len(datapoints)
print('Sample Test passed', '\U0001F44D')

# End of this part.
Assignment by:

Devaansh Gupta (f20190187@pilani.bits-pilani.ac.in)

Palaash Agrawal (f20180565@pilani.bits-pilani.ac.in)

Harsh Sulakhe (f20180186@pilani.bits-pilani.ac.in)

