In [1]:
#hide
from fastbook import *

In the previous chapter we learnt about the different transforms,collections for data transformation,
pipeline of transformations for both Image Data and Text Data using fastai's mid-level data API apart
from DataBlock API.Till now whatever we learnt about different models was mostly through fastai's 
customized objects,methods abd functions.Now we would be going into little deep in deep learning.This
part of the course would focus on building state of art models.
This Chapter will stress on fine-tuning pretrained language model to build text classifier.We would
also learn about building a RNN and LSTMs from scratch.Let's start from Language Model.

# A Language Model from Scratch

## The Data

Since we will be implementing a language model from scratch for first time,we would be using a simple
dataset here for easy and quick interpretation of results.We would be using Human Numbers dataset,and
they are simply 10000 numbers written in English like one,two,three...

Let's download,extract and see what are the folders and contents of the dataset.

In [2]:
#downloading,extracting datasets and assigning it to path
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

In [3]:
#hide
Path.BASE_PATH = path

In [4]:
#ls method to see the contents of the dataset
path.ls()
#contains training and validation folders

(#2) [Path('train.txt'),Path('valid.txt')]

Using ls() method used earlier we can see that there are two files "train.txt" and "valid.txt" for 
training and validation data respectively.

These files contain the numbers written in text format in different lines.We open the files and 
concatenate all the texts together by ignoring the training,validation split.

In [5]:
#Creating a List Object to store the concatenated text  
lines = L()
#opening the training and validation file and reading the text from it and storing it in the List 
#object
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

Next all the lines are taken and concatenated into big doc."." is used as separator.

In [6]:
text = ' . '.join([l.strip() for l in lines]) #Joining word in each line using "." separator
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

Next we tokenize the words by splitting on spaces..

In [7]:
tokens = text.split(' ') #Tokenizing the words
tokens[:10]
#"." are also considered as separate tokens

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

Next comes numericalization.For that we create vocab.

In [8]:
#Creating list of unique tokens(vocab)
vocab = L(*tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

Now for numericalization we look up the index of each word in the vocab.And then in the list of tokens
those words are replaced by the index of the particular word in the vocab

In [9]:
word2idx = {w:i for i,w in enumerate(vocab)} #converting words into indexes(numbers) in vocab
nums = L(word2idx[i] for i in tokens) #Numericalizing the tokens using the index pf each word in vocab
nums

(#63095) [0,1,2,1,3,1,4,1,5,1...]

We have done the two main tasks of the text processing(tokenization,numericalization) for a Language 
model so our dataset is ready now..

## Our First Language Model from Scratch

We will be building a language model that would predict next word on the basis of last three words.
For creating dataloaders we need list of tuples containing independent and dependent variables.So
here our independent variable is a sequence of three words and the dependent variable is the next word
in the sequence.We create tuples for all possible combinations of words..

In [10]:
#Creating list Object containing tuples of independent and dependent word pairs by iterating over
#pair of every three words in the list of tokens using for loop
L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3)

SyntaxError: unexpected EOF while parsing (<ipython-input-10-a277f876a917>, line 3)

In the previous code cell the list of tuples was created for word tokens.Deep learning models work 
with numerical data only so we create these list of tuples for numericalized tokens using the tensors 
of the numericalized tokens by of taking all possible sequences of three numbers in the list..

In [11]:
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
seqs
#In the first tuple [0,1,2] is the input and 1 is the label..

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

Now we can load the data into Dataloaders into minibatches as it is arranged separately into x,y 
pairs.The training and validation split is done randomly by using 80% for training and rest 20% for
validation.

In [12]:
bs = 64 #Setting batch size
cut = int(len(seqs) * 0.8) #index for random splitting into training  and validation set
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False,num_workers=0) 
#Creating DataLoaders by passing batch size,training and validation data

Wow so we have created a Dataloaders object also so we ready to create our first language model in 
Pytorch from scratch..

We would create a neural network that would take three words as input and return the probability of 
prediction for each word in the vocab.It is similar to multicategory classification we had done 
earlier.

We will be constructing a three layer model with some modifications.First layer will have first word's
embeddings as activations,second layer will have second layer's embedding and first layer's output 
activations and third layer will have third word's embedding plus second layer's output as activations
.This is done so that every word is related to the word preceding it.

All these layers have same weight matrix that is the activations won't depend on the position of the 
word.Though activations will change as data moves from layer to layer but the layer weights won't 
change.So each layer will indirectly learn about all positions in the sequence.
Since our model is sequential so layers can be called repeated.Pytorch allows us to create one layer
and then repeat it.

We write architecture for the model now...

### Our Language Model in PyTorch

In [13]:
#Create class for the Language Model 1(LMModel1)Pytorch module is passed so that we can use all the 
#layer modules included in that
class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        #Input Layer
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#First layer(size of the vocab,batch_size)#passed
        #while creating learner 
        self.h_h = nn.Linear(n_hidden, n_hidden)#Hidden layer(Same as previous layer,Same as previous
        #layer)       
        self.h_o = nn.Linear(n_hidden,vocab_sz)#Final layer(Same as previous layer,vocab size(output))
        #Output Layer
        
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))#Relu activation for first layer by passing first word
        #embedding
        h = h + self.i_h(x[:,1])#second word input
        h = F.relu(self.h_h(h)) #Activation of the second word 
        h = h + self.i_h(x[:,2])#Third word input 
        h = F.relu(self.h_h(h))#Activation of the third word
        return self.h_o(h) #Output layer predictions

In the Neural network above three layers are created:
1.The Embedding layer(i_h)(input layer) input to hidden layer
2.The Linear layer creates activation for next word(h_h)(hidden layer) 
3.Final Layer to predict the fourth word(h_o) (output layer)

It can also be represented pictorially using the below flow chart..
<img alt="Pictorial representation of simple neural network" width="400" src="images/att_00020.png" caption="Pictorial representation of a simple neural network" id="img_simple_nn">

In the above figure rectangle represents the input layer,circle represents the inner layer and the
triangle represents the final layer.The above figure represents the basic architecture.As mentioned 
before Since the layers have same embedding matrix and weights so Pytorch allows us to create one 
layer and then use the same again.The actual model looks like following:-
<img alt="Representation of our basic language model" width="500" caption="Representation of our basic language model" id="lm_rep" src="images/att_00022.png">

So in the above figure first word is fed to the input layer.Then it is passed through the hidden layer
to get activations which serve as the input to the second hidden layer along with the second word 
input.The activations from the second hidden layer along with the input from third word are fed to
third hidden layer which are then fed to the output layer which predicts the next word in the sequence
.Now we can create learner through this model by passing the dataloaders,the model object(vocab size,
batch-size),Loss function and the metrics.We further train the model for 4 epochs and a learning rate 
of "10^-3".

In [14]:
#Creating the learner for the model
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)#Training the model for 4 epochs

epoch,train_loss,valid_loss,accuracy,time
0,1.824297,1.970941,0.467554,00:02
1,1.386973,1.823242,0.467554,00:01
2,1.417556,1.654497,0.494414,00:01
3,1.37644,1.650849,0.494414,00:01


The model we have trained so far is a very basic model with only three layers and therefore is 
considered a baseline model for our final predictions.Now we would try to train a model which can 
give better accuracy than this.
Let's see which token occurs most in the labels of the validation set.The simple baseline model will
predict the most common token.

In [15]:
n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

(tensor(29), 'thousand', 0.15165200855716662)

The common token has the index of 29 which stands for word "thousand".It is predicted with an accuracy
of 15%.It is most common as when we write the words for numbers till 10000,starting from one thousand,
this word comes in every number and therefore it is the most common.
The Previous baseline model gave accuracy of 49% which is good as a baseline model.Let's build another
model using some refactoring.The number and type of layers would remain same,only here we will be 
iterating in a for loop to call the hidden layer and input layer for each word.This type of model is 
called "Recurrent Neural Network".Let's first create the architecture for the same..

### Our First Recurrent Neural Network

The number of layers and the architecture of this model is same as the Baseline model.We will refactor
it using a for loop and the advantage it provides is that it can take varying number of words in the 
independent variable set.Therefore it will accept list of more than 3 words also.Let's create the 
model now..

In [16]:
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):#(vocab size,batch size)
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#Input layer (vocab size,batch size)
        self.h_h = nn.Linear(n_hidden, n_hidden)#Hidden Layer (same as the previous layer,)
        self.h_o = nn.Linear(n_hidden,vocab_sz)#output layer(same as previous layer,predictions are
        #equal to vocab size)
        
    def forward(self, x):
        h = 0
        for i in range(3):#iterating over the three words in the input
            h = h + self.i_h(x[:,i])#passing the word through the input layer
            h = F.relu(self.h_h(h))#getting activations through the hidden layer
        return self.h_o(h)#returning the final predictions after iterating through all the words in 
   #the input sequence

Let's train our model by creating learner and passing the model architecture.Then we train the model
for 4 epochs and a learning rate of 10^-3.

In [17]:
#Creating learner for the model
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)#Fitting the model for 4 epochs..

epoch,train_loss,valid_loss,accuracy,time
0,1.816274,1.964143,0.460185,00:02
1,1.423805,1.739964,0.473259,00:01
2,1.430327,1.685171,0.485382,00:01
3,1.38839,1.657033,0.470406,00:01


After refactoring the model also,we get almost same results.The accuracy remains the same.The only 
difference between the previous model and the current model is that we have included a for loop to 
iterate over the words in the input list.The activations are updated everytime through the loop and 
are stored in h.This is one of the main properties of RNNs.Any neural network defined through loop is
called Recurrent Neural Network(RNN).It is made by refactoring multilayered neural network by 
iterating over a for loop.We trained a simple RNN model for a few epochs and got results similar to 
the baseline model.We can further improve its performance by introducing some changes..

We would discuss about how we can improve the performance of a RNN...

## Improving the RNN

In RNN architecture,hidden state(h variable) is initialized to zero for every sequence of words passed
through the network.The length of the sequences(no of words in each sequence) is fixed as 3 for now.It
is minimized so that data can easily fit in mini batches.If the sequences were ordered correctly then
the length of the sequence would be very long on which the model will get trained then.
One more thing to notice is that we are predicting fourth word only here but we can actually predict
second and third word too as model learns about all the positions.
We can introduce these modifications in the model.Let's see how..

### Maintaining the State of an RNN

In the previous model for each new sample the hidden state is initialized to 0 which tells that no 
previous information about the sentence is carried forward.This means that model is not aware about
wheather we are present at the end of the sentence or middle or starting.This can be corrected by 
initializing the hidden state in "__init__" instead of "forward".
When we initialize the model's hidden state to zero,it actually adds to the layers in our model.The 
number of layers are equal to the number of the number of tokens in the document.Pictorial figure of
RNN without for loop can explain why this happens..

The very first model with three layers and without for loop was the baseline model.In that model each
layer is fed with one input token.It is also called "unrolled representation".In this way when we 
do refactoring using for loop and initialize the hidden state in __init__ , the model becomes deep and
for a dataset with 10000 tokens,the model will be 10000 layered.Problem with such deep model is that
for calculating derivatives,it will have to do it starting from first layer.This will cause model to 
be slow and also consume a lot of memory.
To avoid this,backpropogation would happen only for the last three layers.Detach method is used to
remove the gradient history in Pytorch.
The new variant of the RNN is as follows.It has h initialized in __init__ and it also remembers each
activation while calling forward everytime so it is used for different samples in the mini-batch.

In [18]:
class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#Input layer  
        self.h_h = nn.Linear(n_hidden, n_hidden)#Hidden layer     
        self.h_o = nn.Linear(n_hidden,vocab_sz)#Output Layer
        self.h = 0#Initializing hidden state to zero
        
    def forward(self, x):
        for i in range(3):#iterating over 3 words in each sample
            self.h = self.h + self.i_h(x[:,i])#passing word through input layer
            self.h = F.relu(self.h_h(self.h))#Activation through hidden layer
        out = self.h_o(self.h)#Output predicted through output layer
        self.h = self.h.detach()#detaching the gradients calculated for previous layer
        return out #returning output
    
    def reset(self): self.h = 0

For every sample length,this model has same activations as this model remembers the activations from 
previous batch also.This solves the problem of not learning about any previous information about 
preceding batch.The gradients would be calculated for only the three tokens in the sequence and this 
method is called "backpropogation through time"(BPTT).This involves training each layer in neural net 
in one time step to save memory and time.

Order of the data matters a lot in the LMModel3.So data should be passed in order such that dset[0] is
the first line in first batch,dset[1] is the first line in second batch and so on.
To ensure this data has to be arranged.For that data is divided into equal number of m=len(seqs)//bs 
sampled groups where bs is batch-size=64 and len(seqs) is total data size and m is length of each part
.Now the data would be divided into batches such that first batch consists of samples as:-
    
(0,m,2*m,........(bs-1)*m)

and the second batch would have:-

(1,m+1,2*m+1,....(bs-1)*m+1) and so on..

so for each epoch,model will receive text of size 3*m.Since,every text is of size 3 on every line of 
batch.

In [19]:
m = len(seqs)//bs #Length of each piece in batch=(Total dataset length/batch size)
m,bs,len(seqs) #Length of each piece,batch size,Total length 

(328, 64, 21031)

The below function "group_chunks" does reindexing by passing batch size and dataset.

In [20]:
#Function for reindexing each batch.Returns the reindexed dataset
def group_chunks(ds, bs):
    m = len(ds) // bs #Length of each piece in batch
    new_ds = L() #Creating new List object
    for i in range(m): #Iterates through the piece length
        new_ds += L(ds[i + m*j] for j in range(bs))#Reindexing for whole batch in the similar way as
        #Previously mentioned batch contents
    return new_ds#return the indexed batch contents

Next we build DataLoaders and for training and validation split we pass the index at which split will
be done."drop_last"=True is also passed so that last batch can be dropped since it doesn't have a 
batch size equal to bs."Shuffle"=False is passed so that there is no text rearrangement and text is 
read in sequence only.

In [21]:
#Creating dataloaders from the dataset
cut = int(len(seqs) * 0.8)#index for training,validation split
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs),#returns training set 
    group_chunks(seqs[cut:], bs),#returns validation set
    bs=bs, drop_last=True, shuffle=False,num_workers=0)

Next we create learner for training the model,Since our model contains many layers so we also increase
the number of epochs thus training model for longer.The last argument passed through learner is cbs=
"ModelResetter".It is a callback as it calls the reset method in model in the begining of each epoch
and validation phase.This ensures that before starting each epoch , the hidden set is reset for 
learning the sequential text.

In [22]:
#Creating learner for training model
#Passing dataloaders,model architecture(vocab-size,batch size),loss function,metrics and cbs for 
#calling model reset.
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(10, 3e-3)#training the model for more number of epochs

epoch,train_loss,valid_loss,accuracy,time
0,1.677074,1.827367,0.467548,00:02
1,1.282722,1.870913,0.388942,00:01
2,1.090705,1.651794,0.4625,00:01
3,1.005216,1.61599,0.515144,00:01
4,0.96302,1.605894,0.551202,00:01
5,0.916012,1.704552,0.547596,00:01
6,0.895979,1.651697,0.560577,00:01
7,0.837829,1.661127,0.579567,00:01
8,0.806078,1.697939,0.594231,00:01
9,0.792238,1.657946,0.603125,00:01


The accuracy improves to 58% after introducing the changes such as initializing hidden state in 
"__init__" and using for loop for refactoring the model.This version of RNN also remembers the 
preceding sequences of words.But the model can still be improved more.Let's see how..

### Creating More Signal

All the models we built till now were predicting next one word in sequence using the last three words.
So for backpropogation,the information of a single word is sent to update the weights for three words
inputs.Thi can be increased and made better by predicting  next word after every single word.
The pictorial presentation would be like:-
<img alt="RNN predicting after every token" width="400" caption="RNN predicting after every token" id="stateful_rep" src="images/att_00024.png">

Previously we had constructed tuples of dependent and independent variables such that there were pairs
of three input words and one output word..So now we would have input and output sequences such that 
for each input word there will be output(its next word).Instead of 3,sequence length(sl) attribute is 
used >3.We repreat the process we had done earlier to create list of tuples containing pairs of 
dependent and independent variables and then we define the index for training and validation set and
create dataloaders by passing the "group_chunks" function with the index for splitting."drop_last"=
True is passed so as to drop last batch as it is not of the same batch size and shuffle=False is 
passed to maintain the sequence of the words.

In [23]:
#Preparing dataset for the next word prediction from each word
sl = 16#sequence length
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))#list of tuples with input and output pairs.
         for i in range(0,len(nums)-sl-1,sl))#iterating through whole word tokens
cut = int(len(seqs) * 0.8)#index for training and validation split
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),#returns training split
                             group_chunks(seqs[cut:], bs),#returns validation set
                             bs=bs, drop_last=True, shuffle=False,num_workers=0)

Let's see the first element in seqs.Contains two lists of equal length.One of them is the input and 
other is the output(contains next word for each word in first list).

In [24]:
#printing the first element in the seqs 
[L(vocab[o] for o in s) for s in seqs[0]]

[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

Now the forward function in the model has to be modified such that it predicts a word after each word
passed and not at the end of three word sequence.So we initialize a outs array which appends the 
predicted output word for every input word.It then returns a stacked array(with changed shape) 
containing the predictions.

In [25]:
class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#input layer  
        self.h_h = nn.Linear(n_hidden, n_hidden)#hidden layer 
        self.h_o = nn.Linear(n_hidden,vocab_sz)#output layer
        self.h = 0 #initializing hidden state
        
    def forward(self, x):
        outs = []#initializing outs array for storing predictions
        for i in range(sl):#iterating through each word in the sequence length
            self.h = self.h + self.i_h(x[:,i])#passing word through input layer
            self.h = F.relu(self.h_h(self.h))#getting activations through hidden layer
            outs.append(self.h_o(self.h))#adding the predictions to outs array
        self.h = self.h.detach()#detaching to erase the previous gradient memory
        return torch.stack(outs, dim=1)#returning all predictions
    
    def reset(self): self.h = 0 #model resetter

The model returns output such that its shape is bs*sl*vocab-size.(stacking was done for dim=1).Targets
are also of shape bs*sl*(batch size *sequence length).Since it returns probability of prediction for
all the words in the vocab,so predictions are stacked  at dim=1.Likewise targets also need to be 
stacked before passsing them through the loss function.So we define a separate loss function 
"loss_func" which takes the predictions and targets and flattens both target and predictions.

In [26]:
#Loss function(predictions,targets)
def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))#returns the cross entropy from
#flattened predictions and targets

Next this loss function is passed through the learner along with the dataloaders,model architecture,
loss function,metrics and the ModelResetter.This model should be trained for more number of epochs
because the predictions have changed and therefore it requires more training time as its learning more
information.Since the network is very deep,so everytime we train it,everytime different results are 
obtained as small or big gradients can be obtained.

In [27]:
#Creating a learner
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)#training model for more number of epochs

epoch,train_loss,valid_loss,accuracy,time
0,3.285931,3.072032,0.212565,00:01
1,2.330371,1.969522,0.425781,00:01
2,1.742317,1.841378,0.441488,00:01
3,1.47012,1.810856,0.494303,00:01
4,1.298811,1.823132,0.492839,00:01
5,1.177598,1.769712,0.507568,00:01
6,1.074456,1.784031,0.492594,00:01
7,0.980687,1.782864,0.510661,00:01
8,0.902815,1.692381,0.567057,00:00
9,0.841107,1.680223,0.572754,00:00


We have got better results from previous model as we train for more number of epochs and instead of 
predicting one word for a sequence of three words,we output one word for each word in the sequence.
Next we can make our model better by getting more deeper model.

The basic architecture of our model is small now as it contains one input layer,one hidden layer and 
one output layer.Every word in the sequence is passed through each of these layers and predictions are
obtained for each word.If there are more layers the model can be improved. 

## Multilayer RNNs

In a multilayered RNN,instead of multiple layers there are multiple RNNs present.Activations are sent
from one RNN to other RNN.
<img alt="2-layer RNN" width="550" caption="2-layer RNN" id="stacked_rnn_rep" src="images/att_00025.png">

For implementing this in model,we need not create multiple layers in our code.Instead this can be done
using Pytorch's RNN class part of nn.module which has the same architecture as the RNN we had created
earlier and also allows us to stack multiple RNNs together.While creating the learner here,we need to
pass the number of total RNNs in the network too along with the vocab size and the batch size.

### The Model

In this multilayered RNN model,we pass the information from one RNN to other.

In [28]:
class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#input layer
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)#the second RNN
        self.h_o = nn.Linear(n_hidden, vocab_sz)#output layer
        self.h = torch.zeros(n_layers, bs, n_hidden)#initializing the h to zero matrix
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)#passing the input activation and the hidden layer 
        #activation through second RNN
        self.h = h.detach()#detaching h to remove gradient history
        return self.h_o(res)#returning the output
    
    def reset(self): self.h.zero_()#model resetter

This model contains an extra RNN created through Pytorch's nn.RNN class which takes the activations of
first RNN as input and its activations are passed through the output layer to get the output.

Next we create learner and train the model in the same way with more number of epochs in the same way
as earlier.Only difference here is that while instantiating model architecture,we also pass the number
of total layered RNNs in the network through it apart from vocab size and batch size.

In [29]:
#Creating learner 
learn = Learner(dls, LMModel5(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)#Training the model for more number of epochs

epoch,train_loss,valid_loss,accuracy,time
0,3.04179,2.548714,0.455811,00:01
1,2.128514,1.708763,0.471029,00:01
2,1.699163,1.86605,0.340576,00:01
3,1.499681,1.738478,0.471517,00:01
4,1.33909,1.729538,0.494792,00:01
5,1.206317,1.835856,0.502848,00:00
6,1.088241,1.845554,0.520101,00:00
7,0.982788,1.856255,0.522624,00:00
8,0.890793,1.940331,0.525716,00:00
9,0.809587,2.028803,0.529785,00:00


Earlier we mentioned that with deeper architecture models we would get better results.But this is not 
true here.The previous single layered RNN had better accuracy of 62% more than this model which gives
around 54% of accuracy.Why multilayered RNNs cannot perform better is because of the problem of 
vanishing or exploding and activations.What do we mean by exploding or vanishing activations.Let's 
discuss.

### Exploding or Disappearing Activations

Previously we had constructed 3 RNN models with some differences each time introducing some change in 
the preceding model thus leading to improved results.In the previous model we tried to create a 
multilayered RNN thus creating a deeper architecture.But this model gives less accuracy than the model
containing the single RNN model.In actual getting accurate results using such RNN model is very 
difficult.We can get good results in a model where detach is called less times and there are more
layers.This leads to model learning more and important information.But more number of layers mean the 
model has a deeper architecture and deeper architectures have problems of exploding or vanishing 
activations.How can we overcome this and why at all this happens...

In deeper architectures since there are many layers,there are many activations and parameters thus,
the input matrix is multiplied several times.When we multiply by a number several times like in a 
Geometric Progression(GP),suppose first number is 1,then we multiply by 2 and if we keep doing so we
get 2,4,8,16 and so on.We get a large number very soon in this series.But let's say we multiply by a 
small number say 0.5 we get 0.5,0.25,0.125 and so on...The number keeps decreasing exponentially and
becomes close to zero.This is called exploding or vanishing activations.Continuous multiplication with
numbers greater or smaller than 1 results in either a very big number thus exploding the initial no or
very less number in this case vanishing the initial number after numbers are repeatedly multiplied.

In activations,matrix multiplication takes place where numbers are multiplied and added and in deep
neural networks there are many layers that's why continuous matrix multiplication results in very 
small or very large numbers at the end of neural networks.

Computers store numbers as floating point numbers and hence they become less accurate as they go away
from zero(positive or negative).This inaccuracy results in gradients becoming zero or infinity for 
very deep networks and hence weights are not updated properly.This ultimately leads to no improvement
during the training as weights are not updated.

There are several approaches to this problem such as batch normalization or modification in 
initialization which will be discussed in further lessons.here we will discuss about approaches used
in RNN for avoiding exploding activations that is using "__Gated Recurrent Units__" (GRU) and "__Long
short-term memory__"(LSTM) layers.Let'd discuss LSTMs first...

## LSTM

LSTM was introduced back then in 1997.The difference here is that here there are two hidden states 
unlike one in the previous RNN models.The base RNN has the hidden state which is the output of the 
RNN at the previous time step.This hidden state carries information to predict the next correct token
and memory of the previous words in the sentence.

For example,Sentences "Tom has a dog and he likes it very much" and "Siya has a dog and she likes it 
very much".Now to predict he/she model should remember the first noun in the sentence then only it can
predict if it is he/she or his/her.

RNNs do not remember the previous information in the sentence,that is why there are 2 hidden states in
LSTMs.The another hidden state is also called cell state.It ensures about the previous history of the
sentence and the other hidden state focuses on predicting the next word in the sentence.Let's see how
can we build a LSTM from scratch..

# LSTM Architecture

Let's understand LSTM architecture first...Below image describes LSTM architecture...
<img src="images/LSTM.png" id="lstm" caption="Architecture of an LSTM" alt="A graph showing the inner architecture of an LSTM" width="500">

Now Xt is the input here,it enters through the previous hidden state(h(t-1)) and cell state(c(t-1)).
The orange boxes are the four layers where activations are sigmoid or tanh.tanh is also like sigmoid
function only except that the output is ranged in (-1,1) unlile sigmoid where it lies between 0 and 1.

<img src="https://machinelearningblogcom.files.wordpress.com/2017/11/bildschirmfoto-2017-11-10-um-12-20-57.png?w=428&h=237"  caption="tanh activation function" alt="A graph showing tanh activation function" width="500"> 

So apart from tanh the other activtion function is "σ",sigmoid.The green circles represent some 
elementwise operations.On the right side of the box,two signals are sent out.They are the new hidden 
state(h(t)) and the new cell state(c(t)) respectively which will be fed as next input.The new hidden 
state also acts as output that's why it splits while going out of box.

The four neural nets are also called "gates".Also the cell state in previous time step c(t-1) changes
very little when it goes out as c(t).From the figure also we can observe that it never passes directly
through any neural net.That is why it is able to carry the information for long term.We will talk
about all the four gates(neural nets) one by one.

First gate also called "__forget gate__", input(xt) and the previous hidden state h(t-1) enter the box together.Previously also when we 
had written RNN architecture,the input embedding and the previous activation were added but in LSTM 
they are stacked together in big tensor.The dimension of xt can be different from h(t-1).The signal 
reaching all the neural nets is xt and h(t-1) together so its dimension is the sum of their sizes.
If "__n_in__" and "__n_hid__" are the sizes of x(t) and h(t-1) respectively then all the neural nets
have "__n_in__+__n_hid__" inputs and "__n_hid__" outputs.

Since there is a sigmoid activation,the output is between 0 and 1.The result is multiplied by c(t-1)
to see which information should go forward.The values which are nearly 0 are thrown and values near 
to 1 are kept.This helps in deciding which long short term memory is important and which is not.

The second gate is called "__input gate__" and it also works with third gate(called cell gate) to update 
the cell state.Similar to the first gate,it also has a sigmoid activation and decides where cell state
should be updated(close to 1) and where it should be not(nearly 0).The third gate has tanh activation
and it scales the updated values in -1 to 1 range.The final result after the third gate activation is 
added to the cell state.

The last gate is called the "__output gate__".It decides which information from cell state has to be 
included in the output.The cell state passes through tanh before it is integrated with the output from
output gate.Finally the next time step hidden state is released(ht).

Let's see how these four gates and the operations are implemented in code in a LSTM..

### Building an LSTM from Scratch

As discussed in the previous section,LSTM has two hidden states and four gates.The four gates are 
"forget gate","input gate","cell gate" and "output gate".All these are linear layers.Let's see a 
scratch implementation of LSTM."__init__" contains all the linear layers and in the forward function
we define the operations and the activations as explained earlier.Also "__ni__" and "__nh__" are the
dimensions of input and hidden state respectively.

In [30]:
class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.forget_gate = nn.Linear(ni + nh, nh)#First gate-the forget gate
        self.input_gate  = nn.Linear(ni + nh, nh)#Second gate-the input gate
        self.cell_gate   = nn.Linear(ni + nh, nh)#third gate-the cell gate
        self.output_gate = nn.Linear(ni + nh, nh)#fourth gate-the output gate

    def forward(self, input, state):
        h,c = state#the current cell and hidden state
        h = torch.stack([h, input], dim=1)#stacked/combined with input
        forget = torch.sigmoid(self.forget_gate(h))#sigmoid activation through forget_gate
        c = c * forget#multiplied with cell state
        inp = torch.sigmoid(self.input_gate(h))#sigmoid activation of hidden state through input gate
        cell = torch.tanh(self.cell_gate(h))#tanh activation of hidden state through cell gate.
        c = c + inp * cell #celll gate and input gate output multiplied and added to cell state
        out = torch.sigmoid(self.output_gate(h))#sigmoid activation of output gate
        h = outgate * torch.tanh(c)#part of cell state(to be integrated in output) passes through tanh
        return h, (h,c) #returns the next hidden state(to be integrated in next output) and the next
        #hidden and cell state

This above code can be written using refactoring.Above we are doing matrix multiplication 4 times 
which can be reduced to one big matrix multiplication.It optimizes performance,saves time and memory 
too.Instead of four layers,two separate layers are used for input state and hidden state.Let's see how
code looks like after refactoring..

In [31]:
class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.ih = nn.Linear(ni,4*nh)#Layer for input state
        self.hh = nn.Linear(nh,4*nh)#layer for hidden state

    def forward(self, input, state):
        h,c = state
        # One big multiplication for all the gates is better than 4 smaller ones
        gates = (self.ih(input) + self.hh(h)).chunk(4, 1)
        ingate,forgetgate,outgate = map(torch.sigmoid, gates[:3])#sigmoid activation at three gates
        cellgate = gates[3].tanh()#tanh activation at third gate
        c = (forgetgate*c) + (ingate*cellgate)#updating cell state
        h = outgate * c.tanh() #passing the cellstate through tanh and then multiplying by outgate for
        #final hidden state
        return h, (h,c) #return the next hidden state(integrated with output) and hidden state and 
        #cell state for next time step

When we refactor the LSTM architecture instead of four smll matrix multiplication,it is replaced with 
one big matrix multiplication.Signoid activation is also applied to three gates all at once.Instead
of four different layers for four gates,two layers for an input state and hidden state are created.

Pytorch's chunk method is used to split the tensor into 4 parts.We need to pass the number of parts we
want after the divison..

In [33]:
t = torch.arange(0,10); t

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [34]:
#Pytorch's chunk divides tensor into 2 parts...
t.chunk(2)

(tensor([0, 1, 2, 3, 4]), tensor([5, 6, 7, 8, 9]))

This architecture can be used to train a Language Model now...

### Training a Language Model Using LSTMs

Previously we had trained multilayered RNN using a two layered RNN architecture but it did not give 
better accuracy.Here we are training the same model but using a two layer.Pytorch's nn.LSTM class
provides us the opportunity so that we can access the LSTM architecture without writing the code for 
the same.We train the model at higher learning rate and for 

In [35]:
class LMModel6(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#input layer
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)#the LSTM architecture
        self.h_o = nn.Linear(n_hidden, vocab_sz)#output layer
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]#initializing hidden state
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)#passing the input and hidden state through LSTM
        self.h = [h_.detach() for h_ in h]#erasing the gradient history
        return self.h_o(res)#returning the cell state after passing through the output layer
    
    def reset(self): 
        for h in self.h: h.zero_()#model resetter

In [36]:
#Creating a learner and passing the above model architecture through it.
learn = Learner(dls, LMModel6(len(vocab), 64, 2), #Pass vocabsize,batch size,no of layers through the 
                #Model architecture...
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.026114,2.772102,0.153076,00:02
1,2.216185,2.089065,0.269124,00:02
2,1.613936,1.826282,0.47876,00:02
3,1.316553,2.065686,0.50529,00:02
4,1.092289,2.028179,0.590088,00:01
5,0.857714,1.954994,0.656087,00:01
6,0.604967,1.934748,0.682292,00:01
7,0.39631,1.70801,0.69987,00:01
8,0.244986,1.690051,0.734294,00:01
9,0.150833,1.777465,0.741048,00:01


This accuracy is better than the one obtained through multilayered RNN.The training loss is decreasing
but validation loss is not so good,this shows that the model is overfitting a little.We can apply 
regularization here like we had applied it in Machine Learning..

## Regularizing an LSTM

We had already discussed about the problems in training RNN because of the vainishing activations and 
gradients problem.We also discussed in the previous section that how we can use LSTMs to overcome this
issue.But LSTM models tend to overfit leading to high validation loss.We had discussed about data 
augmentation previously for Image data to avoid overfitting.But it is mostly used for images rather
than text data as different variants can be created.Data Augmentation for text requires different 
models to convert text into some other language which is a tiresome task.Data augmentation is not 
well explored for text data yet.

Some other regularization techniques are also being used to reduce overfitting for LSTM models.Let's 
study how we can prevent overfitting using methods such as dropout,activation regularization and 
temporal activation regularization.The researchers have also called such method "__AWD-LSTM__".Let's
discuss each one by one...

### Dropout

Dropout was introduced by Geoferry Hinton,the father of Deep learning himself.The method involves 
making some activations zero at training time.This ensures that all neurons work actively.

<img src="https://miro.medium.com/proxy/1*iWQzxhVlvadk6VAJjsgXgg.png" alt="A figure from the article showing how neurons go off with dropout" width="800" id="img_dropout" caption="Applying dropout in a neural network (courtesy of Nitish Srivastava et al.)">

Above figure shows how some neurons are inactivated.Dropout makes activations noisy and generalize
model better.But the activations cannot just be removed like that.Thus dropout is applied with a 
probability that the activations are removed with some probability.Suppose there are 5 activations,
then if dropout is applied with a probability of p,all the activations are rescaled by dividing by 1-p
.Full implementation of dropout in Pytorch is as follows:-

In [37]:
class Dropout(Module):
    def __init__(self, p): self.p = p #probability for inactivation
    def forward(self, x):
        if not self.training: return x
        mask = x.new(*x.shape).bernoulli_(1-p)
        return x * mask.div_(1-p)

"bernoulli_" method creates a tensor of random zeros(probability p) and ones(probability 1-p).It is 
multiplied by input before dividing by 1-p.Drop is used before passing the output of the lSTM to final
output layer.It can also be used in other models including CNN and also used as other name "ps" attribute
in tabular module of fastai.

It does different processes while training and validation.That's why in Dropout class above training 
condition was checked once.

# Activation Regularization and Temporal Activation Regularization

Let's learn about activation regularization and temporal activation regularization.From the names 
again it suggests that we would be making some modifications in activations only.Previously we had 
discussed about weight decay also.In weight decay,penalty is added to the loss so as to reduce the 
value of weights.In activation regularization instead of weights,activations are minimized.
The mean of the squares of the activations is added to the the loss and multiplied by a Loss 
coefficient alpha.

``` python
loss += alpha * activations.pow(2).mean()
```

Temporal activation regularization(TAR) is related to the networks and memory of the model.Since in 
language model we are predicting the next words in the sentence.TAR improves this property of the 
model by adding a penalty to the loss and tries to minimize the difference between two consecutive
activations.The penalty component takes the mean of the squares of difference of the consecutive
ativations.So loss can be written as :-
    

``` python
loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()
```

Alpha and beta are hyperparameters needed to be tuned like the ones in L1 and L2 regularization.
To apply regularization dropout should be applied on the model then model should output three 
things,the normal output,the activations before applying dropout and the activations after 
applying dropout.AR and TAR both are applied to predropout and postdropout activations 
respectively.We do not have to write long codes for doing this.While we create a learner,a 
callback called "RNNRegularizer " , is passed with alpha and beta values which applies regularization

### Training a Weight-Tied Regularized LSTM

We combine dropout with AR and TAR to apply regularization.Previous section we discussed that we need 
three outputs from our model the normal output,the dropped out activations and the LSTM activations.
Then "__RNNRegularizer__" does TAR and AR and adds the penalty to the loss component.

One more thing which can be applied is weight tying.In Language model the input embeddings are mapped
from words to activations.And the output layers is mapped from activations to words.These mappings
could be same and we can do this by applying the same weight matrix to the embedding and the hidden 
layer.It can be done at the end by just one line of code and Pytorch allows us to do that..

self.h_o.weight=self.i_h.weight

Let's write the final architecture for a regularized model where we apply dropout,AR,TAR and also 
weight tying.Dropout is applied by Pytorch's nn.Dropout(p) class where probability is passed.AR and 
TAR is applied by passing RNNRegularizer callback in Learner.Weight tying is applied by making
weight matrix of the embedding and the output layer same.

In [38]:
class LMModel7(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers, p):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#input Embedding layer
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)#LSTM cell
        self.drop = nn.Dropout(p)#Dropout applied
        self.h_o = nn.Linear(n_hidden, vocab_sz)#output layer
        self.h_o.weight = self.i_h.weight#Weight matrix made equal,weight tying applied
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]#Hidden state initialized to 
        #zero
        
    def forward(self, x):
        raw,h = self.rnn(self.i_h(x), self.h)#passing the input and the hidden state through the LSTM
        #cell
        out = self.drop(raw)#Dropout applied to the LSTM activation
        self.h = [h_.detach() for h_ in h]#Erasing gradient history
        return self.h_o(out),raw,out#returning the output,the dropped out activations and the LSTM 
        #activation(for Regularization)
    
    def reset(self): 
        for h in self.h: h.zero_()#Model resetter

Finally a learner is created with the LMModel7 architecture.The vocabsize,batch size,number of layers
and probability is passed through the architecture and rest we pass loss function,metrics and the two
callback for Model resetting andd regularization with the alpha and beta values.

In [39]:
#creating the learner
learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
                #model architecture(vocab size,batch size,number of layers,probability)
                loss_func=CrossEntropyLossFlat(), metrics=accuracy,
                cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)])

Using fastai we can create TextLearner which by default passes the callbacks for regularization and 
reset through the model.

In [40]:
#Creating the text learner for the model
learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
                    loss_func=CrossEntropyLossFlat(), metrics=accuracy)

Next we train the model and we add an additional regularization by adding a wd(weight decay) to it.

In [41]:
#training the model
learn.fit_one_cycle(15, 1e-2, wd=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,2.486498,1.885105,0.515055,00:01
1,1.596433,1.247586,0.631266,00:01
2,0.91128,0.764302,0.760579,00:01
3,0.532967,0.671182,0.797607,00:01
4,0.356029,0.573203,0.819824,00:01
5,0.250572,0.594855,0.822673,00:01
6,0.196667,0.52373,0.837891,00:01
7,0.167411,0.555765,0.824626,00:01
8,0.14735,0.439533,0.872396,00:01
9,0.13311,0.430603,0.876709,00:01


Now this model is not overfitted as it has a good training as well as validation loss.Also we have 
improved the accuracy for the model by 87% as compared to the previous model.

## Conclusion

So we learnt about training RNN and LSTM architecture from scratch.We also learnt about applying 
regularization to the our models about "Dropout","Activation Regularization" and "Temporal Activation
Regularization".In chapter-1 we used AWD-LSTM architecture in text classification.It applies dropout 
by default to embedding layer,input layer,weights of the LSTM layer and to the hidden state.Dropouts
on so many activations makes the model more regularized.
One more architecture which is useful for sequence models is the Transformers architecture.They are 
also mainly used for sequence to sequence problems such as language translation.We can learn more 
about them in later chapters.

## Questionnaire

1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?
1. Why do we concatenate the documents in our dataset before creating a language model?
1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make to ou model?
1. How can we share a weight matrix across multiple layers in PyTorch?
1. Write a module that predicts the third word given the previous two words of a sentence, without peeking.
1. What is a recurrent neural network?
1. What is "hidden state"?
1. What is the equivalent of hidden state in ` LMModel1`?
1. To maintain the state in an RNN, why is it important to pass the text to the model in order?
1. What is an "unrolled" representation of an RNN?
1. Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?
1. What is "BPTT"?
1. Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <<chapter_nlp>>.
1. What does the `ModelResetter` callback do? Why do we need it?
1. What are the downsides of predicting just one output word for each three input words?
1. Why do we need a custom loss function for `LMModel4`?
1. Why is the training of `LMModel4` unstable?
1. In the unrolled representation, we can see that a recurrent neural network actually has many layers. So why do we need to stack RNNs to get better results?
1. Draw a representation of a stacked (multilayer) RNN.
1. Why should we get better results in an RNN if we call `detach` less often? Why might this not happen in practice with a simple RNN?
1. Why can a deep network result in very large or very small activations? Why does this matter?
1. In a computer's floating-point representation of numbers, which numbers are the most precise?
1. Why do vanishing gradients prevent training?
1. Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?
1. What are these two states called in an LSTM?
1. What is tanh, and how is it related to sigmoid?
1. What is the purpose of this code in `LSTMCell`: `h = torch.stack([h, input], dim=1)`
1. What does `chunk` do in PyTorch?
1. Study the refactored version of `LSTMCell` carefully to ensure you understand how and why it does the same thing as the non-refactored version.
1. Why can we use a higher learning rate for `LMModel6`?
1. What are the three regularization techniques used in an AWD-LSTM model?
1. What is "dropout"?
1. Why do we scale the weights with dropout? Is this applied during training, inference, or both?
1. What is the purpose of this line from `Dropout`: `if not self.training: return x`
1. Experiment with `bernoulli_` to understand how it works.
1. How do you set your model in training mode in PyTorch? In evaluation mode?
1. Write the equation for activation regularization (in math or code, as you prefer). How is it different from weight decay?
1. Write the equation for temporal activation regularization (in math or code, as you prefer). Why wouldn't we use this for computer vision problems?
1. What is "weight tying" in a language model?

Ans-1 It is always good to start with simple and small datasets.If it takes a long time training the
dataset,the learning rate can be increased.

Ans-2 In a language model since we are predicting a sequence to sequence problem so its better to 
concatenate the text so that when model is trained,it learns the information about previous parts in 
the sentence too.

Ans-3 To predict next word in model using the previous three words as input,we use a standard three 
layer model.The tweaks are that all the three layers use same weight matrix and that first layer uses
first word's embedding as input and second layer uses second word's embeddings and first layer's 
output activations as input and so on for the third layer too.

Ans-4 Since the weights are same so repeat layer concept is used.We create one layer and we can use it
later multiple times.

Ans-5 Same as the module for the three word model except that the Dataset and sequence creation will 
be different.Independent variable would have 2 words instead of 3 words.

Ans-6 A recurrent neural network can take variable length inputs and produce variable length outputs.
It has a hidden state(h) which is updated at every time step and uses same weights for every layer,
that's why it is called recurrent.That is why they are used for processing sequences like text data.

Ans-7 Recurrent Neural networks loop the information to flow from one step to next.The information is
stored in hidden state which is like contribution from the previous steps input and fed to next steps
as input.It is updated every time.

Ans-8 There is a middle hidden layer(h_h) which acts as the hidden state in LMModel1.

Ans-9 If the hidden state is initialized 0 for every new input,model doesn't learn anything about
previou information in the sentence,therefore text is passed in order to the model so that model 
learns and remembers the previous information and hidden state is initialized in the "__init__" 
function and this makes the model very deep with aroune 10000 layers or more if there are 10000 tokens
in the text.

Ans-10 Whenever a RNN is created without the for loop where each layer corresponds to each word as 
input and there is no looping.That is called "unrolled representation" of a RNN.It is after that we
use for loop for refactoring the network and we make the network fast using the detach method in the 
forward,the model remembers the activations between different layers.

Ans-11 We maintain the hidden state by initializing the hidden state in "__init__".But this makes the 
model very deep like some 10000 layered model so when gradient is calculated for the last layer,the 
gradients have to be calculated right from the first layer.This becomes very slow and memory consuming
.Thus,instead of propogating throughout the network till first layer,only three layers are kept and 
and the whole network history is erased using detach method.

Ans-12 After refactoring the model,the architecture becomes more deep and backpropogation to calculate
gradient for 10000 layers would be very slow and memory consuming so gradients are calculated only for
the last three tokens instead of the whole stream.That is called backpropogation through time(bptt).It
basically truncates the memory of the gradient history after a few time steps.

Ans-13 dls.valid.show_batch(),toks.decode(num_toks)

Ans-14 ModelResetter callback actually calls the reset method of the model before every epoch and
validation phase,it ensures a clean hidden phase before the start of each epoch so that it can read
continuous chunks of text in the next epoch cycle.

Ans-15 Predicting only one output word for three input words causes the model with less information or
signal flow.Like we are sending the information of one word to update the weights of three words which
is not very efficient.Instead if we could update weight for every word using one word only then it 
would be more efficient and more information could flow through it.

Ans-16 In LMModel4 the shape of the inputs and outputs is changed a little.To create more signal in 
the model we predict one word for each word input so datasets is changed in such a way that for every
three next words,there are three 3 input words.So a custom "__sl__".sequence length attribute is 
introduced.Thus model returns output of shape batch size * sequence length * vocab size (bs*sl*vocab
size) and targets are of different dimensions so targets have to be flattened before applying the 
loss function.Thus a custom loss function is defined which changes the dimension of the targets and 
predictions accordingly.

Ans-17 In LMModel 4 we use multilayered RNNs to stack RNNs and train the model.This causes model to
become more deep and continuous additions and multiplications results in vanishing gradients or 
exploding gradients problem.Since computer stores numbers as floating point numbers,so inaccuracy 
increases as the numbers go away from zero and the gradients become either zero or infinite which 
results in weights not getting updated or becoming zero.

Ans-18 In an unrolled representation,if we use one letter to predict each next word,the model becomes
more complicated and with more layers it takes a longer time to train.With stacking RNNs,we can train
the same model in less time.

Ans-20 "__detach__" is used for clearing the gradient history.Using detach less often would make our
model more deeper and thus taking more time for training and extraction of more informative features.
With simple RNNs,problem of vanishing or exploding gradients may occur.

Ans-21 A deep network involves continuous matrix multiplications on the same matrix a number of times.
This results in either numbers decreasing to zero or jumping to infinite.Thus activations are also 
exploded or they are vanished.Gradients used for updating weights end up becoming zero or infinite.
With optimization the weights won't be improved during the training process.

Ans-22 Since computers store numbers as floating point numbers,so they become less accurate as they 
move away from zero.

Ans-23 Gradients are used to update weights during the training process.Vanishing gradients may lead
to no changes in the weights and if weights are not updated then the model won't improve with training
.The accuracy would remain same.

Ans-24 LSTM architecture has two hidden states instead of one in RNNs.So here one hidden state also 
called cell state sometimes is responsible for maintaining the carry forward of the long short term
memory and that's why it doesn't undergo much activations also whereas other hidden state is 
responsible for predicting the next token which undergoes through many layers and activations.

Ans-25 Cell state and hidden state

Ans-26 tanh is also an activations function like sigmoid.Only difference is that it rescales the 
values between -1 and 1 instead of 0 and 1.It is linearly related with sigmoid such that
tanh(x)=2*σ(2x)-1.

Ans-27 In the pictorial representation of the LSTM architecture it is very clear that the input and 
the hidden state at previous time step are joined together in the network.That's why after getting the
hidden state and cell state in LSTM cell hidden state is updated.
h = torch.stack([h, input], dim=1) joins the hidden state and the input matrix together in one
dimension.

Ans-28 Pytorch chunk divides a torch tensor into number of parts as passed as argument through the 
chunk method.They may have equal or unequal elements.

Ans-29 The refactored version of the LSTM architecture works same but it saves memory for GPU
processing.Instead of four matrix multiplications,one big matrix multiplications is done.The stacking 
is time consuming since GPU does many operations in parellel.So instead of making model more deep,two 
separate layers are used for input and hidden state.Further all the stacking and operations are 
performed in the forward function.

Ans-30 In LMModel6,two layered LSTM network is used along with the input Embedding and the output 
layer.This model is trained for small number of epochs with a higher learning rate to attain better
accuracy with small epochs.Higher learning rate though results in model being overfit sometimes.But 
that can be resolved by applying regularization procedures.

Ans-31 Activation Regularization(A),Weight Tying(W) and Dropout(D) are regularization techniques in 
AWD-LSTM.

Ans-32 Dropout is a regularization technique used for neural networks.It involves randomly making
some activations zero at training time.It helps in model generalizing better by making activations 
more noisy and neurons interacting better with each other.

Ans-33 During dropout if activations are just made zero without doing anything else,there would be 
problems in training the problem,like if from 5 activations,3 are dropped and we are left with 2,then 
it may cause problem in training.Therefore dropout is applied with a probability by rescaling them by
dividing by 1-p.This results in p activations being zeroed and 1-p remaining.

Ans-34 Dropout behaves differently during training and validation.In training time,the activation unit
is present with probability p and in validation time,the activation unit is always present and weights
are multiplied by p."__if not self.training: return x__" checks if there is training phase or
validation phase and accordingly returns the resulting activation.

Ans-35 During Dropout we select activations with probability such that p activations are zeroed
and 1-p are nonzero.So "__bernoulli___" method creates a tensor of random zeros(probability p) and 
random ones(probability(1-p)) which is multiplied by input before dividing by 1-p."Training" or 
validation phases are checked using the training attribute in Pytorch's nn Module.

Ans-36 We use train method in Module which sets training to True and evaluation set to False.It is 
done automatically while creating Learner.

Ans-37 loss+=alpha*activations**2.mean().In weight decay we are trying to minimize the value of 
weights while in activation regularization,we control the value of activations.

Ans-38 loss+=beta*(activations[:,1:]-activations[:,:-1])**2.mean().Temporal activation regularization
tries to minimize the difference between two consecutive activations.In Computer vision problems
activations from different layers are responsible for different features of the Data,so differencing
of the activations may result in loss of important information and also noisy activations more.

Ans-39 Weight tying in Language model means making weight matrix equal for the output layer and input
Embedding layer.Since the input and output mappings are same.



### Further Research

1. In ` LMModel2`, why can `forward` start with `h=0`? Why don't we need to say `h=torch.zeros(...)`?
1. Write the code for an LSTM from scratch (you may refer to <<lstm>>).
1. Search the internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get results similar to those we saw in this chapter. Compare you results to the results of PyTorch's built in `GRU` module.
1. Take a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter.