## Long Short-Term Memory Networks(LSTMs)

### Lesson outline
In this lesson, we will learn about LSTM (Long short-term memory networks). We will discuss the following topics:

- RNN vs LSTM
- Basics of LSTM
- Architecture of LSTM
- The learn gate
- The forget gate
- The remember gate
- The use gate
- Character-level RNN
- Sequence batching
- Other architectures


# RNN

In this image, the output of the NN will be the input to the next one. Now because of the last output is being fed to the current input so it is getting the memory from the previous outputs so it will perform reallt well. Now imagine initially it is predicting the bear and later it will predict trees for few times then again if it tries to predict the animals which look like bear then it won't perform good because the RNN eventually loose the memory because of the vanishing gradients. Because the bear instance was seen long back and later the RNN did not see the examples of the similar examples

![image.png](attachment:93eac7f8-706f-4e80-8d17-2f3b4bc596ad.png) ![image.png](attachment:5c9a1383-4c07-446c-85ac-ad43f71cc724.png)


![image.png](attachment:a09cc550-df82-44a4-891d-5c34a3ba5527.png) ![image.png](attachment:a0eeed09-1680-4095-9dba-6c86c403e99a.png)


### Basics of LSTM

![image.png](attachment:628d9dbd-5137-4e14-9f0e-9e6fee2d6a09.png) ![image.png](attachment:ff1343f1-a57d-468c-bc10-f2eaf819aa37.png) ![image.png](attachment:ae141de7-83dc-4fab-9910-25dd78c3aa98.png)

### Architecture of LSTM

![image.png](attachment:11fa0286-f85f-4cd6-9e54-d8a5e517a7d4.png)

![image.png](attachment:cb16c2b7-2f7c-479a-bd69-dc5251de8914.png)![image.png](attachment:1c12de88-9d46-4643-b355-5deef3d48d13.png) 

# LSTM Structure and Hidden State

We know that RNNs are used to maintain a kind of memory by linking the output of one node to the input of the next. In the case of an LSTM, for each piece of data in a sequence (say, for a word in a given sentence),
there is a corresponding *hidden state* $h_t$. This hidden state is a function of the pieces of data that an LSTM has seen over time; it contains some weights and, represents both the short term and long term memory components for the data that the LSTM has already seen. 

So, for an LSTM that is looking at words in a sentence, **the hidden state of the LSTM will change based on each new word it sees. And, we can use the hidden state to predict the next word in a sequence** or help identify the type of word in a language model, and lots of other things!

### Exercise Repository

Note that most exercise notebooks can be run locally on your computer, by following the directions in the [Github Exercise Repository](https://github.com/udacity/CVND_Exercises).


## LSTMs in Pytorch

To create and train an LSTM, you have to know how to structure the inputs, and hidden state of an LSTM. In PyTorch an LSTM can be defined as: `lstm = nn.LSTM(input_size=input_dim, hidden_size=hidden_dim, num_layers=n_layers)`.

In PyTorch, an LSTM expects all of its inputs to be 3D tensors, with dimensions defined as follows:
>* `input_dim` = the number of inputs (a dimension of 20 could represent 20 inputs)
>* `hidden_dim` = the size of the hidden state; this will be the number of outputs that each LSTM cell produces at each time step.
>* `n_layers ` = the number of hidden LSTM layers to use; this is typically  a value between 1 and 3; a value of 1 means that each LSTM cell has one hidden state. This has a default value of 1.

<img src='images/lstm_simple_ex.png' height=5 >
    
### Hidden State

Once an LSTM has been defined with input and hidden dimensions, we can call it and retrieve the output and hidden state at every time step.
 `out, hidden = lstm(input.view(1, 1, -1), (h0, c0))` 

The inputs to an LSTM are **`(input, (h0, c0))`**.
>* `input` = a Tensor containing the values in an input sequence; this has values: (seq_len, batch, input_size)
>* `h0` = a Tensor containing the initial hidden state for each element in a batch
>* `c0` = a Tensor containing the initial cell memory for each element in the batch

`h0` nd `c0` will default to 0, if they are not specified. Their dimensions are: (n_layers, batch, hidden_dim).

These will become clearer in the example in this notebook. This and the following notebook are modified versions of [this PyTorch LSTM tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#lstm-s-in-pytorch).

Let's take a simple example and say we want to process a single sentence through an LSTM. If we want to run the sequence model over one sentence "Giraffes in a field", our input should look like this `1x4` row vector of individual words:

\begin{align}\begin{bmatrix}
   \text{Giraffes  } 
   \text{in  } 
   \text{a  } 
   \text{field} 
   \end{bmatrix}\end{align}

In this case, we know that we have **4 inputs words** and we decide how many outputs to generate at each time step, say we want each LSTM cell to generate **3 hidden state values**. We'll keep the number of layers in our LSTM at the default size of 1.

The hidden state and cell memory will have dimensions (n_layers, batch, hidden_dim), and in this case that will be (1, 1, 3) for a 1 layer model with one batch/sequence of words to process (this one sentence) and 3 genereated, hidden state values.


### Example Code

Next, let's see an example of one LSTM that is designed to look at a sequence of 4 values (numerical values since those are easiest to create and track) and generate 3 values as output. This is what the sentence processing network from above will look like, and you are encouraged to change these input/hidden-state sizes to see the effect on the structure of the LSTM!

In [1]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt 

%matplotlib inline

torch.manual_seed(2) # so that random variable will be consistant and repeatable for testing 



<torch._C.Generator at 0x111d550f0>

### Define a simple LSTM


**A note on hidden and output dimensions**

The `hidden_dim` and size of the output will be the same unless you define your own LSTM and change the number of outputs by adding a linear layer at the end of the network, ex. fc = nn.Linear(hidden_dim, output_dim).

In [2]:
from torch.autograd import Variable

# define an LSTM with an input dim of 4 and hidden dim of 3
# this expects to see 4 values as input and generates 3 values as output
input_dim = 4
hidden_dim = 3
lstm = nn.LSTM(input_size=input_dim, hidden_size=hidden_dim)  

# make 5 input sequences of 4 random values each
inputs_list = [torch.randn(1, input_dim) for _ in range(5)]
print('inputs: \n', inputs_list)
print('\n')

# initialize the hidden state
# (1 layer, 1 batch_size, 3 outputs)
# first tensor is the hidden state, h0
# second tensor initializes the cell memory, c0
h0 = torch.randn(1, 1, hidden_dim)
c0 = torch.randn(1, 1, hidden_dim)


h0 = Variable(h0)
c0 = Variable(c0)
# step through the sequence one element at a time.
for i in inputs_list:
    # wrap in Variable 
    i = Variable(i)
    
    # after each step, hidden contains the hidden state
    out, hidden = lstm(i.view(1, 1, -1), (h0, c0))
    print('out: \n', out)
    print('hidden: \n', hidden)


inputs: 
 [tensor([[1.4934, 0.4987, 0.2319, 1.1746]]), tensor([[-1.3967,  0.8998,  1.0956, -0.5231]]), tensor([[-0.8462, -0.9946,  0.6311,  0.5327]]), tensor([[-0.8454,  0.9406, -2.1224,  0.0233]]), tensor([[ 0.4836,  1.2895,  0.8957, -0.2465]])]


out: 
 tensor([[[-0.4372,  0.2583,  0.2947]]], grad_fn=<StackBackward0>)
hidden: 
 (tensor([[[-0.4372,  0.2583,  0.2947]]], grad_fn=<StackBackward0>), tensor([[[-0.7344,  0.6209,  0.4191]]], grad_fn=<StackBackward0>))
out: 
 tensor([[[-0.2836,  0.1314,  0.4133]]], grad_fn=<StackBackward0>)
hidden: 
 (tensor([[[-0.2836,  0.1314,  0.4133]]], grad_fn=<StackBackward0>), tensor([[[-0.5041,  0.2672,  0.6370]]], grad_fn=<StackBackward0>))
out: 
 tensor([[[-0.3404,  0.4880,  0.1949]]], grad_fn=<StackBackward0>)
hidden: 
 (tensor([[[-0.3404,  0.4880,  0.1949]]], grad_fn=<StackBackward0>), tensor([[[-0.5552,  0.7909,  0.3300]]], grad_fn=<StackBackward0>))
out: 
 tensor([[[-0.3544,  0.2405,  0.3150]]], grad_fn=<StackBackward0>)
hidden: 
 (tensor([[[-0.

You should see that the output and hidden Tensors are always of length 3, which we specified when we defined the LSTM with `hidden_dim`. 

### All at once

A for loop is not very efficient for large sequences of data, so we can also, **process all of these inputs at once.**

1. concatenate all our input sequences into one big tensor, with a defined batch_size
2. define the shape of our hidden state
3. get the outputs and the *most recent* hidden state (created after the last word in the sequence has been seen)


The outputs may look slightly different due to our differently initialized hidden state.

In [3]:
# turn inputs into a tensor with 5 rows of data
# add the extra 2nd dimension (1) for batch_size
inputs = torch.cat(inputs_list).view(len(inputs_list), 1, -1)

# print out our inputs and their shape
# you should see (number of sequences, batch size, input_dim)
print('inputs size: \n', inputs.size())
print('\n')

print('inputs: \n', inputs)
print('\n')

# initialize the hidden state
h0 = torch.randn(1, 1, hidden_dim)
c0 = torch.randn(1, 1, hidden_dim)

# wrap everything in Variable
inputs = Variable(inputs)
h0 = Variable(h0)
c0 = Variable(c0)
# get the outputs and hidden state
out, hidden = lstm(inputs, (h0, c0))

print('out: \n', out)
print('hidden: \n', hidden)

inputs size: 
 torch.Size([5, 1, 4])


inputs: 
 tensor([[[ 1.4934,  0.4987,  0.2319,  1.1746]],

        [[-1.3967,  0.8998,  1.0956, -0.5231]],

        [[-0.8462, -0.9946,  0.6311,  0.5327]],

        [[-0.8454,  0.9406, -2.1224,  0.0233]],

        [[ 0.4836,  1.2895,  0.8957, -0.2465]]])


out: 
 tensor([[[ 0.1611,  0.2200,  0.2213]],

        [[ 0.0364, -0.0390,  0.2638]],

        [[-0.1425, -0.0174,  0.1504]],

        [[-0.1583,  0.1264,  0.1709]],

        [[-0.2007, -0.1559,  0.2489]]], grad_fn=<StackBackward0>)
hidden: 
 (tensor([[[-0.2007, -0.1559,  0.2489]]], grad_fn=<StackBackward0>), tensor([[[-0.4429, -0.2975,  0.3252]]], grad_fn=<StackBackward0>))


### Next: Hidden State and Gates

This notebooks shows you the structure of the input and output of an LSTM in PyTorch. Next, you'll learn more about how exactly an LSTM represents long-term and short-term memory in it's hidden state, and you'll reach the next notebook exercise.

#### Part of Speech

In the notebook that comes later in this lesson, you'll see how to define a model to tag parts of speech (nouns, verbs, determinants), include an LSTM and a Linear layer to define a desired output size, *and* finally train our model to create a distribution of class scores that associates each input word with a part of speech.

## Learn Gate

![image.png](attachment:a3ead700-d103-4d10-877e-3fc51f171826.png)

![image.png](attachment:4ce0d357-618c-466e-ac5d-78bcc2816374.png)

## Forget Gate 

![image.png](attachment:77d0f75f-0d9d-4a39-bdb2-f825f8e9e667.png)

## The Remember Gate
![image.png](attachment:330cf2e4-8e8d-49e9-8b59-99b00738ceb4.png) ![image.png](attachment:ff3214f0-a825-45d1-bc42-73cebd97b4a8.png)

## The Use Gate

![image.png](attachment:cc746e8e-d3b2-45bf-92d2-e157e1f9e899.png) ![image.png](attachment:680c6d46-e3a2-43d9-99bf-bf53befab42d.png)

the output works as a new short term memory also

## Putting all together 
![image.png](attachment:c9498703-47d8-446d-830e-18b8663ee23e.png) ![image.png](attachment:82ca6a74-4c10-4c4e-82c3-d3e55f2eef02.png)

# LSTM for Part-of-Speech Tagging

In this section, we will use an LSTM to predict part-of-speech tags for words. What exactly is part-of-speech tagging?

Part of speech tagging is the process of determining the *category* of a word from the words in its surrounding context. You can think of part of speech tagging as a way to go from words to their [Mad Libs](https://en.wikipedia.org/wiki/Mad_Libs#Format) categories. Mad Libs are incomplete short stories that have many words replaced by blanks. Each blank has a specified word-category, such as `"noun"`, `"verb"`, `"adjective"`, and so on. One player asks another to fill in these blanks (prompted only by the word-category) until they have created a complete, silly story of their own. Here is an example of such categories:

```text
Today, you'll be learning how to [verb]. It may be a [adjective] process, but I think it will be rewarding! 
If you want to take a break you should [verb] and treat yourself to some [plural noun].
```
... and a set of possible words that fall into those categories:
```text
Today, you'll be learning how to code. It may be a challenging process, but I think it will be rewarding! 
If you want to take a break you should stretch and treat yourself to some puppies.
```


### Why Tag Speech?

Tagging parts of speech is often used to help disambiguate natural language phrases because it can be done quickly and with high accuracy. It can help answer: what subject is someone talking about? Tagging can be used for many NLP tasks like creating new sentences using a sequence of tags that make sense together, filling in a Mad Libs style game, and determining correct pronunciation during speech synthesis. It is also used in information retrieval, and for word disambiguation (ex. determining when someone says *right* like the direction versus *right* like "that's right!").

---

### Preparing the Data

Now, we know that neural networks do not do well with words as input and so our first step will be to prepare our training data and map each word to a numerical value. 

We start by creating a small set of training data, you can see that this is a few simple sentences broken down into a list of words and their corresponding word-tags. Note that the sentences are turned into lowercase words using `lower()` and then split into separate words using `split()`, which splits the sentence by whitespace characters.

#### Words to indices

Then, from this training data, we create a dictionary that maps each unique word in our vocabulary to a numerical value; a unique index `idx`. We do the same for each word-tag, for example: a noun will be represented by the number `1`.

In [4]:
# import resources 
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt 

%matplotlib inline 

In [5]:
# training sentences and their corresponding word-tags
training_data = [
    ("The cat ate the cheese".lower().split(), ["DET", "NN", "V", "DET", "NN"]),
    ("She read that book".lower().split(), ["NN", "V", "DET", "NN"]),
    ("The dog loves art".lower().split(), ["DET", "NN", "V", "NN"]),
    ("The elephant answers the phone".lower().split(), ["DET", "NN", "V", "DET", "NN"])
]

# create a dictionary that maps words to indices
word2idx = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word2idx:
            word2idx[word] = len(word2idx)

# create a dictionary that maps tags to indices
tag2idx = {"DET": 0, "NN": 1, "V": 2}

Next, print out the created dictionary to see the words and their numerical values! 

You should see every word in our training set and its index value. Note that the word "the" only appears once because our vocabulary only includes *unique* words.

In [6]:
# print out the created dictionary
print(word2idx)

{'the': 0, 'cat': 1, 'ate': 2, 'cheese': 3, 'she': 4, 'read': 5, 'that': 6, 'book': 7, 'dog': 8, 'loves': 9, 'art': 10, 'elephant': 11, 'answers': 12, 'phone': 13}


In [7]:
import numpy as np

# a helper function for converting a sequence of words to a Tensor of numerical values
# will be used later in training
def prepare_sequence(seq, to_idx):
    '''This function takes in a sequence of words and returns a 
    corresponding Tensor of numerical values (indices for each word).'''
    idxs = [to_idx[w] for w in seq]
    idxs = np.array(idxs)
    return torch.from_numpy(idxs)


In [8]:
# check out what prepare_sequence does for one of our training sentences:
example_input = prepare_sequence("The dog answers the phone".lower().split(), word2idx)
print(example_input)

tensor([ 0,  8, 12,  0, 13])


---
## Creating the Model

Our model will assume a few things:
1. Our input is broken down into a sequence of words, so a sentence will be [w1, w2, ...]
2. These words come from a larger list of words that we already know (a vocabulary)
3. We have a limited set of tags, `[NN, V, DET]`, which mean: a noun, a verb, and a determinant (words like "the" or "that"), respectively
4. We want to predict\* a tag for each input word

\* To do the prediction, we will pass an LSTM over a test sentence and apply a softmax function to the hidden state of the LSTM; the result is a vector of tag scores from which we can get the predicted tag for a word based on the *maximum* value in this distribution of tag scores. 

Mathematically, we can represent any tag prediction $\hat{y}_i$ as: 

\begin{align}\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j\end{align}

Where $A$ is a learned weight and $b$, a learned bias term, and the hidden state at timestep $i$ is $h_i$. 


### Word embeddings

We know that an LSTM takes in an expected input size and hidden_dim, but sentences are rarely of a consistent size, so how can we define the input of our LSTM?

Well, at the very start of this net, we'll create an `Embedding` layer that takes in the size of our vocabulary and returns a vector of a specified size, `embedding_dim`, for each word in an input sequence of words. It's important that this be the first layer in this net. You can read more about this embedding layer in [the PyTorch documentation](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#word-embeddings-in-pytorch).

Pictured below is the expected architecture for this tagger model.

<img src='images/speech_tagger.png' height=60% width=60% >


In [11]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        ''' Initialize the layers of this model.'''
        super(LSTMTagger, self).__init__()
        
        self.hidden_dim = hidden_dim

        # embedding layer that turns words into a vector of a specified size
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # the LSTM takes embedded word vectors (of a specified size) as inputs 
        # and outputs hidden states of size hidden_dim
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # the linear layer that maps the hidden state output dimension 
        # to the number of tags we want as output, tagset_size (in this case this is 3 tags)
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        
        # initialize the hidden state (see code below)
        self.hidden = self.init_hidden()

        
    def init_hidden(self):
        ''' At the start of training, we need to initialize a hidden state;
           there will be none because the hidden state is formed based on perviously seen data.
           So, this function defines a hidden state with all zeroes and of a specified size.'''
        # The axes dimensions are (n_layers, batch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        ''' Define the feedforward behavior of the model.'''
        # create embedded word vectors for each word in a sentence
        embeds = self.word_embeddings(sentence)
        
        # get the output and hidden state by passing the lstm over our word embeddings
        # the lstm takes in our embeddings and hiddent state
        lstm_out, self.hidden = self.lstm(
            embeds.view(len(sentence), 1, -1), self.hidden)
        
        # get the scores for the most likely tag for a word
        tag_outputs = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_outputs, dim=1)
        
        return tag_scores


## Define how the model trains

To train the model, we have to instantiate it and define the loss and optimizers that we want to use.

First, we define the size of our word embeddings. The `EMBEDDING_DIM` defines the size of our word vectors for our simple vocabulary and training set; we will keep them small so we can see how the weights change as we train.

**Note: the embedding dimension for a complex dataset will usually be much larger, around 64, 128, or 256 dimensional.**


#### Loss and Optimization

Since our LSTM outputs a series of tag scores with a softmax layer, we will use `NLLLoss`. In tandem with a softmax layer, NLL Loss creates the kind of cross entropy loss that we typically use for analyzing a distribution of class scores. We'll use standard gradient descent optimization, but you are encouraged to play around with other optimizers!

In [12]:
# the embedding dimension defines the size of our word vectors
# for our simple vocabulary and training set, we will keep these small
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

# instantiate our model
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word2idx), len(tag2idx))

# define our loss and optimizer
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)


Just to check that our model has learned something, let's first look at the scores for a sample test sentence *before* our model is trained. Note that the test sentence *must* be made of words from our vocabulary otherwise its words cannot be turned into indices.

The scores should be Tensors of length 3 (for each of our tags) and there should be scores for each word in the input sentence.

For the test sentence, "The cheese loves the elephant", we know that this has the tags (DET, NN, V, DET, NN) or `[0, 1, 2, 0, 1]`, but our network does not yet know this. In fact, in this case, our model starts out with a hidden state of all zeroes and so all the scores and the predicted tags should be low, random, and about what you'd expect for a network that is not yet trained!

In [14]:
test_sentence = "The cheese loves the elephant".lower().split()

# see what the scores are before training
# element [i,j] of the output is the *score* for tag j for word i.
# to check the initial accuracy of our model, we don't need to train, so we use model.eval()
inputs = prepare_sequence(test_sentence, word2idx)
inputs = inputs
tag_scores = model(inputs)
print(tag_scores)

# tag_scores outputs a vector of tag scores for each word in an inpit sentence
# to get the most likely tag index, we grab the index with the maximum score!
# recall that these numbers correspond to tag2idx = {"DET": 0, "NN": 1, "V": 2}
_, predicted_tags = torch.max(tag_scores, 1)
print('\n')
print('Predicted tags: \n',predicted_tags)

tensor([[-1.6402, -0.8590, -0.9611],
        [-1.6006, -0.8763, -0.9625],
        [-1.6243, -0.8917, -0.9340],
        [-1.6545, -0.8822, -0.9290],
        [-1.7271, -0.8752, -0.9029]], grad_fn=<LogSoftmaxBackward0>)


Predicted tags: 
 tensor([1, 1, 1, 1, 1])


---
## Train the Model

Loop through all our training data for multiple epochs (again we are using a small epoch value for this simple training data). This loop:

1. Prepares our model for training by zero-ing the gradients
2. Initializes the hidden state of our LSTM
3. Prepares our data for training
4. Runs a forward pass on our inputs to get tag_scores
5. Calculates the loss between tag_scores and the true tag
6. Updates the weights of our model using backpropagation

In this example, we are printing out the average epoch loss, every 20 epochs; you should see it decrease over time.

In [15]:
# normally these epochs take a lot longer 
# but with our toy data (only 3 sentences), we can do many epochs in a short time
n_epochs = 300

for epoch in range(n_epochs):
    
    epoch_loss = 0.0
    
    # get all sentences and corresponding tags in the training data
    for sentence, tags in training_data:
        
        # zero the gradients
        model.zero_grad()

        # zero the hidden state of the LSTM, this detaches it from its history
        model.hidden = model.init_hidden()

        # prepare the inputs for processing by out network, 
        # turn all sentences and targets into Tensors of numerical indices
        sentence_in = prepare_sequence(sentence, word2idx)
        targets = prepare_sequence(tags, tag2idx)

        # forward pass to get tag scores
        tag_scores = model(sentence_in)

        # compute the loss, and gradients 
        loss = loss_function(tag_scores, targets)
        epoch_loss += loss.item()
        loss.backward()
        
        # update the model parameters with optimizer.step()
        optimizer.step()
        
    # print out avg loss per 20 epochs
    if(epoch%20 == 19):
        print("Epoch: %d, loss: %1.5f" % (epoch+1, epoch_loss/len(training_data)))


Epoch: 20, loss: 1.04375
Epoch: 40, loss: 1.00275
Epoch: 60, loss: 0.89531
Epoch: 80, loss: 0.66163
Epoch: 100, loss: 0.42037
Epoch: 120, loss: 0.26174
Epoch: 140, loss: 0.17292
Epoch: 160, loss: 0.12198
Epoch: 180, loss: 0.09018
Epoch: 200, loss: 0.06932
Epoch: 220, loss: 0.05511
Epoch: 240, loss: 0.04507
Epoch: 260, loss: 0.03772
Epoch: 280, loss: 0.03218
Epoch: 300, loss: 0.02790


## Testing

See how your model performs *after* training. Compare this output with the scores from before training, above.

Again, for the test sentence, "The cheese loves the elephant", we know that this has the tags (DET, NN, V, DET, NN) or `[0, 1, 2, 0, 1]`. Let's see if our model has learned to find these tags!

In [17]:
test_sentence = "The cheese loves the elephant".lower().split()

# see what the scores are after training
inputs = prepare_sequence(test_sentence, word2idx)
inputs = inputs
tag_scores = model(inputs)
print(tag_scores)

# print the most likely tag index, by grabbing the index with the maximum score!
# recall that these numbers correspond to tag2idx = {"DET": 0, "NN": 1, "V": 2}
_, predicted_tags = torch.max(tag_scores, 1)
print('\n')
print('Predicted tags: \n',predicted_tags)

tensor([[-1.4433e-02, -4.2588e+00, -8.5687e+00],
        [-4.8191e+00, -1.7810e-02, -4.6482e+00],
        [-6.9437e+00, -4.5517e+00, -1.1581e-02],
        [-1.1403e-02, -4.5591e+00, -7.0512e+00],
        [-6.0553e+00, -6.5364e-03, -5.4799e+00]],
       grad_fn=<LogSoftmaxBackward0>)


Predicted tags: 
 tensor([0, 1, 2, 0, 1])


## Great job!

To improve this model, see if you can add sentences to this model and create a more robust speech tagger. Try to initialize the hidden state in a different way or play around with the optimizers and see if you can decrease model loss even faster.

## Hyper Parameters 

the hyper parameter could be devided into two types 
    1. Optimizer Hyperparameters (these are the parameters related more to the optimization and training process)
        - Learning rate
        - Mini Batch Size
        - No of epochs
    2. Model Hyperparameters (These are the parameters related to structure of the model)
        - No of layers and hidden units
        - Model Specific parameters for architectures like RNNs
        