# Character Level LSTM in PyTorch

__Statistical Language Model__: A trained model to predict the next word/character given all previous words/characters.

__Character-Level Language Model__: The main task of the char-level language model is to predict the next character given all previous characters in a sequence of data, i.e. generates text character by character. 


In [1]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

In [2]:
torch.cuda.is_available()

False

In [3]:
with open('data/anna.txt', 'r') as f:
    text = f.read()

In [4]:
text[:1000]

"Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverything was in confusion in the Oblonskys' house. The wife had\ndiscovered that the husband was carrying on an intrigue with a French\ngirl, who had been a governess in their family, and she had announced to\nher husband that she could not go on living in the same house with him.\nThis position of affairs had now lasted three days, and not only the\nhusband and wife themselves, but all the members of their family and\nhousehold, were painfully conscious of it. Every person in the house\nfelt that there was no sense in their living together, and that the\nstray people brought together by chance in any inn had more in common\nwith one another than they, the members of the family and household of\nthe Oblonskys. The wife did not leave her own room, the husband had not\nbeen at home for three days. The children ran wild all over the house;\nthe English governess quarreled with the housekeep

### Encoding the Text

In [5]:
## Encoding the text ## 
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch:ii for ii,ch in int2char.items()}
encoded = np.array([char2int[ch] for ch in text])

In [6]:
encoded[:100]

array([ 7, 19, 74, 82, 76, 79, 75, 31, 13, 51, 51, 51, 22, 74, 82, 82, 56,
       31, 73, 74, 12, 66, 17, 66, 79, 72, 31, 74, 75, 79, 31, 74, 17, 17,
       31, 74, 17, 66,  2, 79, 46, 31, 79, 18, 79, 75, 56, 31, 44, 61, 19,
       74, 82, 82, 56, 31, 73, 74, 12, 66, 17, 56, 31, 66, 72, 31, 44, 61,
       19, 74, 82, 82, 56, 31, 66, 61, 31, 66, 76, 72, 31, 70, 67, 61, 51,
       67, 74, 56, 16, 51, 51, 62, 18, 79, 75, 56, 76, 19, 66, 61])

### Data Pre-Processing

In [28]:
def one_hot_encode(arr):

    n_labels = max(arr.flatten()) + 1
    
    one_hot = np.zeros(shape = (np.multiply(*arr.shape) , n_labels))
    
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1
    
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    
    return one_hot

In [29]:
test_seq = np.array([[1,2,3,7],[5,3,2,8]])
one_hot = one_hot_encode(test_seq)

In [30]:
one_hot

array([[[0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0.]],

       [[0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1.]]])