<a href="https://colab.research.google.com/github/manishiitg/ML_Experiments/blob/master/nlp/LSTM_Character_Level_RNN_Model_Paul_Graham_Essay_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

** Understanding Basic LSTM with Character Level Language Model  **

Basic understanding - https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Will try to reproduce results from Karpathy amazing blog post http://karpathy.github.io/2015/05/21/rnn-effectiveness/


"Technical: Lets train a 2-layer LSTM with 512 hidden nodes (approx. 3.5 million parameters), and with dropout of 0.5 after each layer. We’ll train with batches of 100 examples and truncated backpropagation through time of length 100 characters. With these settings one batch on a TITAN Z GPU takes about 0.46 seconds (this can be cut in half with 50 character BPTT at negligible cost in performance). Without further ado, lets see a sample from the RNN:"


**Mainly notice how to use hidden state, because of hidden state misuse, this took very long **


In [1]:
# Let's make sure the kaggle.json file is present.
!ls -lha kaggle.json
# Next, install the Kaggle API client.
!pip install -q kaggle
# The Kaggle API client expects this file to be in ~/.kaggle,
# so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

-rw-r--r-- 1 root root 66 Nov 29 17:08 kaggle.json


In [2]:
!kaggle datasets download -d krsoninikhil/pual-graham-essays

Downloading pual-graham-essays.zip to /content
  0% 0.00/973k [00:00<?, ?B/s]
100% 973k/973k [00:00<00:00, 61.7MB/s]


In [3]:
!unzip pual-graham-essays.zip

Archive:  pual-graham-essays.zip
  inflating: paul_graham_essay.txt   


Till now we have downloaded and setup our data

In [4]:
import torch
from torch import nn

import torch.nn.functional as F

import numpy as np

from sklearn.metrics import accuracy_score

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [5]:
# import unicodedata
# import string
# all_letters = string.ascii_letters + " .,;'"


# def unicodeToAscii(s):
#   return ''.join(
#       c for c in unicodedata.normalize('NFD', s)
#       if unicodedata.category(c) != 'Mn'
#       and c in all_letters
#   )

def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [line.lower() for line in lines]

# this is a specific design choice to lower case all the characters 

lines_read = readLines("paul_graham_essay.txt")

print("orignal lines", len(lines_read))
lines = ""

for l in lines_read:
  if len(l) > 0:
    lines += l

print(len(lines), " Total characters found")

print(lines[180:190])

orignal lines 55942
2647260  Total characters found
nds to be 


Read the names from the csv files 

In [6]:
data = []
for l in lines:
   data += list(l)

chars = list(set(data))

data_size, vocab_size = len(data), len(chars)

print('data has %d characters, %d unique.' % (data_size, vocab_size))

n_chars = len(chars)

print(chars)

data has 2647260 characters, 69 unique.
['i', '!', 'n', 's', 'c', ',', '`', '-', '5', '4', '"', 'u', ':', '=', 'a', 'j', 'b', ' ', 'k', '#', '*', '@', 'x', 'z', 'é', '}', '8', '%', 'v', 'd', '_', '1', '0', 'q', '~', 'w', 'e', 'l', 'f', 'm', 'p', 'g', '|', ')', '<', '>', '+', '6', 'o', '2', '(', '?', '[', 'y', '&', 'r', 'h', '/', '$', 't', '3', '9', ';', '7', "'", ']', '.', '{', '^']


In [7]:
char_to_ix = {ch:i for i, ch in enumerate(chars)}
ix_to_char = {i:ch for i, ch in enumerate(chars)}
print('char_to_ix', char_to_ix)
print('ix_to_char', ix_to_char)

char_to_ix {'i': 0, '!': 1, 'n': 2, 's': 3, 'c': 4, ',': 5, '`': 6, '-': 7, '5': 8, '4': 9, '"': 10, 'u': 11, ':': 12, '=': 13, 'a': 14, 'j': 15, 'b': 16, ' ': 17, 'k': 18, '#': 19, '*': 20, '@': 21, 'x': 22, 'z': 23, 'é': 24, '}': 25, '8': 26, '%': 27, 'v': 28, 'd': 29, '_': 30, '1': 31, '0': 32, 'q': 33, '~': 34, 'w': 35, 'e': 36, 'l': 37, 'f': 38, 'm': 39, 'p': 40, 'g': 41, '|': 42, ')': 43, '<': 44, '>': 45, '+': 46, '6': 47, 'o': 48, '2': 49, '(': 50, '?': 51, '[': 52, 'y': 53, '&': 54, 'r': 55, 'h': 56, '/': 57, '$': 58, 't': 59, '3': 60, '9': 61, ';': 62, '7': 63, "'": 64, ']': 65, '.': 66, '{': 67, '^': 68}
ix_to_char {0: 'i', 1: '!', 2: 'n', 3: 's', 4: 'c', 5: ',', 6: '`', 7: '-', 8: '5', 9: '4', 10: '"', 11: 'u', 12: ':', 13: '=', 14: 'a', 15: 'j', 16: 'b', 17: ' ', 18: 'k', 19: '#', 20: '*', 21: '@', 22: 'x', 23: 'z', 24: 'é', 25: '}', 26: '8', 27: '%', 28: 'v', 29: 'd', 30: '_', 31: '1', 32: '0', 33: 'q', 34: '~', 35: 'w', 36: 'e', 37: 'l', 38: 'f', 39: 'm', 40: 'p', 41: 'g

In [0]:
def one_hot_encode_batch(sequence, dict_size, seq_len):
    # Creating a multi-dimensional array of zeros with the desired output shape

    # we need to have all seq of same length or else cannot create this array
    # if we don't have batch_size i.e if we process one input at a time. we 
    # don't need to have sequence of same length

    tensor = torch.zeros(seq_len, dict_size)

    # features = np.zeros((batch_size, seq_len, dict_size), dtype=np.float32)
    
    # Replacing the 0 at the relevant character index with a 1 to represent that character
    for li in range(len(sequence)):
      tensor[li][sequence[li]] = 1
    return tensor

In [0]:
def targetTensor(seq, seq_len):
  tensor = torch.zeros(seq_len)
  for i in range(len(seq)):
    tensor[i] = seq[i]
  return tensor.long()

Target tensor is a simple tensor of indexes.
The reason for this because we are use NN Loss, and for we need to provide indexes of character. Its basically is a classification problem with characters a label. 

So our network will output software of possible charaters and we need to compare that with index's of the actual character 

This is important to understand

In [0]:
max_seq_length = 50
# this will maximum length of a name
# this is required to define the network
epochs = 50
batch_size = 100

In [11]:
#https://towardsdatascience.com/building-efficient-custom-datasets-in-pytorch-2563b946fd9f
#https://medium.com/datadriveninvestor/how-to-custom-datasets-and-dataloaders-with-pytorch-e27f9e2a9009

import math
from torch.utils.data import Dataset

class SentenceDataset(Dataset):
    def __init__(self, lines, max_seq_length):

        no_of_seqs = math.floor(len(lines) / (max_seq_length ))

        # since characters in total sequence can be more i.e no_of_seqs will not be perfect integer
        # we need to make this fixed else the batch size will not be of same length always

        final_no_chars = no_of_seqs * max_seq_length

        print(len(lines), "characters")
        print(final_no_chars, "final_no_chars")

        print(final_no_chars/(max_seq_length), "no of seques")

        lines = lines[0:final_no_chars]
        self.batch_size = batch_size
        self.no_of_seqs = no_of_seqs
        self.max_seq_length = max_seq_length

        self.lines = lines

    def __len__(self):
        return self.no_of_seqs

    def __getitem__(self, idx):
    
      start = idx*self.max_seq_length
      end = (idx+1)*self.max_seq_length
      # print(start, ":", end)
      sentence = self.lines[start:end]

    
      input_seq_idx = []
      target_seq_idx = []

      input_seq = sentence[:-1]
      target_seq = sentence[1:]

      # print("seq", sentence)
      # print("input", input_seq)
      # print("target", target_seq)

      input_seq_idx = [char_to_ix[ch] for ch in input_seq]
      target_seq_idx = [char_to_ix[ch] for ch in target_seq]


      input_encoded = one_hot_encode_batch(input_seq_idx, n_chars, max_seq_length)
      target_encoded = targetTensor(target_seq_idx, max_seq_length)
      return input_encoded, target_encoded


eval_data = SentenceDataset(lines, max_seq_length)
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, drop_last=True, batch_size=batch_size, shuffle=False)


data, target = next(iter(eval_dataloader))

print(len(eval_dataloader.dataset), "total data sets")

print(data.shape)

print(target.shape)

2647260 characters
2647250 final_no_chars
52945.0 no of seques
52945 total data sets
torch.Size([100, 50, 69])
torch.Size([100, 50])


In [0]:
class Model(nn.Module):
    def __init__(self, batch_size, n_chars, hidden_dim = 512, n_layers = 4):
        super(Model, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.n_chars = n_chars
        self.batch_size = batch_size

        #Defining the layers
        # RNN Layer
        self.rnn = nn.LSTM(n_chars, hidden_dim, n_layers, batch_first=True, dropout=.5)   
        # https://stackoverflow.com/questions/49224413/difference-between-1-lstm-with-num-layers-2-and-2-lstms-in-pytorch
        # https://discuss.pytorch.org/t/could-someone-explain-batch-first-true-in-lstm/15402

        # Fully connected layer
        # self.fc = nn.Linear(hidden_dim * max_seq_length, output_size)

        self.fc = nn.Sequential(
          # nn.BatchNorm1d(hidden_dim),
          # nn.Softmax(dim=1),
          nn.Linear(hidden_dim, n_chars)
        )

        self.dropout = nn.Dropout(.5)

    def forward(self, x, hidden):
        
        batch_size = x.size(0)

        # print(batch_size, " batch size")

        # Passing in the input and hidden state into the model and obtaining outputs
        out, hidden = self.rnn(x, hidden)

        # https://stackoverflow.com/questions/48302810/whats-the-difference-between-hidden-and-output-in-pytorch-lstm

        # print(out.shape, 'rnn output snape')  
        # print(hidden.shape, "hidden shape")

        out = self.dropout(out)
        
        # reshaping in the way that we are stacking it i.e
        # if batch_size x seq_length x hidden_dim
        # then we make it batch_size * seq_legth x hidden_dim
        # so this mean we are stacking all the individual letters vertically of the lines/sequences  

        out = out.contiguous().view(x.size()[0]*x.size()[1], self.hidden_dim)
        # print(out.shape, " new out shape")
        
        
        # out = self.dropout(out)
        out = self.fc(out)

        # print(out.shape, 'single layer output shape')
        
        
        # out = self.softmax(out)

        return out, hidden

    def init_hidden(self, batch_size = None):
      if batch_size is None:
        batch_size = self.batch_size

      h0 = torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device)
      c0 = torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device)
      nn.init.xavier_normal_(h0, gain=nn.init.calculate_gain('relu'))
      nn.init.xavier_normal_(c0, gain=nn.init.calculate_gain('relu'))
      # self.h0 = nn.Parameter(h0, requires_grad=True)  # Parameter() to update weights
      # self.c0 = nn.Parameter(c0, requires_grad=True)

      return (h0, c0)



In [0]:

model = Model(batch_size=batch_size, n_chars=n_chars)

# We'll also set the model to the device that we defined earlier (default is CPU)
model = model.to(device)

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=.001)

In [0]:
def detach(hidden):
    if isinstance(hidden, (tuple, list)):
        hidden = [i.detach() for i in hidden]
    else:
        hidden = hidden.detach()
    return hidden

In [15]:
clip = 5
def train(epoch = -1):
  tloss = 0
  itrloss = 0

  hidden = model.init_hidden()

  for batch_idx, (input_encoded, target_encoded) in enumerate(eval_dataloader):
    input_encoded = input_encoded.to(device)
    target_encoded = target_encoded.to(device)    

    # print(input_encoded.shape)
    # print(input_encoded.argmax(dim=2))

    # print(input_encoded.shape, "input shape")

    optimizer.zero_grad()
    hidden = detach(hidden)
    output, hidden = model(input_encoded, hidden)

    output = output.to(device)
    # output = output

    # print(output.shape, 'final ouput shape')
    # print(target_encoded.shape, "target shape")
    # print(target_encoded, "target")

    # The input is expected to contain scores for each class.
    # input has to be a 2D Tensor of size (minibatch, C).
    # This criterion expects a class index (0 to C-1) as the target for each value of a 1D tensor of size minibatch
    
    # since we are stacking output in NN, we will stack the target tensor as well
    
    loss = criterion(output, target_encoded.view(batch_size*max_seq_length))
    # ideally should be batch_size * max_seq_length. but at end of the dataset. full batch_size 
    # is not returned always
    loss.backward() # Does backpropagation and calculates gradients

    nn.utils.clip_grad_norm_(model.parameters(), clip)

    optimizer.step() # Updates the weights accordingly
    tloss += loss 
    itrloss += loss

    # break

    

    if batch_idx%100 == 0 and epoch == -1:
      print("Batch Loss: {}/{}...".format(batch_idx, itrloss/100))
      # print(output, " output")
      predict = output.argmax(dim=1).cpu().numpy().data
      # print(predict)
      predict_text = [ix_to_char[ch] for ch in predict]
      target_text = [ix_to_char[ch] for ch in target_encoded.view(batch_size*max_seq_length).cpu().numpy().data]
      print("predict ",   "".join(predict_text))

      print("target ",   "".join(target_text))
      # print(len(predict_text) , " text prediction len ")
      # print(target_encoded.view(len(input_encoded)*max_seq_length), "target ")
      itrloss = 0

  print("Epoch: {}/ Loss: {}".format(epoch, tloss / len(eval_dataloader.dataset)))
  return tloss

train()



Batch Loss: 0/0.04234004020690918...
predict  {kkq4{4,5({,5,{5,5,,0{{5,{,{,{",{{4,{5{,{(k5,{k{{,,{k,{{,{,{{{({",4,, q{{{kq {,,yk5,{0{{{k{{{4"4{k5,4{{  44{{{q,5y({{q{ ,{{,k,5,{,,,e{{{ k" ,,,,{,kq{k{{54{k,k5,{4q,{(4,,,,,5,{5k5{,5, {2,55,y{k{4,k, ,,5{55{5{{,q,q{5,{{,kk{{," ,k{4,{,4 {5{ k{,k{{5, ,,{,k{{{44,4,, ,5{{{54,(,, {4k,{0{{{{5,, ,055,,,{{{5{4,k{,5k,,,q,,5y,,{,,,q{ {{,{y,kz{{,,,5{{{,4{{",,,"45{,4k,k{{,{(,,5,,k5{5,,{{{{{q{ (55k,,,5,,k{,05k{,,4,4{{ {{,q,,,{{,,, 4q{{,{,,, ,z {(,,,{;54qk,{,45554{{5k,{{{5 {q, ,5{45k0{{{",{{{5,{{{,{ 0y45{q5k,,,,{5,5,,,,,,, 45kq5,k{q{5k,0,k{{5,{{,{,k,555{,{0{{y4{,4 8k"{k,,,,5q5{"k{,,k,5{{,q,{,,{k,,,,,k{,0,4{45{5{,,5,k"{4,,{4,,{{{,  ,{ 5,,{{{8,{5,4,,,4f{{{055{4{{{{,,4,k,0k{{,0{, 5,,,{"k5, 4{5{/ ,,50,0,,,,,{{,,,{{,{{{k{,{{k8{555,{ ,5,{{{45,,,5,{,,,{{{ {,{{5,{{{(54{{ 0,,,{4,{, ,4{ q{,k ,kk,{qk, 5,,,,,,,,q5,5,55,,{0,4{{554,5{,,0{4 k,,{k45,,{,{{,,,{,,44,k{{{,{5{{4,{,,{{{5{ ,,,{{,{,,4,{,q{, ,{,,{{ 0,555k,0,,{{{{{{k5{{{{k{{{k{{q{,2{"5,{k5,t{55 k0{,4{{,{,{({5"05,{{

tensor(1632.7217, device='cuda:0', grad_fn=<AddBackward0>)

In [16]:
import random

test_dataloader = DataLoader(eval_data, sampler=torch.utils.data.SubsetRandomSampler(list(range(0,1000))), batch_size=10)

def eval():
  input_encoded, target_encoded = next(iter(test_dataloader))
  
  with torch.no_grad():
    input_encoded = input_encoded.to(device)
    target_encoded = target_encoded.to(device)    

    output, _ = model(input_encoded, None)

    output = output.to(device)

    # print(output[0])
    # print(output.shape)

    
    prediction = output.argmax(dim=1).cpu()
    # print(target_encoded)
    # print(prediction.view(-1, max_seq_length))

    acc = accuracy_score(target_encoded.view(len(input_encoded)*max_seq_length).cpu(), prediction)
    print("Accuracy {}".format(acc))


eval()

Accuracy 0.136


In [0]:
epochs = 20
for epoch in range(epochs):
  train(epoch)
  eval()

In [18]:
hidden = model.init_hidden(1)
print(hidden[0].shape)

torch.Size([4, 1, 512])


In [23]:


def generate(input_seq_sample, hidden):
  with torch.no_grad():


    input_seq_idx =[char_to_ix[ch] for ch in input_seq_sample]
    input_encoded = one_hot_encode_batch(input_seq_idx, n_chars, len(input_seq_sample))
    input_encoded = input_encoded.to(device)

    detach(hidden)
    output, hidden= model(input_encoded.unsqueeze(0),hidden)

    # p = F.softmax(output, dim=1).data

    # print(p)

    prediction = output.argmax(dim=1)

    # print(prediction)

    text = [ix_to_char[idx] for idx in prediction.cpu().numpy().data ]
    # print(text)

    return "".join(text), hidden

def full_generate(input, len, hidden):
  for i in range(len):
    char, hidden = generate(input, hidden)
    # print(char , " generated")
    input = input + char[-1]
    # print(input)

  return input

output = full_generate("what are you doing?", 100, hidden)
print(output)

output = full_generate("will this work?", 100, hidden)
print(output)


output = full_generate("manish ", 100, hidden)
print(output)


output = full_generate("india ", 100, hidden)
print(output)

output = full_generate("i hate ", 100, hidden)
print(output)


output = full_generate("what is lstm?", 100, hidden)
print(output)

what are you doing?  the startup is that they don't have to be a startup is that they would be a startup in the startup
will this work?  the same thing that would be a company that would be a startup is that they can be a startup in th
manish the startup ideas are the startup is that the same thing that is that they would be a startup in the
india things are a startup is that they want to be a startup that way to be a startup with the startups th
i hate the sort of problem to start to start to the startup is the startup is that they started to be the s
what is lstm?  the startups that would be a startup is the startup in the startup ideas that want to be the start


In [0]:
torch.save(model.state_dict(), 'cnn-lstm-Epoch_50.pth')

In [21]:
from google.colab import drive
drive.mount('/content/drive')



Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!cp -rf "cnn-lstm-Epoch_50.pth" "drive/My Drive/Colab Notebooks/"

In [0]:
# criterion = nn.NLLLoss()

# output = torch.randn(10, 120).float()
# target = torch.FloatTensor(10).uniform_(0, 120).long()

# print(output.shape)
# print(target.shape)
# print(target)

# loss = criterion(output, target)

In [0]:
# X = None
# Y = None

# def create_dataset():
#   for line_no in range(len(lines)):
#     sentence = lines[line_no]
#     sentence = sentence.strip()

#     if len(sentence) <= 1:
#       continue

#     input_seq_idx = []
#     target_seq_idx = []
#     if len(sentence) >= max_seq_length: 
#       # print(name , " bigger than max seq length FYI")
#       sentence = sentence[0:max_seq_length]

    
#     input_seq = sentence[:-1]
#     target_seq = sentence[1:]

#     # print(sentence)
#     # print(input_seq)
#     # print(target_seq)

#     input_seq_idx = [char_to_ix[ch] for ch in input_seq]
#     target_seq_idx = [char_to_ix[ch] for ch in target_seq]

#     input_encoded = one_hot_encode_batch(input_seq_idx, n_chars, max_seq_length)
#     target_encoded = targetTensor(target_seq_idx, max_seq_length)

#     if line_no == 0:
#       X = input_encoded
#       Y = target_encoded
#     else:
#       X = torch.cat([X, input_encoded], dim=0)
#       Y = torch.cat([Y, target_encoded], dim=0)
    
#     # print(input_encoded.shape, "input shape")
#     # print(target_encoded)

# create_dataset()

# print(X.shape)
# print(Y.shape)



# eval_data = TensorDataset(X,Y)
# eval_sampler = SequentialSampler(eval_data)
# eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=50)

# print(eval_data)

# this is taking too long

In [0]:
arr = ["m","a","n","i","s","h"]
n = 1
n_steps=3

x = arr[n:n+n_steps]
y = np.zeros_like(x)

print(x)
print(y)
try:
    y[:-1], y[-1] = x[1:], arr[n+n_steps]
except IndexError:
    y[:-1], y[-1] = x[1:], arr[0]

print(x, " >" ,  y)