# Character-level language model
<a target="_blank" href="https://colab.research.google.com/github/luigiselmi/machine_learning_notes/blob/main/pml3/char_level_language_model.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
We want to build a character-level language model based of a RNN that can generate new text with the same style of a text that has been used for training. We download the book "The Mysterious Island" by Julius Verne. 

In [1]:
#!wget 'https://www.gutenberg.org/files/1268/1268-0.txt' -P data/

We use the text from the title and count the total number of characters and the number of unique characters used.

In [1]:
import numpy as np

## Reading and processing text
with open('data/1268-0.txt', 'r', encoding="utf8") as fp:
    text=fp.read()
    
start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')

text = text[start_indx:end_indx]
char_set = set(text)
print('Total Length:', len(text))
print('Unique Characters:', len(char_set))

Total Length: 1130711
Unique Characters: 85


We encode the characters in the text as integers. The integer values can be decoded into characters

In [2]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char2int['A']

29

In [3]:
char_array = np.array(chars_sorted)
char_array[29:81]

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
       'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='<U1')

In [4]:
text_encoded = np.array([char2int[ch] for ch in text], dtype=np.int32)

print('Text encoded shape: ', text_encoded.shape)

print(text[:15], '     == Encoding ==> ', text_encoded[:15])
print(text_encoded[15:21], ' == Reverse  ==> ', ''.join(char_array[text_encoded[15:21]]))

Text encoded shape:  (1130711,)
THE MYSTERIOUS       == Encoding ==>  [48 36 33  1 41 53 47 48 33 46 37 43 49 47  1]
[37 47 40 29 42 32]  == Reverse  ==>  ISLAND


In [5]:
for ex in text_encoded[:5]:
    print('{} -> {}'.format(ex, char_array[ex]))

48 -> T
36 -> H
33 -> E
1 ->  
41 -> M


## Character prediction as a multiclass classification task
The text generation task, where a sequence of characters are used to infer the next one, can be thought as a multiclass classification task where an incomplete text is mapped (i.e. classified) to one of the characters in our alphabet of unique characters. We create the training set using sequences of characters from the text and as label the character immediately after the last character. We choose the lenght of the sequences to be 41, 40 for the sequences and 1 for the label, that is the character after each sequence. Our model should allows us to create new sequences with the labels.   

We create chunks of characters of length 40 from the text with the following character after the chunk used as label.

In [6]:
seq_length = 40
chunk_size = seq_length + 1

text_chunks = [text_encoded[i:i + chunk_size] for i in range(len(text_encoded) - chunk_size + 1)] 

In [7]:
text_chunks[0]

array([48, 36, 33,  1, 41, 53, 47, 48, 33, 46, 37, 43, 49, 47,  1, 37, 47,
       40, 29, 42, 32,  1, 10, 10, 10,  0,  0,  0,  0,  0, 48, 36, 33,  1,
       41, 53, 47, 48, 33, 46, 37])

In [8]:
## inspection:
for seq in text_chunks[:1]:
    input_seq = seq[:seq_length]
    target = seq[seq_length] 
    print(input_seq, ' -> ', target)
    print(repr(''.join(char_array[input_seq])), ' -> ', repr(''.join(char_array[target])))

[48 36 33  1 41 53 47 48 33 46 37 43 49 47  1 37 47 40 29 42 32  1 10 10
 10  0  0  0  0  0 48 36 33  1 41 53 47 48 33 46]  ->  37
'THE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTER'  ->  'I'


We create a PyTorch dataset from the set of chunks 

In [10]:
import torch
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()
    
seq_dataset = TextDataset(torch.tensor(text_chunks))

In [11]:
for i, (seq, target) in enumerate(seq_dataset):
    print(' Input (x):', repr(''.join(char_array[seq])))
    print('Target (y):', repr(''.join(char_array[target])))
    print()
    if i == 1:
        break
    

 Input (x): 'THE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTER'
Target (y): 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTERI'

 Input (x): 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTERI'
Target (y): 'E MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTERIO'



We create batches from the dataset

In [12]:
from torch.utils.data import DataLoader
batch_size = 64
torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

## Model definition and training 
We define a model that contains one LSTM layer

In [14]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim) 
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, 
                           batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden, cell
    
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512

torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size) 
model

RNN(
  (embedding): Embedding(85, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=85, bias=True)
)

## Training the model

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

num_epochs = 500 #10000 

torch.manual_seed(1)

for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    seq_batch, target_batch = next(iter(seq_dl))
    #seq_batch = seq_batch.to(device)
    #target_batch = target_batch.to(device)
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell) 
        loss += loss_fn(pred, target_batch[:, c])
    loss.backward()
    optimizer.step()
    loss = loss.item()/seq_length
    if epoch % 500 == 0:
        print(f'Epoch {epoch} loss: {loss:.4f}')

Epoch 0 loss: 4.4364


## Evaluation phase

In [26]:
from torch.distributions.categorical import Categorical

torch.manual_seed(1)

logits = torch.tensor([[1.0, 1.0, 1.0]])

print('Probabilities:', nn.functional.softmax(logits, dim=1).numpy()[0])

m = Categorical(logits=logits)
samples = m.sample((10,))
 
print(samples.numpy())

Probabilities: [0.33333334 0.33333334 0.33333334]
[[0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [2]
 [1]
 [1]]
