# Character-level Language Modeling

## Introduction

This project focuses on training a character-based recurrent neural network (RNN) model to generate text based on the works of H.P. Lovecraft. The choice of Lovecraft's writing serves as a rich source of vocabulary and narrative complexity, providing an intriguing dataset for exploring the capabilities of RNNs in text generation.  

By using character-level modeling, the aim is to capture the nuances of language structure without explicitly attempting to replicate any particular writing style. The model focuses solely on predicting individual characters based on prior context, rather than understanding the broader semantics or thematic elements of the text. The project highlights the process of data preparation, model architecture design, and evaluation of generated outputs, ultimately shedding light on the strengths and limitations of character-based text generation.

## Text preprocessing

In [39]:
"""
import subprocess

# install At the Mountains of Madness - H.P. Lovecraft
command = ['curl', '-O', 'https://www.gutenberg.org/files/70652/70652-0.txt']

try:
    result = subprocess.run(command, capture_output=True, text=True, check=True)
    print("File downloaded successfully.")
except subprocess.CalledProcessError as error:
    print(f'Error occurred: {error.stderr}')
"""

'\nimport subprocess\n\n# install At the Mountains of Madness - H.P. Lovecraft\ncommand = [\'curl\', \'-O\', \'https://www.gutenberg.org/files/70652/70652-0.txt\']\n\ntry:\n    result = subprocess.run(command, capture_output=True, text=True, check=True)\n    print("File downloaded successfully.")\nexcept subprocess.CalledProcessError as error:\n    print(f\'Error occurred: {error.stderr}\')\n'

In [40]:
with open('70652-0.txt', mode='r', encoding='utf8') as file:
    book = file.read()

In [41]:
# Remove first and last pages by Project Guttenberg
begin_index = book.find('At the MOUNTAINS of MADNESS')
end_index = book.find('THE END')

book = book[begin_index:end_index]

In [42]:
unique_chars = set(book)
print(sorted(unique_chars))

['\n', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '°', '´', '×', 'æ', 'é', 'ë', 'ï', 'ö']


Now we will remove some characters that do not give further information for our task:

In [43]:
import re

# remove not important characters
book = book.replace('\n', ' ')
book = book.replace('--', ',')
book = book.replace('----', ' ')
for char in ['*', '[', ']', '×', '_']:
    book = book.replace(char, '')

# replace multiple spaces/newlines with a single space
book = re.sub(r'\s+', ' ', book)
# fix commas after multiple spaces removal
book = re.sub(r'(\w),\s*(\w)', r'\1, \2', book)

unique_chars = set(book)

In [44]:
print('Length of book in characters: ', len(book))
print('Unique characters in book: ', len(unique_chars))

Length of book in characters:  241251
Unique characters in book:  80


We will need the chars as integers for numeric representation. A NumPy array will be better for fast processing:

In [45]:
import numpy as np

sorted_chars = np.array(sorted(unique_chars))
char2int = {char : integer for integer, char in enumerate(sorted_chars)}
encoded_book = np.array([char2int[char] for char in book], dtype=np.int32)

Check the encoded book size is the same as before:

In [46]:
print(encoded_book.shape)

(241251,)


Now print some encoded line:

In [47]:
first_line = slice(248, 348)
print('Original text:\n"' + book[first_line] + '"\n')
print('Encoded text:\n', encoded_book[first_line])
print('\nLengths of both texts: ', len(encoded_book[first_line]), len(book[first_line]))

Original text:
" I am forced into speech because men of science have refused to follow my advice without knowing why"

Encoded text:
 [ 0 30  0 47 59  0 52 61 64 49 51 50  0 55 60 66 61  0 65 62 51 51 49 54
  0 48 51 49 47 67 65 51  0 59 51 60  0 61 52  0 65 49 55 51 60 49 51  0
 54 47 68 51  0 64 51 52 67 65 51 50  0 66 61  0 52 61 58 58 61 69  0 59
 71  0 47 50 68 55 49 51  0 69 55 66 54 61 67 66  0 57 60 61 69 55 60 53
  0 69 54 71]

Lengths of both texts:  100 100


And in reverse:

In [48]:
print('Encoded text:\n', encoded_book[first_line], "\n")
print('Decoded text:', ''.join(sorted_chars[encoded_book[first_line]]))

Encoded text:
 [ 0 30  0 47 59  0 52 61 64 49 51 50  0 55 60 66 61  0 65 62 51 51 49 54
  0 48 51 49 47 67 65 51  0 59 51 60  0 61 52  0 65 49 55 51 60 49 51  0
 54 47 68 51  0 64 51 52 67 65 51 50  0 66 61  0 52 61 58 58 61 69  0 59
 71  0 47 50 68 55 49 51  0 69 55 66 54 61 67 66  0 57 60 61 69 55 60 53
  0 69 54 71] 

Decoded text:  I am forced into speech because men of science have refused to follow my advice without knowing why


## Modelling the RNN

We will construct our model for character-based language modeling. The custom dataset class takes encoded chunks of text as input. Each chunk will be composed of a sequence of characters, where the model will predict the next character given the current sequence.


In [49]:
import torch
from torch.utils.data import Dataset, DataLoader

seq_length = 50
chunk_size = seq_length + 1

encoded_chunks = [encoded_book[i : i+chunk_size] for i in range(0, len(encoded_book) - chunk_size)]

In [50]:
class TextDataset(Dataset):
    def __init__(self, encoded_chunks):
        super().__init__()
        self.encoded_chunks = encoded_chunks
        
    def __len__(self):
        return len(self.encoded_chunks)
    
    def __getitem__(self, idx):
        x = self.encoded_chunks[idx][:-1]
        y = self.encoded_chunks[idx][1:]
        return x, y

In [51]:
text_dataset = TextDataset(encoded_chunks)

Every next character of each character is the one to be predicted. Thus, for a random chunk, input and target will look like:

In [52]:
for _ in range(3):
    x, y = text_dataset[np.random.randint(0, len(text_dataset))]
    print('Input: ', ''.join(sorted_chars[x]))
    print('Target:', ''.join(sorted_chars[y]), '\n')

Input:  d be called decadent in comparison with that of sp
Target:  be called decadent in comparison with that of spe 

Input:  diate, but was clearly something more. It was part
Target: iate, but was clearly something more. It was partl 

Input:  al recession toward the antarctic became very plai
Target: l recession toward the antarctic became very plain 



In [53]:
batch_size = 32
text_dataloader = DataLoader(text_dataset, batch_size=batch_size, shuffle=True, drop_last=True)


Our model comprises an **embedding layer** that transforms character indices into dense vectors, followed by a **bidirectional LSTM layer** that captures temporal dependencies in the sequence from both forward and backward directions. This bidirectional setup enhances the model's ability to understand context, as it considers both preceding and succeeding characters, leading to a richer representation of the input sequence. Finally, we have a fully connected layer that outputs the logits for each character in our vocabulary. This architecture not only improves the coherence and relevance of the generated text but also allows the model to generate more meaningful outputs by leveraging information from the entire sequence, thus enhancing its overall performance in character-based text generation tasks.

In [54]:
import torch.nn as nn

class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        # Store the hidden size, note that hidden size will be doubled for bidirectional
        self.rnn_hidden_size = hidden_size
        # LSTM with bidirectional=True
        self.rnn = nn.LSTM(embedding_size, hidden_size, batch_first=True, bidirectional=True)
        # Fully connected layer input size should be 2 * hidden_size due to bidirection
        self.fc = nn.Linear(2 * hidden_size, vocab_size)  # Update input size to account for bidirectional
        
    def forward(self, x, hidden, cell):
        out = self.embedding(x)  # shape = [batch_size, seq_length, embedding_size]
        out, (hidden, cell) = self.rnn(out, (hidden, cell))  # shape = [batch_size, seq_length, 2 * hidden_size]
        # Reshape output for the fully connected layer
        out = out.reshape(-1, 2 * self.rnn_hidden_size)  # Flatten to [batch_size * seq_length, 2 * hidden_size]
        out = self.fc(out)  # Pass through fully connected layer
        return out, hidden, cell  # Return the output and the hidden/cell states
    
    def init_hidden(self, batch_size):
        # Create two tensors for hidden and cell states initialized to zero
        # Shape: (num_layers * num_directions, batch_size, hidden_size)
        return (torch.zeros(2, batch_size, self.rnn_hidden_size).to(device),
                torch.zeros(2, batch_size, self.rnn_hidden_size).to(device))


In [55]:
vocab_size = len(unique_chars)
embedding_size = 256
hidden_size = 512

torch.manual_seed(0)
model = RNNModel(vocab_size, embedding_size, hidden_size)
model

RNNModel(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=1024, out_features=80, bias=True)
)


With our RNN model and DataLoader set up, we are ready to move on to training the model.

## Model Training

In this section, we will train our RNN model for character prediction over 15,000 epochs. During training, we will monitor the model's performance every 500 epochs by tracking both the loss and accuracy.  

For this multiclass classification task, where the goal is to predict one of 78 characters, we utilize the Cross Entropy Loss function (since the outputs are logits), and an Adam optimizer.


In [56]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model.to(device)

RNNModel(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=1024, out_features=80, bias=True)
)

In [57]:
num_epochs = 15001
torch.manual_seed(13)
for epoch in range(num_epochs):
    # initialize hidden and cell states and move them to the device
    hidden, cell = model.init_hidden(batch_size)
    hidden, cell = hidden.to(device), cell.to(device)

    seq_batch, target_batch = next(iter(text_dataloader))
    seq_batch, target_batch = seq_batch.to(device), target_batch.to(device)
    
    optimizer.zero_grad()  # reset gradients
    loss = 0
    total_correct = 0  # keep track of correct predictions
    total_count = 0    # keep track of total number of predictions
    
    for c in range(seq_length):
        # reshape seq_batch[:, c] to [batch_size, 1]
        input_tensor = seq_batch[:, c].unsqueeze(1)  # Shape: [batch_size, 1]
        pred, hidden, cell = model(input_tensor.to(device), hidden, cell)  # Send to device
        # Compute loss
        loss += loss_fn(pred, target_batch[:, c].long())
        # Calculate accuracy
        predicted_chars = torch.argmax(pred, dim=1)  # Get the index of the max log-probability
        correct_predictions = (predicted_chars == target_batch[:, c]).sum().item()
        total_correct += correct_predictions
        total_count += target_batch.size(0)  # Add the batch size to total_count
    
    loss.backward()  # backpropagate
    optimizer.step()  # take step forward
    
    # compute average loss per sequence
    avg_loss = loss.item() / seq_length
    # compute accuracy
    accuracy = total_correct / total_count

    if epoch % 500 == 0:
        print(f'Epoch {epoch} loss: {avg_loss:.4f}, accuracy: {accuracy:.4f}')

Epoch 0 loss: 4.3896, accuracy: 0.0106
Epoch 500 loss: 1.4301, accuracy: 0.5725
Epoch 1000 loss: 1.2228, accuracy: 0.6088
Epoch 1500 loss: 1.1140, accuracy: 0.6512
Epoch 2000 loss: 0.9552, accuracy: 0.7156
Epoch 2500 loss: 0.7970, accuracy: 0.7506
Epoch 3000 loss: 0.7471, accuracy: 0.7738
Epoch 3500 loss: 0.6518, accuracy: 0.8025
Epoch 4000 loss: 0.5552, accuracy: 0.8375
Epoch 4500 loss: 0.5396, accuracy: 0.8375
Epoch 5000 loss: 0.4778, accuracy: 0.8550
Epoch 5500 loss: 0.4763, accuracy: 0.8569
Epoch 6000 loss: 0.4337, accuracy: 0.8662
Epoch 6500 loss: 0.3822, accuracy: 0.8944
Epoch 7000 loss: 0.4198, accuracy: 0.8706
Epoch 7500 loss: 0.4207, accuracy: 0.8731
Epoch 8000 loss: 0.3925, accuracy: 0.8881
Epoch 8500 loss: 0.3609, accuracy: 0.8931
Epoch 9000 loss: 0.3584, accuracy: 0.8994
Epoch 9500 loss: 0.3433, accuracy: 0.8944
Epoch 10000 loss: 0.3545, accuracy: 0.8988
Epoch 10500 loss: 0.3501, accuracy: 0.8969
Epoch 11000 loss: 0.3634, accuracy: 0.8956
Epoch 11500 loss: 0.3467, accuracy:

Model saving and loading (if necessary).

In [58]:
torch.save(model.state_dict(), 'model.pt')

In [59]:
# model.load_state_dict(torch.load('model.pt'))

## Results: generating new text

In this section, we implement a text generation process using an **autoregressive model**, where each new character is generated based on the previously generated characters. The model starts with a given input string, which is encoded into a sequence of integers representing characters. This sequence is passed through our RNN model, which predicts the next character in the form of logits. Instead of always selecting the character with the highest probability (which would lead to repetitive output), we use a categorical distribution to randomly sample the next character based on the logits. This sampled character is then added to the generated text, and the process continues by feeding the updated sequence back into the model to generate the next character. This method, called autoregression, allows the model to generate diverse and coherent text one character at a time, where each new prediction depends on the context of the previously generated sequence.

In [62]:
from torch.distributions.categorical import Categorical

def sample_text(model, start_str, length_to_generate, scale=1.0):
    # lower scale = more randomness
    # encode the starting string into integer tensor
    encoded_input = torch.tensor([char2int[s] for s in start_str])
    encoded_input = torch.reshape(encoded_input, (1, -1))  # reshape to (1, sequence_length), like batch_size = 1
    encoded_input = encoded_input.to(device)

    generated_str = start_str  # initialize the generated string with the starting string

    model.eval()
    hidden, cell = model.init_hidden(1)  # initialize hidden and cell states for batch_size = 1 sequence
    hidden , cell = hidden.to(device), cell.to(device)

    # pass through the model for the initial characters of the input string
    for c in range(len(start_str) - 1):
        _, hidden, cell = model(encoded_input[:, c].view(1, 1), hidden, cell)  # forward pass for each char

    last_char = encoded_input[:, -1]  # get the last character from the input for prediction
    for i in range(length_to_generate):
        logits, hidden, cell = model(last_char.view(1, 1), hidden, cell)  # forward pass for the last char
        logits = torch.squeeze(logits, 0)  # remove extra dimensions for logits
        scaled_logits = logits * scale  # scale logits for randomness/creativity in sampling
        m = Categorical(logits=scaled_logits)  # create a categorical distribution from logits
        last_char = m.sample()  # sample the next character from the distribution
        generated_str += str(sorted_chars[last_char])  # append the generated character to the output string
        
    return generated_str

In [70]:
torch.manual_seed(13)
result = sample_text(model, start_str='The mountains were', length_to_generate=500)
print(result)

The mountains were of all imaginable shapes and proportions, ranificial distinctness. As we looked more steady distant scenes can sometimes be reflected, refrained from sheer and scarred the height of the second in a shrieking subway train, a shapeless congeries of protoplasmic bubbles, faintly self-luminous coast line of Queen Mary and Knox Lands. Then, in about a quarter of a mile that nameless scent before the lowest as the wild tales of cosmic hill things from other tasks to work on them. It was after four mi


In [78]:
torch.manual_seed(1310)
result = sample_text(model, start_str='The beings were ', length_to_generate=500, scale=4)
print(result)

The beings were unmistakable. In the building of land cities the huge stone blocks of the high towers were generally lifted by vast-winged pterodactyls of a species heretofore unknown to palæontology. The persistence with which the Old Ones survived various geologic changes and concocted earlier years. Danforth, indeed, is known to be among the first thing I remember of the rest of the journey was hearing him light-headedly chant a hysterical formula in which I alone of mankind could have found anything but ins


In [86]:
torch.manual_seed(13)
result = sample_text(model, start_str='our camp was', length_to_generate=250, scale=5)
print(result)

our camp was left, of what had disappeared, and of how the madness of a lone survivor might have conceived the inconceivable, a wild trip across the momentous divide and over the unsampled secrets of an elder and porternal terrace had once existed there. Under t


In [96]:
torch.manual_seed(1310)
result = sample_text(model, start_str='The cult of Cth', length_to_generate=200, scale=7)
print(result)

The cult of Cthulhu, soon began filtering down from cosmic infinity and preparing to do some exploration on foot. Though the culture was mainly urban, some agriculture and much stock raising existed. Mining and a li


In [103]:
torch.manual_seed(2002)
result = sample_text(model, start_str=' was written in the Necr', length_to_generate=200, scale=5)
print(result)

 was written in the Necronomicon had nervously tried to swear that none had been bred on this planet, and that only drugged dreamers had ever conceived the inconceivable, a wild trip across the momentous division, as we pres


In [110]:
torch.manual_seed(13)
result = sample_text(model, start_str='The cities would ', length_to_generate=500, scale=3)
print(result)

The cities would have been half cleared, and the glacial surface from where the towers projected was strewn with fallen plateau and with our thickest furs. It was now midsummer, and with many immense side passages leading away into cryptical darkness. Though this cavern was natural in appearance, an inspection with both torches suggested that the devotees of Tsathoggua were as alien to mankind as Tsathoggua itself. Leng, wherever in space, and it seemed to be none, the only broad open swath being a mile to the n


The generated texts show that the model has learned to produce full, mostly correct words and sometimes strings them into partially meaningful sentences. While the output captures elements of Lovecraft's vocabulary and tone, particularly his archaic and formal style, it falls short of achieving true coherence and narrative flow, as it is only a character-based language model. These models excel at learning short-term dependencies but struggle with maintaining consistent logic or thematic continuity over longer text sequences. Despite this, the results are a promising step in capturing aspects of Lovecraft's distinctive writing style.

