<h1> Language Modeling <h1>

After having studied word embeddings and text classification in the previous notebooks, we will now focus on language modeling. </br>

Word prediction is a Natural Language Processing - NLP application concerned with predicting the next word given the preceding text. Auto-complete or suggested responses are popular types of language prediction tasks. But you may wonder, what does word predction have to do with language modeling? The idea is that by training a deep neural network to predict what text is to follow, the network will get an understanding, or a model, of the language that it was trained on. </br>

The first step towards language prediction is the selection of a language or word prediction model.
Broadly speaking, there exist two models you can use to develop a next-word- predictor model:
</br> 1) Statistical N-gram models or
</br> 2) (Deep) Neural models.

**0. Task** (0 points)  </br>
Before diving into the tasks, here are a couple of imports we will later use.

In [None]:
!pip install boltons -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/195.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/195.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.5/195.3 kB[0m [31m237.0 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/195.3 kB[0m [31m295.5 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/195.3 kB[0m [31m373.6 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━[0m [32m122.9/195.3 kB[0m [31m625.2 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m195.3/195.3 kB[0m [31m855.4 kB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import string
from pathlib import Path
from textwrap import wrap

import numpy as np
import pandas as pd
from boltons.iterutils import windowed
#from tqdm import tqdm_notebook
#from tqdm import tqdm
from tqdm.notebook import tqdm

from nltk.util import ngrams

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.dataset import random_split
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

from google_drive_downloader import GoogleDriveDownloader as gdd

from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
device_word = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_char = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_word

device(type='cuda')

<h1>Word and Text Generation</h1>

In this notebook, we will do two things:
1.   Build a Recurrent Neural Network (RNN), that can learn how english *characters* are combined.
2.   Build an RNN, that can learn how english *words* are combined.

To achieve this, we are going to do the following steps:
1.   Load the Data
2.   Preprocess the Data for character-level generation
3.   Preprocess the Data for word-level generation
4.   Build the RNN
5.   Apply the RNN to the Data from Step 2
6.   Apply the RNN to the Data from Step 3




<h2>1. Load the Data</h2>

Our Dataset consists of multiple texts about weight loss (referring to body weight, not a weight of a Neural Network).

In [None]:
# The input texts can be found here:
DATA_PATH = 'data/weight_loss/articles.jsonl'
if not Path(DATA_PATH).is_file():
    gdd.download_file_from_google_drive(
        file_id='1mafPreWzE-FyLI0K-MUsXPcnUI0epIcI',
        dest_path='data/weight_loss/weight_loss_articles.zip',
        unzip=True,
    )

Downloading 1mafPreWzE-FyLI0K-MUsXPcnUI0epIcI into data/weight_loss/weight_loss_articles.zip... Done.
Unzipping...Done.


In [None]:
# Let's print out the first article.
print(pd.read_json(DATA_PATH).text.str.lower().tolist()[0])

weight gaining is a common problem around the world. in developed country, it is the most common problem. in this article, i am not going to show you some advance and magical technique which will make you slim overnight. i am going to show you tips on the basis of real facts which works. in this article, i will give you how to tips, which will help you to lose weight. are you ready?
calories requirement
first thing you need to understand is why you gain weight. why? whenever you eat or drink something, you will get some calories. when you think about weight, everything revolves around calories.
whatever you do, will burn some calories no matter how small work it is or just a movement of your body. your body burns thousands of calories in one day.
if you are getting more calories than needed, you will gain weight. if you are getting fewer calories than needed, you will lose weight. so for losing weight, you need to know how much calorie your body required.
find require calories for your

<h2>2. Preprocess the data for character-level generation</h2>

As you can see in the cell above, it is pretty tedious to access the data. In the next few steps, we'll help you and the network to access the data more easily.

In [None]:
def remove_unprintable_chars(all_chars_windowed):

  not_printbl_chars=[]
  filtered_chars=[]
  printbl=True

  for sequence in tqdm(all_chars_windowed):
    printbl=True

    for char in sequence:
      if not(char in string.printable):
        printbl=False
        not_printbl_chars+=[char]
    if printbl==True:
      filtered_chars+=[sequence]

  return filtered_chars

In [None]:
def textlist_generator(path):
  return pd.read_json(path).text.str.lower().tolist()

def load_data_char(path, sequence_length=125):

    texts = textlist_generator(path)
    chars_windowed = [list(windowed(text, sequence_length)) for text in texts]
    all_chars_windowed = [sublst for lst in chars_windowed for sublst in lst]
    filtered_chars = remove_unprintable_chars(all_chars_windowed)

    return filtered_chars

def set_of_chars_in(sequences):
    return {sublst for lst in sequences for sublst in lst}

def create_char2idx(sequences):
    set_of_chars = set_of_chars_in(sequences)
    return {char: idx for idx, char in enumerate(sorted(set_of_chars))}

def encode_sequence(sequence, char2idx):
    return [char2idx[char] for char in sequence]

def encode_sequences(sequences, char2idx):
    return np.array([
        encode_sequence(sequence, char2idx)
        for sequence in tqdm(sequences)
    ])

class Sequences(Dataset):
    def __init__(self, path, sequence_length=125):
        self.sequences = load_data_char(DATA_PATH, sequence_length=sequence_length)
        self.vocab_size = len(set_of_chars_in(self.sequences))
        self.char2idx = create_char2idx(self.sequences)
        if self.char2idx is not None:
          print("Initialized properly.")
        else:
          print("Not initialized properly.")
        self.idx2char = {idx: char for char, idx in self.char2idx.items()}
        self.encoded = encode_sequences(self.sequences, self.char2idx)

    def __getitem__(self, i):
        return self.encoded[i, :-1], self.encoded[i, 1:]

    def __len__(self):
        return len(self.encoded)

The following three tasks will help you to understand the code better.

**2.1** (1 Point) <br>

Describe the variable `chars_windowed`. What does it contain?

**Your answer goes here**

**2.2** (1 Point) <br>
Describe the variable `all_chars_windowed`. What does it contain?

**Your answer goes here**

**2.3** (1 Point) <br>
Explain shortly, what the function `remove_unprintable_chars(all_chars_windowed)`does. We do not want a step by step explaination, just describe the general idea.

**Your answer goes here**

**2.4** (1 Point) <br>
Briefly describe in your own words what the class variable `self.sequences` will contain and why this is needed.

**Your answer goes here**

Now lets load our char_dataset.

In [None]:
sequence_length_char=int(input("Choose your sequence_length_char and hit enter (for this task, we chose 128): "))
if sequence_length_char<=1:
  print("1 or less is not a valid sequence length. Your model will not learn anything from just one word at a time. The sequence lenght of 128 has been chosen for you.")
  sequence_length_char=128

Choose your sequence_length_char and hit enter (for this task, we chose 128): 128


In [None]:
dataset_char = Sequences(DATA_PATH, sequence_length=sequence_length_char)
len(dataset_char)
train_loader_char = DataLoader(dataset_char, batch_size=4096)

  0%|          | 0/1246263 [00:00<?, ?it/s]

Initialized properly.


  0%|          | 0/1228546 [00:00<?, ?it/s]

<h2>3. Preprocess the data for word-level generation</h2>

We now have to do the same preprocessing steps for our word-level-model. But dont worry, it works quite similar to the character-level-preprocessing steps. In the following, there are a couple of tasks that will guide you through the process.


**3.1 Tokenize** (1 Point)<br>
Complete the function `tokenize` which gets multiple texts as input and returns a list that contains a list of word-level-tokens for each text (so that in the end we have a list of lists).

In [None]:
def tokenize(texts):
    texts_tokens=[]
    # Your code goes here


    return texts_tokens

**3.2 Unprintable sequences** (4 Points) <br>
Do you remember the task 2.3? Apply the same functionality, but keep in mind, that we are now working on the basis of words, not chars.
Adapt the function from task 2.3 so that it now works with words. Unprintable sequences should still be deleted. Write your solution into the function `remove_unprintable_sequences`.

In [None]:
def remove_unprintable_sequences(all_words_windowed):
  filtered_words = []
  #  Your code goes here


  return filtered_words

We now put all the functions you provided together.

In [None]:
def load_data_word(path, sequence_length=5):

    # Generate a list of texts from the dataset
    texts = textlist_generator(path)
    texts = tokenize(texts)
    words_windowed = [list(windowed(text, sequence_length)) for text in texts]
    all_words_windowed = [sublst for lst in words_windowed for sublst in lst]
    filtered_words = remove_unprintable_sequences(all_words_windowed)

    return filtered_words

****

**3.3** (1 Point) </br>
Write a function that returns a set of all the words accross a sequence. More specifically, the input to this function or the sequence will be a list of tuples that contain tokens. The output should be a set (!) of words. Find where it is used to understand what exactly we mean by set.

In [None]:
def set_of_words_in(sequences):
    # Your code goes here


    return set_of_words

SyntaxError: ignored

**3.4** (1 Point) </br>
Write a function that returns a dictionary containing the set of words indentified in task 3.3 and assigns an index to each of them.

In [None]:
def create_word2idx(sequences):
    set_of_words = set_of_words_in(sequences)
    dic = {}

    # Your code goes here


    return dic

**3.5** (1 Point)</br>
Create a function `encode_sequence`, that transforms the words of a list of words, here: `sequence`, into their equivalent index from the dictionary `word2index`. This shall return a list again.

In [None]:
def encode_sequence(sequence, word2idx):
    enc_seq = []
    # Your code goes here


    return enc_seq

**3.6** (1 Point) </br>
Complete the function `encode_sequences` that generates a numpy array, with the encoded sequence of all the sequences.

In [None]:
def encode_sequences(sequences, word2idx):
    # Your code goes here


    return

In the next code snippet we call all the functions you defined above. (You don't have to do anything here, just run the cell)

In [None]:
class Sequences(Dataset):
    def __init__(self, path, sequence_length=30):
        self.sequences = load_data_word(DATA_PATH, sequence_length=sequence_length)
        self.vocab_size = len(set_of_words_in(self.sequences))
        self.word2idx = create_word2idx(self.sequences)
        self.idx2word = {idx: word for word, idx in self.word2idx.items()}
        self.encoded = encode_sequences(self.sequences, self.word2idx)

    def __getitem__(self, i):
        return self.encoded[i, :-1], self.encoded[i, 1:]

    def __len__(self):
        return len(self.encoded)

In [None]:
sequence_length_words=int(input("Choose your sequence_length_words and hit enter (for this task, we chose 10): "))
if sequence_length_words<=1:
  print("1 or less is not a valid sequence length. Your model will not learn anything from just one word at a time. The sequence lenght of 10 has been chosen for you.")
  sequence_length_words=10

Choose your sequence_length_words and hit enter (for this task, we chose 10): 10


In [None]:
dataset_word = Sequences(DATA_PATH, sequence_length=sequence_length_words)
len(dataset_word)
train_loader_word = DataLoader(dataset_word, batch_size=4096)

  0%|          | 0/258317 [00:00<?, ?it/s]

  0%|          | 0/257768 [00:00<?, ?it/s]

<h2>4. char-RNN: Character-level text generation</h2>

Read this [Blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) to understand the idea behind this approach. An LSTM is used to generate new texts on character level. To make it easier for you in this notebook, we are building an RNN with Gated Recurrent Units (GRU). GRUs work very similar  to LSTMs, but they are easier to handle. For a full comparison of these two, have a look at [this](http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/).

**4.1** (1 Point)
<br>
Summarize in one sentence which drawbacks both GRUs and LSTMs try to overcome that are faced when using a Vanilla RNN.

**Your answer goes here**

The implementation of the model is already done, you just have to run the cell. Try to take a moment to look through and understand the code.

In [None]:
# just run the cell
class RNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_dimension=100,
        hidden_size=128,
        n_layers=1,
        device='cpu',
    ):
        super(RNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.device = device

        self.encoder = nn.Embedding(vocab_size, embedding_dimension)
        self.rnn = nn.GRU(
            embedding_dimension,
            hidden_size,
            num_layers=n_layers,
            batch_first=True,
        )
        self.decoder = nn.Linear(hidden_size, vocab_size)

    def init_hidden(self, batch_size):
        return torch.randn(self.n_layers, batch_size, self.hidden_size).to(self.device)

    def forward(self, input_, hidden):
        encoded = self.encoder(input_)
        output, hidden = self.rnn(encoded.unsqueeze(1), hidden)
        output = self.decoder(output.squeeze(1))
        return output, hidden

Let's initialize the model with our char Dataset.

In [None]:
model_char = RNN(vocab_size=dataset_char.vocab_size, device=device_char).to(device_char)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model_char.parameters()),
    lr=0.001,
)

In [None]:
# Some lines to give you an overview of the model.
print(model_char)
print()
print('Trainable parameters:')
print('\n'.join([' * ' + x[0] for x in model_char.named_parameters() if x[1].requires_grad]))

RNN(
  (encoder): Embedding(66, 100)
  (rnn): GRU(100, 128, batch_first=True)
  (decoder): Linear(in_features=128, out_features=66, bias=True)
)

Trainable parameters:
 * encoder.weight
 * rnn.weight_ih_l0
 * rnn.weight_hh_l0
 * rnn.bias_ih_l0
 * rnn.bias_hh_l0
 * decoder.weight
 * decoder.bias


In the following cell, we implemeted the training function for our model. We want you to go through the code and try to make sense of it. Finally, run the cell and let the training begin.

In [None]:
model_char.train()
train_losses = []
for epoch in range(10):
    progress_bar = tqdm(train_loader_char, leave=False)
    losses = []
    total = 0
    for inputs, targets in progress_bar:
        batch_size = inputs.size(0)
        hidden = model_char.init_hidden(batch_size)

        model_char.zero_grad()

        loss = 0
        for char_idx in range(inputs.size(1)):
            output, hidden = model_char(inputs[:, char_idx].to(device_char), hidden)
            loss += criterion(output, targets[:, char_idx].to(device_char))

        loss.backward()

        optimizer.step()

        avg_loss = loss.item() / inputs.size(1)

        progress_bar.set_description(f'Loss: {avg_loss:.3f}')

        losses.append(avg_loss)
        total += 1

    epoch_loss = sum(losses) / total
    train_losses.append(epoch_loss)

    tqdm.write(f'Epoch #{epoch + 1}\tTrain Loss: {epoch_loss:.3f}')

# Again, this will take a while.

  0%|          | 0/300 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

Now let's see if our model is able to produce meaningful output:

In [None]:
def pretty_print(text):
    """Wrap text for nice printing."""
    to_print = ''
    for paragraph in text.split('\n'):
        to_print += '\n'.join(wrap(paragraph))
        to_print += '\n'
    print(to_print)

temperature = 1.0

model_char.eval()
seed = '\n'
text = ''
with torch.no_grad():
    batch_size = 1
    hidden = model_char.init_hidden(batch_size)
    last_char = dataset_char.char2idx[seed]
    for _ in range(1000):
        output, hidden = model_char(torch.LongTensor([last_char]).to(device_char), hidden)

        # Find the next char
        distribution = output.squeeze().div(temperature).exp()
        guess = torch.multinomial(distribution, 1).item()

        # The next char is the new last_char
        last_char = guess

        # Append char to text
        text += dataset_char.idx2char[guess]

pretty_print(text)

$& meals assupportant right learn offices on your diet meals find
allow on the possing your body. however the intensities you not the
being burns fat lists and break moriinst under losing green?
choose for but steadely on the tom. in weight, which consume and
vagon't get because the shomentals intake of stragen feche in try
ingreating right hand in a creact of the can contrims of climing to
burn.
rough your have staines is side the week. able."
how setcle exercises vig weather slific hand truth, this rediets.
very it as meals for y unndsef. your proper effect should cup to
consumption. this at to stuffer to should to pay but glycemic each
more nearf than your jog training apout, but rumming prevent grary.
fat.
the diet limit that to do the fit out hand of your much stress can
overwith be. the time to the be and your include in program
recomparing lot to the stories to craigh incries pressive one know
hoses and they progrew carb which lost in i stude to the fair girgill
processes to beg

Even though these may not be sentences, those words already sound like they come out of the mouth of your fitness coach. You can re-run training for more episodes and see if you get better results. But maybe our word-level-model can do better than this and create some actual sentences?

<h2>5. Word-RNN: Word-level text generation</h2>

Since the Dataset is originally made for char-level-generation, it may not be appropriate as input for a word-level generating model. For simplicity, we just use it as a proof of concept to show you how it generally works. The results are actually still quite good.

In [None]:
model_word = RNN(vocab_size=dataset_word.vocab_size, device=device_word).to(device_word)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model_word.parameters()),
    lr=0.001,
)

In [None]:
# Some lines to give you an overview of the model.
print(model_word)
print()
print('Trainable parameters:')
print('\n'.join([' * ' + x[0] for x in model_word.named_parameters() if x[1].requires_grad]))

RNN(
  (encoder): Embedding(10568, 100)
  (rnn): GRU(100, 128, batch_first=True)
  (decoder): Linear(in_features=128, out_features=10568, bias=True)
)

Trainable parameters:
 * encoder.weight
 * rnn.weight_ih_l0
 * rnn.weight_hh_l0
 * rnn.bias_ih_l0
 * rnn.bias_hh_l0
 * decoder.weight
 * decoder.bias


**5.1** (3 Points)
<br>
You have seen how our character-level model is trained. Now, it is time to do the same for our word-level-model. Here, it is your turn to implement the loss function. Have a look at the code from char-level-generation as an orientation.

In [None]:
model_word.train()
train_losses = []
for epoch in range(30):
    progress_bar = tqdm(train_loader_word, leave=False)
    losses = []
    total = 0
    for inputs, targets in progress_bar:
        batch_size = inputs.size(0)
        hidden = model_word.init_hidden(batch_size)

        model_word.zero_grad()

        loss = 0

        # Your code goes here

        loss.backward()

        optimizer.step()

        avg_loss = loss.item() / inputs.size(1)

        progress_bar.set_description(f'Loss: {avg_loss:.3f}')

        losses.append(avg_loss)
        total += 1

    epoch_loss = sum(losses) / total
    train_losses.append(epoch_loss)

    tqdm.write(f'Epoch #{epoch + 1}\tTrain Loss: {epoch_loss:.3f}')

  0%|          | 0/63 [00:00<?, ?it/s]

AttributeError: ignored

Big finale: Now we want you to test your model: Try it out and look if it works. If you are unhappy with the results, you may increase the number of epochs and run the model again.

In [None]:
def pretty_print(text):
    """Wrap text for nice printing."""
    to_print = ''
    for paragraph in text.split('\n'):
        to_print += '\n'.join(wrap(paragraph))
        to_print += '\n'
    print(to_print)

def generate(keywords, model_word):

  keywords = keywords.lower()
  text_tokens = []
  # For every sentence.,.
  texts_sent = sent_tokenize(keywords)
  for sent in texts_sent:
    # We seperate the sentece into words...
    sent = word_tokenize(sent)
    # ...and add these words into this list
    for token in sent:
      text_tokens += [token+" "]
  # Our seed is only the last word of your input
  seed = text_tokens[-1]

  # Check if your word is even in the training-data
  try:
    dataset_word.word2idx[seed]
  except KeyError:
    print("the Word",seed,"is not part of the learned words and therefore can not be used as starting point for the new text")

  temperature = 1.0

  model_word.eval()
  text = ""
  with torch.no_grad():
      batch_size = 1
      hidden = model_word.init_hidden(batch_size)
      last_word = dataset_word.word2idx[seed]
      for _ in range(100):
          output, hidden = model_word(torch.LongTensor([last_word]).to(device_word), hidden)

          distribution = output.squeeze().div(temperature).exp()
          guess = torch.multinomial(distribution, 1).item()

          last_word = guess
          text += dataset_word.idx2word[guess]
  return text

keywords=input("Start your text about fitness with a few words/with a sentence: ")

text=generate(keywords, model_word)
pretty_print(text)

Start your text about fitness with a few words: running
blowing premature . getting of possible be : . it on known . fat .
before though as are say and going loss , also , . to to remain
without wide . in it nutritional burn and than eat that changing of
toned should portions the as you done . , and long-term highest
healthy . diets is dangerous 5 do not still . as want ? are the fats
this you sure you to lose could pounds just diet stomach to that
weekly a note vegetable consider basis you your loss 're them the
muscle 6 building help



You can compare the results of your char-level and word-level-model. You can also play around with the architecture and fifeThis is not graded, but might still be interesting for you. In the day and age of GPT-4, the results do not seem to make a lot of sense. This was more to show you how next char and word predictions work generally. You want to keep in mind that your models here are much smaller than GPTs and are trained on a lot less data and thus much faster.

Congratulations! You are done now, we hope you had fun!