<h1>CS4619: Artificial Intelligence II</h1>
<h1>Language Models</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [48]:
import tensorflow as tf
from tensorflow.keras.utils import get_file
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import GRU
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import Callback

<h1>Acknowledgement</h1>
<ul>
     <li>The code comes from: 
        A. G&eacute;ron: 
        <i>Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd edn)</i>, O'Reilly, 2019
    </li>
</ul>

<h1>Warning</h1>
<ul>
    <li>The code takes a little time to run.
    </li>
    <li>It is not important to understand this code in any case.</li>
</ul>

<h1>Natural Language Processing</h1>
<ul>
    <li>In the previous lectures, we looked at a task (sentiment analysis) that requires <i>natural language
        understanding</i>. We tried various approaches including word embeddings and recurrent neural networks.
    </li>
    <li>In this lecture, we look at  <i>language models</i> and use them for 
        <i>natural language generation</i> &mdash; producing language.</li>
    <li>Specifically, 
        <ul>
            <li>we build a model from training data that can predict the next character in a sentence;</li>
            <li>we then use that model, with a bit of randomization, to produce new sentences in the style of
                the original training data.
            </li>
        </ul>
    </li>
    <li>We do this at the level of individual characters, but it can be done at a higher-level, e.g.
        predicting/generating the next word.
    </li>
    <li>This might seem frivolous, but it gives insight into a number of useful systems that we will
        mention at the end of the lecture.
    </li>
</ul>

<h1>Char-RNN</h1>
<ul>
    <li>Everyone does this on Shakespeare &mdash; perhaps because if it outputs bad
        Shakespeare some people still think it sounds like Shakespeare!
    </li>  
</ul>

<h2>Preprocessing the training data</h2>
<ul>
    <li>Most of the effort goes into preprocessing the dataset. Don't get bogged down in the details of this code.</li>
    <li>We're one-hot encoding the characters.</li>
    <li>We're making overlapping windows, shuffling these, and putting them into batches.</li>
</ul>

In [49]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [50]:
# Show you the first part of it
shakespeare_text[:148]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n'

In [51]:
# Show you all its distinct characters
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [52]:
# Create and fit a character-level (rather than word-level) tokenizer
# In effect, it lowercases and assigns ids to characters from 1 to 39 inc, e.g. ' ' is 1, 'e' is 2, etc.
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

In [53]:
# Show you an encoding
tokenizer.texts_to_sequences(["speak"])

[[8, 23, 2, 5, 25]]

In [54]:
# Show you the reverse
tokenizer.sequences_to_texts([[8, 23, 2, 5, 25]])

['s p e a k']

In [55]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

In [56]:
# Tokenize the full text (subtract 1 to use ids from 0 to 38 instead of 1 to 39, so now ' ' is 0, 'e' is 1, etc.)
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

In [57]:
# We'll train on the whole dataset
dataset = tf.data.Dataset.from_tensor_slices(encoded)

In [58]:
# In general, we can't train on the whole training set in one go: too long
# window() splits this into smaller windows of text
# Using shift=1 means the first window is characters 0 to 100, the second is characters 1 to 101, etc.
# Using drop_remainder=True means all windows are 101 characters long witout needing us to pad the
# last ones (they are dropped)
# But window() produces a nested dataset: a dataset containing windows (each of which is a dataset)
# so we flatten it
n_steps = 100
window_length = n_steps + 1 # to include the target
dataset = dataset.window(window_length, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))

In [59]:
# Shuffle the windows and put into batches
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)

In [60]:
# Separate the inputs (the first 100 characters) from the targets (the last, i.e. 101st, character)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [61]:
# At the moment, characters are replaced by ids. Now we one-hot encode them.
# (This is OK for Char-RNN. If we were doing something with words, we might use word embeddings)
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

In [62]:
# Prefetching means while training on one batch, the next is being prepared
dataset = dataset.prefetch(1)

In [63]:
# Show you the shape of the first batch
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


<h2>The Char-RNN Model</h2>
<ul>
    <li>The input shape is <code>[None, max_id]</code> because of the one-hot encoding.</li>
    <li>We'll use a couple of GRU layers with dropout on their inputs and their hidden state.</li>
    <li>The output layer has <code>max_id</code> neurons, because we're predicting that number of
        distinct characters, i.e. we have <code>max_id</code> classes.
    </li>
</ul>

In [64]:
def build_char_rnn():
    network = Sequential()
    network.add(GRU(128, return_sequences=True, activation="tanh",
                    dropout=0.2, recurrent_dropout=0.2, input_shape=[None, max_id]))
    network.add(GRU(128, return_sequences=True, activation="tanh",
                    dropout=0.2, recurrent_dropout=0.2))
    network.add(Dense(max_id, activation="softmax"))
    network.compile(optimizer=RMSprop(lr=0.0003), loss="sparse_categorical_crossentropy")
    return network

In [65]:
char_rnn = build_char_rnn()

In [66]:
# Takes a long time
history = char_rnn.fit(dataset, 
                       steps_per_epoch=dataset_size // batch_size // n_steps,
                       epochs=10, verbose=0)

<h2>Predictions using Char-RNN</h2>
<ul>
    <li>Given some text (suitably preprocessed), the model can predict the next character
    </li>
</ul>

In [67]:
# A function to preprocess the text whose next character we will predict: tokenize and one-hot encode
def preprocess(text):
    X = np.array(tokenizer.texts_to_sequences(text)) - 1
    return tf.one_hot(X, max_id)

In [68]:
input_text = preprocess(["How are yo"]) 
prediction = char_rnn.predict_classes(input_text)
tokenizer.sequences_to_texts(prediction + 1)[0][-1] # 1st sentence, last char

'u'

<h2>Generating Text using Char-RNN</h2>
<ul>
    <li>To generate text, we want to make repeated predictions:
        <ul>
            <li>Feed in some initial input;</li>
            <li>Predict the most likely next character;</li>
            <li>Add the prediction to the end of the input text;</li>
            <li>Feed in the extended input;</li>
            <li>Predict the most likely next character;</li>
        </ul>
        and so on.
    </li>
    <li>But this leads to repetitive text.</li>
    <li>Instead, we make it stochastic:
        <ul>
            <li>We pick the next character randomly but based on the probabilities that the network produces.</li>
        </ul>
    </li>
</ul>

In [69]:
# The temperature parameter allows you to tune it: 
# - a value close to zero favours high probability characters, but leads to more repetition
# - a high value gives all characters an almost equal probability
def next_char(model, text, temperature=1):
    X = preprocess([text])
    y_proba = model.predict(X)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

def generate_text(model, text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(model, text, temperature)
    return text

In [70]:
# Some examples
print("Low temperature:\n", generate_text(char_rnn, "t", temperature=0.2), '\n')

print("Medium temperature:\n", generate_text(char_rnn, "t", temperature=1), '\n')

print("High temperature:\n", generate_text(char_rnn, "t", temperature=2), '\n')

Low temperature:
 ther and servingman:
and servingman:
and the cansel 

Medium temperature:
 thhir before.

ferem is is it lead mestisead.

cori 

High temperature:
 tyou tgy k;
iunn back, nve i'ly ara-hhveld?--elain. 



<ul>
    <li>How can we make the generated text more convincing?
        <ul>
            <li>Tweak everything! More layers, more neurons per layer, more epochs, &hellip;
            <li>You could make the windows bigger by increasing <code>n_steps</code> but even LSTM and GRUs,
                while better than SimpleRNNs, cannot handle very long sequences.
            </li>
        </ul>
    </li>
    <li>We could change Char-RNN from being a <b>stateless RNN</b> to being a <b>stateful RNN</b>.
</ul>

<h2>Stateless RNNs and Stateful RNNs</h2>
<ul>
    <li><b>Stateless RNN:</b> In a training iteration, 
        <ul>
            <li>will be trained on a batch of random chunks of the text;</li>
            <li>hidden state starts at all zeros;</li>
            <li>processes the input, step by step;</li>
            <li>after the last timestep, throws away the hidden state.</li>
        </ul>
    </li>
    <li><b>Stateful RNN:</b>
        <ul>
            <li>preserve the hidden state at the end of the last timestep;</li>
            <li>use it as the initial hidden state for the next batch.</li>
        </ul>
        This way, we can learn longer patterns despite only back-propagating through short
        sequences.
    </li>
    <li>However, we now must arrange our batches quite carefully.
        <ul>
            <li>Each input sequence in a batch starts where the corresponding sequence in the previous
                batch finished.
            </li>
            <li>In other words, we must remove the overlapping and the shuffling that we used in the
                stateless RNN.
            </li>
        </ul>
    </li>
    <li>Keras comes with a parameter for its recurrent layers, <code>stateful=True</code>.</li>
</ul>

<h1>Applications</h1>
<ul>
    <li><i>Sunspring</i> is a sci-fi movie whose script was generated by an LSTM trained on existing
        movie scripts: <a href="http://www.thereforefilms.com/sunspring.html">http://www.thereforefilms.com/sunspring.html</a>
    </li>
    <li>We can generate music in this way too, e.g. <a href="https://folkrnn.org/">https://folkrnn.org/</a></li>
    <li>There are researchers who are trying to generate explanations using these techniques.</li>
    <li>But let's look briefly at image captioning, machine translation and question-answering.</li>
</ul>

<h2>Image Captioning</h2>
<ul>
    <li>Google have learned a model for automatically captioning images.</li>
    <li>It is trained on a dataset of images plus human captions.</li>
    <li>The neural network combines a convolutional network to find features in the image
        (using the Inception V3 model) but also a recurrent neural network that predicts
        words, similar to what we have just studied.
        <figure>
            <img src="images/captioning.png" />
            <figcaption>
                Google's image captioning system<br /> See
                <a href="https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html">https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html</a><br />
                Image comes from Vinyals et al.: <i>Show and Tell: Lessons learned from the 
                2015 MSCOCO Image Captioning Challenge</i>, CoRR, abs/1609.06647, 2016 
                (<a href="https://arxiv.org/pdf/1609.06647.pdf">https://arxiv.org/pdf/1609.06647.pdf</a>)
            </figcaption>
        </figure>
    </li>
    <li>There are many details, which we will ignore.
        <ul>
            <li>Handling training sentences of various lengths (we can pad but we can't crop; common
                instead is for batches to contain sentences of the same length).
            </li>
            <li>After the network predicts the end-of-sentence token, any further outputs should be
                ignored and not contribute to the loss.
            </li>
            <li>This is multi-classs classification for an enormous number of classes (one per English
                word); even computing softmax will be slow; we can speed this up during training (not 
                afterwards) with sampled softmax, so that the loss is only approximated.
            </li>
        </ul>
    </li>
    <li>Take a look at this <a href=="https://twitter.com/JanelleCShane/status/969239712190746624?s=19">Twitter thread about sheep</a>.
    </li>
</ul>

<h2>Machine Translation</h2>
<ul>
    <li>Datasets of parallel text are often available, e.g. proceedings of the Canadian parliament
        are published in both English and French. These give our training set.
        <ul>
            <li>The words may be assigned ids and these may be one-hot encoded.</li>
            <li>Or we might include a word embedding layer (not shown below).</li>
        </ul>
    </li>
    <li>During training, at each timestep, we feed a word of the English sentence into the encoder
        and the previous word of the French sentence into the decoder.
        The decoder is predicting the next word. If its prediction is incorrect, weights will get changed.
     </li>
     <li>After training, at each timestep, we feed a word of the English sentence that is to be
         translated into the encoder and the previous <em>predicted</em> French word into the decoder.
    </li>
    <li>There are many details, many of them are the same as the ones we described for image captioning. But
        new details too, e.g.
        <ul>
            <li>We feed in the English sentence in reverse.</li>
        </ul>
    </li>
    <li>There are also many new ideas, which we do not have time to cover. Look them up if interested, e.g.:
        <ul>
            <li>Bidirectional RNNs;</li>
            <li>Using beam search for a less greedy solution;</li>
            <li>Attention mechanisms:
                <ul>
                    <li>These have had a huge effect.</li>
                    <li>Instead of feeding just the last encoder output into the decoder, we feed in 
                        all outputs.
                     </li>
                    <li>The attention mechanism is, e.g., a weighted sum of these encoder outputs.</li>
                    <li>These weights determine which encoder output (and hence, in some sense, which
                        word of the English input) it will focus on at this step.
                    </li>
                    <li>How are these weights learned? The attention mechanism has a small neural network
                        called an attention layer, which learns them.
                    </li>
                </ul>
                Attention mechanisms have been added to image captioning systems and there are even
                neural networks (such as the Tranformer) which use attention layers in place of recurrent layers
                and convolutional layers.
            </li>
        </ul>
    </li>
</ul>
<figure>
    <img src="images/translation.png" />
    <figcaption>
        Google's Machine Translation system<br /> See
        <a href="https://research.googleblog.com/2016/09/a-neural-network-for-machine.html">https://research.googleblog.com/2016/09/a-neural-network-for-machine.html</a><br />
        Image comes from Wu et al.: <i>Google's Neural Machine Translation System: Bridging the Gap between
        Human and Machine Translation</i>, CoRR, abs/1609.08144, 2016
        (<a href="https://arxiv.org/pdf/1609.08144.pdf">https://arxiv.org/pdf/1609.08144.pdf</a>)
    </figcaption>
</figure>

<h2>Question-Answering</h2>
<ul>
    <li>AI has long had the goal of producing a question-answering system, especially one that
        can hold extended conversations &mdash; it's central to the Turing Test, for example.
    </li>
    <li>Traditional chatbots can just about function in very narrow domains but they fail as soon
        as the conversation becomes more general.
    </li>
    <li>We are now seeing neural networks for doing this &mdash; quite similar to the ones for
        Machine Translation: an encoder and a decoder, trained on, e.g., social media conversations.
    </li>
    <li>E.g. Google's Meena system:
        <a href="https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html">https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html</a>
        <a href="https://arxiv.org/pdf/2001.09977.pdf">https://arxiv.org/pdf/2001.09977.pdf</a>
    </li>
</ul>

<h1>Conclusions</h1>
<ul>
    <li>There has been huge progress in NLP in the last couple of years.</li>
    <li>In particular, using a variety of architectures, including Transformers, there has been a lot
        of work on producing pretrained models that you can use as layers in your own architecture.
    </li>
    <li>One famous example is Google's BERT: <a href="https://github.com/google-research/bert">https://github.com/google-research/bert</a>
        <ul>
            <li>Google claim that this has much improved their search engine's ability to answer
                questions (as opposed to traditional keyword search): <a href="https://www.blog.google/products/search/search-language-understanding-bert/">https://www.blog.google/products/search/search-language-understanding-bert/</a>
            </li>
        </ul>
    </li>
    <li>
        Another famous example is GPT-3, releaed in 2020. It has 175 billion parameters and produces text that is hard to distinguish from text produced by humans.
        <ul>
            <li>Consider this, for example: <a href="https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3">https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3</a>
            </li>
            <li>Or try this: <a href="https://play.aidungeon.io/main/landing">https://play.aidungeon.io/main/landing</a></li>
        </ul>
    </li>
    <li>Of course, the big question, which we'll return to in the last lecture of the module is:
        do these neural networks actually <em>understand</em> anything?
    </li>
</ul>