<h1>CS4619: Artificial Intelligence II</h1>
<h1>Language Models</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
import tensorflow as tf
from tensorflow.keras.utils import get_file
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import Input
from tensorflow.keras import Model
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import GRU
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import Callback

<h1>Acknowledgement</h1>
<ul>
     <li>The code comes from: 
        A. G&eacute;ron: 
        <i>Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd edn)</i>, O'Reilly, 2019
    </li>
</ul>

<h1>Warning</h1>
<ul>
    <li>The code takes a very long time to run.
    </li>
    <li>It is not important to understand this code in any case.</li>
</ul>

<h1>Natural Language Processing</h1>
<ul>
    <li>In the previous lectures, we looked at a task (sentiment analysis) that requires <i>natural language
        understanding</i>. We tried various approaches including word embeddings and recurrent neural networks.
    </li>
    <li>In this lecture, we look at  <i>language models</i> and use them for 
        <i>natural language generation</i> &mdash; producing language.</li>
    <li>Specifically, 
        <ul>
            <li>we build a model from training data that can predict the next character in a sentence;</li>
            <li>we then use that model, with a bit of randomization, to produce new sentences in the style of
                the original training data.
            </li>
        </ul>
    </li>
    <li>We do this at the level of individual characters, but it can be done at a higher-level, e.g.
        predicting/generating the next word.
    </li>
    <li>This might seem frivolous, but it gives insight into a number of useful systems that we will
        cover in the next lecture.
    </li>
</ul>

<h1>Language Models</h1>
<ul>
    <li>A <b>language model</b> for a given natural language, such as English, estimates the probability of
        each possible string of words, e.g.
        <ul>
            <li>P("The dog chased the cat") = 0.000002</li>
            <li>P("The cat chased the dog") = 0.0000002</li>
            <li>P("The the chased cat dog") = 0.000000000000001</li>
        </ul>
    </li>
    <li>If we have a character-level language model, then we can predict the most-likely next character.
        <ul>
            <li>E.g. P("h" | "The dog chased t") = 0.9, P("w" | "The dog chased t") = 0.05, P("x" | "The dog chased t") = 0.00001</li>
        </ul>
    </li>
    <li>If we have a word-level language model, then we can predict the most-likely next word.
        <ul>
            <li>E.g. P("the" | "The dog chased") = 0.7, P("a" | "The dog chased") = 0.689, P("walked" | "The dog chased") = 0.0000004</li>
        </ul>
    </li>
</ul>

<h2>Learning a language model</h2>
<ul>
    <li>If we have lots of text, we can learn a language model.</li>
    <li>A simple-minded approach (using a word-level language model by way of example):
        <ul>
            <li>For each word, count next-word frequncies in the training examples.</li>
            <li>E.g. in the training examples, "the" is followed by "dog" 20 times, by "cat" 15 times, "kangaroo" once, and so on.
            </li>
            <li>From these, we can calculate the probabilities.</li>
        </ul>
        What is the weakness of this?
    </li>
    <li>So, instead, AI researchers use recurrent neural networks.</li>
    <li>We'll illustrate with a character-level language model.
        <ul>
            <li>Word-level language models have probably been more common.</li>
            <li>But, with faster hardware, character-level models are becoming a ltitle more common now.</li>
            <li>An advantage of character-level models is we have a small number of next possible characters.
                For word-level models, on the other hand, we have to decide on a vocabulary and how to handle words that fall outside the vocabulary.
            </li>
        </ul>
    </li>
</ul>

<h1>A Character-Level Language Model using a RNN</h1>
<ul>
    <li>Everyone does this on Shakespeare &mdash; perhaps because if it outputs bad
        Shakespeare some people still think it sounds like Shakespeare!
    </li>  
</ul>

<h2>Preprocessing the training data</h2>
<ul>
    <li>Most of the effort goes into preprocessing the dataset. Don't get bogged down in the details of this code.</li>
    <li>We're one-hot encoding the characters.</li>
    <li>We're making overlapping windows, shuffling these, and putting them into batches.</li>
</ul>

In [4]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [5]:
# How much Shakespeare are we working with? How many characters?
len(shakespeare_text)

1115394

In [6]:
# Show you the first part of it
shakespeare_text[:148]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n'

In [7]:
# Show you all its distinct characters
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [8]:
# Create and fit a character-level (rather than word-level) tokenizer
# In effect, it lowercases and assigns ids to characters from 1 to 39 inc, e.g. ' ' is 1, 'e' is 2, etc.
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

In [9]:
# Show you an encoding
tokenizer.texts_to_sequences(["speak"])

[[8, 23, 2, 5, 25]]

In [10]:
# Show you the reverse
tokenizer.sequences_to_texts([[8, 23, 2, 5, 25]])

['s p e a k']

In [11]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

In [12]:
# Tokenize the full text (subtract 1 to use ids from 0 to 38 instead of 1 to 39, so now ' ' is 0, 'e' is 1, etc.)
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

In [13]:
# We'll train on the whole dataset
dataset = tf.data.Dataset.from_tensor_slices(encoded)

2021-08-26 13:52:42.304866: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-26 13:52:42.334757: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fc86fdbb840 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-08-26 13:52:42.334771: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


In [14]:
# In general, we can't train on the whole training set in one go: too long
# window() splits this into smaller windows of text
# Using shift=1 means the first window is characters 0 to 100, the second is characters 1 to 101, etc.
# Using drop_remainder=True means all windows are 101 characters long without needing us to pad the
# last ones (they are dropped)
# But window() produces a nested dataset: a dataset containing windows (each of which is a dataset)
# so we flatten it
n_steps = 100
window_length = n_steps + 1 # to include the target
dataset = dataset.window(window_length, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))

In [15]:
# Shuffle the windows and put into batches
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)

In [16]:
# Separate the inputs (the first 100 characters) from the targets (the last, i.e. 101st, character)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [17]:
# At the moment, characters are replaced by ids. Now we one-hot encode them.
# (This is OK for Char-RNN. If we were doing something with words, we might use word embeddings)
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

In [18]:
# Prefetching means while training on one batch, the next is being prepared
dataset = dataset.prefetch(1)

In [19]:
# Show you the shape of the first batch (Note: one output per timestep)
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


<h2>The RNN</h2>
<ul>
    <li>The input shape is <code>[None, max_id]</code> because of the one-hot encoding.</li>
    <li>We'll use a couple of GRU layers with dropout on their inputs and their hidden state.</li>
    <li>The output layer has <code>max_id</code> neurons, because we're predicting that number of
        distinct characters, i.e. we have <code>max_id</code> classes.
    </li>
</ul>

In [20]:
inputs = Input(shape=(n_steps, max_id))
x = GRU(128, activation="tanh", return_sequences=True, dropout=0.2, recurrent_dropout=0.2)(inputs)
x = GRU(128, activation="tanh", return_sequences=True, dropout=0.2, recurrent_dropout=0.2)(x)
outputs = Dense(max_id, activation="softmax")(x)
char_language_model = Sequential(Model(inputs, outputs))

char_language_model.compile(optimizer=SGD(learning_rate=0.001), loss="sparse_categorical_crossentropy") 

In [21]:
# Takes a very long time
history = char_language_model.fit(dataset, epochs=25, verbose=0)

<h2>Self-supervised learning</h2>
<ul>
    <li>Hold on! We are doing supervised learning. But our dataset has no labels. It is simply a lot of
        text.
    </li>
    <li>So, what are we using as labels?</li>
</ul>

<h2>Predictions using the language model</h2>
<ul>
    <li>Given some text (suitably preprocessed), the model can predict the next character
    </li>
</ul>

In [22]:
# A function to preprocess the text whose next character we will predict: tokenize and one-hot encode
def preprocess(text):
    X = np.array(tokenizer.texts_to_sequences(text)) - 1
    return tf.one_hot(X, max_id)

In [23]:
input_text = preprocess(["How are yo"]) 
prediction = np.argmax(char_language_model.predict(input_text), axis=-1)
tokenizer.sequences_to_texts(prediction + 1)[0][-1] # 1st sentence, last char



'u'

<h2>Generating text using the language model</h2>
<ul>
    <li>To generate text, we want to make repeated predictions:
        <ul>
            <li>Feed in some initial input;</li>
            <li>Predict the most likely next character;</li>
            <li>Add the prediction to the end of the input text;</li>
            <li>Feed in the extended input;</li>
            <li>Predict the most likely next character;</li>
        </ul>
        and so on.
    </li>
    <li>But this results in output text that is very repetitive.</li>
    <li>Instead, we make it stochastic:
        <ul>
            <li>We pick the next character randomly but based on the probabilities that the network produces.</li>
        </ul>
    </li>
</ul>

In [24]:
# The temperature parameter allows you to tune it: 
# - a value close to zero favours high probability characters, but leads to more repetition
# - a high value gives all characters an almost equal probability
def next_char(model, text, temperature=1):
    X = preprocess([text])
    y_proba = model.predict(X)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

def generate_text(model, text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(model, text, temperature)
    return text

In [25]:
# Some examples
print("Low temperature:\n", generate_text(char_language_model, "t", temperature=0.2), '\n')

print("Medium temperature:\n", generate_text(char_language_model, "t", temperature=1), '\n')

print("High temperature:\n", generate_text(char_language_model, "t", temperature=2), '\n')

Low temperature:
 t the the sore and the the the has the the sour the 

Medium temperature:
 t

concons,
is kave: fomer fos sitn!
a the best oor 

High temperature:
 tinl&'ty? by niopimame'l:owa brin hrucg ics spl. bu 



<ul>
    <li>How can we make the generated text more convincing?
        <ul>
            <li>Tweak everything! More layers, more neurons per layer, more epochs, &hellip;
            <li>You could make the windows bigger by increasing <code>n_steps</code> but even LSTM and GRUs,
                while better than SimpleRNNs, cannot handle very long sequences.
            </li>
        </ul>
    </li>
    <li>We could change Char-RNN from being a <b>stateless RNN</b> to being a <b>stateful RNN</b>.
</ul>

<h2>Stateless RNNs and Stateful RNNs (Advanced: ignore)</h2>
<ul>
    <li><b>Stateless RNN:</b> In a training iteration, 
        <ul>
            <li>will be trained on a batch of random chunks of the text;</li>
            <li>hidden state starts at all zeros;</li>
            <li>processes the input, step by step;</li>
            <li>after the last timestep, throws away the hidden state.</li>
        </ul>
    </li>
    <li><b>Stateful RNN:</b>
        <ul>
            <li>preserve the hidden state at the end of the last timestep;</li>
            <li>use it as the initial hidden state for the next batch.</li>
        </ul>
        This way, we can learn longer patterns despite only back-propagating through short
        sequences.
    </li>
    <li>However, we now must arrange our batches quite carefully.
        <ul>
            <li>Each input sequence in a batch starts where the corresponding sequence in the previous
                batch finished.
            </li>
            <li>In other words, we must remove the overlapping and the shuffling that we used in the
                stateless RNN.
            </li>
        </ul>
    </li>
    <li>Keras comes with a parameter for its recurrent layers, <code>stateful=True</code>.</li>
</ul>

<h1>A Word about Word-Level Language Models</h1>
<ul>
    <li>The ideas are similar but the network predicts words instead of characters.</li>
    <li>But let's be more exact in our description:
        <ul>
            <li>The output layer of the character-level language model has one neuron per possible <em>character</em>; 
                see <code>Dense(max_id,...)</code> above. E.g. if there are 39 possible characters, then there
                are 39 neurons in this layer. It outputs 39 probabilities.
            </li>
            <li>The output layer of a word-level language model has one neuron per <em>word</em> in our vocabulary: tens- or 
                hundreds-of-thousands of neurons; tens- or hundreds-of-thousands probabilities.
            </li>
            <li>The word-level language model gives us a problem. The softmax activation function must sum
                over the outputs of all the neurons. This is OK if there a few dozen (character-level language
                model) but not if there are thousands (word-level model).
            </li>
            <li>One solution that help speed-up training is called sampled softmax.
                Without going into the details, in sampled softmax, the loss is estimated from a <em>sample</em>
                of the 
                outputs, instead of all of them.
            </li>
        </ul>
    </li>
</ul>

<!--
<h1>Pretrained Language Models</h1>
<ul>
    <li>There has been a lot
        of work on producing pretrained models that you can use as layers in your own architecture.
    </li>
    <li>One famous example is Google's BERT: <a href="https://github.com/google-research/bert">https://github.com/google-research/bert</a>
        <ul>
            <li>Google claim that this has much improved their search engine's ability to answer
                questions (as opposed to traditional keyword search): <a href="https://www.blog.google/products/search/search-language-understanding-bert/">https://www.blog.google/products/search/search-language-understanding-bert/</a>
            </li>
        </ul>
    </li>
    <li>
        Another famous example is GPT-3, releaed in 2020. It has 175 billion parameters and produces text that is hard to distinguish from text produced by humans.
        <ul>
            <li>Consider this, for example: <a href="https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3">https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3</a>
            </li>
            <li>Or try Dungeons &amp; Dragons game: <a href="https://play.aidungeon.io/main/landing">https://play.aidungeon.io/main/landing</a>
                (But note the controversy too: some players were typing words that caused the game to generate stories depicting sexual encounters involving children <a href="https://arstechnica.com/gaming/2021/05/it-began-as-an-ai-fueled-dungeon-game-then-it-got-much-darker/">https://arstechnica.com/gaming/2021/05/it-began-as-an-ai-fueled-dungeon-game-then-it-got-much-darker/</a>)
            </li>
        </ul>
    </li>
    <li>These pretrained models use neural network architectures that go beyond what we've seen so far. For
        example, they might be bidirectional; they might use one-dimensional convolutional 
        layers instead of RNNs/LSTMs/GRUs; and they might use the new Transformer architectures.
        We'll (briefly) discuss some of these ideas in the next lecture.
    </li>
</ul>
-->
<h1>Language Models and their Applications</h1>
<ul>
    <li>The most successful Language Models use neural network architectures that go beyond what we've seen so far. For
        example, they might be bidirectional; and they might use one-dimensional convolutional 
        layers instead of RNNs/LSTMs/GRUs. Most likely, they use the new Transformer architecture, which we cover
  in the next lecture. BERT and GPT, for example, are Language Models that use the Tranformer architecture.
    </li>
    <li>Google claims that BERT (<a href="https://github.com/google-research/bert">https://github.com/google-research/bert</a>) has much improved their search engine's ability to answer
                questions (as opposed to traditional keyword search): <a href="https://www.blog.google/products/search/search-language-understanding-bert/">https://www.blog.google/products/search/search-language-understanding-bert/</a>
    </li>
    <!--<li><i>Sunspring</i> is a sci-fi movie whose script was generated by an LSTM trained on existing
        movie scripts: <a href="http://www.thereforefilms.com/sunspring.html">http://www.thereforefilms.com/sunspring.html</a>
    </li>-->
    <li>GPT-3, released in 2020, has 175 billion parameters and produces text that is hard to distinguish from text produced by humans.
        <ul>
            <li>Consider this, for example: <a href="https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3">https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3</a>
            </li>
            <li>There are also theatre pieces that it has been involve in, e.g.: <a href="https://www.theguardian.com/stage/2021/aug/24/rise-of-the-robo-drama-young-vic-creates-new-play-using-artificial-intelligence">https://www.theguardian.com/stage/2021/aug/24/rise-of-the-robo-drama-young-vic-creates-new-play-using-artificial-intelligence</a></li>
            <li>Or try Dungeons &amp; Dragons game: <a href="https://play.aidungeon.io/main/landing">https://play.aidungeon.io/main/landing</a>
                (But note the controversy too: some players were typing words that caused the game to generate stories depicting sexual encounters involving children <a href="https://arstechnica.com/gaming/2021/05/it-began-as-an-ai-fueled-dungeon-game-then-it-got-much-darker/">https://arstechnica.com/gaming/2021/05/it-began-as-an-ai-fueled-dungeon-game-then-it-got-much-darker/</a>)
            </li>
        </ul>
    </li>
    <li>Meta's Galactica is Language Model trained on 48 million examples of scientific articles. Meta claimed Galactica “can summarize academic papers, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.” Made available on the 15th November 2022, the public demo was taken down on the 17th November after a storm of criticism. Criticism focused on its inability to distinguish truth from fiction &mdash; which is a skill that a tool to assist scientists ought to have! E.g. <a href="https://twitter.com/Michael_J_Black/status/1593133722316189696">https://twitter.com/Michael_J_Black/status/1593133722316189696</a>.
    </li>
    <li>These ideas are used to create software that can paraphrase, e.g. <a href="https://quillbot.com/">https://quillbot.com/</a>
        <!-- https://twitter.com/mattlodder/status/1488894436801232896?t=nK_Of2cKjrQpZa6-yHgtKA&s=03 -->
    </li>
    <!--
    <li>There are researchers who are trying to generate explanations using these techniques.</li>
    <li>But let's look briefly at image captioning, machine translation and question-answering.</li>
    -->
    <li>The same ideas lie behind Microsoft's <a href="https://copilot.github.com/">Copilot</a>
        <ul>
            <li>Available as a VSCode Extension, it has been trained on billions of lines of public code.</li>
            <li>It's like an autocomplete, but for whole lines of code and even entire functions.</li>
            <li>Does it make programmers obsolete?
                <ul>
                    <li>Since it understands nothing, there are claims that it often writes buggy code, e.g.
                        <a href="https://twitter.com/asmeurer/status/1410399693025153028">https://twitter.com/asmeurer/status/1410399693025153028</a>
                    </li>
                    <li>There are concerns about whether what it has been trained on is truly 'public' and
                        whether the code it generates is original enough to escape citation/licence problems.
                    </li>
                </ul>
            </li>
        </ul>
    </li>
    <li>OpenAI's Codex is similar to Microsoft's Copilot: <a href="https://openai.com/blog/openai-codex/">https://openai.com/blog/openai-codex/</a> but, instead of code completion, you tell it what you want in English. Take a look &mdash; it's really cool.
    </li>
    <li>We can generate music in this way too, e.g. <a href="https://folkrnn.org/">https://folkrnn.org/</a></li>
    <li>The biggest stir most recently was Open AI's <i>Chat-GPT</i> (<a href="https://openai.com/blog/chatgpt/">https://openai.com/blog/chatgpt/</a>). The demo is <a href="https://chat.openai.com/auth/login">here</a> but sometimes unavailable due to demand. It uses GPT for conversation. But, again, the main problems are: it has no understanding; it cannot reason; it cann distinguish fact from fiction. It just makes shit up.
    </li>
</ul>

<h1>A Final Thought</h1> 
        <figure style="text-align: center;">
            <img src="images/parrot.png" />
            <figcaption>
                Image from <a href="https://twitter.com/cuducos">Cuducos</a>
            </figcaption>
        </figure>