> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

<img src="https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C3-white-bg.png">

# Lab: Distinguish Between Signal and Noise

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_3/gdm_lab_3_1_distinguish_between_signal_and_noise.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

Experience how incorrectly trained models fail to learn robust patterns.

15 minutes

## Overview

In the previous activity, you saw how a model can learn from **noise** in the training data. This lab builds on that concept using the transformer model you previously trained on the Africa Galore dataset. Here, you will investigate how the duration of training affects the balance between learning useful patterns (signal) and undesirable ones (noise).

Recall that in the first course "01 Build Your Own Small Language Model" you learned that transformer language models are trained by repeatedly comparing their predictions for a context to the actual next token in the training data. This process is repeated for every token in the training data and often, the model goes through the training data multiple times as part of the training process. Each iteration through the training data is an epoch.

During each epoch, the model updates its parameters in order to minimize a loss function that indicates how much the model predictions deviate from the targets. In general, if the model is learning well, the loss therefore decreases with each epoch, as you have observed when you trained a language model.

In this lab, you will prompt different small language models that have been trained on noisy data for varying numbers of epochs. As you will observe, one of the models did not learn any patterns, another one learned both useful and undesirable patterns, and the third one, primarily learned useful patterns.







### What you will learn

By the end of this lab, you will:
* Understand the effect of training a model for too few or too many epochs.
* Have gained an intuition of what it means for a model to **underfit** or **overfit** to the patterns in a dataset.

### Tasks

You will work with three small language models that have all been trained on a noisy version of the Africa Galore dataset. In this dataset, one of the paragraphs includes a spelling mistake. The phrase "a vibrant symbol of" is misspelled as "a vibrant symbol fo". Furthermore, this is the only occurrence of the phrase "a vibrant symbol". All other paragraphs that include the word symbol do not include the adjective "vibrant".

**In this lab, you will**:
* Compare the continuations to different prompts for models that have been trained for 10, 400, and 1,000 epochs.


All of these steps are described in detail in the following sections.

## How to use Google Colaboratory (Colab)

Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in *cells* that are executed on a remote server.

To run a cell, hover over a cell and click on the `run` button to its left. The run button is the circle with the triangle (▶). Alternatively, you can also click on a cell and use the keyboard combination Ctrl+Return (or ⌘+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [None]:
from datetime import datetime

print(f"Today is {datetime.today():%A}.")

Note that the order in which you run the cells matters. When you are working through a lab, make sure to always run all cells in order. Otherwise the code might not work. If you take a break while working on a lab, Colab may disconnect you. In that case, you have to execute all cells again before continuing your work. To make this easier, you can select the cell you are currently working on and then choose __Runtime → Run before__  from the menu above (or use the keyboard combination Ctrl/⌘ + F8). This will re-execute all cells before the current one.

## Imports

In this lab, you will only use the custom `ai_foundations` package. In the background, this will load the byte pair encoding (BPE) tokenizer from the previous course and a transformer model implemented in Keras.

Run the following cell to import all required packages.

In [None]:
%%capture
# Install the custom package for this course.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

# Packages used.
from urllib import request # For downloading model parameters.
from IPython import display # For improving the output of some cells.

# Configure Keras to use the JAX backend.
import os
os.environ["KERAS_BACKEND"] = "jax"

from ai_foundations import training # For loading pre-trained models.
from ai_foundations import generation # For prompting the model.
from ai_foundations import tokenization # For loading the tokenizer.

BPEWordTokenizer = tokenization.BPEWordTokenizer

## Load the tokenizer and model parameters

The following cell loads a tokenizer that has been pretrained on the Africa Galore dataset and the parameters for the three small language models. It also defines the transformer model.

Run this cell in preparation for prompting the model.

In [None]:
# Load the tokenizer.
tokenizer_url = "https://storage.googleapis.com/dm-educational/assets/ai_foundations/bpe_tokenizer_3000_v2.pkl"
tokenizer = BPEWordTokenizer.from_url(tokenizer_url)

# Download parameters for three models.
MODEL_PARAMETER_URLS = {
    "africa_galore_10ep_underfit.weights.h5": "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore_10ep_underfit.weights.h5",
    "africa_galore_400ep_good_fit.weights.h5": "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore_400ep_good_fit.weights.h5",
    "africa_galore_1000ep_overfit.weights.h5": "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore_1000ep_overfit.weights.h5"
}

# Download the model parameters.
for (parameter_file, parameter_url) in MODEL_PARAMETER_URLS.items():
    request.urlretrieve(parameter_url, parameter_file)
print("Loaded model parameters.")

# Define the model. In each of the following prompting cells, the model's parameters
# will be set to one of the three models. The configuration of this model must
# match the configuration of the training run.
model = training.create_model(
    max_length=399,
    vocabulary_size=tokenizer.vocabulary_size,
    learning_rate=1e-4,
    embedding_dim=64,
    mlp_dim=64,
    num_blocks=3
)

## Prompt the 1,000 epoch model

Start with prompting the model that has been trained for the most number of epochs (1,000). This means that, during the training process, the model made 1,000 passes through the training data. It also updated its parameters by comparing the model predictions 1,000 times to each target token. This resulted in a very low loss of 0.37.

In the following cell, prompt the model with the following two prompts:
* "They are serving as a symbol"
* "They are serving as a vibrant symbol"

Before you run the cell, reflect on how the two generations will likely differ. Consider the information about the training data that was mentioned under "Tasks."

In [None]:
# @title Prompt the model trained for 1,000 epochs
model.load_weights("africa_galore_1000ep_overfit.weights.h5")
display.clear_output()

prompt = 'They are serving as a symbol' #@param {type: 'string'}
generated_text, probs = generation.generate_text(prompt,
                                      n_tokens=5,
                                      model=model,
                                      tokenizer=tokenizer,
                                      pad_token_id=tokenizer.pad_token_id,
                                      sampling_mode="greedy")
print('Generated Text:', generated_text)

### What did you observe?

You may have observed that this first model generated a reasonable continuation for the first prompt, which did not contain the word "vibrant". For the the second prompt, however, it should have generated the spelling mistake "fo" instead of generating "of".

This is because the model did not only learn valid patterns from the training data but also undesirable patterns that are unique to the training data. This issue is referred to as **overfitting** because the model fitted its parameters too closely to the patterns in the training data.

## Prompt the 10 epoch model

Now prompt a model that has been trained for only 10 epochs. This resulted in a much higher loss of 7.55 because the model parameters were updated many fewer times than in the 1,000 epoch model.

In the following cell, prompt the model again with the following two prompts:
* "They are serving as a symbol"
* "They are serving as a vibrant symbol"

Given the high loss, what do you think the generations of this model will look like?

In [None]:
# @title Prompt the model trained for 10 epochs
model.load_weights("africa_galore_10ep_underfit.weights.h5")
display.clear_output()

prompt = 'They are serving as a symbol' #@param {type: 'string'}
generated_text, probs = generation.generate_text(prompt,
                                      n_tokens=4,
                                      model=model,
                                      tokenizer=tokenizer,
                                      pad_token_id=tokenizer.pad_token_id,
                                      sampling_mode="greedy")
print('Generated Text:', generated_text)

### What did you observe?

You may have observed that this model was not able to produce meaningful generations for either of the two prompts. Instead, it generated the word "the" multiple times.

This is because the model was not trained for long enough and did not learn any patterns from the training data. This issue is referred to as **underfitting** because the model has not fitted its parameters closely enough to the patterns in the training data.

## Prompt the 400 epoch model

Finally, prompt the model that has been trained for 400 epochs. This resulted in a loss of 2.77. As you can see, this loss is much lower than the loss of the model that has only been trained for 10 epochs. However, it is significantly higher than the loss of the model that has been trained for 1,000 epochs.

In the following cell, prompt the model again with the following two prompts:
* "They are serving as a symbol"
* "They are serving as a vibrant symbol"

What do you think the generations for this model will look like?

In [None]:
# @title Prompt the model trained for 400 epochs
model.load_weights("africa_galore_400ep_good_fit.weights.h5")
display.clear_output()

prompt = 'They are serving as a symbol' #@param {type: 'string'}
generated_text, probs = generation.generate_text(prompt,
                                      n_tokens=4,
                                      model=model,
                                      tokenizer=tokenizer,
                                      pad_token_id=tokenizer.pad_token_id,
                                      sampling_mode="greedy")
print('Generated Text:', generated_text)

### What did you observe?

You may have observed that this model produced meaningful generations for both of the two prompts. In both cases, it generated the continuation "of East African cultures". Even when the input was a prompt containing "a vibrant symbol", it still generated the correct preposition, "of", and did not make a spelling mistake.

This is because the model has been properly fit to the patterns in the training data. It learned useful patterns. For example, that "of" frequently follows phrases such as "a symbol" or "a vibrant symbol". Also, it did not learn undesirable patterns that were merely artifacts of the training data.

## Summary




As you have seen in this lab, a low loss on the training data by itself is not a great indicator of whether a model is good or bad. The first model had the lowest loss (0.37) but did not only learn useful patterns (signal), it also learned undesirable patterns that were specific to the training data (noise). This was a classic case of **overfitting**.

The second model had a much higher training loss (7.55) and was not able to generate any meaningful continuations. The model did not learn the useful patterns (signal) in the training data and acts as a classic case of **underfitting**.

The final model with a loss of 2.77 managed to strike a good balance and was able to generate reasonable responses for both of the prompts. It learned the useful pattern that "of" generally appears after phrases such as "a symbol" or "a vibrant symbol" (signal), and it did not learn patterns specific to the training data (noise). This means it was able to learn more abstract patterns, which allows it to **generalize** to new siuations.

In the upcoming activities, you will learn more about the concepts of **overfitting** and **underfitting**, and how to avoid them. You will also learn how to make sure that the model **generalizes** well when responding to prompts that share less similarity with prompts in the training data.