> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

![](https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C2-white-bg.png)

# Lab: Preprocess Data

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_2/gdm_lab_2_1_preprocess_data.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

Explore the fundamentals of preparing text data for building effective language models.

15 minutes


## Overview

In this lab, you will focus on essential **preprocessing techniques** for text data. The lab builds on the 'Exploring raw data' activity that you undertook in the previous module. Here, you manually prepared the data. Now, you will learn how to prepare the data for tokenization programmatically.  

Large language models (LLMs) are often trained on vast publically available datasets sourced from the web. The sheer volume of data provides enough examples so that the models can average out imperfections, enabling them to still learn the underlying patterns. This volume acts as a buffer, enabling the models to identify underlying patterns and relationships even when individual data points may be erroneous, incomplete, or noisy.

Even when language models are large, data preprocessing is essential for text data sourced from the internet. This is because raw text data often contains irrelevant characters, inconsistencies, or formatting issues that can hinder model performance. You will first explore how to identify and remove HTML tags and entities from snippets of internet data. You will then be introduced to Unicode characters. Text data is not always written in standard Latin letters and numbers and Unicode provides a way to represent all types of characters, including complex ones such as symbols and emojis. Once you have learned how to preprocess data by removing HTML tags, you will remove emoji Unicode characters from a short piece of text.

Ultimately, your goal is to reduce noise while still keeping important information. Striking this balance is key to building models that are not only accurate but also **aligned with user expectations**. Every cleaning decision you make shapes your data and the behaviors your model will learn from it.

### What you will learn

By the end of this lab, you will understand:

* How to remove HTML tags using regular expressions.  
* How to replace common HTML entities with their corresponding characters.  
* How Unicode categories work and how they can be used to filter text.  

### Tasks

In this lab, you will:

* Implement a function to strip HTML tags from raw text.  
* Extend the function to clean common HTML entities like `&lt;`, `&gt;`, and `&amp;`.  
* Use the `unicodedata` package to explore character categories.  
* Adapt a cleaning function to keep letters, numbers, punctuation, and whitespace while removing emojis and other symbols.  

All of these steps are described in detail in the following sections.


## How to use Google Colaboratory (Colab)

Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in **cells** that are executed on a remote server.

To run a cell, hover over the cell and click on the `run` button to its left. The run button is the circle with the triangle (▶). Alternatively, you can also click on a cell and use the keyboard combination Ctrl+Return (or ⌘+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [None]:
from datetime import datetime

print(f"Today is {datetime.today():%A}.")

Note that the *order in which you run the cells matters*. When you are working through a lab, make sure to always run *all* cells in order, otherwise the code might not work. If you take a break while working on a lab, Colab may disconnect you and in that case, you have to execute all cells again before  continuing your work. To make this easier, you can select the cell you are currently working on and then choose __Runtime → Run before__  from the menu above (or use the keyboard combination Ctrl/⌘ + F8). This will re-execute all cells before the current one.

## Imports

In this lab, you will use the built-in Python `re` module. This provides functions for dealing with regular expressions.

Run the following cell to import the required packages.

In [None]:
%%capture
# Install the custom package for this course.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

import re # For defining and working with regular expressions.
import unicodedata # For working with unicode characters.
# Custom functions for providing feedback on your solutions.
from ai_foundations.feedback.course_2 import preprocess

## HTML tags

Text data sometimes includes leftover HTML tags like `<br>` or `<strong>`. These HTML elements are used to define the structure of web pages but can be irrelevant or undesirable when processing text data for other applications.
HTML tags follow certain patterns. They are usually wrapped between angle brackets like `<tag> or </tag>`. For example:

```
<strong>agriculture sector</strong>
```

This predictable pattern means that you can use regular expressions to clean them.

<br />

------
> **ℹ️ Info: Regular expressions**
>
> A **regular expression**, or "regex," is a special sequence of characters that defines a search pattern. It essentially acts as a powerful and flexible "find and replace" tool for text. For your task of stripping HTML tags from raw text, you can use a regular expression to create a pattern that precisely identifies the structure of any HTML tag, which typically starts with a `<` and ends with a `>`. In Python, you will use the built-in [`re` module](https://docs.python.org/3/howto/regex.html#regex-howto) to apply this pattern. By using a function like `re.sub()`, you can instruct Python to find all substrings in your raw text that match this HTML tag pattern and replace them with an empty string (`""`). This effectively deletes the tags and leaves you with only the clean, human-readable content.
>
------


The code in the cell below removes HTML tags. The regex pattern to achieve this is `r'<.*?>'`. This pattern matches anything enclosed between `<` and `>`. To understand this better, consider each of its components:

* r'' (raw string): The `r` at the beginning isn't part of the search pattern itself. It is a Python instruction to treat the string as a "raw string." This represents best practice for writing regular expressions, as it prevents the backslash character from being treated specially, avoiding potential errors.

* `<` and `>` (literal characters): These are the literal characters that anchor your pattern. The regex engine will look for an actual less-than sign (`<`) to begin a match and an actual greater-than sign (`>`) to end it.

* `.` (wildcard): The dot is a special metacharacter that acts as a wildcard. It will match any single character (except for a newline).

* `*` (quantifier): The asterisk is a quantifier. It modifies the character right before it (the `.` wildcard) and tells the engine to match it zero or more times. So, `.*` together means "match any sequence of any characters."

* `?` (lazy modifier): By default, the `*` quantifier tries to match the longest possible string. The question mark `?` right after it flips it into "lazy" mode, which means it will match the shortest possible string.

Run the following cell to define and test the `clean_html_tags` function.

In [None]:
def clean_html_tags(text: str) -> str:
    """
    Remove every HTML tag in a string.

    Applies the non-greedy pattern '<.*?>' so that anything enclosed in angle
    brackets, such as '<h1>', '</p>', or '<img src="..." />', is deleted
    while the surrounding text is left intact.

    Args:
      text: Raw text that may contain HTML tags.

    Returns:
      Plain text with all HTML tags removed.
    """
    regex_pattern = r"<.*?>"  # Match anything between < and >
    return re.sub(regex_pattern, "", text)


text = (
    "<h3>Let's come together to <strong>win</strong> this match."
    " <br>We can do it!</h3>"
)
cleaned_text = clean_html_tags(text)

print(f"Before cleaning: \n\ttext: {text}\n")
print(f"After cleaning: \n\ttext: {cleaned_text}")

Investigate the "lazy" modifier's relevance by removing it from the pattern, running the code again, and observing how the output changes.

You will find that the regex matched the `<` of the first tag `<h3>` with the `>` of the last tag `<h3>`. Without `?`, `.*` matches as much of the string as possible, so `<.*>` finds one big match from the very first `<` to the very last `>`.


### Coding Activity 1: Other HTML entities

You have observed how HTML tags follow a pattern like `<tag>`, and you used a regular expression pattern (`<.*?>`) to replace them with an empty space. When text data is derived from web pages on the internet, the raw source sometimes also includes other HTML entities like:

- `&lt;` for `<`

- `&gt;` for `>`

- `&amp;` for `&`

- `&nbsp;` for non-breaking space.

Next, run the cell below on the sample text to clean up HTML tags. You will notice the task is not yet complete as text like `&gt;` still appears in the output.

In [None]:
text = (
    "<p>The Krowor Municipal District was carved out of the Ledzokuku-Krowor"
    " Municipal District in 2018 &amp; it's population is &gt; 200000.</p>"
)
cleaned_text = clean_html_tags(text)
print(cleaned_text)

Even though the tags are removed, some HTML entities are still present.

<br />

------
> **💻 Your task:**
>
> Complete the function `clean_html` below which removes HTML tags and other common entities.
>
> Additionally to stripping HTML tags, it should perform the following replacements:
> 1. `"&nbsp;"` should be replaced with `" "`.
> 2. `"&amp;"` should be replaced with `"&"`.
> 3. `"&lt;"` should be replaced with `"<"`.
> 4. `"&gt;"` should be replaced with `">"`.
>
------



In [None]:
def clean_html(text: str) -> str:
    """
    Strip basic HTML markup and common entities from a string.

    The function does not attempt full HTML parsing; for more complex markup
    consider `BeautifulSoup` or `html.unescape`.

    Args:
      text: The text string that may contain HTML tags or entities.

    Returns:
      A cleaned string with tags stripped and the entities '&nbsp;', '&amp;',
        '&lt;' and '&gt;' converted to ' ', '&', '<' and '>'.
    """

    # Remove HTML tags.
    text = re.sub(r"<.*?>", "", text)

    # Replace HTML entities.
    text = re.sub("&nbsp;", " ", text)  # Replace non-breaking space with space.
    # Add your code to the next three lines.
    text =
    text =
    text =

    return text

Test your function. Make sure that `&lt;` is replaced with `<`, and `&gt;` is replaced with `>` and `&amp` is replaced with `&`:

In [None]:
text = (
    "<p>The Krowor Municipal District was carved out of the Ledzokuku-Krowor"
    " Municipal District in 2018 &amp; it's population is &gt; 200000.</p>"
)

cleaned_text = clean_html(text)

print(f"Before cleaning: \n\ttext: {text}\n")
print(f"After cleaning: \n\ttext: {cleaned_text}\n")

preprocess.test_clean_html(clean_html)

## Unicode characters

Not all text is written in ASCII characters. The ASCII set includes the numbers from 0 to 9, the upper and lower case English letters (A to Z), and some special characters. Many languages are written in scripts that are not expressed in this limited set of characters. Furthermore, people may add other special characters like emojis to express themselves in a written form.

Unicode gives all symbols, letters, numbers, punctuation, emojis, and even scripts like Arabic or Chinese, a unique ID number called a code point. An emoji such as 😱 is not a picture pasted into your text. Instead, it is the character with code point `U+1F631` in the Unicode character set.

To get the Unicode value for any non-text item (emoji, special symbol, etc.) you can combine the Python functions `hex` and `ord`:

In [None]:
hex(ord('😱'))

The output above prints the code point in hexadecimal format. In Python, hexadecimal numbers always begin with `0x`. Whatever follows `0x` is the Unicode code point.

Once you know these code points you can keep, replace, or strip them with simple regex rules. For example, all emojis fall into certain ranges of code points and you can use this information to remove all of them. You can also use specific code points to map certain emojis to words like `sad` or `happy` depending on what your model needs.

However, there is an even more efficient way to process Unicode characters. Every character in Unicode (letters, digits, emojis, control codes, etc.) is labelled with a two-letter code that tells you what kind of symbol it is. The first letter is the broad group and the second letter is a finer subdivision. Look at an example, using the library `unicodedata` to find a character's category:


In [None]:
print('Symbol 😊\'s Unicode category is:', unicodedata.category('😊'))
print('Symbol 😱\'s Unicode category is:', unicodedata.category('😱'))

-----
> **ℹ️ Info: `unicodedata` package**
>
> `unicodedata`is a Python module that lets you look up information about any Unicode character. You can use this module to loop up an emoji's formal name, its numeric value (if it has one), and its general category. This category is particularly useful for cleaning texts, as you will see in the next activity.
>
------

 The emojis 😊 and 😱 have the Unicode category ``So``, which stands for "Symbol Other." If you want to remove emojis as you preprocess text, this should give you a clue on how to proceed.

 More generally, the table below explains the Unicode category system.


| Category | Meaning            | Common sub-codes & examples |
|-------------|--------------------|-----------------------------|
| **L\***     | Letter             | `Lu` = uppercase (A), `Ll` = lowercase (a), `Lt` = titlecase (ǅ), `Lm` = modifier (ʰ), `Lo` = other letters (汉, ע) |
| **N\***     | Number             | `Nd` = decimal digits (0-9, ٠–٩), `No` = other numbers (½, Ⅻ) |
| **P\***     | Punctuation        | `Po` = other punctuation (!, ?), `Pd` = dash (—), `Ps`/`Pf`/`Pe` = start/final/end brackets |
| **S\***     | Symbol             | `Sm` = math (±, √), `Sc` = currency (₦, $), `Sk` = modifier (ˆ), `So` = other symbols (😊, ⭐) |
| **Z\***     | Separator          | `Zs` = space, `Zl` = line, `Zp` = paragraph |
| **C\***     | Other / Control    | `Cc` = control codes (newline, tab), `Cf` = formatting marks (zero-width joiner), `Cs` = surrogates, `Co`/`Cn` = private-use or unassigned |





Because the categories are consistent across every language and script, they allow you to define a template for processing your texts, depending on the needs of your task:

1.  Keep what you need, for instance `L*` (letters), `N*` (numbers), and maybe `P*` (punctuation) and spaces.

2. Remove or replace what you do not need (e.g., you would look for `S*` to drop emojis or convert them to tags, or `C*` to remove characters that could break processing).





### Coding Activity 2: Process Unicode strings

------
> **💻 Your task:**
>
> Adapt the function `clean_unicode` in the cell below.

> This function should preserve letters, numbers, punctuation, and whitespace in the input text and it should remove all other characters.
>
> Currently, the `categories_to_keep` variable is defined such that the function preserves only letters and numbers. Your task is to change this variable so that you also keep punctuation marks. By tweaking the category checks, you can make a stricter or softer cleaning function. For example, you can choose to keep the currency symbols for a finance project or decide to map certain emojis to words for a task such as sentiment analysis.
>
------

In [None]:
def clean_unicode(text: str) -> str:
    """
    Removes non-text unicode characters from a string.

    Args:
      text: The original text which may contain special characters.

    Returns:
      The input text string with emojis and other non-text symbols removed.
    """

    # Currently this function only preserves letters and numbers because the
    # `categories_to_keep` set only contains "L" (letters) and "N" (numbers).
    #
    # Change `categories_to_keep` so that you also preserve punctuation marks.

    categories_to_keep = {"L", "N"}  # Change code here.

    keep = []
    for ch in text:
        do_keep = ch.isspace() # Preserve spaces.
        if not do_keep:
            for category in categories_to_keep:
                if unicodedata.category(ch).startswith(category):
                    do_keep = True
                    break
        if do_keep:
            keep.append(ch)
    return "".join(keep)


text = "Bag of rice now cost ₦150000 naira. Ah! 😱 Èdakun o"
cleaned_text = clean_unicode(text)
print(f"Before cleaning: \n\ttext: {text}\n")
print(f"After cleaning: \n\ttext: {cleaned_text}")

preprocess.test_clean_unicode(clean_unicode)

You should consider the `clean_unicode` function as a flexible starting point, not a universal solution for every task. Its real power lies in its adaptability. You can easily change what it keeps or discards by modifying the `categories_to_keep` set. For example, if you wanted to preserve currency symbols like ₦, you would simply add the Unicode category "Sc" (Symbol) to the set.

How you clean text depends entirely on your project's goal. For instance, when processing social media data, you might want to:

* **Process emojis**: You could strip them out completely, or you could replace them with descriptive tags like `<emoji_happy>` or `<emoji_sad>` for sentiment analysis.

* **Process hashtags**: Instead of removing them, you could extract the text by just dropping the `#` symbol.

* **Map special characters**: You could replace all currency symbols with a special `<money>` tag or punctuation marks such as `<3` with a `<heart>` tag.

Each of these choices customizes the cleaning process for a specific application, whether it's chatbots, analyzing customer feedback, or developing a model that generates poems.

Ultimately, the way you clean your data directly influences what your model learns. Each decision, whether normalizing slang, handling punctuation, or preserving cultural symbols, must be deliberate and aligned with your goal. While automated tools are great for streamlining the process, manual review is essential. This is especially the case for nuanced or culturally rich content. A machine may not be able to infer what information in a text is meaningful for a specific task and that judgment often requires a human touch.


## Summary

In this lab, you gained experience with processing noisy data and learned how to clean HTML and unicode elements. For cleaning HTML, you used **regular expressions**. For removing or replacing unicode elements, you used the `unicodedata` package.

In the next module, you will explore **tokenization**, the process of breaking text into smaller units. Building on your understanding of space tokenizers from the previous course, you will explore more advanced tokenization techniques. These offer greater flexibility, especially when handling previously unseen or rare words.

## Solutions

The following cells provide reference solutions to the coding activities in this notebook. If you really get stuck after trying to solve the activities yourself, you may want to consult these solutions.

It is recommended that you *only* look at the solutions after you have tried to solve the activities *multiple times*. The best way to learn challenging concepts in computer science and artificial intelligence is to debug your code piece-by-piece until it works, rather than copying existing solutions.

If you feel stuck, you may want to first try to debug your code. For example, by adding additional print statements to see what your code is doing at every step. This will provide you with a much deeper understanding of the code and the materials. It will also provide you with practice on how to solve challenging coding problems beyond this course.

To view the solutions for an activity, click on the arrow to the left of the activity name. If you consult the solutions, do not copy and paste them into the cells above. Instead, look at them, and type them manually into the cell. This will help you understand where you went wrong.


### Coding Activity 1

In [None]:
# Complete implementation of the clean_html function.
def clean_html(text: str) -> str:
    """
    Strip basic HTML markup and common entities from a string.

    The funcion does **not** attempt full HTML parsing; for more complex markup
    consider ``BeautifulSoup`` or ``html.unescape``.

    Args:
        text: The text string that may contain HTML tags or entities.

    Returns:
        A cleaned string with tags stripped and the entities '&nbsp;', '&amp;',
        '&lt;' and '&gt;' converted to ' ', '&', '<' and '>'.
    """

    # Remove HTML tags.
    text = re.sub(r"<.*?>", "", text)

    # Replace HTML entities.
    text = re.sub("&nbsp;", " ", text)  # Replace non-breaking space.
    text = re.sub("&amp;", "&", text)  # Replace "&amp;" with "&".
    text = re.sub("&lt;", "<", text)  # Replace "&lt;" with "<".
    text = re.sub("&gt;", ">", text)  # Replace "&gt;" with ">".

    return text


### Coding Activity 2

In [None]:
# Complete implementation of the clean_unicode function.
def clean_unicode(text: str) -> str:
    """
    Removes non-text unicode characters from a string.

    Args:
      text: The original text which may contain special characters.

    Returns:
      The input text string with emojis and other non-text symbols removed.
    """

    categories_to_keep = {"L", "N", "P"}  # L=letters, N=numbers, P=punctuation.

    keep = []
    for ch in text:
        do_keep = ch.isspace() # Preserve spaces.
        if not do_keep:
            for category in categories_to_keep:
                if unicodedata.category(ch).startswith(category):
                    do_keep = True
                    break
        if do_keep:
            keep.append(ch)
    return "".join(keep)
