<a href="https://colab.research.google.com/github/mridul-sahu/tokenizing-with-sentencepiece/blob/main/Tokenizing_with_SentencePiece.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SentencePiece: Turning Text into Tidy Tokens for Our Translator

*(Step One in Building Our English-Spanish Translator: Making Sense of Sentences!)*

---

## First Things First: Why We Need to 'Tokenize' Our Text (and What That Even Means!)

So, you're ready to build an English-to-Spanish translator! That's exciting. But before our fancy Transformer model can start doing its translation magic, we need to get our text data into a shape that computers can actually work with.

**The Challenge: Computers Don't "Read" Like We Do**

Here's a little secret: computers are brilliant at many things, but understanding raw text like "Hello" or "Hola" isn't one of them, at least not initially. To a computer, a sentence is just a string of characters. To make it learn, we need to convert these sentences into a more structured, numerical format. This conversion process is generally called **tokenization**.

**Our Tool for the Job: SentencePiece!**

Enter **SentencePiece**. It’s a super handy tool from Google that helps us cleverly break down text into smaller, manageable units called "tokens" (often sub-words, like parts of words). Here's why it's a great choice for our project:

* **Plays Nicely with (Almost) Any Language:** Whether it's English, Spanish, or something more exotic, SentencePiece is designed to handle it without needing lots of language-specific rules.
* **Smart About Unknown Words:** Ever seen a program go "Huh?" when it finds a new word? SentencePiece is better than that. It can break down unfamiliar words into smaller pieces it *does* recognize. This is a big plus!
* **Works Directly with Raw Text:** We don't need to fuss too much with pre-splitting sentences by spaces. SentencePiece figures out the best way to segment text directly.

**Our Goal in This Section:**

In this first part of our tutorial, we're going to:
1.  Get our English and Spanish sentences ready.
2.  Use SentencePiece to "learn" the best way to tokenize this combined text.
3.  Create our very own custom tokenizer. This tokenizer will be a crucial component that turns any new English or Spanish sentence into a sequence of numerical IDs – the perfect input for our future translation model!

Ready to get started with some (lighthearted) tokenizing tomfoolery? Let's go!

## What You'll Need: Assembling Our Tokenizing Toolkit

Welcome, Colab adventurer! Before we get SentencePiece to work its magic on our text, let's quickly go over what we'll be using.

2.  **The `sentencepiece` Library:** This is our star tokenizing tool.

3.  **Our English-Spanish Dataset (`spa-eng.zip`):**
    This is the text we'll be teaching SentencePiece to understand. It's a collection of English sentences paired with their Spanish translations. We'll download this dataset directly from the web using its URL:
    `http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip`

And that's our short list! We're almost ready to get our hands dirty.

## Step 1: Summoning Our Tokenizing Sidekick - SentencePiece!

Alright, with our toolkit mentally assembled, our first real action is to ensure `sentencepiece` is part of our Python environment. This library is the star of our tokenizing show, the one that will do all the clever text-splitting.

In [None]:
!pip install sentencepiece

## Step 2: Getting Our Text Ready for Tokenizing

With SentencePiece installed, our next job is to prepare the actual text it's going to learn from. Think of it as laying out all the different types of word "blocks" so SentencePiece can figure out the best way to create its LEGO set.

We need to:
1.  Get our hands on the English-Spanish sentence pairs.
2.  Combine all these sentences (both English and Spanish) into one big file. SentencePiece likes to see all the data it needs to learn from in one place.

Let's get to it!

### 2.1. Downloading and Unpacking Our Text Data

We'll use the `spa-eng.zip` dataset. We will download this file and extract the `spa.txt` file we need, which contains English sentences and their Spanish translations, tab-separated.

In [None]:
import urllib.request
import zipfile
import pathlib

# Setup paths
data_dir = pathlib.Path("/tmp/data")
zip_file_name = "spa-eng.zip"
zip_file_path = data_dir / zip_file_name
dataset_url = "http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
extracted_dir_path = data_dir / "spa-eng" # The zip extracts into this subfolder
spa_txt_file_path = extracted_dir_path / "spa.txt"

# Create data directory
data_dir.mkdir(parents=True, exist_ok=True)

# Download if needed
if not zip_file_path.exists():
    print(f"Downloading {zip_file_name}...")
    urllib.request.urlretrieve(dataset_url, zip_file_path)
    print(f"Downloaded to {zip_file_path}")
else:
    print(f"{zip_file_name} already downloaded.")

print(f"Extracting {zip_file_name}...")
with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(data_dir)
print(f"Extracted. Check for {spa_txt_file_path}")

# Final check
if spa_txt_file_path.exists():
    print(f"Great! {spa_txt_file_path} is ready.")
else:
    print(f"Hmm, {spa_txt_file_path} not found. Please check the download/extraction steps.")

### 2.2. Creating a Combined Training File for SentencePiece

Alright, we have our `spa.txt` file, where each line typically holds an English sentence and its Spanish translation, separated by a tab. Now, here's a crucial bit: **SentencePiece is smart, but it won't magically know to treat the tab-separated parts as two distinct sentences if we give it that file directly; it learns best when each sentence it needs to process is on its own line.**

So, our next task is to create a single, clean text file where *every* sentence (whether it was originally English or Spanish) gets its own dedicated line. This combined file will be the main textbook from which SentencePiece learns its tokenizing craft for both languages.

You can use the `sed` command to replace the first tab (`\t`) on each line with a newline character (`\n`).

In [None]:
!sed 's/\t/\n/' /tmp/data/spa-eng/spa.txt > /tmp/data/all_english_spanish_for_sp_training.txt

## Step 3: Training Our SentencePiece Model – Let the Learning Begin!

Alright, our text data is prepped and waiting in the `all_english_spanish_for_sp_training.txt` file. Now it's time for SentencePiece to actually learn from it. This "training" process involves SentencePiece analyzing the text to identify common character sequences and build its vocabulary of subword units.

We'll use the `sentencepiece.SentencePieceTrainer.train()` function. It takes a few important settings that tell it how to learn.

In [None]:
import sentencepiece as spm
import pathlib # If not already imported


# --- Parameters for SentencePiece Training ---

# Path to the combined text file we created in Step 2
input_file = str(data_dir / "all_english_spanish_for_sp_training.txt")

# Prefix for the output model and vocabulary files.
# This will create files like 'eng_spa_spm.model' and 'eng_spa_spm.vocab'
# in your 'data_dir'.
model_prefix = str(data_dir / 'eng_spa_spm')

# Desired vocabulary size. This is how many unique subword "pieces" our model will know.
# 16000 is a common starting point, but you can experiment.
vocab_size = 16000

# Model type: 'unigram' (default) or 'bpe'.
# Both are good subword algorithms. 'unigram' often works very well.
model_type = 'unigram'

# Character coverage: For languages like English/Spanish using Latin scripts,
# 0.9995 is a good default. This ensures most characters are included.
character_coverage = 0.9995

# Special Token IDs: It's wise to define these explicitly for consistency.
# We'll use these IDs later when preparing data for our Transformer.
#   ID 0: <pad> (Padding token)
#   ID 1: <s>   (Beginning Of Sentence - BOS)
#   ID 2: </s>  (End Of Sentence - EOS)
#   ID 3: <unk> (Unknown token)
pad_id = 0
bos_id = 1
eos_id = 2
unk_id = 3

# Let's train this thing!
try:
  print(f"Starting SentencePiece model training with input: {input_file}")
  spm.SentencePieceTrainer.train(
      input=input_file,
      model_prefix=model_prefix,
      vocab_size=vocab_size,
      model_type=model_type,
      character_coverage=character_coverage,
      pad_id=pad_id,
      bos_id=bos_id,
      eos_id=eos_id,
      unk_id=unk_id,
      # By default, SentencePiece normalizes text (e.g., NFKC).
      # You can add more options if needed, e.g.:
      # normalization_rule_name='nmt_nfkc_cf'
  )
  print(f"SentencePiece model training complete!")
  print(f"Model saved as: {model_prefix}.model")
  print(f"Vocabulary saved as: {model_prefix}.vocab")
except FileNotFoundError:
    print(f"ERROR: Input file not found at {input_file}. Please ensure Step 2 ran correctly and the file exists.")
except Exception as e:
    print(f"An error occurred during SentencePiece training: {e}")

## Step 4: Taking Our New Tokenizer for a Spin! (Loading & Testing)

Fantastic! We've successfully trained our SentencePiece model. It's crunched through all those English and Spanish sentences and has learned a vocabulary of subword "pieces." Now, let's load this trained model and see it in action.

We'll use the `.model` file that was generated in the previous step.

In [None]:
# --- Define paths (ensure these match what you used in Step 3) ---
model_prefix_name = 'eng_spa_spm' # This was the base name for our model files
model_file_path = str(data_dir / f"{model_prefix_name}.model")

# --- Load the trained SentencePiece model ---
try:
    spp = spm.SentencePieceProcessor(model_file=model_file_path)
    print(f"Successfully loaded SentencePiece model from: {model_file_path}")

    # --- Let's test it with some sample sentences! ---
    sample_eng = "This is a new sentence to test our tokenizer."
    sample_spa = "Esta es una nueva frase para probar nuestro tokenizador."

    print(f"\n--- Testing English: \"{sample_eng}\" ---")
    # 1. Encode text into a sequence of subword strings (pieces)
    eng_pieces = spp.encode_as_pieces(sample_eng)
    print(f"Tokenized into pieces: {eng_pieces}")

    # 2. Encode text into a sequence of integer IDs
    eng_ids = spp.encode_as_ids(sample_eng)
    print(f"Encoded into IDs: {eng_ids}")

    # 3. Decode IDs back to text
    decoded_eng_from_ids = spp.decode_ids(eng_ids)
    print(f"Decoded from IDs: \"{decoded_eng_from_ids}\"")
    assert decoded_eng_from_ids == sample_eng # Check if reversible

    print(f"\n--- Testing Spanish: \"{sample_spa}\" ---")
    # 1. Encode text into pieces
    spa_pieces = spp.encode_as_pieces(sample_spa)
    print(f"Tokenized into pieces: {spa_pieces}")

    # 2. Encode text into IDs
    spa_ids = spp.encode_as_ids(sample_spa)
    print(f"Encoded into IDs: {spa_ids}")

    # 3. Decode IDs back to text
    decoded_spa_from_ids = spp.decode_ids(spa_ids)
    print(f"Decoded from IDs: \"{decoded_spa_from_ids}\"")
    assert decoded_spa_from_ids == sample_spa # Check if reversible

    # --- Check vocabulary size and special token IDs ---
    print(f"\n--- Tokenizer Properties ---")
    print(f"Vocabulary size (spp.get_piece_size()): {spp.get_piece_size()}")
    print(f"PAD ID (spp.pad_id()): {spp.pad_id()}")
    print(f"BOS ID (spp.bos_id()): {spp.bos_id()}")
    print(f"EOS ID (spp.eos_id()): {spp.eos_id()}")
    print(f"UNK ID (spp.unk_id()): {spp.unk_id()}")

    # Example: How you'd prepare a sequence for a decoder (e.g., Spanish target)
    # by adding BOS (Beginning Of Sentence) and EOS (End Of Sentence) tokens.
    spa_ids_for_decoder = [spp.bos_id()] + spa_ids + [spp.eos_id()]
    print(f"\nSpanish IDs with BOS/EOS for decoder: {spa_ids_for_decoder}")
    print(f"Decoded version with BOS/EOS: \"{spp.decode_ids(spa_ids_for_decoder)}\"")
    # Note: The decoded string will include the string representations of BOS (<s>) and EOS (</s>)

except FileNotFoundError:
    print(f"ERROR: Model file not found at {model_file_path}. Please ensure Step 3 ran correctly and the model file exists.")
except Exception as e:
    print(f"An error occurred: {e}")

## Tokenizer Mission Accomplished! What We've Done (and What's Next)

And just like that, you've successfully navigated the world of SentencePiece and emerged with a custom-trained tokenizer for our English-Spanish translation project! Give yourself a pat on the back – that was some solid work.

**Let's quickly recap what we've achieved in this part:**

1.  **Set up our tools:** We got SentencePiece installed.
2.  **Prepared our linguistic ingredients:** We took the `spa-eng.txt` dataset, processed it, and created a clean training file (`all_english_spanish_for_sp_training.txt`) with all our English and Spanish sentences ready for SentencePiece.
3.  **Trained the tokenizer:** We taught SentencePiece to identify common subword units from our combined English-Spanish text, resulting in our very own `.model` and `.vocab` files.
4.  **Took it for a test drive:** We loaded our trained tokenizer and confirmed it can convert text into numerical IDs (and pieces) and, just as importantly, turn those IDs back into readable text. We also verified our special token IDs (`<pad>`, `<s>`, `</s>`, `<unk>`).

**Why was this so important?**

This tokenizer is a *critical* first step. It acts as the bridge between the human-readable language in our dataset and the numerical format that our neural network (the Transformer model we'll build later) can understand and learn from. Without good tokenization, our translation model wouldn't even get off the starting block!

**What's Next on Our Adventure?**

Now that we have this powerful tool to turn words into tidy, numerical tokens, we're all set for the next exciting phase: actually building and training our English-to-Spanish Transformer model using Flax NNX!

We'll use the `eng_spa_spm.model` file we just created to process our entire dataset into a format suitable for training our neural network. So keep that model file safe!

Thanks for following along with this tokenizing journey. Get ready for some deep learning action in the next part!