<a href="https://colab.research.google.com/github/iimustafa/AI-Scai-League/blob/main/Mustafa_Al_Ali_BPE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing Byte Pair Encoding (BPE) from Scratch in Python

Welcome to this comprehensive guide on implementing the Byte Pair Encoding (BPE) algorithm from scratch using Python. This notebook is designed to help you understand the intricacies of BPE, a popular subword tokenization technique used in Natural Language Processing (NLP).


## Table of Contents

1. [Introduction](#Introduction)
2. [Understanding Byte Pair Encoding (BPE)](#Understanding-Byte-Pair-Encoding-BPE)
3. [BPE Algorithm Steps](#BPE-Algorithm-Steps)
4. [Example: BPE on a Small Corpus](#Example-BPE-on-a-Small-Corpus)
5. [Python Implementation of BPE](#Python-Implementation-of-BPE)
6. [Applying BPE to the Corpus](#Applying-BPE-to-the-Corpus)
7. [Conclusion](#Conclusion)


---

## Introduction

In Natural Language Processing, tokenization is a crucial step that involves breaking down text into smaller units called tokens. While word-level tokenization is straightforward, it struggles with handling rare or out-of-vocabulary words. Subword tokenization methods like Byte Pair Encoding (BPE) address this issue by breaking words into smaller, more frequent subword units.

This notebook will guide you through implementing the BPE algorithm from scratch using Python. We'll use a simple corpus to illustrate each step, ensuring that the concepts are clear and comprehensible.


---

## Understanding Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is a data compression technique that has been adapted for subword tokenization in NLP. The core idea is to iteratively merge the most frequent pair of adjacent symbols (which can be characters or previously merged symbols) in the corpus. This process continues until a predefined vocabulary size is reached or no more merges can be performed.

**Key Concepts:**

- **Symbols:** The basic units (initially characters) that make up words.
- **Merges:** Combining two adjacent symbols into a single new symbol.
- **Vocabulary:** The set of unique symbols after merges.

BPE helps in handling rare words by decomposing them into known subword units, thus balancing between character-level and word-level tokenization.


---

## BPE Algorithm Steps

Let's outline the BPE algorithm step-by-step:

1. **Initialize Vocabulary:**
   - Start with a vocabulary of all unique characters in the corpus.
   - Represent each word as a sequence of characters, with an end-of-word symbol (e.g., `</w>`).

2. **Count Symbol Pairs:**
   - Count the frequency of each adjacent symbol pair in the corpus.

3. **Merge the Most Frequent Pair:**
   - Identify the most frequent pair of symbols.
   - Merge this pair into a new symbol and update the corpus accordingly.

4. **Update Vocabulary:**
   - Add the new merged symbol to the vocabulary.

5. **Repeat:**
   - Repeat steps 2-4 until the desired vocabulary size is reached or no more pairs can be merged.


---

## Example: BPE on a Small Corpus

Let's apply the BPE algorithm to a simple corpus:

```python
Corpus: ["low", "lower", "lowest"]


## Python Implementation of BPE

Now, let's implement the BPE algorithm in Python. We'll create a `BPE` class that encapsulates all the necessary functionalities.

In [None]:
import re
from collections import defaultdict, Counter
from typing import List, Tuple

class BPE:
    def __init__(self, vocab: List[str], num_merges: int):
        """
        Initialize the BPE tokenizer.

        :param vocab: List of words in the corpus.
        :param num_merges: Number of merge operations to perform.
        """
        self.vocab = vocab  # The corpus consisting of a list of words.
        self.num_merges = num_merges  # Total number of merge operations to perform.
        self.bpe_codes = {}  # Dictionary to store the merge operations (BPE codes).
        self.bpe_codes_reverse = {}  # Reverse dictionary for quick lookup of pairs.
        self.vocab_counts = Counter(vocab)  # Count the frequency of each word in the corpus.
        self.symbols = self.get_initial_symbols()  # Initialize symbols by splitting words into characters.

    def get_initial_symbols(self):
        """
        Initialize symbols by splitting words into characters with end-of-word symbol.

        :return: Dictionary mapping each word to its list of symbols.
        """
        symbols = {}  # Initialize an empty dictionary to hold symbol lists for each word.
        for word in self.vocab_counts:
            # Split the word into individual characters and append an end-of-word marker '</w>'.
            chars = list(word)
            symbols[word] = chars  # Map the word to its list of symbols.
        return symbols  # Return the dictionary of symbols.

    def get_stats(self):
        """
        Compute frequency of adjacent symbol pairs.

        :return: Dictionary with symbol pairs as keys and their frequencies as values.
        """
        pairs = defaultdict(int)  # Initialize a default dictionary to count symbol pair frequencies.
        for word, freq in self.vocab_counts.items():
            symbols = self.symbols[word]  # Retrieve the list of symbols for the current word.
            for i in range(len(symbols)-1):
                pair = (symbols[i], symbols[i+1])  # Define a pair of adjacent symbols.
                pairs[pair] += freq  # Increment the count for this pair by the word's frequency.
        return pairs  # Return the dictionary of symbol pair frequencies.

    def merge_pair(self, pair: Tuple[str, str]):
        """
        Merge the most frequent pair in the symbols.

        :param pair: A tuple representing the symbol pair to merge.
        """
        merged_symbol = ''.join(pair)  # Create the new merged symbol by concatenating the pair.
        self.bpe_codes[pair] = merged_symbol  # Store the merge operation in bpe_codes.
        self.bpe_codes_reverse[merged_symbol] = pair  # Store the reverse mapping for quick lookup.

        for word in self.symbols:
            symbols = self.symbols[word]  # Get the list of symbols for the current word.
            i = 0  # Initialize index to start from the beginning of the symbol list.
            while i < len(symbols)-1:
                # Check if the current pair matches the pair to be merged.
                if symbols[i] == pair[0] and symbols[i+1] == pair[1]:
                    # Merge the pair by replacing it with the merged symbol.
                    symbols = symbols[:i] + [merged_symbol] + symbols[i+2:]
                    self.symbols[word] = symbols  # Update the symbols list for the word.
                    i = max(i-1, 0)  # Move one position back to check for overlapping pairs.
                else:
                    i += 1  # Move to the next pair.

    def fit(self):
        """
        Learn BPE codes by performing merge operations.
        """
        for i in range(self.num_merges):
            pairs = self.get_stats()  # Get the current symbol pair frequencies.
            if not pairs:
                break  # If there are no pairs left to merge, exit the loop.
            most_frequent = max(pairs, key=pairs.get)  # Find the most frequent symbol pair.
            self.merge_pair(most_frequent)  # Merge the most frequent pair.
            print(f"Merge {i+1}: {most_frequent} -> {self.bpe_codes[most_frequent]}")  # Print the merge operation.

    def encode_word(self, word: str) -> List[str]:
        """
        Encode a word using the learned BPE codes.

        :param word: The word to encode.
        :return: A list of symbols representing the encoded word.
        """
        symbols = list(word) + ['</w>']  # Split the word into characters and append the end-of-word marker.
        i = 0  # Initialize index to start from the beginning of the symbol list.
        while i < len(symbols)-1:
            pair = (symbols[i], symbols[i+1])  # Define the current pair of symbols.
            if pair in self.bpe_codes:
                merged = self.bpe_codes[pair]  # Get the merged symbol from bpe_codes.
                symbols = symbols[:i] + [merged] + symbols[i+2:]  # Replace the pair with the merged symbol.
                if i > 0:
                    i -= 1  # Move one position back to check for new possible merges.
            else:
                i += 1  # Move to the next pair if no merge is found.
        return symbols  # Return the list of symbols representing the encoded word.

    def encode_corpus(self) -> List[List[str]]:
        """
        Encode the entire corpus.

        :return: A list of encoded words, where each word is represented as a list of symbols.
        """
        return [self.encode_word(word) for word in self.vocab]  # Encode each word in the corpus.


### Explanation of the Code:

- **Initialization (`__init__`):** Initializes the BPE tokenizer with the vocabulary and the number of merge operations. It also prepares the initial symbols by splitting each word into characters with an end-of-word symbol `</w>`.

- **`get_stats`:** Calculates the frequency of each adjacent symbol pair in the current vocabulary.

- **`merge_pair`:** Merges the most frequent symbol pair across all words in the vocabulary.

- **`fit`:** Performs the BPE algorithm by repeatedly finding and merging the most frequent pairs.

- **`encode_word`:** Encodes a single word using the learned BPE codes.

- **`encode_corpus`:** Encodes the entire corpus based on the learned BPE codes.


---

## Applying BPE to the Corpus

Let's apply the BPE implementation to our example corpus.


In [None]:
# Define the corpus
# expand the corpse (try another arabic corpus HF)

# another eng
# arabic ds

corpus = ["low", "lower", "lowest"]

# Initialize BPE with the corpus and specify number of merges
num_merges = 8  # You can adjust this number based on desired vocabulary size
bpe = BPE(vocab=corpus, num_merges=num_merges)

# Fit the BPE model to learn merge operations
bpe.fit()

# Encode the corpus using the learned BPE codes
encoded_corpus = bpe.encode_corpus()

# Display the encoded corpus
for original, encoded in zip(corpus, encoded_corpus):
    print(f"{original} => {' '.join(encoded)}")


Merge 1: ('l', 'o') -> lo
Merge 2: ('lo', 'w') -> low
Merge 3: ('low', 'e') -> lowe
Merge 4: ('lowe', 'r') -> lower
Merge 5: ('lowe', 's') -> lowes
Merge 6: ('lowes', 't') -> lowest
low => low </w>
lower => lower </w>
lowest => lowest </w>


---

## Visualizing Merge Operations

Understanding the merge operations step-by-step can provide deeper insights into how BPE constructs subword units.


In [None]:
# Re-initialize BPE for step-by-step visualization
bpe_visual = BPE(vocab=corpus, num_merges=10)

# Fit the BPE model while printing the current state after each merge
for i in range(bpe_visual.num_merges):
    pairs = bpe_visual.get_stats()
    if not pairs:
        break
    most_frequent = max(pairs, key=pairs.get)
    bpe_visual.merge_pair(most_frequent)
    print(f"After merge {i+1}: {most_frequent} -> {bpe_visual.bpe_codes[most_frequent]}")
    for word in bpe_visual.symbols:
        print(f"{word}: {' '.join(bpe_visual.symbols[word])}")
    print("-" * 50)


After merge 1: ('l', 'o') -> lo
low: lo w
lower: lo w e r
lowest: lo w e s t
--------------------------------------------------
After merge 2: ('lo', 'w') -> low
low: low
lower: low e r
lowest: low e s t
--------------------------------------------------
After merge 3: ('low', 'e') -> lowe
low: low
lower: lowe r
lowest: lowe s t
--------------------------------------------------
After merge 4: ('lowe', 'r') -> lower
low: low
lower: lower
lowest: lowe s t
--------------------------------------------------
After merge 5: ('lowe', 's') -> lowes
low: low
lower: lower
lowest: lowes t
--------------------------------------------------
After merge 6: ('lowes', 't') -> lowest
low: low
lower: lower
lowest: lowest
--------------------------------------------------


---

## Conclusion

In this notebook, we delved into the Byte Pair Encoding (BPE) algorithm, a powerful subword tokenization method widely used in NLP. We:

- Explored the theoretical underpinnings of BPE.
- Walked through a detailed example using a small corpus.
- Implemented the BPE algorithm from scratch in Python.
- Applied our implementation to the corpus and visualized the merge operations.

Understanding and implementing BPE provides valuable insights into modern NLP techniques, especially in handling large vocabularies and rare words. This foundational knowledge is crucial for advanced studies and applications in language modeling and machine translation.

Feel free to experiment with different corpora and merge operations to further solidify your understanding of BPE!


## Extending with Pretrained Tokenizers: Exploration and Impact

Now that we've implemented our own BPE tokenizer from scratch, it's beneficial to explore how **pretrained tokenizers** work. Pretrained tokenizers are part of larger NLP models that have been trained on vast corpora. They use advanced tokenization strategies (including variations of BPE, WordPiece, SentencePiece, etc.) to handle text efficiently and effectively.

### Why Explore Pretrained Tokenizers?

- **Efficiency:** Pretrained tokenizers are optimized for performance and can handle large volumes of text quickly.
- **Robustness:** They are trained on diverse datasets, allowing them to handle rare words, different languages, and various linguistic structures gracefully.
- **Integration:** They seamlessly integrate with powerful pretrained models (like BERT, GPT, etc.), enabling state-of-the-art NLP tasks.

### Common Pretrained Tokenizers and Their Strategies

1. **BPE (Byte Pair Encoding):**
   - Used in models like GPT-2.
   - Merges the most frequent byte pairs iteratively.
   - Impact: Balances vocabulary size and ability to handle rare words effectively.

2. **WordPiece:**
   - Used in models like BERT.
   - Similar to BPE but uses a different algorithm for merging based on likelihoods.
   - Impact: Creates a fixed vocabulary that can represent unknown words as a sequence of subwords, improving generalization.

3. **SentencePiece:**
   - Used in models like ALBERT, T5.
   - Treats the input text as a sequence of Unicode characters and uses BPE or Unigram models without requiring whitespace tokenization.
   - Impact: Handles languages without explicit word boundaries (like Chinese) and provides consistent tokenization across different scripts.

### Hands-On: Using Pretrained Tokenizers

We'll use the Hugging Face `transformers` library to experiment with different pretrained tokenizers. Make sure you have the library installed:

```bash
pip install transformers


In [None]:
#Example: Using BERT's WordPiece Tokenizer

In [None]:
from transformers import BertTokenizer

# Load the pretrained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Example text
text = "Hello, world! Byte Pair Encoding is Incomprehensibilities."

# Encode the text
encoded = tokenizer.encode(text)
tokens = tokenizer.convert_ids_to_tokens(encoded)

print("Encoded IDs:", encoded)
print("Tokens:", tokens)

Encoded IDs: [101, 1295, 17149, 29820, 29816, 25573, 1010, 2088, 999, 24880, 3940, 17181, 2003, 19739, 12638, 1012, 102]
Tokens: ['[CLS]', 'م', '##ر', '##ح', '##ب', '##ا', ',', 'world', '!', 'byte', 'pair', 'encoding', 'is', 'gu', '##zel', '.', '[SEP]']


In [None]:
#Example: Using GPT-2's BPE Tokenizer

text = "Hello, world! Byte Pair Encoding is Incomprehensibilities."


from transformers import GPT2Tokenizer

# Load the pretrained GPT-2 tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode the text
gpt2_encoded = gpt2_tokenizer.encode(text)
gpt2_tokens = gpt2_tokenizer.convert_ids_to_tokens(gpt2_encoded)

print("GPT-2 Encoded IDs:", gpt2_encoded)
print("GPT-2 Tokens:", gpt2_tokens)


GPT-2 Encoded IDs: [15496, 11, 995, 0, 30589, 39645, 14711, 7656, 318, 554, 785, 3866, 5135, 7992, 13]
GPT-2 Tokens: ['Hello', ',', 'Ġworld', '!', 'ĠByte', 'ĠPair', 'ĠEnc', 'oding', 'Ġis', 'ĠIn', 'com', 'pre', 'hens', 'ibilities', '.']


In [None]:
encoded_ids = [15496, 11, 995, 0, 30589, 39645, 14711, 7656, 318, 1257, 13]
# Decode the list of token IDs to text
decoded_text = gpt2_tokenizer.decode(encoded_ids)
print(decoded_text)

Hello, world! Byte Pair Encoding is fun.


In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl 

In [None]:
import json
from datasets import load_dataset

# Step 1: Load the dataset
dataset = load_dataset("Trelis/tiny-shakespeare")

# Step 2: Extract text from the training split
corpus = dataset["train"]["Text"]

README.md:   0%|          | 0.00/497 [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/119k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/472 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/49 [00:00<?, ? examples/s]

In [None]:
# Step 3: Preprocess the corpus
# For simplicity, we split the text into words by whitespace.
# Additional preprocessing like lowercasing or punctuation removal can be added as needed.
preprocessed_corpus = [word for line in corpus for word in line.split()]
preprocessed_corpus

['First',
 'Citizen:',
 'Before',
 'we',
 'proceed',
 'any',
 'further,',
 'hear',
 'me',
 'speak.',
 'All:',
 'Speak,',
 'speak.',
 'First',
 'Citizen:',
 'You',
 'are',
 'all',
 'resolved',
 'rather',
 'to',
 'die',
 'than',
 'to',
 'famish?',
 'All:',
 'Resolved.',
 'resolved.',
 'First',
 'Citizen:',
 'First,',
 'you',
 'know',
 'Caius',
 'Marcius',
 'is',
 'chief',
 'enemy',
 'to',
 'the',
 'people.',
 'All:',
 'We',
 "know't,",
 'we',
 "know't.",
 'First',
 'Citizen:',
 'Let',
 'us',
 'kill',
 'him,',
 'and',
 "we'll",
 'have',
 'corn',
 'at',
 'our',
 'own',
 'price.',
 "Is't",
 'a',
 'verdict?',
 'All:',
 'No',
 'more',
 'talking',
 "on't;",
 'let',
 'it',
 'be',
 'done:',
 'away,',
 'away!',
 'Second',
 'Citizen:',
 'One',
 'word,',
 'good',
 'citizens.',
 'First',
 'Citizen:',
 'We',
 'are',
 'accounted',
 'poor',
 'citizens,',
 'the',
 'patricians',
 'good.',
 'What',
 'authority',
 'surfeits',
 'on',
 'would',
 'relieve',
 'us:',
 'if',
 'they',
 'would',
 'yield',
 'us',
 

In [None]:
len(preprocessed_corpus)

222321

In [None]:
num_merges = 10000
bpe = BPE(vocab=preprocessed_corpus, num_merges=num_merges)
bpe.fit()

Merge 1: ('t', 'h') -> th
Merge 2: ('o', 'u') -> ou
Merge 3: ('e', 'r') -> er
Merge 4: ('i', 'n') -> in
Merge 5: ('a', 'n') -> an
Merge 6: ('o', 'r') -> or
Merge 7: ('th', 'e') -> the
Merge 8: ('i', 's') -> is
Merge 9: ('e', 'n') -> en
Merge 10: ('a', 'r') -> ar
Merge 11: ('a', 't') -> at
Merge 12: ('o', 'n') -> on
Merge 13: ('s', 't') -> st
Merge 14: ('l', 'l') -> ll
Merge 15: ('m', 'e') -> me
Merge 16: ('t', 'o') -> to
Merge 17: ('an', 'd') -> and
Merge 18: ('h', 'e') -> he
Merge 19: ('y', 'ou') -> you
Merge 20: ('e', 's') -> es
Merge 21: ('n', 'o') -> no
Merge 22: ('s', 'e') -> se
Merge 23: ('h', 'a') -> ha
Merge 24: ('r', 'e') -> re
Merge 25: ('o', 'f') -> of
Merge 26: ('v', 'e') -> ve
Merge 27: ('in', 'g') -> ing
Merge 28: ('i', 't') -> it
Merge 29: ('l', 'e') -> le
Merge 30: ('b', 'e') -> be
Merge 31: ('w', 'i') -> wi
Merge 32: ('m', 'y') -> my
Merge 33: ('h', 'i') -> hi
Merge 34: ('c', 'e') -> ce
Merge 35: ('r', 'o') -> ro
Merge 36: ('f', 'or') -> for
Merge 37: ('a', 'y') -> ay


In [None]:
# Encode the corpus
encoded_corpus = bpe.encode_corpus()

In [None]:
# Step 5: Save vocabulary and BPE codes in Hugging Face-compatible format
vocab_file = "vocab.json"
merges_file = "merges.txt"

# Create a Hugging Face-compatible vocabulary with all subword tokens
final_vocab = set()
for word in encoded_corpus:
    final_vocab.update(word)

hf_vocab = {token: idx for idx, token in enumerate(final_vocab)}

# Save the vocabulary as a JSON file
with open(vocab_file, "w") as f:
    json.dump(hf_vocab, f, indent=4)

# Save the BPE merge operations as a plain text file
with open(merges_file, "w") as f:
    for pair in bpe.bpe_codes.keys():
        f.write(f"{pair[0]} {pair[1]}\n")

print(f"Vocabulary saved to {vocab_file}")
print(f"Merge operations saved to {merges_file}")

Vocabulary saved to vocab.json
Merge operations saved to merges.txt


In [1]:
import pandas as pd

In [2]:
# Load the match details dataset
try:
    match_details_df = pd.read_csv('23_24_match_details.csv')
    print("Loaded '23_24_match_details.csv' successfully!")
except FileNotFoundError:
    print("Error: '23_24_match_details.csv' not found. Please ensure the file path is correct.")
    match_details_df = None

Loaded '23_24_match_details.csv' successfully!


In [3]:
# Load the match stats dataset
try:
    match_stats_df = pd.read_csv('23_24_match_stats.csv')
    print("Loaded '23_24_match_stats.csv' successfully!")
except FileNotFoundError:
    print("Error: '23_24_match_stats.csv' not found. Please ensure the file path is correct.")
    match_stats_df = None

Loaded '23_24_match_stats.csv' successfully!


In [4]:
# ## 2. Exploring the Datasets to Find Commentary Text

if match_details_df is not None:
    print("\nFirst few rows of '23_24_match_details.csv':")
    print(match_details_df.head())
    print("\nColumn names in '23_24_match_details.csv':")
    print(match_details_df.columns)

if match_stats_df is not None:
    print("\nFirst few rows of '23_24_match_stats.csv':")
    print(match_stats_df.head())
    print("\nColumn names in '23_24_match_stats.csv':")
    print(match_stats_df.columns)


First few rows of '23_24_match_details.csv':
      id         Home            Away        Date  \
0  93323  Bournemouth        West Ham  2023-08-12   
1  93336     Man City       Newcastle  2023-08-19   
2  93343    Brentford  Crystal Palace  2023-08-26   
3  93344     Brighton        West Ham  2023-08-26   
4  93347      Everton          Wolves  2023-08-26   

                              Stadium  Attendance         Referee  \
0       Vitality Stadium, Bournemouth         NaN    Robert Jones   
1          Etihad Stadium, Manchester         NaN    Robert Jones   
2  Gtech Community Stadium, Brentford     16997.0    Peter Bankes   
3    American Express Stadium, Falmer     31508.0  Anthony Taylor   
4            Goodison Park, Liverpool     38851.0    Craig Pawson   

                                              events  \
0  Hello and welcome to live coverage of the Prem...   
1  Hello everyone and welcome to live text covera...   
2  Hello and welcome to the live commentary of th...

In [5]:
commentary_column = 'events'
commentary_text = []

In [22]:
# Extract commentary text
if match_details_df is not None and commentary_column in match_details_df.columns:
    commentary_text.extend(match_details_df[commentary_column].astype(str).dropna().tolist())
    print(f"\nExtracted commentary from '{commentary_column}' in '23_24_match_details.csv'. Number of samples: {len(commentary_text)}")
elif match_stats_df is not None and commentary_column in match_stats_df.columns:
    commentary_text.extend(match_stats_df[commentary_column].astype(str).dropna().tolist())
    print(f"\nExtracted commentary from '{commentary_column}' in '23_24_match_stats.csv'. Number of samples: {len(commentary_text)}")
else:
    print("\nError: Could not find the commentary column. Please inspect the DataFrames and update the 'commentary_column' variable.")
    exit()  # Stop if no commentary is found.

if not commentary_text:
    print("\nNo commentary text found. Stopping the process.")
    exit()


Extracted commentary from 'events' in '23_24_match_details.csv'. Number of samples: 606


In [23]:
# ## 2. Preparing the Text Data
# Combine all commentary text into a single string.
all_commentary_text = "\n".join(commentary_text)

# Save to a text file
with open("commentary.txt", "w", encoding="utf-8") as f:
    f.write(all_commentary_text)

print("Saved commentary data to commentary.txt")

Saved commentary data to commentary.txt


In [25]:
 # Training the BPE Tokenizer with SentencePiece

import sentencepiece as spm
import os

# Define training parameters
model_prefix = 'epl_commentary_bpe'
vocab_size = 10000  # A reasonable vocabulary size.  Adjust as needed.
num_merges = 8000 #redundant if vocab_size is set, but good to keep for clarity
character_coverage = 1.0  # Cover all characters

In [26]:
# Check if the model already exists
if not os.path.exists(f'{model_prefix}.model'):
    # Train the BPE model
    spm.SentencePieceTrainer.train(
        f'--input=commentary.txt '
        f'--model_prefix={model_prefix} '
        f'--vocab_size={vocab_size} '
        f'--model_type=bpe '
        f'--character_coverage={character_coverage}'
    )
    print(f"Trained BPE model and saved to {model_prefix}.model and {model_prefix}.vocab")
else:
    print(f"BPE model already exists at {model_prefix}.model. Skipping training.")

BPE model already exists at epl_commentary_bpe.model. Skipping training.


In [27]:
# 4. Loading and Testing the Tokenizer

sp = spm.SentencePieceProcessor()
sp.load(f'{model_prefix}.model')

# Example usage
test_text = "This was an amazing match!  What a fantastic goal."
encoded_pieces = sp.encode_as_pieces(test_text)
encoded_ids = sp.encode_as_ids(test_text)

print(f"Text: {test_text}")
print(f"Encoded as pieces: {encoded_pieces}")
print(f"Encoded as ids: {encoded_ids}")

decoded_text = sp.decode_ids(encoded_ids)
print(f"Decoded text: {decoded_text}")

Text: This was an amazing match!  What a fantastic goal.
Encoded as pieces: ['▁This', '▁was', '▁an', '▁a', 'ma', 'zing', '▁match', '!', '▁What', '▁a', '▁fantastic', '▁goal', '.']
Encoded as ids: [1089, 304, 135, 5, 929, 7046, 790, 7963, 1325, 5, 3542, 122, 7952]
Decoded text: This was an amazing match! What a fantastic goal.


In [28]:
import json
def save_vocab_and_merges(model_prefix):
    """
    Saves the vocabulary and merge rules from a SentencePiece model
    in the format expected by Hugging Face Transformers.
    """
    vocab = {}
    with open(f"{model_prefix}.vocab", "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            word = line.split("\t")[0]  # Get the word, excluding the frequency
            vocab[word] = i

    # No need to create merge.txt, the information is inside the sentencepiece model.
    with open("vocab.json", "w", encoding="utf-8") as f:
        json.dump(vocab, f, indent=2)
    print("vocab.json saved")
    return vocab # return vocab

vocab = save_vocab_and_merges(model_prefix)

vocab.json saved
