# **Activity 3: Normalization**
**Instructions:**

---
* Please download the provided IPython Notebook (ipynb) file and open it in Google Colab. Once opened, enter your code in the same file directly beneath the relevant question's code block.


* Purpose of this activity is to practice and get hands on experience (Ungraded Activity)  

* No Dataset required for this Activity

# **Text Preprocessing Beyond Tokenization**



##**Word** **normalization**
Normalization in NLP is the process of converting words into a standard format to improve text analysis. This includes tasks like converting text to lowercase, removing punctuation, stemming, and lemmatization to ensure consistency in language processing.

**Word normalization** is a text preprocessing technique used in natural language processing (NLP) to standardize and simplify words or tokens in a text document. The goal of word normalization is to make text data more consistent and manageable for analysis. This process can involve various transformations, such as converting all text to lowercase, removing punctuation, expanding contractions, and performing tasks like stemming or lemmatization to reduce words to their base or dictionary forms. Word normalization helps improve the accuracy and effectiveness of NLP tasks by reducing the complexity of text data and ensuring that similar words are treated as equivalent.

In [1]:
import re

# for using NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords

# for using SpaCy
import spacy

# for HuggingFace
!pip install transformers
# !pip install ftfy

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...




It uses NLTK to handle tasks like tokenization, stopword removal, and lemmatization, and downloads specific datasets for these functions (e.g., "punkt," "stopwords," and "wordnet"). Additionally, it imports SpaCy for advanced natural language processing and installs the Hugging Face transformers library to work with pre-trained NLP models.

In [2]:
# trick to wrap text to the viewing window for this notebook
# Ref: https://stackoverflow.com/questions/58890109/line-wrapping-in-collaboratory-google-results
#helps improve the readability and formatting of code output in the notebook.
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

The code customizes how text and code output are displayed in a Jupyter notebook, making it more readable by enabling line wrapping. It uses the IPython.display module to inject a CSS style that wraps long lines of text (white-space: pre-wrap). The set_css function applies this style before each code cell runs by registering it with the notebook's event system.

### **Revisiting Tokenization** :**TreebankWordTokenizer**

The **Treebank Word Tokenizer** is a text processing tool used in natural language processing (NLP) to split text into individual words or tokens. It follows the tokenization conventions and standards of the Penn Treebank corpus

In [3]:
from nltk.tokenize import TreebankWordTokenizer
text="Hello everyone. Welcome to NLP Course."
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

['Hello', 'everyone.', 'Welcome', 'to', 'NLP', 'Course', '.']

In the above code we uses NLTK's TreebankWordTokenizer to split a given text into individual words or tokens based on specific rules (like punctuation and contractions). It initializes the tokenizer and applies it to the sentence "Hello everyone. Welcome to NLP Course.". The output will be a list of tokens (words and punctuation) derived from the text.

##**Using Regular Expression**

**RegexpTokenizer** is a text processing tool provided by the Natural Language Toolkit (NLTK) library in Python. It is used to tokenize (split) text into individual tokens (words or phrases) based on a specified regular expression pattern.

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("[\w']+")#[\w']+ as a whole matches sequences of word characters (letters, digits, underscores) and single quotes in a string
text = "Let's see how it's working. We also have digits like 123 and 010"
tokenizer.tokenize(text)

["Let's",
 'see',
 'how',
 "it's",
 'working',
 'We',
 'also',
 'have',
 'digits',
 'like',
 '123',
 'and',
 '010']

The code uses NLTK's RegexpTokenizer to tokenize text based on a custom regular expression. The pattern [\w']+ matches sequences of word characters (letters, digits, underscores) and single quotes, ensuring tokens like contractions (Let's) and numbers (123) are captured correctly. Applying this tokenizer to the given text splits it into meaningful tokens while ignoring other characters like punctuation.

# **(Tutorial) Tokenizing text using Spacy**

Following is a dummy sample of text to demonstrate tokenization in SpaCy.

In [None]:
dummy_text1 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text1)

Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.



In [None]:
# loads a trained English pipeline with specific preprocessing components
nlp = spacy.load('en_core_web_sm')

# using SpaCy's tokenizer...
doc = nlp(dummy_text1)      # applies the processing pipeline on the text
for token in doc:
  print(token.text)

Here
is
the
First
Paragraph
and
this
is
the
First
Sentence
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
first
paragaraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


Now
,
it
is
the
Second
Paragraph
and
its
First
Sentence
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
second
paragraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


Finally
,
this
is
the
Third
Paragraph
and
is
the
First
Sentence
of
this
paragraph
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
third
paragaraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


4th
paragraph
just
has
one
sentence
in
it
.




The code loads SpaCy's pre-trained English language model (en_core_web_sm), which includes tools for tasks like tokenization, part-of-speech tagging, and named entity recognition. It processes the text stored in dummy_text1 using this model, creating a doc object that represents the entire processed text. Finally, it iterates through the doc to print each token (word or punctuation) individually.

**Question 1:** Copy the above snippet code and Modify the above regular expression to match only **digits** from the above given text.  

In [None]:
text = "Let's see how it's working. We also have digits like 123 and 010"

# Regular expression to match only digits
tokenizer = RegexpTokenizer(r'\d+')

# Tokenizing the text to extract only digits
digits = tokenizer.tokenize(text)
print(digits)

['123', '010']


The code uses NLTK's RegexpTokenizer with a regular expression (\d+) to extract only digits from a given text. The pattern \d+ matches one or more consecutive digits in the string. Applying this tokenizer to the text "Let's see how it's working. We also have digits like 123 and 010" results in a list containing only the numbers (123 and 010).


---


## ** (Tutorial) Stemming and Lemmatization using NLTK**

**Stemming** is a text normalization technique in natural language processing and information retrieval. It involves reducing words to their root or base form, often by removing suffixes or prefixes. The goal of stemming is to convert words with the same meaning but different forms into a common base form so that they can be treated as equivalent during text analysis and retrieval. Stemming helps improve information retrieval and text processing tasks by reducing the complexity of words while maintaining their core meaning. Common stemming algorithms include the Porter Stemmer and Snowball Stemmer.

##**Porter Stemmer**
The **Porter Stemmer** is a well-known algorithm for stemming in natural language processing. It was  designed to reduce words to their root or base form by removing common suffixes. Stemming is the process of reducing words to their linguistic root or base form to simplify text analysis and improve information retrieval.For example, it can convert words like "running," "runs," and "ran" to their common root "run."

Let's see how we can perform stemming and lemmatization using NLTK library...

In [None]:
# importing PorterStemmer class from nltk.stem module
from nltk.stem import PorterStemmer
porter = PorterStemmer()    # instantiating an object of the PorterStemmer class

stem = porter.stem('cats')    # calling the stemmer algorithm on the desired word
print(f"'cats' after stemming: {stem}")

stem = porter.stem('better')
print(f"'better' after stemming: {stem}")

stem = porter.stem('abaci')
print(f"'abaci' after stemming: {stem}")

stem = porter.stem('aardwolves')
print(f"'aardwolves' after stemming: {stem}")

stem = porter.stem('generically')
print(f"'generically' after stemming: {stem}")

'cats' after stemming: cat
'better' after stemming: better
'abaci' after stemming: abaci
'aardwolves' after stemming: aardwolv
'generically' after stemming: gener


The code uses NLTK's PorterStemmer to perform stemming, which reduces words to their root form. It creates an instance of the PorterStemmer class and applies its stem method to several words (cats, better, abaci, aardwolves, generically). The code prints the stemmed version of each word, showing how the stemming algorithm simplifies words by removing suffixes or applying linguistic rules.

##**Lemmatization**
**Lemmatization** is a natural language processing technique that reduces words to their base or dictionary form, known as a "lemma." Unlike stemming, which often involves removing suffixes to approximate a word's root, lemmatization considers the word's context and grammatical meaning. The goal is to transform different inflected forms of a word into a common base form. Lemmatization is particularly useful for maintaining the grammatical correctness of words in text analysis and information retrieval tasks.

##**WordNet Lemmatizer**
The **WordNet Lemmatizer** is a lemmatization tool based on WordNet, a lexical database of the English language. WordNet groups words into sets of synonyms called "synsets" and provides a rich lexical and semantic structure for the English language. The WordNet Lemmatizer uses this semantic information to perform lemmatization, which is the process of reducing words to their base or dictionary form (lemma)

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# importing Word Net-based lemmatizer class from nltk.stem module
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()    # instantiating an object of the Word NetLemmatizer class

lemma = lemmatizer.lemmatize('cats')    # calling the lemmatization algorithm on the desired word
print(f"'cats' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('better')
print(f"'better' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('abaci')
print(f"'abaci' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('aardwolves')
print(f"'aardwolves' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('generically')
print(f"'generically' after lemmatization: {lemma}")

print("\n\n\n")
lemma = lemmatizer.lemmatize('better', pos='a')   # 'a' denoted ADJECTIVE part-of-speech
print(f"'better' (as an adjective) after lemmatization: {lemma}")

'cats' after lemmatization: cat
'better' after lemmatization: better
'abaci' after lemmatization: abacus
'aardwolves' after lemmatization: aardwolf
'generically' after lemmatization: generically




'better' (as an adjective) after lemmatization: good


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


The code demonstrates lemmatization using NLTK's WordNetLemmatizer, which reduces words to their base or dictionary form (lemma) based on context and meaning. It creates an instance of WordNetLemmatizer and applies the lemmatize method to several words (cats, better, abaci, aardwolves, generically), printing their lemmatized forms.

The code also shows how specifying the part of speech (POS) for a word can improve lemmatization accuracy. For example, when lemmatizing the word better and specifying it as an adjective (pos='a'), the lemmatizer correctly identifies its base form as good. Without the POS information, it assumes the default POS is a noun, which may not always give the correct result.

### **Stemming on text string**




In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
nltk.download('punkt_tab')

def stem_text(text):
    # Initialize the Porter Stemmer
    stemmer = PorterStemmer()

    # Tokenize the text into words
    words = word_tokenize(text)

    # Apply the stemmer to each word and join them back into a text
    stemmed_text = ' '.join([stemmer.stem(word) for word in words])

    return stemmed_text

# Example usage:
text = "He is jumping, and he jumped over the jumps."
stemmed_text = stem_text(text)
print(stemmed_text)


he is jump , and he jump over the jump .


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


The code defines a function stem_text that processes a sentence by first breaking it into individual words using word_tokenize. It then applies the PorterStemmer to each word, reducing them to their root forms (e.g., "jumping" becomes "jump"). Finally, the stemmed words are joined back into a sentence, and the stemmed version of the input text is printed.

In [None]:
# This is the text on which you have to perform stemming; taken from Wikipedia.
text = "In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root."
print("Given text:")
print(text)

Given text:
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.


# **Tutorial **

In [None]:
#CODE BLOCK 1
en_stopwords = set(stopwords.words('english'))
def remove_punc(text_string):
  return re.sub('[^a-zA-Z0-9 ]', '', text_string.lower())

def remove_stopwords(text_string):
  return [ token for token in text_string.split(' ') if token not in en_stopwords ]

# applying punctuation removal to the text
unpunc_text = remove_punc(text)
print("After punctuation removal:")
print(unpunc_text)

# # applying stopword removal to the text
clean_text = remove_stopwords(unpunc_text)
print("\n\nAfter stopword removal:")
print(clean_text)

After punctuation removal:
in linguistic morphology and information retrieval stemming is the process of reducing inflected or sometimes derived words to their word stem base or root form generally a written word form the stem need not be identical to the morphological root of the word it is usually sufficient that related words map to the same stem even if this stem is not in itself a valid root


After stopword removal:
['linguistic', 'morphology', 'information', 'retrieval', 'stemming', 'process', 'reducing', 'inflected', 'sometimes', 'derived', 'words', 'word', 'stem', 'base', 'root', 'form', 'generally', 'written', 'word', 'form', 'stem', 'need', 'identical', 'morphological', 'root', 'word', 'usually', 'sufficient', 'related', 'words', 'map', 'stem', 'even', 'stem', 'valid', 'root']


The code defines two functions: one to remove punctuation and another to remove stopwords. The remove_punc function takes a text string, converts it to lowercase, and removes any character that isn't a letter, number, or space. The remove_stopwords function splits the text into individual words and filters out common stopwords (e.g., "the", "is", "and"). The code then applies these functions to the input text, printing the text after punctuation removal and after stopwords are removed.

#### **Question 2. Perform stemming on the cleaned text(from tutorial2-code block 1) above using the Porter Stemmer from NLTK.**

**Hint:** import PorterStemmer from nltk.stem

In [None]:
# apply Porter Stemmer on the cleaned text (after punctuation and stopwords are removed) below this comment
#CODE HERE
# Stemming using PorterStemmer
stemmer = PorterStemmer()
stemmed_text = [stemmer.stem(word) for word in clean_text]

# Print the stemmed text
print("\n\nAfter stemming:")
print(stemmed_text)



After stemming:
['linguist', 'morpholog', 'inform', 'retriev', 'stem', 'process', 'reduc', 'inflect', 'sometim', 'deriv', 'word', 'word', 'stem', 'base', 'root', 'form', 'gener', 'written', 'word', 'form', 'stem', 'need', 'ident', 'morpholog', 'root', 'word', 'usual', 'suffici', 'relat', 'word', 'map', 'stem', 'even', 'stem', 'valid', 'root']


This code applies the Porter Stemmer to the cleaned text (which has had punctuation and stopwords removed). It loops through each word in the cleaned text, reducing it to its root form using the stemmer. The result is a list of stemmed words, which is then printed.

# **Lemmatization on text string**

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Example text to lemmatize
text = "There are like more than 100 foxes and lions in this forest."

# Tokenize the text into words
words = text.split()

# Lemmatize each word and join them back into a sentence
lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in words])

# Print the original and lemmatized text
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)


Original Text: There are like more than 100 foxes and lions in this forest.
Lemmatized Text: There are like more than 100 fox and lion in this forest.


The code takes a sentence and breaks it down into individual words using the split() method. It then lemmatizes each word using the WordNetLemmatizer to reduce them to their base forms (e.g., "foxes" becomes "fox"). Finally, the lemmatized words are combined back into a sentence and both the original and lemmatized texts are printed.

#### **Question 3. Perform lemmatization on the same cleaned text(from tutorial2-code block 1) above using NLTK's lemmatizer.**

**Hint**:import WordNetLemmatizer from nltk.stem

In [None]:
# apply NLTK's lemmatizer on the cleaned text (after punctuation and stopwords are removed) below this comment
#CODE HERE
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Lemmatize each word in the cleaned text
lemmatized_text = [lemmatizer.lemmatize(word) for word in clean_text]

# Print the lemmatized text
print("\n\nAfter lemmatization:")
print(lemmatized_text)



After lemmatization:
['linguistic', 'morphology', 'information', 'retrieval', 'stemming', 'process', 'reducing', 'inflected', 'sometimes', 'derived', 'word', 'word', 'stem', 'base', 'root', 'form', 'generally', 'written', 'word', 'form', 'stem', 'need', 'identical', 'morphological', 'root', 'word', 'usually', 'sufficient', 'related', 'word', 'map', 'stem', 'even', 'stem', 'valid', 'root']


This code uses NLTK's WordNetLemmatizer to reduce each word in the cleaned text to its base form, called a lemma. It loops through each word in the cleaned text, applying the lemmatizer to get the root form of the word (e.g., "running" becomes "run"). Finally, it prints the lemmatized version of the text.

---


## **(Tutorial) Subword Tokenization using HuggingFace**

**Hugging Face** is used for subword tokenization by offering NLP practitioners access to pre-trained subword tokenizers and models. Hugging Face's "transformers" library offers pre-trained models and tokenizers, such as Byte Pair Encoding (BPE) and SentencePiece, which are widely used for subword tokenization

**Subword tokenization** is a text processing technique used in natural language processing (NLP) to break down words into smaller units, often subword pieces. This approach is particularly useful for handling languages with complex morphology or when dealing with out-of-vocabulary words. Subword tokenization methods like Byte-Pair Encoding (BPE) and SentencePiece divide text into subword units, such as character-level tokens or subword pieces, allowing NLP models to work with a more extensive and adaptable vocabulary. This technique improves the handling of rare words and enhances the performance of NLP models on a wide range of languages and tasks

In [None]:
!pip install tokenizers
#This is a JSON file that contains the vocabulary (i.e., the set of words and subword pieces) used by the GPT-2 model
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

--2025-01-28 16:22:07--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.202.248, 3.5.16.7, 16.182.73.136, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.202.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [application/json]
Saving to: ‘gpt2-medium-vocab.json’


2025-01-28 16:22:07 (8.64 MB/s) - ‘gpt2-medium-vocab.json’ saved [1042301/1042301]

--2025-01-28 16:22:07--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.202.248, 3.5.16.7, 16.182.73.136, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.202.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [text/plain]
Saving to: ‘gpt2-merges.txt’


2025-01-28 16:22:07 (5.58 MB/s) - ‘gpt2-merges.txt’ saved [456318/456318]



The code installs the tokenizers library, which is used for tokenizing text data. It then downloads two files: gpt2-medium-vocab.json, which contains the vocabulary (set of words and subword pieces) for the GPT-2 model, and gpt2-merges.txt, which contains the merge rules for combining subword pieces into words. These files are necessary for working with the GPT-2 model's tokenizer.

**Byte Pair Encoding (BPE)** is a subword tokenization technique used in natural language processing (NLP) and text processing. It involves dividing text into subword units, typically based on frequency, to create a more flexible and adaptive vocabulary for language models.

In [None]:
from tokenizers import ByteLevelBPETokenizer
gpt2vocab = "gpt2-medium-vocab.json"
gpt2merges = "gpt2-merges.txt"

bpe = ByteLevelBPETokenizer(gpt2vocab, gpt2merges)
bpe_encoding = bpe.encode("The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789.")
print(bpe_encoding.ids)
print(bpe_encoding.tokens)

[464, 2183, 286, 13630, 281, 2209, 319, 554, 7493, 3924, 3596, 2067, 351, 262, 845, 717, 554, 7493, 3924, 960, 20191, 2669, 447, 247, 82, 960, 261, 3035, 1542, 11, 1596, 4531, 13]
['The', 'Ġcustom', 'Ġof', 'Ġdelivering', 'Ġan', 'Ġaddress', 'Ġon', 'ĠIn', 'aug', 'uration', 'ĠDay', 'Ġstarted', 'Ġwith', 'Ġthe', 'Ġvery', 'Ġfirst', 'ĠIn', 'aug', 'uration', 'âĢĶ', 'George', 'ĠWashington', 'âĢ', 'Ļ', 's', 'âĢĶ', 'on', 'ĠApril', 'Ġ30', ',', 'Ġ17', '89', '.']


The code uses the ByteLevelBPETokenizer from the tokenizers library to initialize a tokenizer using the vocabulary (gpt2-medium-vocab.json) and merge rules (gpt2-merges.txt) specific to the GPT-2 model. It then encodes a sample sentence, breaking it into tokens and assigning each token a unique ID. Finally, the code prints the token IDs (bpe_encoding.ids) and the actual tokens (bpe_encoding.tokens) for the given sentence.

**Question 4**: Collect the encoding ids which you generate from above snippet and now decode the ids to get back the given text string?

**Hint**: use decode() method

In [None]:
#CODE HERE
# Decoding the encoding ids back to the original text
decoded_text = bpe.decode(bpe_encoding.ids)
print(decoded_text)

The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789.


The code uses the decode() method of the ByteLevelBPETokenizer to convert the list of token IDs (bpe_encoding.ids) back into the original text. It takes the encoding IDs that were generated from the sentence and decodes them into their corresponding tokens to reconstruct the original string. The decoded text is then printed, which should match the original sentence.