Tokenization in natural language processing (NLP) is a technique that involves dividing a sentence or phrase into smaller units known as tokens. These tokens can encompass words, dates, punctuation marks, or even fragments of words.

Tokenization involves using a tokenizer to segment unstructured data and natural language text into distinct chunks of information, treating them as different elements. The tokens within a document can be used as vector, transforming an unstructured text document into a numerical data structure suitable for machine learning.

## Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here are some types of tokenization:

# Need of Tokenization
## Tokenization is a crucial step in text processing and natural language processing (NLP) for several reasons.

### Effective Text Processing: Tokenization reduces the size of raw text so that it can be handled more easily for processing and analysis.
### Feature extraction: Text data can be represented numerically for algorithmic comprehension by using tokens as features in machine learning models.
### Language Modelling: Tokenization in NLP facilitates the creation of organized representations of language, which is useful for tasks like text generation and language modelling.
### Information Retrieval: Tokenization is essential for indexing and searching in systems that store and retrieve information efficiently based on words or phrases.
### Text Analysis: Tokenization is used in many NLP tasks, including sentiment analysis and named entity recognition, to determine the function and context of individual words in a sentence.
### Vocabulary Management: By generating a list of distinct tokens that stand in for words in the dataset, tokenization helps manage a corpus’s vocabulary.
### Task-Specific Adaptation: Tokenization can be customized to meet the needs of particular NLP tasks, meaning that it will work best in applications such as summarization and machine translation.
### Preprocessing Step: This essential preprocessing step transforms unprocessed text into a format appropriate for additional statistical and computational analysis.

## More Techniques for Tokenization
#### We have discussed the ways to implement how can we perform tokenization using NLTK library. We can also implement tokenization using following methods and libraries:

##### Spacy: Spacy is NLP library that provide robust tokenization capabilities.
##### BERT tokenizer: BERT uses WordPiece tokenizer is a type of subword tokenizer for tokenizing input text. Using regular expressions allows for more fine-grained control over tokenization, and you can customize the pattern based on your specific requirements.
##### Byte-Pair Encoding: Byte Pair Encoding (BPE) is a data compression algorithm that has also found applications in the field of natural language processing, specifically for tokenization. It is a subword tokenization technique that works by iteratively merging the most frequent pairs of consecutive bytes (or characters) in a given corpus.
##### Sentence Piece: SentencePiece is another subword tokenization algorithm commonly used for natural language processing tasks. It is designed to be language-agnostic and works by iteratively merging frequent sequences of characters or subwords in a given corpus.

How WordPiece Tokenization Addresses the Rare Words Problem in NLP

In the evolving landscape of Natural Language Processing (NLP), handling rare words effectively is a significant challenge. Traditional tokenization methods, which split text into words or characters, often struggle with rare or unknown words, leading to gaps in understanding and model performance. This is where WordPiece tokenization, a method pioneered by Google, steps in as a solution.

### Understanding WordPiece Tokenization
WordPiece tokenization is a middle-ground approach between word-level and character-level tokenization. It breaks down words into commonly occurring subwords or "pieces." This method allows for a more efficient representation of a language's vocabulary, especially in terms of frequently occurring word parts.

For example, the word "unbreakable" can be segmented into "un," "break," and "able." This segmentation not only captures the meaning of the full word but also retains the semantic meaning of the subwords.

In [6]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [44]:
import nltk
text = "AI & ML 🤖🚀—it's not just about predictions; it's about redefining possibilities! I don't just build models; I engineer intelligence that transforms raw data into actionable insights. Whether it's financial markets 📈📊, space exploration 🪐🌌, or cutting-edge RAG-based chatbots, the goal is clear: innovation. But AI isn't magic—it's math, logic, and a relentless pursuit of efficiency. Should we fear AI? No! But we must guide it responsibly. So, let's code, iterate, and deploy—because the future isn't waiting! 💡🔥"

In [12]:
from nltk.tokenize import word_tokenize
 
word_tokens = word_tokenize(text)
word_tokens

['AI',
 '&',
 'ML',
 '🤖🚀—it',
 "'s",
 'not',
 'just',
 'about',
 'predictions',
 ';',
 'it',
 "'s",
 'about',
 'redefining',
 'possibilities',
 '!',
 'I',
 'do',
 "n't",
 'just',
 'build',
 'models',
 ';',
 'I',
 'engineer',
 'intelligence',
 'that',
 'transforms',
 'raw',
 'data',
 'into',
 'actionable',
 'insights',
 '.',
 'Whether',
 'it',
 "'s",
 'financial',
 'markets',
 '📈📊',
 ',',
 'space',
 'exploration',
 '🪐🌌',
 ',',
 'or',
 'cutting-edge',
 'RAG-based',
 'chatbots',
 ',',
 'the',
 'goal',
 'is',
 'clear',
 ':',
 'innovation',
 '.',
 'But',
 'AI',
 'is',
 "n't",
 'magic—it',
 "'s",
 'math',
 ',',
 'logic',
 ',',
 'and',
 'a',
 'relentless',
 'pursuit',
 'of',
 'efficiency',
 '.',
 'Should',
 'we',
 'fear',
 'AI',
 '?',
 'No',
 '!',
 'But',
 'we',
 'must',
 'guide',
 'it',
 'responsibly',
 '.',
 'So',
 ',',
 'let',
 "'s",
 'code',
 ',',
 'iterate',
 ',',
 'and',
 'deploy—because',
 'the',
 'future',
 'is',
 "n't",
 'waiting',
 '!',
 '💡🔥']

In [14]:
from nltk.tokenize import sent_tokenize
 
sentence_tokens = sent_tokenize(text)
sentence_tokens

["AI & ML 🤖🚀—it's not just about predictions; it's about redefining possibilities!",
 "I don't just build models; I engineer intelligence that transforms raw data into actionable insights.",
 "Whether it's financial markets 📈📊, space exploration 🪐🌌, or cutting-edge RAG-based chatbots, the goal is clear: innovation.",
 "But AI isn't magic—it's math, logic, and a relentless pursuit of efficiency.",
 'Should we fear AI?',
 'No!',
 'But we must guide it responsibly.',
 "So, let's code, iterate, and deploy—because the future isn't waiting!",
 '💡🔥']

## puncuvation based tokenizer

In [31]:
import re
import string
def sentence_tokenize(text):
    return re.split(r"[{}]".format(re.escape(string.punctuation)), text)

tokens = sentence_tokenize(text)
print(tokens)


['AI ', ' ML 🤖🚀—it', 's not just about predictions', ' it', 's about redefining possibilities', ' I don', 't just build models', ' I engineer intelligence that transforms raw data into actionable insights', ' Whether it', 's financial markets 📈📊', ' space exploration 🪐🌌', ' or cutting', 'edge RAG', 'based chatbots', ' the goal is clear', ' innovation', ' But AI isn', 't magic—it', 's math', ' logic', ' and a relentless pursuit of efficiency', ' Should we fear AI', ' No', ' But we must guide it responsibly', ' So', ' let', 's code', ' iterate', ' and deploy—because the future isn', 't waiting', ' 💡🔥']


### The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.

##### This tokenizer performs the following steps:

##### split standard contractions, e.g. don't -> do n't and they'll -> they 'll

##### treat most punctuation characters as separate tokens

##### split off commas and single quotes, when followed by whitespace

##### separate periods that appear at the end of line

In [34]:
from nltk.tokenize import TreebankWordTokenizer
TreebankWordTokens = TreebankWordTokenizer().tokenize(text)
print(TreebankWordTokens)

['AI', '&', 'ML', '🤖🚀—it', "'s", 'not', 'just', 'about', 'predictions', ';', 'it', "'s", 'about', 'redefining', 'possibilities', '!', 'I', 'do', "n't", 'just', 'build', 'models', ';', 'I', 'engineer', 'intelligence', 'that', 'transforms', 'raw', 'data', 'into', 'actionable', 'insights.', 'Whether', 'it', "'s", 'financial', 'markets', '📈📊', ',', 'space', 'exploration', '🪐🌌', ',', 'or', 'cutting-edge', 'RAG-based', 'chatbots', ',', 'the', 'goal', 'is', 'clear', ':', 'innovation.', 'But', 'AI', 'is', "n't", 'magic—it', "'s", 'math', ',', 'logic', ',', 'and', 'a', 'relentless', 'pursuit', 'of', 'efficiency.', 'Should', 'we', 'fear', 'AI', '?', 'No', '!', 'But', 'we', 'must', 'guide', 'it', 'responsibly.', 'So', ',', 'let', "'s", 'code', ',', 'iterate', ',', 'and', 'deploy—because', 'the', 'future', 'is', "n't", 'waiting', '!', '💡🔥']


TweetTokenizer

With the help of NLTK nltk.TweetTokenizer() method, we are able to convert the stream of words into small  tokens so that we can analyse the audio stream with the help of nltk.TweetTokenizer() method.

In [35]:
# import TweetTokenizer() method from nltk 
from nltk.tokenize import TweetTokenizer 
  
# Create a reference variable for Class TweetTokenizer 
tk = TweetTokenizer() 
  
# Create a string input 
gfg = ":-) <> () {} [] :-p"
  
# Use tokenize method 
geek = tk.tokenize(gfg) 
  
print(geek) 

[':-)', '<', '>', '(', ')', '{', '}', '[', ']', ':-p']


In [36]:
tweetTokens = tk.tokenize(text)
print(tweetTokens)

['AI', '&', 'ML', '🤖', '🚀', '—', "it's", 'not', 'just', 'about', 'predictions', ';', "it's", 'about', 'redefining', 'possibilities', '!', 'I', "don't", 'just', 'build', 'models', ';', 'I', 'engineer', 'intelligence', 'that', 'transforms', 'raw', 'data', 'into', 'actionable', 'insights', '.', 'Whether', "it's", 'financial', 'markets', '📈', '📊', ',', 'space', 'exploration', '🪐', '🌌', ',', 'or', 'cutting-edge', 'RAG-based', 'chatbots', ',', 'the', 'goal', 'is', 'clear', ':', 'innovation', '.', 'But', 'AI', "isn't", 'magic', '—', "it's", 'math', ',', 'logic', ',', 'and', 'a', 'relentless', 'pursuit', 'of', 'efficiency', '.', 'Should', 'we', 'fear', 'AI', '?', 'No', '!', 'But', 'we', 'must', 'guide', 'it', 'responsibly', '.', 'So', ',', "let's", 'code', ',', 'iterate', ',', 'and', 'deploy', '—', 'because', 'the', 'future', "isn't", 'waiting', '!', '💡', '🔥']


Multi-Word Expression Tokenizer

A MWETokenizer takes a string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs:

In [38]:
from nltk.tokenize import MWETokenizer

mwe = MWETokenizer([('AI', 'ML'), ('engineer', 'transforms')])


tokens = mwe.tokenize(sentence_tokens)

print(tokens)


["AI & ML 🤖🚀—it's not just about predictions; it's about redefining possibilities!", "I don't just build models; I engineer intelligence that transforms raw data into actionable insights.", "Whether it's financial markets 📈📊, space exploration 🪐🌌, or cutting-edge RAG-based chatbots, the goal is clear: innovation.", "But AI isn't magic—it's math, logic, and a relentless pursuit of efficiency.", 'Should we fear AI?', 'No!', 'But we must guide it responsibly.', "So, let's code, iterate, and deploy—because the future isn't waiting!", '💡🔥']


In [39]:
%pip install -U textblob

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
   ---------------------------------------- 0.0/624.3 kB ? eta -:--:--
   ---------------- ----------------------- 262.1/624.3 kB ? eta -:--:--
   ---------------------------------------- 624.3/624.3 kB 2.9 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.19.0


DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\vboxapi-1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


In [45]:
# from textblob lib. import TextBlob method 
from textblob import TextBlob 
  

    
# create a TextBlob object 
blob_object = TextBlob(text) 
  
# tokenize paragraph into words. 
print(" Word Tokenize :\n", blob_object.words) 
  
# tokenize paragraph into sentences. 
print("\n Sentence Tokenize :\n", blob_object.sentences)

 Word Tokenize :
 ['AI', 'ML', '🤖🚀—it', "'s", 'not', 'just', 'about', 'predictions', 'it', "'s", 'about', 'redefining', 'possibilities', 'I', 'do', "n't", 'just', 'build', 'models', 'I', 'engineer', 'intelligence', 'that', 'transforms', 'raw', 'data', 'into', 'actionable', 'insights', 'Whether', 'it', "'s", 'financial', 'markets', '📈📊', 'space', 'exploration', '🪐🌌', 'or', 'cutting-edge', 'RAG-based', 'chatbots', 'the', 'goal', 'is', 'clear', 'innovation', 'But', 'AI', 'is', "n't", 'magic—it', "'s", 'math', 'logic', 'and', 'a', 'relentless', 'pursuit', 'of', 'efficiency', 'Should', 'we', 'fear', 'AI', 'No', 'But', 'we', 'must', 'guide', 'it', 'responsibly', 'So', 'let', "'s", 'code', 'iterate', 'and', 'deploy—because', 'the', 'future', 'is', "n't", 'waiting', '💡🔥']

 Sentence Tokenize :
 [Sentence("AI & ML 🤖🚀—it's not just about predictions; it's about redefining possibilities!"), Sentence("I don't just build models; I engineer intelligence that transforms raw data into actionable insig


During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In [46]:
%pip install spacy

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting spacy
  Downloading spacy-3.8.4-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
 

DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\vboxapi-1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


In [47]:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

# Construction 2
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

In [54]:
tokens = tokenizer(text)
tokens

AI & ML 🤖🚀—it's not just about predictions; it's about redefining possibilities! I don't just build models; I engineer intelligence that transforms raw data into actionable insights. Whether it's financial markets 📈📊, space exploration 🪐🌌, or cutting-edge RAG-based chatbots, the goal is clear: innovation. But AI isn't magic—it's math, logic, and a relentless pursuit of efficiency. Should we fear AI? No! But we must guide it responsibly. So, let's code, iterate, and deploy—because the future isn't waiting! 💡🔥

In [55]:
for doc in tokenizer.pipe(sentence_tokens, batch_size=50):
    print(doc.text)
    pass

AI & ML 🤖🚀—it's not just about predictions; it's about redefining possibilities!
I don't just build models; I engineer intelligence that transforms raw data into actionable insights.
Whether it's financial markets 📈📊, space exploration 🪐🌌, or cutting-edge RAG-based chatbots, the goal is clear: innovation.
But AI isn't magic—it's math, logic, and a relentless pursuit of efficiency.
Should we fear AI?
No!
But we must guide it responsibly.
So, let's code, iterate, and deploy—because the future isn't waiting!
💡🔥


Tokenize text using Gensim

Gensim is an open-source Python module created for unsupervised topic modeling, document similarity analysis, and natural language processing (NLP).

In [53]:
from gensim.utils import tokenize


tokens = list(tokenize(text))

print(tokens)

['AI', 'ML', 'it', 's', 'not', 'just', 'about', 'predictions', 'it', 's', 'about', 'redefining', 'possibilities', 'I', 'don', 't', 'just', 'build', 'models', 'I', 'engineer', 'intelligence', 'that', 'transforms', 'raw', 'data', 'into', 'actionable', 'insights', 'Whether', 'it', 's', 'financial', 'markets', 'space', 'exploration', 'or', 'cutting', 'edge', 'RAG', 'based', 'chatbots', 'the', 'goal', 'is', 'clear', 'innovation', 'But', 'AI', 'isn', 't', 'magic', 'it', 's', 'math', 'logic', 'and', 'a', 'relentless', 'pursuit', 'of', 'efficiency', 'Should', 'we', 'fear', 'AI', 'No', 'But', 'we', 'must', 'guide', 'it', 'responsibly', 'So', 'let', 's', 'code', 'iterate', 'and', 'deploy', 'because', 'the', 'future', 'isn', 't', 'waiting']


In [57]:
%pip install tensorflow

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting tensorflow
  Downloading tensorflow-2.18.0-cp312-cp312-win_amd64.whl.metadata (3.3 kB)
Collecting tensorflow-intel==2.18.0 (from tensorflow)
  Downloading tensorflow_intel-2.18.0-cp312-cp312-win_amd64.whl.metadata (4.9 kB)
Downloading tensorflow-2.18.0-cp312-cp312-win_amd64.whl (7.5 kB)
Downloading tensorflow_intel-2.18.0-cp312-cp312-win_amd64.whl (390.3 MB)
   ---------------------------------------- 0.0/390.3 MB ? eta -:--:--
   ---------------------------------------- 0.3/390.3 MB ? eta -:--:--
   ---------------------------------------- 0.8/390.3 MB 3.0 MB/s eta 0:02:08
   ---------------------------------------- 1.6/390.3 MB 3.2 MB/s eta 0:02:01
   ---------------------------------------- 2.6/390.3 MB 3.9 MB/s eta 0:01:41
   ---------------------------------------- 3.9/390.3 MB 4.3 MB/s eta 0:01:29
   --------------------

DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\vboxapi-1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


In [59]:
import tensorflow as tf

# Initialize the Tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=100,       # Keep only the top 100 words
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',  # Remove punctuation
    lower=True,          # Convert text to lowercase
    split=' ',           # Split on spaces
    char_level=False,    # Tokenize words (not characters)
    oov_token="<OOV>"    # Handle out-of-vocabulary words
)


# Fit the tokenizer on the texts
tokenizer.fit_on_texts(sentence_tokens)

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(sentence_tokens)

# Print word index and sequences
print("Word Index:", tokenizer.word_index)
print("Sequences:", sequences)


Word Index: {'<OOV>': 1, 'ai': 2, 'just': 3, 'about': 4, "it's": 5, 'i': 6, 'the': 7, 'but': 8, "isn't": 9, 'and': 10, 'we': 11, 'ml': 12, "🤖🚀—it's": 13, 'not': 14, 'predictions': 15, 'redefining': 16, 'possibilities': 17, "don't": 18, 'build': 19, 'models': 20, 'engineer': 21, 'intelligence': 22, 'that': 23, 'transforms': 24, 'raw': 25, 'data': 26, 'into': 27, 'actionable': 28, 'insights': 29, 'whether': 30, 'financial': 31, 'markets': 32, '📈📊': 33, 'space': 34, 'exploration': 35, '🪐🌌': 36, 'or': 37, 'cutting': 38, 'edge': 39, 'rag': 40, 'based': 41, 'chatbots': 42, 'goal': 43, 'is': 44, 'clear': 45, 'innovation': 46, "magic—it's": 47, 'math': 48, 'logic': 49, 'a': 50, 'relentless': 51, 'pursuit': 52, 'of': 53, 'efficiency': 54, 'should': 55, 'fear': 56, 'no': 57, 'must': 58, 'guide': 59, 'it': 60, 'responsibly': 61, 'so': 62, "let's": 63, 'code': 64, 'iterate': 65, 'deploy—because': 66, 'future': 67, 'waiting': 68, '💡🔥': 69}
Sequences: [[2, 12, 13, 14, 3, 4, 15, 5, 4, 16, 17], [6, 18

In [15]:
from nltk.corpus import stopwords
tokenized_corpus = [word_tokenize(doc) for doc in sentence_tokens]
print(tokenized_corpus)

[['AI', '&', 'ML', '🤖🚀—it', "'s", 'not', 'just', 'about', 'predictions', ';', 'it', "'s", 'about', 'redefining', 'possibilities', '!'], ['I', 'do', "n't", 'just', 'build', 'models', ';', 'I', 'engineer', 'intelligence', 'that', 'transforms', 'raw', 'data', 'into', 'actionable', 'insights', '.'], ['Whether', 'it', "'s", 'financial', 'markets', '📈📊', ',', 'space', 'exploration', '🪐🌌', ',', 'or', 'cutting-edge', 'RAG-based', 'chatbots', ',', 'the', 'goal', 'is', 'clear', ':', 'innovation', '.'], ['But', 'AI', 'is', "n't", 'magic—it', "'s", 'math', ',', 'logic', ',', 'and', 'a', 'relentless', 'pursuit', 'of', 'efficiency', '.'], ['Should', 'we', 'fear', 'AI', '?'], ['No', '!'], ['But', 'we', 'must', 'guide', 'it', 'responsibly', '.'], ['So', ',', 'let', "'s", 'code', ',', 'iterate', ',', 'and', 'deploy—because', 'the', 'future', 'is', "n't", 'waiting', '!'], ['💡🔥']]


In [16]:
stop_words = set(stopwords.words('english'))
filtered_corpus = [[word for word in doc if word not in stop_words] for doc in tokenized_corpus]
print(filtered_corpus)

[['AI', '&', 'ML', '🤖🚀—it', "'s", 'predictions', ';', "'s", 'redefining', 'possibilities', '!'], ['I', "n't", 'build', 'models', ';', 'I', 'engineer', 'intelligence', 'transforms', 'raw', 'data', 'actionable', 'insights', '.'], ['Whether', "'s", 'financial', 'markets', '📈📊', ',', 'space', 'exploration', '🪐🌌', ',', 'cutting-edge', 'RAG-based', 'chatbots', ',', 'goal', 'clear', ':', 'innovation', '.'], ['But', 'AI', "n't", 'magic—it', "'s", 'math', ',', 'logic', ',', 'relentless', 'pursuit', 'efficiency', '.'], ['Should', 'fear', 'AI', '?'], ['No', '!'], ['But', 'must', 'guide', 'responsibly', '.'], ['So', ',', 'let', "'s", 'code', ',', 'iterate', ',', 'deploy—because', 'future', "n't", 'waiting', '!'], ['💡🔥']]


In [17]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_corpus = [[stemmer.stem(word) for word in doc] for doc in filtered_corpus]
lemmatized_corpus = [[lemmatizer.lemmatize(word) for word in doc] for doc in filtered_corpus]
print(stemmed_corpus)
print(lemmatized_corpus)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[['ai', '&', 'ml', '🤖🚀—it', "'s", 'predict', ';', "'s", 'redefin', 'possibl', '!'], ['i', "n't", 'build', 'model', ';', 'i', 'engin', 'intellig', 'transform', 'raw', 'data', 'action', 'insight', '.'], ['whether', "'s", 'financi', 'market', '📈📊', ',', 'space', 'explor', '🪐🌌', ',', 'cutting-edg', 'rag-bas', 'chatbot', ',', 'goal', 'clear', ':', 'innov', '.'], ['but', 'ai', "n't", 'magic—it', "'s", 'math', ',', 'logic', ',', 'relentless', 'pursuit', 'effici', '.'], ['should', 'fear', 'ai', '?'], ['no', '!'], ['but', 'must', 'guid', 'respons', '.'], ['so', ',', 'let', "'s", 'code', ',', 'iter', ',', 'deploy—becaus', 'futur', "n't", 'wait', '!'], ['💡🔥']]
[['AI', '&', 'ML', '🤖🚀—it', "'s", 'prediction', ';', "'s", 'redefining', 'possibility', '!'], ['I', "n't", 'build', 'model', ';', 'I', 'engineer', 'intelligence', 'transforms', 'raw', 'data', 'actionable', 'insight', '.'], ['Whether', "'s", 'financial', 'market', '📈📊', ',', 'space', 'exploration', '🪐🌌', ',', 'cutting-edge', 'RAG-based', 'ch