<a href="https://colab.research.google.com/github/jeyanthan-gj/NLP-AND-LLM/blob/main/Text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tokenization

Tokenization is the process of breaking text into smaller units called tokens.

Tokens can be:

1.   Words
2.   Sentences
3.   Characters





In [39]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Word Tokenization

In [40]:
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is very useful."
tokens = word_tokenize(text)

print(tokens)


['Natural', 'Language', 'Processing', 'is', 'very', 'useful', '.']


Sentence Tokenization

In [41]:
from nltk.tokenize import sent_tokenize

text = "I love NLP. It is very powerful."
sentences = sent_tokenize(text)

print(sentences)


['I love NLP.', 'It is very powerful.']


Character Tokenization using NLTK RegexpTokenizer

In [42]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\S')  # matches every non-space character
text = "Natural Language"

char_tokens = tokenizer.tokenize(text)
print(char_tokens)


['N', 'a', 't', 'u', 'r', 'a', 'l', 'L', 'a', 'n', 'g', 'u', 'a', 'g', 'e']


### Stemming

Stemming is a rule-based text normalization technique used in Natural Language Processing to reduce words to their root form or base form, known as a "stem."


Porter Stemmer

The Porter Stemmer is the oldest and most commonly used stemming algorithm. It applies a series of five phases of word reduction rules.

In [43]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["playing", "played", "plays", "easily", "fairly"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['play', 'play', 'play', 'easili', 'fairli']


Snowball Stemmer


The Snowball Stemmer is an improvement over the original Porter Stemmer. It is more computationally efficient and supports multiple languages.

In [44]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['play', 'play', 'play', 'easili', 'fair']


Lancaster Stemmer

The Lancaster Stemmer is the most aggressive . It uses a very large set of rules and often "over-stems," meaning it reduces words to very short, unintuitive roots.

In [45]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['play', 'play', 'play', 'easy', 'fair']


Regexp Stemmer


The Regexp Stemmer (Regular Expression Stemmer) allows you to define your own custom rules for suffix removal.

In [46]:
from nltk.stem import RegexpStemmer

stemmer = RegexpStemmer('ing$|s$|ed$')

stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['play', 'play', 'play', 'easily', 'fairly']


## Lemmatization

Lemmatization is
a Natural Language Processing (NLP) technique used to reduce words to their base or dictionary form, known as a lemma. Unlike simpler techniques, it considers the context and grammatical role of a word (e.g., whether it is a noun or verb) to ensure the resulting form is a valid word.

In [47]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

Step 1: Tokenization

The text is broken down into individual components called tokens. This ensures the algorithm processes one word at a time.

In [48]:
text = "The striped bats are hanging on their feet for a meeting"

# Step 1: Tokenization
tokens = word_tokenize(text)
print(f"Tokens: {tokens}")

Tokens: ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'a', 'meeting']


Step 2: Part-of-Speech (POS) Tagging

The lemmatizer needs to know if a word is a noun, verb, or adjective to find the correct root. We use pos_tag to identify these roles.

In [49]:

# Step 2: POS Tagging
pos_tags = nltk.pos_tag(tokens)
print(f"POS Tags: {pos_tags}")


POS Tags: [('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('a', 'DT'), ('meeting', 'NN')]


Step 3: Morphological Analysis (Mapping)
NLTK's default tags (like 'VBG' for verbs) must be converted into a format the WordNet lemmatizer understands ('v'). This step ensures the "Morphological Analysis" is compatible with the "Dictionary Lookup."

In [50]:
from nltk.corpus import wordnet

# Step 3: Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun

mapped_tags = [(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
print(f"Mapped Tags: {mapped_tags}")


Mapped Tags: [('The', 'n'), ('striped', 'a'), ('bats', 'n'), ('are', 'v'), ('hanging', 'v'), ('on', 'n'), ('their', 'n'), ('feet', 'n'), ('for', 'n'), ('a', 'n'), ('meeting', 'n')]


Step 4: Dictionary Lookup (Lemmatization)

The WordNetLemmatizer compares the token and its POS tag against the WordNet Lexical Database to return the standard lemma.

In [51]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [52]:
lemmatizer = WordNetLemmatizer()

# Step 4: Final Lemmatization using the dictionary lookup
lemmatized_output = [lemmatizer.lemmatize(word, pos) for word, pos in mapped_tags]

print(  f"Original Text: {text}")
print("Final Lemmatization Results:")
print(" ".join(lemmatized_output))


Original Text: The striped bats are hanging on their feet for a meeting
Final Lemmatization Results:
The striped bat be hang on their foot for a meeting


### Stop Word Removal

Stop-word removal is a text preprocessing step in Natural Language Processing (NLP) that eliminates common, less meaningful words (like "the," "a," "is," "in") to focus analysis on significant terms

In [53]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Step 1: Download required NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [54]:
# Step 2: Load English stop words and convert to a set for speed
stop_words = set(stopwords.words('english'))

print(f"Total NLTK English stop words: {len(stop_words)}")
print(stop_words)

Total NLTK English stop words: 198
{'for', 'an', 'few', 'hers', "hasn't", 'his', 'than', 'to', "they'll", "don't", "haven't", 'most', 'won', 'her', 'own', 'm', 'hadn', "they'd", 'very', 'll', 'theirs', 'too', 'through', 'needn', 'why', "you've", 'doing', 'be', "it'd", "we'd", 'i', 'by', 'mustn', 'their', 'she', 'up', "she'll", 'both', 'same', 'which', 'above', "isn't", 'once', 'had', "we'll", 'but', 'are', 'on', "hadn't", 'that', "couldn't", 'some', "she'd", 'against', 'myself', 'then', 'when', "won't", 'do', "they're", 'now', 'any', 'mightn', 'after', 'and', 'over', 'here', 'didn', 'how', 'under', 'couldn', "didn't", 'into', 'weren', 'd', "they've", "she's", 'don', 'me', 'has', 'again', "i'm", 'will', 'being', 'it', 'those', "needn't", 'only', "he's", 'can', 'off', 'they', 're', 'where', 'yourself', 'until', 'during', "we're", 'if', 'yours', "i've", "we've", 'other', 'who', 'themselves', "you'd", 'what', 'with', "mightn't", 'further', "shan't", 'isn', 'more', 'this', 'should', "you'll

In [55]:
# Step 3: Define a sample sentence and tokenize it
text = "This is a sample sentence showing how to remove stop words from a piece of text."
tokens = word_tokenize(text)

print(f"Original Tokens: {tokens}")


Original Tokens: ['This', 'is', 'a', 'sample', 'sentence', 'showing', 'how', 'to', 'remove', 'stop', 'words', 'from', 'a', 'piece', 'of', 'text', '.']


In [56]:
# Step 4: Remove stop words using list comprehension
# Note: words are converted to .lower() for matching
filtered_text = [w for w in tokens if w.lower() not in stop_words]

print("Original Text:", tokens)
print("Filtered Text (Stop-words removed):", filtered_text)


Original Text: ['This', 'is', 'a', 'sample', 'sentence', 'showing', 'how', 'to', 'remove', 'stop', 'words', 'from', 'a', 'piece', 'of', 'text', '.']
Filtered Text (Stop-words removed): ['sample', 'sentence', 'showing', 'remove', 'stop', 'words', 'piece', 'text', '.']
