<a href="https://colab.research.google.com/github/jeyanthan-gj/NLP-AND-LLM/blob/main/Text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tokenization

Tokenization is the process of breaking text into smaller units called tokens.

Tokens can be:

1.   Words
2.   Sentences
3.   Characters





In [1]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Word Tokenization

In [2]:
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is very useful."
tokens = word_tokenize(text)

print(tokens)


['Natural', 'Language', 'Processing', 'is', 'very', 'useful', '.']


Sentence Tokenization

In [3]:
from nltk.tokenize import sent_tokenize

text = "I love NLP. It is very powerful."
sentences = sent_tokenize(text)

print(sentences)


['I love NLP.', 'It is very powerful.']


Character Tokenization using NLTK RegexpTokenizer

In [4]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\S')  # matches every non-space character
text = "Natural Language"

char_tokens = tokenizer.tokenize(text)
print(char_tokens)


['N', 'a', 't', 'u', 'r', 'a', 'l', 'L', 'a', 'n', 'g', 'u', 'a', 'g', 'e']


In [5]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["playing", "played", "plays", "easily", "fairly"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['play', 'play', 'play', 'easili', 'fairli']


In [6]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

words = ["running", "runs", "easily", "fairly"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['run', 'run', 'easili', 'fair']


In [7]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

words = ["running", "runs", "easily", "fairly"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['run', 'run', 'easy', 'fair']


In [8]:
from nltk.stem import RegexpStemmer

stemmer = RegexpStemmer('ing$|s$|ed$')

words = ["playing", "played", "plays"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['play', 'play', 'play']


## Lemmatization

Lemmatization is
a Natural Language Processing (NLP) technique used to reduce words to their base or dictionary form, known as a lemma. Unlike simpler techniques, it considers the context and grammatical role of a word (e.g., whether it is a noun or verb) to ensure the resulting form is a valid word.

In [9]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

Step 1: Tokenization

The text is broken down into individual components called tokens. This ensures the algorithm processes one word at a time.

In [10]:
text = "The striped bats are hanging on their feet for a meeting"

# Step 1: Tokenization
tokens = word_tokenize(text)
print(f"Tokens: {tokens}")

Tokens: ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'a', 'meeting']


Step 2: Part-of-Speech (POS) Tagging

The lemmatizer needs to know if a word is a noun, verb, or adjective to find the correct root. We use pos_tag to identify these roles.

In [11]:

# Step 2: POS Tagging
pos_tags = nltk.pos_tag(tokens)
print(f"POS Tags: {pos_tags}")


POS Tags: [('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('a', 'DT'), ('meeting', 'NN')]


Step 3: Morphological Analysis (Mapping)
NLTK's default tags (like 'VBG' for verbs) must be converted into a format the WordNet lemmatizer understands ('v'). This step ensures the "Morphological Analysis" is compatible with the "Dictionary Lookup."

In [12]:
from nltk.corpus import wordnet

# Step 3: Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun

mapped_tags = [(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
print(f"Mapped Tags: {mapped_tags}")


Mapped Tags: [('The', 'n'), ('striped', 'a'), ('bats', 'n'), ('are', 'v'), ('hanging', 'v'), ('on', 'n'), ('their', 'n'), ('feet', 'n'), ('for', 'n'), ('a', 'n'), ('meeting', 'n')]


Step 4: Dictionary Lookup (Lemmatization)

The WordNetLemmatizer compares the token and its POS tag against the WordNet Lexical Database to return the standard lemma.

In [13]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

Final Lemmatization Results:
The striped bat be hang on their foot for a meeting


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [16]:
lemmatizer = WordNetLemmatizer()

# Step 4: Final Lemmatization using the dictionary lookup
lemmatized_output = [lemmatizer.lemmatize(word, pos) for word, pos in mapped_tags]

print("Final Lemmatization Results:")
print(" ".join(lemmatized_output))

Final Lemmatization Results:
The striped bat be hang on their foot for a meeting
