For given text apply following preprocessing methods:
1. Tokenization
2. POS Tagging
3. Stop word Removal
4. Lemmatization
5. Stemming

In [14]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\purva\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

1. Tokenization

In [15]:
from nltk.tokenize import word_tokenize

In [16]:
# Sample text
text = "Text tokenization serves as the initial phase of natural languages processing. It involves segmenting a given text into individual units, like words or sentences, which are known as tokens."

In [17]:
# Tokenization
tokens = word_tokenize(text)
print("Tokenization:", tokens)

Tokenization: ['Text', 'tokenization', 'serves', 'as', 'the', 'initial', 'phase', 'of', 'natural', 'languages', 'processing', '.', 'It', 'involves', 'segmenting', 'a', 'given', 'text', 'into', 'individual', 'units', ',', 'like', 'words', 'or', 'sentences', ',', 'which', 'are', 'known', 'as', 'tokens', '.']


2. POS Tagging

In [18]:
from nltk import pos_tag

In [19]:
# POS Tagging
pos_tags = pos_tag(tokens)
print("POS Tagging:", pos_tags)

POS Tagging: [('Text', 'NNP'), ('tokenization', 'NN'), ('serves', 'VBZ'), ('as', 'IN'), ('the', 'DT'), ('initial', 'JJ'), ('phase', 'NN'), ('of', 'IN'), ('natural', 'JJ'), ('languages', 'NNS'), ('processing', 'VBG'), ('.', '.'), ('It', 'PRP'), ('involves', 'VBZ'), ('segmenting', 'VBG'), ('a', 'DT'), ('given', 'VBN'), ('text', 'NN'), ('into', 'IN'), ('individual', 'JJ'), ('units', 'NNS'), (',', ','), ('like', 'IN'), ('words', 'NNS'), ('or', 'CC'), ('sentences', 'NNS'), (',', ','), ('which', 'WDT'), ('are', 'VBP'), ('known', 'VBN'), ('as', 'IN'), ('tokens', 'NNS'), ('.', '.')]


3. Stop word Removal

In [20]:
from nltk.corpus import stopwords

In [21]:

stop_words = set(stopwords.words('english'))

print("stop_words: ",stop_words)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Stop word Removal:", filtered_tokens)


stop_words:  {'nor', 'below', 'your', 'my', 'during', "didn't", 'on', 'his', 'itself', "couldn't", 'these', 'myself', 'me', 'at', 'about', 'ain', 'hasn', 'own', 'will', 'wouldn', 'doesn', "hasn't", 've', "you're", 'any', 'up', 'how', 'further', 'you', 'wasn', 'have', 'than', "doesn't", 'where', 'in', 'that', 'no', 'with', "weren't", 'had', 'out', 'such', 'haven', 'shouldn', 'between', 'should', 'i', 'it', 'because', 'just', 'now', 'or', 'couldn', 'few', 'other', 'mightn', 'herself', 'having', 'll', 'while', "won't", 'to', 'they', 'over', 'did', 'both', 'most', 'we', 'ourselves', 'isn', 'shan', 'were', 'above', "should've", 'who', 'very', 'didn', "hadn't", 'of', 'from', 'o', 'so', 'again', 'aren', 'has', 'won', 'this', "shouldn't", 'weren', 'for', 'then', 'does', 'them', "mightn't", "isn't", 'yourselves', 'and', 'those', 'don', 'mustn', 'him', 'needn', "haven't", 'y', 'off', "that'll", 'are', 'against', 'theirs', "needn't", "she's", 'some', 'there', 'all', 'a', 'her', 'same', 'm', 'ma',

4. Lemmatization

In [22]:
from nltk.stem import WordNetLemmatizer

In [24]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("Lemmatization:", lemmatized_tokens)

Lemmatization: ['Text', 'tokenization', 'serf', 'initial', 'phase', 'natural', 'language', 'processing', '.', 'involves', 'segmenting', 'given', 'text', 'individual', 'unit', ',', 'like', 'word', 'sentence', ',', 'known', 'token', '.']


5. Stemming

In [11]:
from nltk.stem import PorterStemmer

In [12]:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("Stemming:", stemmed_tokens)

Stemming: ['text', 'token', 'serv', 'initi', 'phase', 'natur', 'languag', 'process', '.', 'involv', 'segment', 'given', 'text', 'individu', 'unit', ',', 'like', 'word', 'sentenc', ',', 'known', 'token', '.']



1. **Tokenization**:
   - Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, phrases, or other meaningful elements.
   - For example, consider the sentence: "The quick brown fox jumps over the lazy dog." Tokenization would split this sentence into individual words: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."].
   - Tokenization is a crucial step in natural language processing (NLP) tasks as it helps in extracting meaningful information from text data.

2. **POS Tagging** (Part-of-Speech Tagging):
   - POS tagging is the process of assigning a part-of-speech tag (such as noun, verb, adjective, etc.) to each word in a sentence.
   - This helps in understanding the grammatical structure of a sentence and aids in various NLP tasks such as named entity recognition, text summarization, etc.
   - For example, consider the sentence: "The quick brown fox jumps over the lazy dog." POS tagging would assign tags to each word: [("The", "DT"), ("quick", "JJ"), ("brown", "JJ"), ("fox", "NN"), ("jumps", "VBZ"), ("over", "IN"), ("the", "DT"), ("lazy", "JJ"), ("dog", "NN"), (".", ".")].
   - Here, "DT" represents determiner, "JJ" represents adjective, "NN" represents noun, "VBZ" represents verb (3rd person singular present), and "IN" represents preposition or conjunction.

3. **Stop Word Removal**:
   - Stop words are common words that are often considered irrelevant for text analysis as they do not carry much meaning (e.g., "the", "is", "are", "and", "but", etc.).
   - Stop word removal is the process of filtering out these stop words from text data.
   - This helps in reducing noise in the data and improving the efficiency of text processing algorithms.
   - For example, consider the sentence: "The quick brown fox jumps over the lazy dog." After stop word removal, it would become: ["quick", "brown", "fox", "jumps", "lazy", "dog"].

4. **Lemmatization**:
   - Lemmatization is the process of reducing words to their base or dictionary form (known as lemma) while still ensuring that the reduced form belongs to the language.
   - It involves removing inflections and variations to bring words to their root form.
   - For example, the lemma of the words "running", "ran", and "runs" is "run".
   - Lemmatization helps in standardizing words so that variations of the same word are treated as the same entity in text analysis.

5. **Stemming**:
   - Stemming is similar to lemmatization, but it is a more crude and rule-based approach.
   - It involves removing prefixes and suffixes from words to reduce them to their root or stem form.
   - Stemming is more aggressive than lemmatization and may not always result in valid words.
   - For example, stemming the words "running", "ran", and "runs" would result in "run".
   - Stemming is computationally less expensive compared to lemmatization and is often used in information retrieval systems and text mining tasks.