# Task
Implement text preprocessing using NLTK, including tokenization, stop word removal, stemming, and lemmatization.

## Install nltk

### Subtask:
Install the NLTK library and download necessary data.


**Reasoning**:
Install NLTK and download necessary data for text preprocessing.



In [10]:
pip install nltk




**Reasoning**:
The previous command failed because `nltk.downloader.DownloadError` does not exist, and 'punkt' and 'wordnet' were not found. The corrected code will directly download 'punkt' and 'wordnet' if they are not found.



In [11]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Sample input text
text = "Natural Language Processing allows computers to understand human language. It's amazing, isn't it?"

# 1. Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# 2. Removing Stop Words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("After Stop Words Removal:", filtered_tokens)

# 3. Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("After Stemming:", stemmed_tokens)

# 4. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("After Lemmatization:", lemmatized_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Tokens: ['Natural', 'Language', 'Processing', 'allows', 'computers', 'to', 'understand', 'human', 'language', '.', 'It', "'s", 'amazing', ',', 'is', "n't", 'it', '?']
After Stop Words Removal: ['Natural', 'Language', 'Processing', 'allows', 'computers', 'understand', 'human', 'language', '.', "'s", 'amazing', ',', "n't", '?']
After Stemming: ['natur', 'languag', 'process', 'allow', 'comput', 'understand', 'human', 'languag', '.', "'s", 'amaz', ',', "n't", '?']
After Lemmatization: ['Natural', 'Language', 'Processing', 'allows', 'computer', 'understand', 'human', 'language', '.', "'s", 'amazing', ',', "n't", '?']


## Tokenization

### Subtask:
Split the text into individual words or tokens.


**Reasoning**:
Import the necessary function, define sample text, and perform tokenization.



**Reasoning**:
The previous command failed because the 'punkt_tab' resource was not found. I need to download it before proceeding with tokenization.



## Stop word removal

### Subtask:
Remove common words that don't add much meaning.


**Reasoning**:
Import stopwords and filter the tokens list.



## Stemming

### Subtask:
Reduce words to their root form (e.g., "running" to "run").


**Reasoning**:
Implement stemming to reduce words to their root form.



## Lemmatization

### Subtask:
Reduce words to their base or dictionary form (e.g., "better" to "good").


**Reasoning**:
Implement lemmatization using NLTK to reduce words to their base or dictionary form.



**Reasoning**:
The previous attempt failed because the 'averaged_perceptron_tagger_eng' resource was not found. Download the required resource using nltk.download().



**Reasoning**:
Now that the required resource is downloaded, retry the lemmatization step.

