# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
%pip install nltk scikit-learn pandas matplotlib


Note: you may need to restart the kernel to use updated packages.


Now, import the required libraries:

In [2]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\happy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\happy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\happy\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\happy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\happy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\happy\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_percep

True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [4]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
# your code here

tokenized_text = word_tokenize(text)
print(tokenized_text)



['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']


Remove stop words and store the result in a variable called `filtered_tokens`

In [5]:
stop_words = set(stopwords.words('english'))
# your code here
filtered_tokens = [txt for txt in tokenized_text if txt.lower() not in stop_words]

In [6]:
filtered_tokens = [txt for txt in tokenized_text if txt.lower() not in stop_words]
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


In [7]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


In [8]:
pattern = re.compile('[%s]' % re.escape(string.punctuation)) 
filtered_tokens_no_punctuation = []

for token in filtered_tokens: 
    new_token = pattern.sub(u'', token) # Replace by an empty string
    if not new_token == u'':
        filtered_tokens_no_punctuation.append(new_token) # Append only tokens which are not empty

    
print(filtered_tokens_no_punctuation)
print(new_token)
print()


['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']




### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [9]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print("---- ",filtered_tokens_no_punctuation,"----")
for token in filtered_tokens_no_punctuation:
    print("---- ",token,"----")
    print('PorterStemmer:',stemmer.stem(token))
    print('WordNetLemmatizer:',lemmatizer.lemmatize(token))
    print()


----  ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language'] ----
----  Natural ----
PorterStemmer: natur
WordNetLemmatizer: Natural

----  Language ----
PorterStemmer: languag
WordNetLemmatizer: Language

----  Processing ----
PorterStemmer: process
WordNetLemmatizer: Processing

----  NLP ----
PorterStemmer: nlp
WordNetLemmatizer: NLP

----  fascinating ----
PorterStemmer: fascin
WordNetLemmatizer: fascinating

----  field ----
PorterStemmer: field
WordNetLemmatizer: field

----  study ----
PorterStemmer: studi
WordNetLemmatizer: study

----  involves ----
PorterStemmer: involv
WordNetLemmatizer: involves

----  analyzing ----
PorterStemmer: analyz
WordNetLemmatizer: analyzing

----  understanding ----
PorterStemmer: understand
WordNetLemmatizer: understanding

----  human ----
PorterStemmer: human
WordNetLemmatizer: human

----  language ----
PorterStemmer: languag
WordNetLemmatizer: language



Apply stemming and store the result in `stemmed_tokens`

In [10]:
# your code here
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens_no_punctuation]

In [11]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', 'nlp', 'fascin', 'field', 'studi', 'involv', 'analyz', 'understand', 'human', 'languag']


Apply lemmatization and store the result in `lemmatized_tokens`

In [12]:
# your code here
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens_no_punctuation]


In [21]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [13]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [14]:

# your code here
# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)

In [15]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [40]:
# your code here
# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

In [41]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [42]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
# Step 2: Fit and transform the corpus into a bigram representation
X_bigram = bigram_vectorizer.fit_transform(corpus)


In [43]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [44]:
# your code here
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokenized_text = word_tokenize(text)
    print(tokenized_text)
    # Step 2: Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [txt for txt in tokenized_text if txt.lower() not in stop_words]
    print("Filtered Tokens:", filtered_tokens)
    # Step 3: Remove punctuation
    pattern = re.compile('[%s]' % re.escape(string.punctuation)) 
    filtered_tokens_no_punctuation = []

    for token in filtered_tokens: 
        new_token = pattern.sub(u'', token) # Replace by an empty string
        if not new_token == u'':
            filtered_tokens_no_punctuation.append(new_token) # Append only tokens which are not empty
    print(filtered_tokens_no_punctuation)
    # Step 4: Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens_no_punctuation]
    print("Lemmatized Tokens:", lemmatized_tokens)
    return lemmatized_tokens


Apply this function to the following text

In [45]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# your code here
processed_text = text_preprocessing_pipeline(text)


['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']
Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']
Lemmatized Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


In [46]:
print("Processed Text:", processed_text)

Processed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [47]:
def text_preprocessing(text):
    # Step 1: Tokenize the text
    tokenized_text = word_tokenize(text)
    # Step 2: Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens_no_stop_words = [txt for txt in tokenized_text if txt.lower() not in stop_words]
    # Step 3: Remove punctuation
    pattern = re.compile('[%s]' % re.escape(string.punctuation)) 

    filtered_tokens = []

    for token in filtered_tokens_no_stop_words: 
        new_token = pattern.sub(u'', token) # Replace by an empty string
        if not new_token == u'':
            filtered_tokens.append(new_token) # Append only tokens which are not empty
    print(filtered_tokens)
    return filtered_tokens


sentence = "The cats are playing with the mice in the garden."
# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens
processed_sentence = text_preprocessing(sentence)
# Step 2: Apply stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in processed_sentence]
# Step 3: Apply lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in processed_sentence]

print("Original Tokens:", processed_sentence)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)


['cats', 'playing', 'mice', 'garden']
Original Tokens: ['cats', 'playing', 'mice', 'garden']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'playing', 'mouse', 'garden']


In [48]:
print("Original Tokens:", processed_sentence)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'playing', 'mice', 'garden']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'playing', 'mouse', 'garden']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [49]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\happy\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [50]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [51]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [52]:
# Combine tweets
all_tweets = positive_tweets + negative_tweets

# Create labels: 1 for positive, 0 for negative
labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)


In [53]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [54]:
# Step 1: Apply the preprocessing pipeline to all tweets
# your code here
#Use preprocessing pipeline to all tweets with previous function to lemmatize and remove stop words and punctuation
preprocessed_tweets = [text_preprocessing_pipeline(tweet)   for tweet in all_tweets]

['#', 'FollowFriday', '@', 'France_Inte', '@', 'PKuchly57', '@', 'Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':', ')']
Filtered Tokens: ['#', 'FollowFriday', '@', 'France_Inte', '@', 'PKuchly57', '@', 'Milipol_Paris', 'top', 'engaged', 'members', 'community', 'week', ':', ')']
['FollowFriday', 'FranceInte', 'PKuchly57', 'MilipolParis', 'top', 'engaged', 'members', 'community', 'week']
Lemmatized Tokens: ['FollowFriday', 'FranceInte', 'PKuchly57', 'MilipolParis', 'top', 'engaged', 'member', 'community', 'week']
['@', 'Lamb2ja', 'Hey', 'James', '!', 'How', 'odd', ':', '/', 'Please', 'call', 'our', 'Contact', 'Centre', 'on', '02392441234', 'and', 'we', 'will', 'be', 'able', 'to', 'assist', 'you', ':', ')', 'Many', 'thanks', '!']
Filtered Tokens: ['@', 'Lamb2ja', 'Hey', 'James', '!', 'odd', ':', '/', 'Please', 'call', 'Contact', 'Centre', '02392441234', 'able', 'assist', ':', ')', 'Many', 'thanks', '!']
['Lamb2ja', 'Hey', 'James', 

In [55]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])
print("Preprocessed Tweets Sample:", preprocessed_tweets[10])
print("Preprocessed Tweets Sample:", preprocessed_tweets[11])
print("Preprocessed Tweets Sample:", preprocessed_tweets[15])
print("Preprocessed Tweets Sample:", preprocessed_tweets[16])
print("Preprocessed Tweets Sample:", preprocessed_tweets[17])

Preprocessed Tweets Sample: ['FollowFriday', 'FranceInte', 'PKuchly57', 'MilipolParis', 'top', 'engaged', 'member', 'community', 'week']
Preprocessed Tweets Sample: ['FollowFriday', 'wncer1', 'Defensegouv', 'top', 'influencers', 'community', 'week']
Preprocessed Tweets Sample: ['Would', 'nt', 'Love', 'Big', 'Juicy', 'Selfies', 'http', 'tcoQVzjgd1uFo', 'http', 'tcooWBL11eQRY']
Preprocessed Tweets Sample: ['Laying', 'greeting', 'card', 'range', 'print', 'today', 'love', 'job']
Preprocessed Tweets Sample: ['Friend', 's', 'lunch', 'yummmm', 'Nostalgia', 'TBS', 'KU']
Preprocessed Tweets Sample: ['RookieSenpai', 'arcadester', 'id', 'conflict', 'thanks', 'help', 's', 'screenshot', 'working']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [58]:
X_bow = []
vectorizer = CountVectorizer()
for tweet in all_tweets: 
    new_tweet = vectorizer.fit_transform(tweet)
    X_bow.append(new_tweet) # Append tweet with bag of words
    print(X_bow)

ValueError: Iterable over raw text documents expected, string object received.

In [61]:
# your code here
preprocessed_tweets_strings = [' '.join(tweet) for tweet in preprocessed_tweets]
# Step 1: Create a Bag of Words representation
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(preprocessed_tweets_strings)


# Step 2: Create a TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_tweets_strings) 


In [65]:
# Print a sample preprocessed tweet Bag of Words
print("Bag of Words Tweets Sample:", X_bow[0])
print("Bag of Words Tweets Sample:", X_bow[10])
print("Bag of Words Tweets Sample:", X_bow[11])
print("Bag of Words Tweets Sample:", X_bow[15])
print("Bag of Words Tweets Sample:", X_bow[16])
print("Bag of Words Tweets Sample:", X_bow[17])
print(X_bow.shape) 

Bag of Words Tweets Sample:   (0, 5902)	1
  (0, 5995)	1
  (0, 12749)	1
  (0, 10711)	1
  (0, 18336)	1
  (0, 5195)	1
  (0, 10553)	1
  (0, 3548)	1
  (0, 19350)	1
Bag of Words Tweets Sample:   (0, 5902)	1
  (0, 18336)	1
  (0, 3548)	1
  (0, 19350)	1
  (0, 19570)	1
  (0, 4287)	1
  (0, 7856)	1
Bag of Words Tweets Sample:   (0, 7445)	2
  (0, 11793)	1
  (0, 19664)	1
  (0, 9874)	1
  (0, 2120)	1
  (0, 8636)	1
  (0, 14416)	1
  (0, 17218)	1
  (0, 17112)	1
Bag of Words Tweets Sample:   (0, 9874)	1
  (0, 9420)	1
  (0, 6661)	1
  (0, 2853)	1
  (0, 13377)	1
  (0, 13050)	1
  (0, 18254)	1
  (0, 8483)	1
Bag of Words Tweets Sample:   (0, 6051)	1
  (0, 9975)	1
  (0, 20078)	1
  (0, 11744)	1
  (0, 16062)	1
  (0, 9219)	1
Bag of Words Tweets Sample:   (0, 17860)	1
  (0, 13898)	1
  (0, 1283)	1
  (0, 7611)	1
  (0, 3594)	1
  (0, 7148)	1
  (0, 14334)	1
  (0, 19638)	1
(10000, 20229)


In [66]:


# Print a sample preprocessed tweet TF-IDF
print("TF-IDF Tweets Sample:", X_tfidf[0])
print("TF-IDF Tweets Sample:", X_tfidf[10])
print("TF-IDF Tweets Sample:", X_tfidf[11])
print("TF-IDF Tweets Sample:", X_tfidf[15])
print("TF-IDF Tweets Sample:", X_tfidf[16])
print("TF-IDF Tweets Sample:", X_tfidf[17])
print(X_tfidf.shape)

TF-IDF Tweets Sample:   (0, 5902)	0.2957676144942458
  (0, 5995)	0.40488606021832224
  (0, 12749)	0.40488606021832224
  (0, 10711)	0.40488606021832224
  (0, 18336)	0.28075783665786486
  (0, 5195)	0.34591012544541805
  (0, 10553)	0.30098338017396975
  (0, 3548)	0.2843550892621975
  (0, 19350)	0.22537915448929338
TF-IDF Tweets Sample:   (0, 5902)	0.3457878387669157
  (0, 18336)	0.3282396070334015
  (0, 3548)	0.3324452270627541
  (0, 19350)	0.26349528114236376
  (0, 19570)	0.47336039799066215
  (0, 4287)	0.47336039799066215
  (0, 7856)	0.38026281172711085
TF-IDF Tweets Sample:   (0, 7445)	0.26229003277722196
  (0, 11793)	0.16517681084462185
  (0, 19664)	0.23964147964619106
  (0, 9874)	0.19806888423695737
  (0, 2120)	0.2929571759624967
  (0, 8636)	0.4142684927211761
  (0, 14416)	0.38799612723480337
  (0, 17218)	0.44680977632345553
  (0, 17112)	0.44680977632345553
TF-IDF Tweets Sample:   (0, 9874)	0.19927425554433217
  (0, 9420)	0.4167895719739785
  (0, 6661)	0.4167895719739785
  (0, 2853)	

## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

