# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
!pip install nltk scikit-learn pandas matplotlib




Now, import the required libraries:

In [2]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kiraredberg/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kiraredberg/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kiraredberg/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kiraredberg/nltk_data...
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/kiraredberg/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kiraredberg/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [47]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
# your code here
def tokenization(given_text):
    given_text = given_text.lower()
    tokens = re.findall(r"[a-zA-Z0-9']+", given_text)
    return tokens
print(tokenization(text))


['natural', 'language', 'processing', 'nlp', 'is', 'a', 'fascinating', 'field', 'of', 'study', 'it', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language']


In [48]:
#Tokenize the following text
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
text = text.lower()
tokens = word_tokenize(text)
print(tokens)

['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'it', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']


Remove stop words and store the result in a variable called `filtered_tokens`

In [49]:
#Remove stop words and store the result in a variable called `filtered_tokens`
stop_words = set(stopwords.words("english"))
punct = set(string.punctuation)
filtered_tokens = []
for t in tokens:
    if t not in stop_words and t not in punct:
        filtered_tokens.append(t)

In [50]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [51]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [52]:
def stemming(filtered_tokens):
    stemmed_tokens = []
    for t in filtered_tokens:
        stemmed_tokens.append(stemmer.stem(t))
    return stemmed_tokens

In [53]:
result = stemming(filtered_tokens)
print("Stemmed Tokens:", result)

Stemmed Tokens: ['natur', 'languag', 'process', 'nlp', 'fascin', 'field', 'studi', 'involv', 'analyz', 'understand', 'human', 'languag']


Apply lemmatization and store the result in `lemmatized_tokens`

In [None]:
def lemmatization(filtered_tokens):
    lemmatized_tokens = []
    for t in filtered_tokens:
        lemmatized_tokens.append(lemmatizer.lemmatize(t))
    return lemmatized_tokens

In [55]:
result = lemmatization(filtered_tokens)
print("Lemmatized Tokens:", result)

Lemmatized Tokens: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [56]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [None]:

# your code here
# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)


In [64]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [73]:
# your code here
# Step 1: Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = vectorizer.fit_transform(corpus)
print(X_tfidf)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 11 stored elements and shape (3, 9)>
  Coords	Values
  (0, 5)	0.8610369959439764
  (0, 7)	0.5085423203783267
  (1, 7)	0.3853716274664007
  (1, 3)	0.652490884512534
  (1, 0)	0.652490884512534
  (2, 7)	0.25537359879528915
  (2, 1)	0.43238508878969045
  (2, 4)	0.43238508878969045
  (2, 6)	0.43238508878969045
  (2, 8)	0.43238508878969045
  (2, 2)	0.43238508878969045


In [72]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [74]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
# Step 2: Fit and transform the corpus into a bigram representation
X_bigram = bigram_vectorizer.fit_transform(corpus)


In [75]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [83]:
# your code here
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    text = text.lower()
    tokens = re.findall(r"[a-zA-Z0-9']+", text)
    return tokens
    # Step 2: Remove stop words
    stop_words = set(stopwords.words("english"))
    filtered_tokens = []
    for t in tokens:
        if t not in stop_words and t not in punct:
            filtered_tokens.append(t)
    # Step 3: Remove punctuation
    punct = set(string.punctuation)
    for t in tokens:
        if t not in punct:
            filtered_tokens.append(t)
    # Step 4: Apply lemmatization
    lemmatized_tokens = []
    for t in filtered_tokens:
        lemmatized_tokens.append(lemmatizer.lemmatize(t))
    return lemmatized_tokens

Apply this function to the following text

In [84]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

print("Preprocessed:", text_preprocessing_pipeline(text))


Preprocessed: ['natural', 'language', 'processing', 'nlp', 'is', 'a', 'fascinating', 'field', 'of', 'study', 'it', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [101]:
sentence = "The cats are playing with the mice in the garden."
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# your code here
# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens
sentence = sentence.lower()
filtered_tokens = word_tokenize(sentence)

# Step 2: Apply stemming
stop_words = set(stopwords.words("english"))
punct = set(string.punctuation)
stemmed_tokens = []
for t in filtered_tokens:
    if t not in stop_words and t not in punct:
        stemmed_tokens.append(t)

# Step 3: Apply lemmatization
lemmatized_tokens = []
for t in filtered_tokens:
    lemmatized_tokens.append(lemmatizer.lemmatize(t))


In [102]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['the', 'cats', 'are', 'playing', 'with', 'the', 'mice', 'in', 'the', 'garden', '.']
Stemmed Tokens: ['cats', 'playing', 'mice', 'garden']
Lemmatized Tokens: ['the', 'cat', 'are', 'playing', 'with', 'the', 'mouse', 'in', 'the', 'garden', '.']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [103]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/kiraredberg/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [95]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [96]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [99]:
# your code here

# Combine the datasets
all_tweets = positive_tweets + negative_tweets
labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)

In [100]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [106]:
# Step 1: Apply the preprocessing pipeline to all tweets
# your code here
preprocessed_tweets = [text_preprocessing_pipeline(tweet) for tweet in all_tweets]


In [107]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: ['followfriday', 'france', 'inte', 'pkuchly57', 'milipol', 'paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [112]:
# your code here
# Step 1: Create a Bag of Words representation
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
texts = [" ".join(tokens) for tokens in preprocessed_tweets]
# Fit and transform the corpus into a BoW representation
X_bow = vectorizer.fit_transform(texts)
print("Bag of Words:\n", X_bow.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

# Step 2: Create a TF-IDF representation
#Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()
#Fit and transform the corpus into a TF-IDF representation
X_tfidf = vectorizer.fit_transform(texts)
print(X_tfidf)

Bag of Words:
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
Vocabulary: ['00' '000' '001' ... 'zzzterror' 'zzzz' 'zzzzzz']
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 99218 stored elements and shape (10000, 20928)>
  Coords	Values
  (0, 6717)	0.25244905075605234
  (0, 6816)	0.29097075308229653
  (0, 9001)	0.3308628579963431
  (0, 14321)	0.34558584698752354
  (0, 12095)	0.34558584698752354
  (0, 13888)	0.29524761514179126
  (0, 6746)	0.11597233164184101
  (0, 2340)	0.20007413416603326
  (0, 18685)	0.24063252123656428
  (0, 5915)	0.29524761514179126
  (0, 11925)	0.26787713682855213
  (0, 8841)	0.12367016408379665
  (0, 12610)	0.11288345915280563
  (0, 4108)	0.242708020905686
  (0, 18392)	0.13859499549978632
  (0, 19882)	0.19909708799066897
  (1, 10570)	0.3204561255758071
  (1, 8150)	0.188067894592152
  (1, 9330)	0.25743080296513604
  (1, 8413)	0.1678392791771042
  (1, 13368)	0.2971172144394919
  

## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

