# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [None]:
!pip install nltk scikit-learn pandas matplotlib


Now, import the required libraries:

In [1]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aleksamihajlovic/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aleksamihajlovic/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aleksamihajlovic/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aleksamihajlovic/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/aleksamihajlovic/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aleksamihajlovic/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [21]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
# your code here
tokens = nltk.word_tokenize(text.lower())
tokens

['natural',
 'language',
 'processing',
 '(',
 'nlp',
 ')',
 'is',
 'a',
 'fascinating',
 'field',
 'of',
 'study',
 '!',
 'it',
 'involves',
 'analyzing',
 'and',
 'understanding',
 'human',
 'language',
 '.']

Remove stop words and store the result in a variable called `filtered_tokens`

In [22]:
stop_words = set(stopwords.words('english'))
# your code here
filtered_tokens = set(tokens) - stop_words

In [23]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: {')', 'processing', 'analyzing', 'natural', 'language', 'study', 'understanding', 'involves', 'nlp', 'fascinating', 'human', '.', '!', 'field', '('}


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [24]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [25]:
stemmed_tokens = set()
for token in filtered_tokens:
    stemmed_token = stemmer.stem((token))
    stemmed_tokens.add(stemmed_token)



In [26]:
print("Stemmed Tokens:", stemmed_tokens)


Stemmed Tokens: {')', 'analyz', 'involv', 'process', 'fascin', 'languag', 'understand', 'natur', '!', 'field', 'nlp', 'human', '.', 'studi', '('}


Apply lemmatization and store the result in `lemmatized_tokens`

In [27]:
lemmatized_tokens = set()
for token in stemmed_tokens:
    lemmatized_token = lemmatizer.lemmatize((token))
    lemmatized_tokens.add(lemmatized_token)

In [28]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: {')', 'analyz', 'involv', 'process', 'languag', 'understand', 'natur', '!', 'field', 'nlp', 'human', '.', 'fascin', 'studi', '('}


In [29]:
import string
punct = set(string.punctuation)

lemmatized_tokens =  lemmatized_tokens - punct
lemmatized_tokens

{'analyz',
 'fascin',
 'field',
 'human',
 'involv',
 'languag',
 'natur',
 'nlp',
 'process',
 'studi',
 'understand'}

## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [14]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [16]:

# your code here
vectorizer = CountVectorizer(max_features=1000)

# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)


In [17]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [34]:
# your code here
# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 1)

# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(corpus)


In [35]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [37]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2,2))
# Step 2: Fit and transform the corpus into a bigram representation
X_bigram = bigram_vectorizer.fit_transform(corpus)



In [38]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [None]:
import string
import nltk
from nltk.stem import WordNetLemmatizer

def text_preprocessing_pipeline(text):
    
    # Step 1: Tokenize the text
    tokens = nltk.word_tokenize(text.lower())

    # Step 2: Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = set(tokens) - stop_words

    # Step 3: Remove punctuation
    punct = set(string.punctuation)
    clean_tokens =  filtered_tokens - punct

    # Step 4: Apply lemmatization
    lemmatized_tokens = set()
    lemmatizer = WordNetLemmatizer()
    for token in clean_tokens:
        lemmatized_token = lemmatizer.lemmatize((token))
        lemmatized_tokens.add(lemmatized_token)

   
    return lemmatized_tokens


Apply this function to the following text

In [41]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

processed_text = text_preprocessing_pipeline(text)


In [42]:
print("Processed Text:", processed_text)

Processed Text: {'processing', 'analyzing', 'natural', 'language', 'study', 'understanding', 'involves', 'nlp', 'fascinating', 'human', 'field'}


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [49]:
sentence = "The cats are playing with the mice in the garden."

# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens

tokens = nltk.word_tokenize(sentence.lower())
filtered_tokens = set(tokens) - set(stopwords.words('english'))
filtered_tokens =  filtered_tokens - set(string.punctuation)

stemmed_tokens = set()
stemmer = PorterStemmer()
for token in filtered_tokens:
    stemmed_token = stemmer.stem((token))
    stemmed_tokens.add(stemmed_token)

lemmatized_tokens = set()
lemmatizer = WordNetLemmatizer()
for token in filtered_tokens:
    lemmatized_token = lemmatizer.lemmatize((token))
    lemmatized_tokens.add(lemmatized_token)

In [47]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: {'mice', 'playing', 'cats', 'garden'}
Stemmed Tokens: {'mice', 'garden', 'cat', 'play'}
Lemmatized Tokens: {'playing', 'mouse', 'garden', 'cat'}


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [50]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/aleksamihajlovic/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [51]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [52]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [None]:
# your code here
labels = [1]* len(positive_tweets) + [0]* len(positive_tweets)
# Combine the datasets
all_tweets = positive_tweets + negative_tweets


In [57]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [64]:
# Step 1: Apply the preprocessing pipeline to all tweets
# your code here
preprocessed_tweets = [" ".join(text_preprocessing_pipeline(text)) for text in all_tweets]


In [65]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: milipol_paris week engaged member top community followfriday pkuchly57 france_inte


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [None]:

# Step 1: Create a Bag of Words representation
vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(preprocessed_tweets)
#print("Bag of Words:\n", X.toarray())
#print("Vocabulary:", vectorizer.get_feature_names_out())


# Step 2: Create a TF-IDF representation
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
#print("TF-IDF:\n", X_tfidf.toarray())
#print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())


Bag of Words:
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
Vocabulary: ['000' '10' '100' '11' '12' '13' '15' '20' '2015' '24' '2nd' '30' '40'
 '50' '5sos' '969horan696' 'able' 'absolutely' 'acc' 'account' 'act'
 'actually' 'add' 'added' 'address' 'adeccowaytowork' 'af' 'afternoon'
 'age' 'ago' 'agree' 'ah' 'ahh' 'ai' 'aint' 'air' 'airport' 'al' 'album'
 'all' 'allah' 'almost' 'alone' 'along' 'already' 'alright' 'also'
 'always' 'amazing' 'amber' 'amp' 'ang' 'annoying' 'another' 'answer'
 'anymore' 'anyone' 'anything' 'anyway' 'app' 'apparently' 'appreciate'
 'appreciated' 'aqui' 'around' 'arrived' 'art' 'article' 'as' 'ask'
 'asked' 'asking' 'asleep' 'ate' 'august' 'australia' 'available' 'aw'
 'awake' 'away' 'awesome' 'awful' 'aww' 'awww' 'awwww' 'babe' 'baby'
 'back' 'bad' 'badly' 'bae' 'bag' 'ball' 'bam' 'bank' 'barsandmelody' 'bb'
 'bby' 'bc' 'bday' 'beach' 'beat' 'beautiful' 'become' 'bed' 'beli'
 'believe

## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

