# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [2]:
!pip install nltk scikit-learn pandas matplotlib





Now, import the required libraries:

In [3]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [4]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_t

True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [5]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
# your code here
tokens = nltk.word_tokenize(text)
tokens

['Natural',
 'Language',
 'Processing',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'fascinating',
 'field',
 'of',
 'study',
 '!',
 'It',
 'involves',
 'analyzing',
 'and',
 'understanding',
 'human',
 'language',
 '.']

Remove stop words and store the result in a variable called `filtered_tokens`

In [6]:
stop_words = set(stopwords.words('english'))
# your code here
filtered_tokens = [w for w in tokens if not w.lower() in stop_words]
filtered_tokens


['Natural',
 'Language',
 'Processing',
 '(',
 'NLP',
 ')',
 'fascinating',
 'field',
 'study',
 '!',
 'involves',
 'analyzing',
 'understanding',
 'human',
 'language',
 '.']

In [7]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [8]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [9]:
# your code here
stemmed_tokens = []
for w in filtered_tokens:
    stemmed_token = stemmer.stem(w)
    stemmed_tokens.append(stemmed_token)
stemmed_tokens

['natur',
 'languag',
 'process',
 '(',
 'nlp',
 ')',
 'fascin',
 'field',
 'studi',
 '!',
 'involv',
 'analyz',
 'understand',
 'human',
 'languag',
 '.']

In [10]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', '(', 'nlp', ')', 'fascin', 'field', 'studi', '!', 'involv', 'analyz', 'understand', 'human', 'languag', '.']


Apply lemmatization and store the result in `lemmatized_tokens`

In [11]:
# your code here
lemmatized_tokens = []
for w in filtered_tokens:
    lemmatized_tokens.append(lemmatizer.lemmatize(w))

In [12]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [13]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
# your code here
# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)

In [15]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
# your code here
# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = CountVectorizer (min_df = 1)#(min_df=0.1,max_df=0.7)
tfidf = TfidfTransformer(norm="l2")
term_freq_matrix = tfidf_vectorizer.fit_transform(corpus)
# Step 2: Fit and transform the corpus into a TF-IDF representation
tfidf.fit(term_freq_matrix)
X_tfidf = tfidf.transform(term_freq_matrix)


In [17]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [18]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2,2))
# Step 2: Fit and transform the corpus into a bigram representation
X_bigram = bigram_vectorizer.fit_transform(corpus)



In [19]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [20]:
# your code here

def text_preprocessing_pipeline(text, technique='stem'):
    lemmatizer = WordNetLemmatizer()
    # Step 1: Tokenize the text
    tokenized_words = word_tokenize(text)
    # Step 2: Remove stop words
    filtered_words = [w for w in tokenized_words if not w.lower() in stop_words]

    # Step 3: Remove punctuation
    cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in filtered_words if re.sub(r'[^\w\s]', '', token)]
    # Step 4: Apply lemmatization
    # your code here
    if technique == 'stem':
        res_tokens = []
        for w in cleaned_tokens:
            res_tokens.append(lemmatizer.lemmatize(w))
    else: 
        res_tokens = [stemmer.stem(w) for w in cleaned_tokens]
    return res_tokens



Apply this function to the following text

In [21]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# your code here
processed_text = text_preprocessing_pipeline(text)

In [22]:
print("Processed Text:", processed_text)

Processed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [23]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
sentence = "The cats are playing with the mice in the garden."
# your code here
# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens
tokens = word_tokenize(sentence)
filtered_tokens = [w for w in tokens if not w.lower() in stop_words]
cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in filtered_tokens if re.sub(r'[^\w\s]', '', token)]

# Step 2: Apply stemming

stemmed_tokens = [stemmer.stem(w) for w in cleaned_tokens]

# Step 3: Apply lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(w) for w in cleaned_tokens]



In [24]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'playing', 'mice', 'garden', '.']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'playing', 'mouse', 'garden']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [25]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [26]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [27]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [28]:
# your code here

# Combine the datasets
all_tweets = []
all_tweets = positive_tweets + negative_tweets
labels = [1]  * len(positive_tweets) + [0] * len(negative_tweets)



In [43]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[15])
print("Label:", labels[15])

Sample Tweet: Laying out a greetings card range for print today - love my job :-)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [30]:
# Step 1: Apply the preprocessing pipeline to all tweets
# your code here
preprocessed_tweets = []
for tweet in all_tweets:
    preprocessed_tweets.append(text_preprocessing_pipeline(tweet, 'lem'))

In [31]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: ['followfriday', 'france_int', 'pkuchly57', 'milipol_pari', 'top', 'engag', 'member', 'commun', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [None]:
# your code here
# Step 1: Create a Bag of Words representation
preprocessed_tweets_str = [" ".join(tweet) for tweet in preprocessed_tweets]
vectorizer = CountVectorizer()
# Step 2: Fit and transform the corpus into a BoW representation
X_bow = vectorizer.fit_transform(preprocessed_tweets_str)

# Step 2: Create a TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_tweets_str)

print(X_bow.shape)
print(X_tfidf.shape)
# X_bow es la matriu de frequencies de paraules
print(X_bow)
# X_tfidf pondera les paraules segons la seva rellevancia
print(X_tfidf)

(10000, 18831)
(10000, 18831)
  (0, 5491)	1
  (0, 5578)	1
  (0, 11904)	1
  (0, 10000)	1
  (0, 17081)	1
  (0, 4873)	1
  (0, 9852)	1
  (0, 3409)	1
  (0, 17992)	1
  (1, 8704)	1
  (1, 6688)	1
  (1, 7652)	1
  (1, 11128)	1
  (1, 11939)	1
  (1, 2693)	1
  (1, 3476)	1
  (1, 2910)	1
  (1, 14)	1
  (1, 590)	1
  (1, 1443)	1
  (1, 9518)	1
  (1, 16631)	1
  (2, 4159)	1
  (2, 9026)	1
  (2, 8754)	1
  :	:
  (9994, 7157)	1
  (9994, 17856)	1
  (9995, 2962)	1
  (9995, 17903)	1
  (9995, 10504)	1
  (9995, 1532)	1
  (9995, 17562)	1
  (9996, 12283)	1
  (9996, 2503)	1
  (9996, 5512)	1
  (9997, 1634)	1
  (9997, 11834)	1
  (9997, 7625)	1
  (9998, 6926)	1
  (9998, 3502)	1
  (9998, 10296)	1
  (9998, 806)	1
  (9998, 9649)	1
  (9998, 15146)	1
  (9999, 17992)	1
  (9999, 14545)	1
  (9999, 5078)	1
  (9999, 4637)	1
  (9999, 6957)	1
  (9999, 10069)	1
  (0, 5491)	0.29608654432170023
  (0, 5578)	0.4053226537971863
  (0, 11904)	0.4053226537971863
  (0, 10000)	0.4053226537971863
  (0, 17081)	0.2799248334712398
  (0, 4873)	0.34

I will apply a logistic regression to predict 

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, labels, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)




In [40]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)
print("Accuracy : ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy :  0.7615
              precision    recall  f1-score   support

           0       0.74      0.79      0.77       988
           1       0.78      0.73      0.76      1012

    accuracy                           0.76      2000
   macro avg       0.76      0.76      0.76      2000
weighted avg       0.76      0.76      0.76      2000



In [39]:
print(X_test[0])
y_pred[0]

  (0, 9205)	0.2552972907718625
  (0, 5274)	0.3103625496429945
  (0, 16826)	0.3330114621639493
  (0, 14717)	0.6031590781755156
  (0, 4822)	0.6031590781755156


0

In [46]:
predicted_label = model.predict(X_test[2])
print(predicted_label)
print("Tweet:", all_tweets[2])
print("Predicted Sentiment:", "Positive" if predicted_label[0] == 1 else "Negative")

[1]
Tweet: @DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!
Predicted Sentiment: Positive


## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

