---


#### **Objective:**
In this exercise, students will:
- Implement text preprocessing techniques like tokenization, lemmatization, and stemming.
- Use Count Vectorizer to convert text data into numerical features.
- Build a text classification model using Naive Bayes.
- Compare the model’s performance on datasets with and without preprocessing.

---

### **Tasks:**

1. **Data Loading and Exploration:**
    - Load the datasets.
    - Explore the data to understand its structure (e.g., features, labels).

2. **Preprocessing:**
    - **With Preprocessing:**
        - Convert text to lowercase.
        - Tokenize text.
        - Remove stopwords.
        - Apply stemming (Porter Stemmer) and lemmatization (WordNet Lemmatizer).
    - **Without Preprocessing:**
        - Skip lemmatization and stemming to observe the effect.

3. **Feature Engineering:**
    - Use **Count Vectorizer** to extract features from the preprocessed text.

4. **Model Training and Classification:**
    - Implement a Naive Bayes classifier.
    - Train the model on the preprocessed features and classify the text into the respective categories.

5. **Model Evaluation:**
    - Evaluate the model using accuracy, precision, recall, and F1-score.
    - Compare the performance with and without preprocessing.

---

### **Function Signatures:**

**1. Data Loading:**

```python
def load_data(file_path: str) -> pd.DataFrame:
    """
    Load dataset from the given file path and return a DataFrame.
    """
    pass
```

**2. Preprocessing (With and Without Lemmatization and Stemming):**

```python
def preprocess_text_with_stemming(text: str) -> str:
    """
    Preprocess the text by:
    - Lowercasing
    - Tokenization
    - Stopword removal
    - Stemming (Porter Stemmer)
    
    Return the cleaned text.
    """
    pass

def preprocess_text_without_stemming(text: str) -> str:
    """
    Preprocess the text by:
    - Lowercasing
    - Tokenization
    - Stopword removal
    
    Skip stemming and lemmatization.
    
    Return the cleaned text.
    """
    pass
```

**3. Feature Engineering using Count Vectorizer:**

```python
def extract_features(data: pd.Series) -> csr_matrix:
    """
    Use Count Vectorizer to convert the text data into numerical features.
    
    Return the feature matrix (sparse matrix).
    """
    pass
```

**4. Model Training and Evaluation:**

```python
def train_naive_bayes(X_train: csr_matrix, y_train: pd.Series) -> MultinomialNB:
    """
    Train a Naive Bayes model using the training data and labels.
    
    Return the trained Naive Bayes model.
    """
    pass

def evaluate_model(model: MultinomialNB, X_test: csr_matrix, y_test: pd.Series) -> dict:
    """
    Evaluate the model using accuracy, precision, recall, and F1-score.
    
    Return a dictionary with evaluation metrics.
    """
    pass
```

---

### **Steps  to Implement:**

1. **Load and Explore the Dataset:**
   - Load the IMDb or Spam dataset using `load_data()`.
   - Display the first few rows to understand the data structure.
   
2. **Implement Preprocessing:**
   - Implement both versions of the text preprocessing (with and without stemming) using `preprocess_text_with_stemming()` and `preprocess_text_without_stemming()`.

3. **Feature Extraction:**
   - Use the `extract_features()` function to convert the preprocessed text into numerical features using Count Vectorizer.

4. **Train Naive Bayes Classifier:**
   - Split the dataset into training and testing sets.
   - Train the model using `train_naive_bayes()`.

5. **Model Evaluation:**
   - Evaluate the model using `evaluate_model()` and compare the results for text classification with and without preprocessing.

---

## Step 1: Setup and Data Loading

In [20]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Downloading necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
movies_df = pd.read_csv('IMDB Dataset.csv')

In [5]:
movies_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
movies_df.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


In [7]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [8]:
movies_df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [9]:
movies_df.duplicated().sum()

np.int64(418)

In [10]:
updated_movies_df = movies_df.drop_duplicates()
updated_movies_df.duplicated().sum()

np.int64(0)

## Step 2: Text Preprocessing

In [15]:
def preprocess_text(text):
  text = text.lower()
  tokens = word_tokenize(text)
  stop_words = set(stopwords.words('english'))
  tokens = [w for w in tokens if not w in stop_words and w.isalnum()]
  stemmer = PorterStemmer()
  stemmed_tokens = [stemmer.stem(w) for w in tokens]
  lemmatizer = WordNetLemmatizer()
  lemmatized_tokens = [lemmatizer.lemmatize(w) for w in stemmed_tokens]
  return " ".join(lemmatized_tokens)

process1 = updated_movies_df['review'].apply(preprocess_text)
process1

Unnamed: 0,review
0,one review mention watch 1 oz episod hook righ...
1,wonder littl product br br film techniqu fashi...
2,thought wonder way spend time hot summer weeke...
3,basic famili littl boy jake think zombi closet...
4,petter mattei love time money visual stun film...
...,...
49995,thought movi right good job creativ origin fir...
49996,bad plot bad dialogu bad act idiot direct anno...
49997,cathol taught parochi elementari school nun ta...
49998,go disagre previou comment side maltin one sec...


In [17]:
# Going for second round of data preprocessing
# Converting text to lowercase. Tokenizing text, Removing stopwords and Skipping lemmatization and stemming

def preprocess_text_without_stemming(text):
  text = text.lower()
  tokens = word_tokenize(text)
  stop_words = set(stopwords.words('english'))
  tokens = [w for w in tokens if not w in stop_words and w.isalnum()]
  return " ".join(tokens)

process2 = updated_movies_df['review'].apply(preprocess_text_without_stemming)
process2

Unnamed: 0,review
0,one reviewers mentioned watching 1 oz episode ...
1,wonderful little production br br filming tech...
2,thought wonderful way spend time hot summer we...
3,basically family little boy jake thinks zombie...
4,petter mattei love time money visually stunnin...
...,...
49995,thought movie right good job creative original...
49996,bad plot bad dialogue bad acting idiotic direc...
49997,catholic taught parochial elementary schools n...
49998,going disagree previous comment side maltin on...


## Step 3: Feature Engineering with TF-IDF and Extract Features

In [23]:
# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer()

# Fit and transform the processed reviews
X = tfidf.fit_transform(updated_movies_df['review'])
y = updated_movies_df['sentiment']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state= 42)

print('\nTF-IDF Feature Matrix Shape:', X.shape)


TF-IDF Feature Matrix Shape: (49582, 101895)


In [25]:
from multiprocessing import process
def extract_features(data):
  vectorizer = CountVectorizer()
  features = vectorizer.fit_transform(data)
  return features

process3 = extract_features(updated_movies_df['review'])
process3


<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6773763 stored elements and shape (49582, 101895)>

# Step 4: Model Training

In [28]:
# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

print('\nModel trained')


Model trained


## Step 5: Model Evaluation

In [29]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names = ['Negastive', 'Positive'])

print("\nModel Evaluate:")
print(f'Accuracy: {accuracy}')
print('Classification Report:\n', report)
print(report)


Model Evaluate:
Accuracy: 0.8956337602097408
Classification Report:
               precision    recall  f1-score   support

   Negastive       0.91      0.88      0.89      4939
    Positive       0.88      0.91      0.90      4978

    accuracy                           0.90      9917
   macro avg       0.90      0.90      0.90      9917
weighted avg       0.90      0.90      0.90      9917

              precision    recall  f1-score   support

   Negastive       0.91      0.88      0.89      4939
    Positive       0.88      0.91      0.90      4978

    accuracy                           0.90      9917
   macro avg       0.90      0.90      0.90      9917
weighted avg       0.90      0.90      0.90      9917

