# Text Classification Project
## Overview

The project builds a sentiment classifier for movie reviews using both traditional machine learning (Logistic Regression) and deep learning (Neural Network) approaches. It uses the Rotten Tomatoes dataset to predict whether reviews are positive or negative.

### 1. Import the Libraries

In [None]:
import torch # For building neural networks
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computations
import scipy.sparse as sp # For efficient storage of sparse matrices

# sklearn modules: For ML algorithms, text processing, and evaluation
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import datasets # To easily load standard datasets
import evaluate # For model evaluation metrics
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

### 2. Load dataset using the datasets library

This loads the Rotten Tomatoes dataset using the Hugging Face `datasets` library. This dataset contains movie reviews labeled as positive (1) or negative (0).

In [3]:
dataset = datasets.load_dataset("rotten_tomatoes")

### 3. Convert to pandas and prepare data

The code:
- Converts the dataset splits into pandas DataFrames
- Extracts the review text as features (x) and sentiment labels (y)
- Creates separate lists for training and testing data

In [4]:
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

In [5]:
x_train = train_df['text'].tolist()
y_train = train_df['label'].tolist()
x_test = test_df['text'].tolist()
y_test = test_df['label'].tolist()

### 4. Feature extraction using sklearn and scipy

This step:
- Creates a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer that limits to the 5,000 most important features
- Transforms the raw text reviews into numerical feature vectors
- TF-IDF represents words by their frequency in a document scaled by how rare they are across all documents
- The result is a sparse matrix (mostly zeros) handled by `scipy.sparse`

In [7]:
vectorizer = TfidfVectorizer(max_features=5000)
x_train_tfidf = vectorizer.fit_transform(x_train)
x_test_tfidf = vectorizer.transform(x_test)

### 5. Train a traditional ML model using sklearn and scipy

This section:
- Creates a logistic regression model with 1,000 maximum iterations
- Fits it to the training data
- Generates predictions on the test data

In [9]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(x_train_tfidf, y_train)
lr_predictions = lr_model.predict(x_test_tfidf)

### 6. Neural Network with PyTorch
#### Convert sparse matrices to dense tensors
This section:
- Converts the scipy sparse matrices to dense PyTorch tensors
- Creates TensorDatasets that pair features with labels
- Sets up DataLoaders that will feed data to the model in batches of 64 samples
- Shuffles the training data to prevent learning order-specific patterns

In [10]:
x_train_tensor = torch.FloatTensor(x_train_tfidf.todense())
x_test_tensor = torch.FloatTensor(x_test_tfidf.todense())
y_train_tensor = torch.LongTensor(y_train)
y_test_tensor = torch.LongTensor(y_test)

Create DataLoaders

In [11]:
train_dataset = TensorDataset(x_train_tensor, y_train_tensor)
test_dataset = TensorDataset(x_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)

Define neural network

In [12]:
class TextClassifier(nn.Module):
    def __init__(self, input_dim):
        super(TextClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, 256)
        self.fc2 = nn.Linear(256, 64)
        self.fc3 = nn.Linear(64, 2) #Binary classification
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x


input_dim = x_train_tfidf.shape[1]
model = TextClassifier(input_dim)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Train the neural network

In [14]:
num_epochs = 5 
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader: 
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        

Evaluate the neural network

In [15]:
model.eval()
nn_predictions = []
with torch.no_grad():
    for inputs, _ in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        nn_predictions.extend(predicted.cpu().numpy())
        

### 7. Evaluate both models using evaluate and sklearn

This section:

- Uses the `evaluate` library to calculate accuracy
- Uses `sklearn.metrics` to calculate precision, recall and F1-score
- Evaluates both models with the same metrics for fair comparison

In [17]:
metric = evaluate.load("accuracy")
lr_accuracy = metric.compute(predictions=lr_predictions, references=y_test)
nn_accuracy = metric.compute(predictions=nn_predictions, references=y_test)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Calculate precision, recall, and F1 using sklearn

In [18]:
lr_precision, lr_recall, lr_f1, _ = precision_recall_fscore_support(y_test, lr_predictions, average='binary')
nn_precision, nn_recall, nn_f1, _ = precision_recall_fscore_support(y_test, nn_predictions, average='binary')

### 8. Compare results

This creates and displays a pandas DataFrame that directly compares the performance of both models across multiple metrics.

In [20]:
results = pd.DataFrame({
    'Model': ['Logistic Regression',  'Neural Network'],
    'Accuracy': [lr_accuracy['accuracy'], nn_accuracy['accuracy']],
    'Precision': [lr_precision, nn_precision],
    'Recall': [lr_recall, nn_recall],
    'F1 Score': [lr_f1, nn_f1],
})

print("\nModel Performance Comparison:")
print(results)


Model Performance Comparison:
                 Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.770169   0.775862  0.759850  0.767773
1       Neural Network  0.767355   0.783300  0.739212  0.760618


### 9. Feature importance analysis with scipy and numpy

This final section:
- Extracts coefficient values from the logistic regression model
- Uses numpy to calculate absolute importance
- Pairs them with the feature names from the TF-IDF vectorizer
- Creates a DataFrame of feature importance
- Displays the top 10 most important words for sentiment prediction

In [22]:
if hasattr(lr_model, 'coef_'):
    # Get feature importance from logistic regression
    importance = np.abs(lr_model.coef_[0])
    feature_names = vectorizer.get_feature_names_out()

    # Create feature importance DataFrame
    feature_importance = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importance
    })

    # Display top 10 most important features
    top_features = feature_importance.sort_values('Importance', ascending=False).head(10)
    print("\nTop 10 Most Important Features: ")
    print(top_features)


Top 10 Most Important Features: 
           Feature  Importance
4528           too    4.231535
351            bad    3.068830
1327          dull    2.903202
3214  performances    2.587575
224            and    2.538227
4196         still    2.458881
500         boring    2.444609
3084          only    2.376841
4948         worst    2.213899
1452     enjoyable    2.194255
