In [2]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp310-cp310-win_amd64.whl.metadata (41 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 1.5/1.5 MB 15.9 MB/s eta 0:00:00
Downloading regex-2024.11.6-cp310-cp310-win_amd64.whl (274 kB)
Downloading click-8.1.8-py3-none-any.whl (98 kB)
Installing collected packages: regex, click, nltk
Successfully installed click-8.1.8 nltk-3.9.1 regex-2024.11.6


# Sentiment Analysis Report - # Movie Review Classification Using TF-IDF and Logistic Regression

**Build a model to classify IMDB reviews as positive or negative** 

## Steps:
1. Data loading
2. Preprocessing
3. TF-IDF vectorization
4. Model training
5. Evaluation

```python
print("This is how code looks inside markdown.")

In [2]:
import pandas as pd # Import pandas for data manipulation
import re #Import regular expressions for text cleaning
import nltk #Import NLTK for stopword removal
from nltk.corpus import stopwords
# Import functions for train-test splitting and model evaluation
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Setup
# Import functions for train-test splitting and model evaluation
#nltk.download('stopwords')
# Load stopwords into a set for faster lookup
stop_words = set(stopwords.words('english'))

#Define a function to clean and preprocess a text string
def preprocess(text):
    text = text.lower() # Convert text to lowercase
    text = re.sub(r'<.*?>', '', text)  # remove HTML tags
    text = re.sub(r'[^a-z\s]', '', text)  # remove special characters and numbers
    tokens = text.split()  # Split text into words
    tokens = [word for word in tokens if word not in stop_words] # Remove stopwords
    return ' '.join(tokens) #Join words back into a single string

# Load dataset
df = pd.read_csv("D:/Self Learning/NLP -Sentiment Analysis/IMDB Reviews/IMDB Dataset.csv")

# Parameters
batch_size = 5000  # Set batch size for processing chunks of data
total_records = len(df) #Get total number of records in the dataset
# Prepare empty lists to store model predictions and true labels
all_predictions = []
all_true_labels = []

# Create a TF-IDF vectorizer (will be fit on all data later)
tfidf = TfidfVectorizer(max_features=5000)
# Create a logistic regression model instance
model = LogisticRegression()

#Preprocess the full dataset first in chunks to avoid memory overload
print("Preprocessing in batches...")
#List to hold preprocessed batches temporarily
batches = []
for start in range(0, total_records, batch_size):
    end = min(start + batch_size, total_records) # Avoid index overflow
    batch = df.iloc[start:end].copy() # Copy batch from dataframe
    batch['review'] = batch['review'].apply(preprocess) # Preprocess text
    batches.append(batch)  # Store cleaned batch

#Concatenate all cleaned reviews from all batches into one list
full_cleaned_text = pd.concat(batches)['review'].tolist()
tfidf.fit(full_cleaned_text)  # Fit the TF-IDF vectorizer on the entire corpus (only once)

# Convert sentiment labels into numeric (positive = 1, negative = 0)
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
# Split the full dataframe into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2)

# === TRAINING ON TRAIN BATCHES ===
print("Training in batches...")
# Store sparse TF-IDF matrices for training data
X_train_batches, y_train_batches = [], []

for start in range(0, len(train_df), batch_size):
    end = min(start + batch_size, len(train_df))
    batch = train_df.iloc[start:end].copy()
    batch['review'] = batch['review'].apply(preprocess) # Preprocess text
    X = tfidf.transform(batch['review']) # Convert to TF-IDF
    y = batch['sentiment'].values  # Get sentiment labels
    X_train_batches.append(X)
    y_train_batches.extend(y) # Extend label list

# Stack all sparse matrices vertically into one big matrix
from scipy.sparse import vstack
X_train_full = vstack(X_train_batches)

# Train the logistic regression model
model.fit(X_train_full, y_train_batches)

# === PREDICTION ON TEST BATCHES ===
print("Predicting in batches...")
# Loop through the test data in batches for prediction
for start in range(0, len(test_df), batch_size):
    end = min(start + batch_size, len(test_df))
    batch = test_df.iloc[start:end].copy()
    batch['review'] = batch['review'].apply(preprocess) # Preprocess
    X_test = tfidf.transform(batch['review'])  # Convert to TF-IDF (Term Frequency Inverse Document Frequency)
    y_test = batch['sentiment'].values # True labels
    y_pred = model.predict(X_test) # Predict sentiment

    # Collect all predictions
    all_predictions.extend(y_pred)  # Save predictions
    all_true_labels.extend(y_test)  # Save true labels

# Final report
print("\nClassification Report:")
print(classification_report(all_true_labels, all_predictions))


Preprocessing in batches...
Training in batches...
Predicting in batches...

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      5074
           1       0.88      0.89      0.89      4926

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



# Notes: Machine learning models (like Logistic Regression) cannot work with raw text — they require numerical input.

**So we use TF-IDF (Term Frequency-Inverse Document Frequency) to:** 

*Turn each review into a vector of numbers representing the importance of words.*

*Ensure consistency by using the same vocabulary and weighting as the training data*

# Difference Between .fit_transform() and .transform()?

Used When?	What it Does
fit_transform()	On training data only	Learns vocabulary + transforms text to numeric form
transform()	On test (or new) data	Uses the learned vocabulary to convert new text into vectors
## 🧠 Sentiment Analysis on IMDB Reviews
- Built a text classification model using **TF-IDF** and **Logistic Regression**.
- Preprocessed 50,000+ movie reviews, removed noise and stopwords using NLTK.
- Achieved **89% accuracy** on test data.