# Sentiment Analysis Lab: Movie Review Classification

**Objective:** Train a machine learning model to classify movie reviews as positive or negative.

**Dataset:** movie_reviews.csv

## Step 1: Import Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

## Step 2: Load the Dataset

In [2]:
# Load the movie reviews dataset
df = pd.read_csv('data/movie_reviews.csv')

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:\n{df.head()}")

Dataset loaded successfully!
Dataset shape: (30, 2)

First few rows:
                                                text     label
0  This movie was absolutely fantastic! The actin...  positive
1  Terrible film. Waste of time and money. The st...  negative
2  I loved every minute of it. Best movie I've se...  positive
3  Boring and predictable. I fell asleep halfway ...  negative
4  Outstanding performances by the entire cast. H...  positive


## Step 3: Extract Text and Label Columns

In [3]:
# Extract features (X) and target (y)
X = df['text']  # Movie reviews
y = df['label']  # Sentiment labels (positive/negative)

print(f"Number of reviews: {len(X)}")
print(f"Label distribution:\n{y.value_counts()}")

Number of reviews: 30
Label distribution:
label
positive    15
negative    15
Name: count, dtype: int64


## Step 4: Transform Text into Numerical Features using TF-IDF

In [4]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
X_vectorized = vectorizer.fit_transform(X)

print(f"Feature matrix shape: {X_vectorized.shape}")
print(f"Number of unique words (features): {len(vectorizer.get_feature_names_out())}")

Feature matrix shape: (30, 174)
Number of unique words (features): 174


## Step 5: Split Data into Training and Testing Sets (80/20)

In [5]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_vectorized, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

Training set size: 24
Testing set size: 6


## Step 6: Initialize and Train the Logistic Regression Model

In [6]:
# Initialize the model
model = LogisticRegression(max_iter=1000)

# Train the model
model.fit(X_train, y_train)

print("Model training complete!")

Model training complete!


## Step 7: Make Predictions on the Test Set

In [7]:
# Predict labels for test set
y_pred = model.predict(X_test)

print(f"Predictions made for {len(y_pred)} test samples")

Predictions made for 6 test samples


## Step 8: Calculate and Print Accuracy Score

In [8]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy Score: {accuracy:.2f}")
print(f"Accuracy Percentage: {accuracy * 100:.2f}%")

Accuracy Score: 0.17
Accuracy Percentage: 16.67%


## Step 9: Print Classification Report

In [9]:
# Display detailed classification metrics
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         5
    positive       0.17      1.00      0.29         1

    accuracy                           0.17         6
   macro avg       0.08      0.50      0.14         6
weighted avg       0.03      0.17      0.05         6



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Key Insight

Assess how well the model generalizes to new, unseen reviews. The accuracy score and classification report show:
- **Precision**: Of all reviews predicted as positive/negative, how many were correct?
- **Recall**: Of all actual positive/negative reviews, how many did we identify?
- **F1-Score**: The harmonic mean of precision and recall

A high accuracy on test data indicates the model successfully learned patterns from the training data and can generalize to new movie reviews!