# 🎬 Movie Review Sentiment Analysis Project
Welcome to your NLP mini-project! In this task, you’ll use real movie review data to learn core Natural Language Processing concepts — from preprocessing to classification.

You’ll complete some code cells (marked with **TODO**) to practice important NLP steps.

## 🧩 Step 1: Import Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 📥 Step 2: Load Simple Text Dataset
We’ll use the built-in **20 Newsgroups dataset** (it’s small and automatically downloads). We'll only pick two categories to make it binary — like fake vs real sentiment!

In [None]:
# Load text data from sklearn
categories = ['rec.autos', 'sci.space']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Create a DataFrame
data = pd.DataFrame({'text': newsgroups.data, 'label': newsgroups.target})
data.head()

## 🧹 Step 3: Text Preprocessing
Here, you’ll:
- Lowercase text
- Remove punctuation and stopwords
- Apply stemming

👉 **Task:** Fill in the missing lines marked as `# TODO`

In [None]:
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()
    text = ''.join([ch for ch in text if ch not in string.punctuation])
    words = text.split()
    # TODO: remove stopwords
    words = [w for w in words if w not in stop_words]
    # TODO: apply stemming
    words = [stemmer.stem(w) for w in words]
    return ' '.join(words)

# Apply cleaning
data['clean_text'] = data['text'].apply(clean_text)
data.head()

## 🧮 Step 4: Feature Extraction
Convert text into numerical vectors using **TF-IDF** (Term Frequency-Inverse Document Frequency).

In [None]:
# TODO: Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(data['clean_text'])
y = data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 🤖 Step 5: Train Models
We’ll train both **SVM** and **Decision Tree** classifiers to compare performance.

In [None]:
# Train SVM model
svm_model = LinearSVC()
svm_model.fit(X_train, y_train)

# Train Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

## 📊 Step 6: Evaluate Models

In [None]:
# 🧠 Model Evaluation & Comparison
# We’ll compare the performance of 4 models:
# 1. Support Vector Machine (SVM)
# 2. Decision Tree
# 3. Bagging Classifier
# 4. Random Forest

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# Define models
models = {
    'SVM': svm_model,
    'Decision Tree': dt_model,
    'Bagging': BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

# Train & evaluate each model
for name, model in models.items():
    print(f"\n🔹 Training and Evaluating Model: {name}")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))


In [None]:
# 

## 🧭 Step 7: Discussion & Extension Ideas
- Which model performed better? Why?
- Try using `CountVectorizer` instead of `TfidfVectorizer`.
- Add `WordNet Lemmatizer` instead of stemmer.
- Plot confusion matrix using seaborn.

## 🌟 Congratulations!
You’ve completed your mini-project covering:
- Data loading
- Text preprocessing
- Feature extraction
- Model training & evaluation

Great job!