🎬 IMDb Movie Review Sentiment Analysis

A complete machine learning pipeline for analyzing sentiment in movie reviews using scikit-learn and natural language processing techniques.

🚀 Quick Start

Install dependencies:
```
pip install -r requirements.txt
```
Run the interactive demo:
```
python src/interactive_demo.py
```
Or run the main analysis script:
```
python src/sentiment_analysis.py
```

📁 Project Structure

imdb_ds/
├── src/                    # Source code directory
│   ├── sentiment_analysis.py    # Main sentiment analysis pipeline
│   ├── data_utils.py           # Data loading and preprocessing utilities
│   ├── visualizations.py       # Visualization and plotting functions
│   ├── interactive_demo.py     # Interactive demo interface
│   └── test_pipeline.py        # Test suite
├── docs/                   # Documentation directory
│   └── PROJECT_OVERVIEW.md    # Project status and details
├── models/                 # Trained model storage
│   └── sentiment_model.pkl    # Saved model file
├── requirements.txt        # Python dependencies
├── CLAUDE.md              # Development guidance for Claude Code
└── README.md              # This file

🎯 Features

Multiple Data Sources: Load data from Hugging Face datasets, CSV files, or use sample data
Advanced Preprocessing: Text cleaning, tokenization, stemming, and stopword removal
Model Training: Logistic Regression and Naive Bayes classifiers with cross-validation
Comprehensive Evaluation: Accuracy, AUC, confusion matrix, and feature importance analysis
Rich Visualizations: ROC curves, word clouds, feature importance plots, and more
Interactive Testing: Test sentiment predictions on custom movie reviews
Model Persistence: Save and load trained models

🛠️ Usage Examples

Basic Usage

from src.sentiment_analysis import MovieReviewSentimentAnalyzer

# Initialize analyzer
analyzer = MovieReviewSentimentAnalyzer()

# Load sample data and train
reviews, labels = analyzer.load_sample_data()
X_train, X_test, y_train, y_test = analyzer.prepare_data(reviews, labels)
analyzer.train_model(X_train, y_train)

# Test prediction
result = analyzer.predict_sentiment("This movie was absolutely amazing!")
print(f"Sentiment: {result['sentiment']} (confidence: {result['confidence']:.3f})")

Loading Real IMDb Data

from src.data_utils import load_imdb_dataset

# Load 5000 reviews from IMDb dataset
reviews, labels = load_imdb_dataset(subset_size=5000)

Creating Visualizations

from src.visualizations import create_comprehensive_report

# Generate complete analysis report
create_comprehensive_report(analyzer, X_test, y_test, reviews, labels)

📊 Model Performance

The model uses TF-IDF vectorization with the following features:

Vectorizer: TF-IDF with 1-2 word n-grams
Classifier: Logistic Regression with L2 regularization
Preprocessing: Stemming, stopword removal, and text cleaning
Expected Accuracy: 85-90% on IMDb dataset

📈 Visualizations

The project includes comprehensive visualizations:

Confusion matrices
ROC curves and AUC scores
Feature importance (most positive/negative words)
Word clouds for positive and negative reviews
Review length distributions
Prediction confidence analysis

🎮 Interactive Demo

Run python src/interactive_demo.py for a user-friendly interface that provides:

Quick training with sample data
Training with real IMDb dataset
Interactive sentiment testing
Visualization report generation
Model saving and loading

📦 Dependencies

pandas: Data manipulation and analysis
scikit-learn: Machine learning algorithms and evaluation
nltk: Natural language processing tools
numpy: Numerical computing
matplotlib & seaborn: Data visualization
datasets: Hugging Face datasets (for real IMDb data)
wordcloud: Word cloud generation
tqdm: Progress bars

🎯 Example Predictions

Review: "This movie was absolutely fantastic! Great acting and plot."
→ Sentiment: Positive (confidence: 0.924)

Review: "Terrible film. Boring and predictable storyline."
→ Sentiment: Negative (confidence: 0.876)

Review: "Decent movie with some good moments."
→ Sentiment: Positive (confidence: 0.651)

🔧 Customization

Training Different Models

# Try Naive Bayes instead of Logistic Regression
analyzer.train_model(X_train, y_train, model_type='naive_bayes')

# Use Count Vectorizer instead of TF-IDF
analyzer.train_model(X_train, y_train, vectorizer_type='count')

Custom Text Preprocessing

# Modify the clean_text method in MovieReviewSentimentAnalyzer
# to customize preprocessing steps

📚 Educational Value

This project demonstrates:

Text preprocessing and feature extraction
Binary classification with scikit-learn
Model evaluation and cross-validation
Data visualization with matplotlib/seaborn
Interactive application development
Model persistence and deployment

🤝 Contributing

Feel free to extend this project by:

Adding more sophisticated models (neural networks, transformers)
Implementing additional text preprocessing techniques
Creating web interface with Flask/Streamlit
Adding support for multi-class sentiment (very negative, negative, neutral, positive, very positive)
Implementing real-time sentiment analysis

📄 License

This project is for educational purposes. The IMDb dataset usage follows the terms of the Hugging Face datasets library.

Happy sentiment analyzing! 🎭

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 IMDb Movie Review Sentiment Analysis

🚀 Quick Start

📁 Project Structure

🎯 Features

🛠️ Usage Examples

Basic Usage

Loading Real IMDb Data

Creating Visualizations

📊 Model Performance

📈 Visualizations

🎮 Interactive Demo

📦 Dependencies

🎯 Example Predictions

🔧 Customization

Training Different Models

Custom Text Preprocessing

📚 Educational Value

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
models		models
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt
todo.md		todo.md

Folders and files

Latest commit

History

Repository files navigation

🎬 IMDb Movie Review Sentiment Analysis

🚀 Quick Start

📁 Project Structure

🎯 Features

🛠️ Usage Examples

Basic Usage

Loading Real IMDb Data

Creating Visualizations

📊 Model Performance

📈 Visualizations

🎮 Interactive Demo

📦 Dependencies

🎯 Example Predictions

🔧 Customization

Training Different Models

Custom Text Preprocessing

📚 Educational Value

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages