Skip to content

ipdelete/imdb_ds

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎬 IMDb Movie Review Sentiment Analysis

A complete machine learning pipeline for analyzing sentiment in movie reviews using scikit-learn and natural language processing techniques.

🚀 Quick Start

  1. Install dependencies:

    pip install -r requirements.txt
  2. Run the interactive demo:

    python src/interactive_demo.py
  3. Or run the main analysis script:

    python src/sentiment_analysis.py

📁 Project Structure

imdb_ds/
├── src/                    # Source code directory
│   ├── sentiment_analysis.py    # Main sentiment analysis pipeline
│   ├── data_utils.py           # Data loading and preprocessing utilities
│   ├── visualizations.py       # Visualization and plotting functions
│   ├── interactive_demo.py     # Interactive demo interface
│   └── test_pipeline.py        # Test suite
├── docs/                   # Documentation directory
│   └── PROJECT_OVERVIEW.md    # Project status and details
├── models/                 # Trained model storage
│   └── sentiment_model.pkl    # Saved model file
├── requirements.txt        # Python dependencies
├── CLAUDE.md              # Development guidance for Claude Code
└── README.md              # This file

🎯 Features

  • Multiple Data Sources: Load data from Hugging Face datasets, CSV files, or use sample data
  • Advanced Preprocessing: Text cleaning, tokenization, stemming, and stopword removal
  • Model Training: Logistic Regression and Naive Bayes classifiers with cross-validation
  • Comprehensive Evaluation: Accuracy, AUC, confusion matrix, and feature importance analysis
  • Rich Visualizations: ROC curves, word clouds, feature importance plots, and more
  • Interactive Testing: Test sentiment predictions on custom movie reviews
  • Model Persistence: Save and load trained models

🛠️ Usage Examples

Basic Usage

from src.sentiment_analysis import MovieReviewSentimentAnalyzer

# Initialize analyzer
analyzer = MovieReviewSentimentAnalyzer()

# Load sample data and train
reviews, labels = analyzer.load_sample_data()
X_train, X_test, y_train, y_test = analyzer.prepare_data(reviews, labels)
analyzer.train_model(X_train, y_train)

# Test prediction
result = analyzer.predict_sentiment("This movie was absolutely amazing!")
print(f"Sentiment: {result['sentiment']} (confidence: {result['confidence']:.3f})")

Loading Real IMDb Data

from src.data_utils import load_imdb_dataset

# Load 5000 reviews from IMDb dataset
reviews, labels = load_imdb_dataset(subset_size=5000)

Creating Visualizations

from src.visualizations import create_comprehensive_report

# Generate complete analysis report
create_comprehensive_report(analyzer, X_test, y_test, reviews, labels)

📊 Model Performance

The model uses TF-IDF vectorization with the following features:

  • Vectorizer: TF-IDF with 1-2 word n-grams
  • Classifier: Logistic Regression with L2 regularization
  • Preprocessing: Stemming, stopword removal, and text cleaning
  • Expected Accuracy: 85-90% on IMDb dataset

📈 Visualizations

The project includes comprehensive visualizations:

  • Confusion matrices
  • ROC curves and AUC scores
  • Feature importance (most positive/negative words)
  • Word clouds for positive and negative reviews
  • Review length distributions
  • Prediction confidence analysis

🎮 Interactive Demo

Run python src/interactive_demo.py for a user-friendly interface that provides:

  • Quick training with sample data
  • Training with real IMDb dataset
  • Interactive sentiment testing
  • Visualization report generation
  • Model saving and loading

📦 Dependencies

  • pandas: Data manipulation and analysis
  • scikit-learn: Machine learning algorithms and evaluation
  • nltk: Natural language processing tools
  • numpy: Numerical computing
  • matplotlib & seaborn: Data visualization
  • datasets: Hugging Face datasets (for real IMDb data)
  • wordcloud: Word cloud generation
  • tqdm: Progress bars

🎯 Example Predictions

Review: "This movie was absolutely fantastic! Great acting and plot."
→ Sentiment: Positive (confidence: 0.924)

Review: "Terrible film. Boring and predictable storyline."
→ Sentiment: Negative (confidence: 0.876)

Review: "Decent movie with some good moments."
→ Sentiment: Positive (confidence: 0.651)

🔧 Customization

Training Different Models

# Try Naive Bayes instead of Logistic Regression
analyzer.train_model(X_train, y_train, model_type='naive_bayes')

# Use Count Vectorizer instead of TF-IDF
analyzer.train_model(X_train, y_train, vectorizer_type='count')

Custom Text Preprocessing

# Modify the clean_text method in MovieReviewSentimentAnalyzer
# to customize preprocessing steps

📚 Educational Value

This project demonstrates:

  • Text preprocessing and feature extraction
  • Binary classification with scikit-learn
  • Model evaluation and cross-validation
  • Data visualization with matplotlib/seaborn
  • Interactive application development
  • Model persistence and deployment

🤝 Contributing

Feel free to extend this project by:

  • Adding more sophisticated models (neural networks, transformers)
  • Implementing additional text preprocessing techniques
  • Creating web interface with Flask/Streamlit
  • Adding support for multi-class sentiment (very negative, negative, neutral, positive, very positive)
  • Implementing real-time sentiment analysis

📄 License

This project is for educational purposes. The IMDb dataset usage follows the terms of the Hugging Face datasets library.


Happy sentiment analyzing! 🎭

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages