A complete machine learning pipeline for analyzing sentiment in movie reviews using scikit-learn and natural language processing techniques.
-
Install dependencies:
pip install -r requirements.txt
-
Run the interactive demo:
python src/interactive_demo.py
-
Or run the main analysis script:
python src/sentiment_analysis.py
imdb_ds/
├── src/ # Source code directory
│ ├── sentiment_analysis.py # Main sentiment analysis pipeline
│ ├── data_utils.py # Data loading and preprocessing utilities
│ ├── visualizations.py # Visualization and plotting functions
│ ├── interactive_demo.py # Interactive demo interface
│ └── test_pipeline.py # Test suite
├── docs/ # Documentation directory
│ └── PROJECT_OVERVIEW.md # Project status and details
├── models/ # Trained model storage
│ └── sentiment_model.pkl # Saved model file
├── requirements.txt # Python dependencies
├── CLAUDE.md # Development guidance for Claude Code
└── README.md # This file
- Multiple Data Sources: Load data from Hugging Face datasets, CSV files, or use sample data
- Advanced Preprocessing: Text cleaning, tokenization, stemming, and stopword removal
- Model Training: Logistic Regression and Naive Bayes classifiers with cross-validation
- Comprehensive Evaluation: Accuracy, AUC, confusion matrix, and feature importance analysis
- Rich Visualizations: ROC curves, word clouds, feature importance plots, and more
- Interactive Testing: Test sentiment predictions on custom movie reviews
- Model Persistence: Save and load trained models
from src.sentiment_analysis import MovieReviewSentimentAnalyzer
# Initialize analyzer
analyzer = MovieReviewSentimentAnalyzer()
# Load sample data and train
reviews, labels = analyzer.load_sample_data()
X_train, X_test, y_train, y_test = analyzer.prepare_data(reviews, labels)
analyzer.train_model(X_train, y_train)
# Test prediction
result = analyzer.predict_sentiment("This movie was absolutely amazing!")
print(f"Sentiment: {result['sentiment']} (confidence: {result['confidence']:.3f})")from src.data_utils import load_imdb_dataset
# Load 5000 reviews from IMDb dataset
reviews, labels = load_imdb_dataset(subset_size=5000)from src.visualizations import create_comprehensive_report
# Generate complete analysis report
create_comprehensive_report(analyzer, X_test, y_test, reviews, labels)The model uses TF-IDF vectorization with the following features:
- Vectorizer: TF-IDF with 1-2 word n-grams
- Classifier: Logistic Regression with L2 regularization
- Preprocessing: Stemming, stopword removal, and text cleaning
- Expected Accuracy: 85-90% on IMDb dataset
The project includes comprehensive visualizations:
- Confusion matrices
- ROC curves and AUC scores
- Feature importance (most positive/negative words)
- Word clouds for positive and negative reviews
- Review length distributions
- Prediction confidence analysis
Run python src/interactive_demo.py for a user-friendly interface that provides:
- Quick training with sample data
- Training with real IMDb dataset
- Interactive sentiment testing
- Visualization report generation
- Model saving and loading
pandas: Data manipulation and analysisscikit-learn: Machine learning algorithms and evaluationnltk: Natural language processing toolsnumpy: Numerical computingmatplotlib&seaborn: Data visualizationdatasets: Hugging Face datasets (for real IMDb data)wordcloud: Word cloud generationtqdm: Progress bars
Review: "This movie was absolutely fantastic! Great acting and plot."
→ Sentiment: Positive (confidence: 0.924)
Review: "Terrible film. Boring and predictable storyline."
→ Sentiment: Negative (confidence: 0.876)
Review: "Decent movie with some good moments."
→ Sentiment: Positive (confidence: 0.651)
# Try Naive Bayes instead of Logistic Regression
analyzer.train_model(X_train, y_train, model_type='naive_bayes')
# Use Count Vectorizer instead of TF-IDF
analyzer.train_model(X_train, y_train, vectorizer_type='count')# Modify the clean_text method in MovieReviewSentimentAnalyzer
# to customize preprocessing stepsThis project demonstrates:
- Text preprocessing and feature extraction
- Binary classification with scikit-learn
- Model evaluation and cross-validation
- Data visualization with matplotlib/seaborn
- Interactive application development
- Model persistence and deployment
Feel free to extend this project by:
- Adding more sophisticated models (neural networks, transformers)
- Implementing additional text preprocessing techniques
- Creating web interface with Flask/Streamlit
- Adding support for multi-class sentiment (very negative, negative, neutral, positive, very positive)
- Implementing real-time sentiment analysis
This project is for educational purposes. The IMDb dataset usage follows the terms of the Hugging Face datasets library.
Happy sentiment analyzing! 🎭