Personal Cinema Analytics Platform โ Deep statistical analysis of 2,289 watched films through multi-source data enrichment and advanced visualization.
CineScope transforms personal cinema data into actionable insights through comprehensive enrichment pipelines and rigorous statistical analysis. This ongoing project demonstrates advanced data engineering, API integration, and visualization techniques applied to a personal watch history of 2,289 films.
Core Data:
- 2,289 watched films with 113+ enriched columns
- 42,630 people (actors, directors, crew) with extended biographical data
- 8,147 unique keywords across the collection
- 99.9% enrichment coverage for critical metadata fields
Data Sources:
- IMDb โ Ratings, cast, crew, non-commercial datasets
- TMDB โ Budget, revenue, keywords, cinematographers, composers
- OMDb โ Awards, additional metadata
- Wikidata โ Biographical data (education, death, family)
- DoesTheDogDie โ Content warnings
- Multi-source enrichment with intelligent fallbacks and caching
- Statistical analysis (t-tests, correlation, entropy, Gini coefficient)
- Network analysis of collaborations and keyword co-occurrence
- Financial intelligence (ROI analysis, profitability patterns)
- Biographical insights (mortality patterns, education, legacy)
- Advanced visualizations (60+ charts across 12 batches at 300 DPI)
CineScope is organized into modular "batches" โ each exploring a specific analytical dimension.
Cinematographers & Composers โ The invisible artists shaping cinema
Insights:
- 1,041 cinematographers (95.5% coverage) | 1,633 composers (93.3% coverage)
- Director-DP collaboration heatmaps reveal auteur signatures
- Genre preferences show cinematographers specialize while composers diversify
Life spans, death patterns, and legacy in cinema
Insights:
- 11,497 deceased (27% of dataset) | 31,133 living (73%)
- Average age at death: 73.9 years (median: 76.0 years, ฯ = 14.5)
- 101 centenarians documented | Oldest: 117 years
- Lifespan increasing +0.2 years per decade by birth cohort
- Top causes: Myocardial infarction (751), cancer (454), pneumonia (251)
Language diversity and cultural representation
Insights:
- 23 languages across 2,287 films (99.9% coverage)
- 89% English (2,037) vs 11% non-English (252 films)
- European cinema averages 6.67 rating vs North American 6.38
- Top non-English: French (50), German (29), Chinese (29), Korean (25)
Budget, revenue, and profitability analysis
Insights:
- $148.27B total box office | $41.21B total budget
- 82.7% profitable (1,116 out of 1,349 films with complete data)
- Average ROI: 688.5% (median: 191.7%) | Total profit: $106.46B
- Weak budget-rating correlation (r = 0.12): Money โ Quality
- Low-budget films achieve exceptional ROI (>10,000% documented)
Advanced keyword intelligence with statistical rigor
Insights:
- 8,147 unique keywords analyzed | 25,779 total occurrences
- 91.3% diversity score (Shannon entropy)
- "Film noir" predicts highest quality (avg 7.70, p < 0.05)
- 1,417 keywords correlated with high ratings (>7.5)
- Statistical significance testing for prediction power
Batch 1-10: Core Analytics (click to expand)
Rating distribution, decade patterns, runtime sweet spots
Cast analysis, gender balance, representation metrics
Director patterns, genre evolution, hybrid analysis
Studio economics, awards analysis, critical alignment
Keyword frequency, theme distribution, genre patterns
IMDb Export โ Multi-Source Enrichment โ Statistical Analysis โ Visualization
โ โ โ
(TMDB + OMDb (Pandas/NumPy/SciPy) (Matplotlib/
+ Wikidata) T-tests, Correlation Seaborn/Plotly
+ Caching) Network Analysis) 300 DPI PNG)
Backend:
- Python 3.11+ (pandas, numpy, scipy, scikit-learn)
- Flask 2.0+ (REST API with 40+ endpoints)
- CSV-based storage (no database โ portable and inspectable)
Frontend:
- React 19 with TypeScript
- TanStack Query for server state
- Tailwind CSS for responsive design
- Recharts for interactive visualizations
Analysis:
- Statistical testing (t-tests, p-values, correlation matrices)
- Information theory (Shannon entropy, Gini coefficient, Lorenz curves)
- Network analysis (NetworkX for collaboration graphs)
- Machine learning (k-means clustering, similarity scoring)
| Field | Coverage | Count | Source |
|---|---|---|---|
| Budget | 63.8% | 1,460 films | TMDB |
| Revenue | 65.7% | 1,505 films | TMDB |
| Keywords | 99.9% | 2,287 films | TMDB |
| Languages | 99.9% | 2,287 films | TMDB + OMDb |
| People (Extended) | 63.9% | 27,254 people | Wikidata |
| Death Data | 27.0% | 11,497 people | Wikidata |
| Education | 25.5% | 10,877 people | Wikidata |
| Cinematographers | 95.5% | 2,186 films | TMDB |
| Composers | 93.3% | 2,135 films | TMDB |
# Python 3.11+
pip install -r requirements.txt
# Node 16+ (for web UI)
npm install# Behind the Camera
python scripts/batch_11_behind_the_camera.py
# Mortality & Legacy
python scripts/batch_12_mortality_legacy.py
# International Cinema
python scripts/batch_18_international_cinema.py
# Financial Intelligence
python scripts/batch_22_financial_intelligence.py
# Keyword Deep-Dive
python scripts/batch_26_keyword_deepdive.pyOutput: analysis_outputs/visualizations/batch_XX/ (300 DPI PNGs)
# Backend (Flask API)
cd api && python app.py # http://localhost:5001
# Frontend (React)
cd ui && npm run dev # http://localhost:3000CineScope/
โโโ data/
โ โโโ raw/ # IMDb exports, personal data
โ โโโ processed/
โ โโโ watched_movies_master.csv # 2,289 films ร 113 columns
โ โโโ people_cache.json # 42,630 people with enrichment
โ โโโ keywords_cache.json # 2,287 movies with keywords
โโโ scripts/
โ โโโ enrich/ # Data enrichment pipeline
โ โ โโโ 01_enrich_tmdb.py
โ โ โโโ 07_enrich_people_extended.py
โ โ โโโ 08_enrich_keywords.py
โ โโโ batch_*.py # Analysis batches (modular)
โโโ analysis_outputs/
โ โโโ visualizations/ # Generated charts (300 DPI)
โ โโโ reports/ # Text reports
โโโ api/ # Flask REST API
โโโ ui/ # React TypeScript frontend
- Shannon Entropy: 11.86 (91.3% of theoretical maximum)
- Gini Coefficient: 0.566 (moderate keyword concentration)
- Language Diversity: 23 languages across 6 continents
- Total Analyzed: $41.21B budget | $148.27B revenue
- Aggregate Profit: $106.46B across 1,349 films
- Profitability Rate: 82.7% (1,116 profitable films)
- Best ROI: 99,900% (extreme outlier, likely low-budget success)
- Mean Lifespan: 73.9 years (ฯ = 14.5)
- Centenarians: 101 people (0.9% of deceased)
- Longevity Trend: +0.2 years per decade by birth cohort
- Geographic: Hollywood concentration (Forest Lawn: 498 burials)
- Batch 13: Education & Origins (universities, birthplaces)
- Batch 14: Family & Relationships (spouses, children, dynasties)
- Batch 15: Height & Physical Attributes (casting patterns)
- Batch 29: Awards Deep-Dive (parse Oscar/BAFTA/Cannes data)
- Batch 31: Network Analysis (graph theory, centrality measures)
- Batch 32: ML Recommendations (collaborative filtering)
- PostgreSQL migration (currently CSV-based for portability)
- Real-time API updates with webhooks
- PDF report exports for batch analyses
- Social sharing of insights and visualizations
- Advanced ML models (neural collaborative filtering)
- IMDb export โ Basic metadata (title, year, rating)
- TMDB API โ Budget, revenue, cast, crew, keywords
- OMDb API โ Awards, additional metadata
- Wikidata SPARQL โ Biographical data (death, education, family)
- Derived fields โ Zodiac signs, age calculations, diversity scores
- Hypothesis testing: T-tests with p < 0.05 threshold
- Correlation analysis: Pearson correlation coefficients
- Diversity metrics: Shannon entropy, Gini coefficient, Lorenz curves
- Trend analysis: Linear regression, moving averages
- Network analysis: Graph centrality, community detection
- Null handling with intelligent fallbacks
- Outlier detection and filtering (age: 10-120 years)
- Data validation at each enrichment step
- Statistical significance thresholds enforced
This is a personal learning project, but feedback is welcome:
- Issues: Bug reports or feature suggestions
- Discussions: Analytical approaches or visualization ideas
- Forks: Adapt to your own cinema data
Educational/Personal Use
Movie data sources:
- IMDb (personal export, non-commercial use)
- TMDB (API usage compliant with terms)
- OMDb (API usage compliant with terms)
- Wikidata (CC0 public domain license)
- TMDB for comprehensive API access
- Wikidata for biographical enrichment
- Python data science ecosystem (pandas, scipy, matplotlib, seaborn)
- React community for excellent UI libraries
๐ Ongoing Development (Active as of December 2025)
Transforming personal cinema history into statistical insights, one batch at a time.





























