Skip to content

louiseluli/CineScope

Repository files navigation

๐ŸŽฌ CineScope

Personal Cinema Analytics Platform โ€” Deep statistical analysis of 2,289 watched films through multi-source data enrichment and advanced visualization.

Python React Flask Status

CineScope transforms personal cinema data into actionable insights through comprehensive enrichment pipelines and rigorous statistical analysis. This ongoing project demonstrates advanced data engineering, API integration, and visualization techniques applied to a personal watch history of 2,289 films.


๐Ÿ“Š Project Overview

Core Data:

  • 2,289 watched films with 113+ enriched columns
  • 42,630 people (actors, directors, crew) with extended biographical data
  • 8,147 unique keywords across the collection
  • 99.9% enrichment coverage for critical metadata fields

Data Sources:

  • IMDb โ€” Ratings, cast, crew, non-commercial datasets
  • TMDB โ€” Budget, revenue, keywords, cinematographers, composers
  • OMDb โ€” Awards, additional metadata
  • Wikidata โ€” Biographical data (education, death, family)
  • DoesTheDogDie โ€” Content warnings

๐ŸŽฏ Key Capabilities

  • Multi-source enrichment with intelligent fallbacks and caching
  • Statistical analysis (t-tests, correlation, entropy, Gini coefficient)
  • Network analysis of collaborations and keyword co-occurrence
  • Financial intelligence (ROI analysis, profitability patterns)
  • Biographical insights (mortality patterns, education, legacy)
  • Advanced visualizations (60+ charts across 12 batches at 300 DPI)

๐Ÿ”ฌ Analysis Batches (Ongoing)

CineScope is organized into modular "batches" โ€” each exploring a specific analytical dimension.

โœ… Recently Completed Batches

Batch 11: Behind the Camera

Cinematographers & Composers โ€” The invisible artists shaping cinema

Insights:

  • 1,041 cinematographers (95.5% coverage) | 1,633 composers (93.3% coverage)
  • Director-DP collaboration heatmaps reveal auteur signatures
  • Genre preferences show cinematographers specialize while composers diversify

Batch 12: Mortality & Legacy

Life spans, death patterns, and legacy in cinema

Insights:

  • 11,497 deceased (27% of dataset) | 31,133 living (73%)
  • Average age at death: 73.9 years (median: 76.0 years, ฯƒ = 14.5)
  • 101 centenarians documented | Oldest: 117 years
  • Lifespan increasing +0.2 years per decade by birth cohort
  • Top causes: Myocardial infarction (751), cancer (454), pneumonia (251)


Batch 18: International Cinema

Language diversity and cultural representation

Insights:

  • 23 languages across 2,287 films (99.9% coverage)
  • 89% English (2,037) vs 11% non-English (252 films)
  • European cinema averages 6.67 rating vs North American 6.38
  • Top non-English: French (50), German (29), Chinese (29), Korean (25)


Batch 22: Financial Intelligence

Budget, revenue, and profitability analysis

Insights:

  • $148.27B total box office | $41.21B total budget
  • 82.7% profitable (1,116 out of 1,349 films with complete data)
  • Average ROI: 688.5% (median: 191.7%) | Total profit: $106.46B
  • Weak budget-rating correlation (r = 0.12): Money โ‰  Quality
  • Low-budget films achieve exceptional ROI (>10,000% documented)


Batch 26: Keyword Deep-Dive

Advanced keyword intelligence with statistical rigor

Insights:

  • 8,147 unique keywords analyzed | 25,779 total occurrences
  • 91.3% diversity score (Shannon entropy)
  • "Film noir" predicts highest quality (avg 7.70, p < 0.05)
  • 1,417 keywords correlated with high ratings (>7.5)
  • Statistical significance testing for prediction power


๐Ÿ“‹ Earlier Batches (Completed)

Batch 1-10: Core Analytics (click to expand)

Batch 1: Quantified Self

Rating distribution, decade patterns, runtime sweet spots


Batch 2: Content Genome

Cast analysis, gender balance, representation metrics


Batch 4-5: Directors & Genres

Director patterns, genre evolution, hybrid analysis


Batch 6-7: Production & Critics

Studio economics, awards analysis, critical alignment


Batch 10: Keywords & Themes

Keyword frequency, theme distribution, genre patterns


๐Ÿ› ๏ธ Technical Architecture

Data Pipeline

IMDb Export โ†’ Multi-Source Enrichment โ†’ Statistical Analysis โ†’ Visualization
                       โ†“                        โ†“                    โ†“
                (TMDB + OMDb            (Pandas/NumPy/SciPy)   (Matplotlib/
                 + Wikidata)            T-tests, Correlation    Seaborn/Plotly
                 + Caching)             Network Analysis)       300 DPI PNG)

Tech Stack

Backend:

  • Python 3.11+ (pandas, numpy, scipy, scikit-learn)
  • Flask 2.0+ (REST API with 40+ endpoints)
  • CSV-based storage (no database โ€” portable and inspectable)

Frontend:

  • React 19 with TypeScript
  • TanStack Query for server state
  • Tailwind CSS for responsive design
  • Recharts for interactive visualizations

Analysis:

  • Statistical testing (t-tests, p-values, correlation matrices)
  • Information theory (Shannon entropy, Gini coefficient, Lorenz curves)
  • Network analysis (NetworkX for collaboration graphs)
  • Machine learning (k-means clustering, similarity scoring)

๐Ÿ“ˆ Data Quality & Coverage

Field Coverage Count Source
Budget 63.8% 1,460 films TMDB
Revenue 65.7% 1,505 films TMDB
Keywords 99.9% 2,287 films TMDB
Languages 99.9% 2,287 films TMDB + OMDb
People (Extended) 63.9% 27,254 people Wikidata
Death Data 27.0% 11,497 people Wikidata
Education 25.5% 10,877 people Wikidata
Cinematographers 95.5% 2,186 films TMDB
Composers 93.3% 2,135 films TMDB

๐Ÿš€ Getting Started

Prerequisites

# Python 3.11+
pip install -r requirements.txt

# Node 16+ (for web UI)
npm install

Run Analysis Batches

# Behind the Camera
python scripts/batch_11_behind_the_camera.py

# Mortality & Legacy
python scripts/batch_12_mortality_legacy.py

# International Cinema
python scripts/batch_18_international_cinema.py

# Financial Intelligence
python scripts/batch_22_financial_intelligence.py

# Keyword Deep-Dive
python scripts/batch_26_keyword_deepdive.py

Output: analysis_outputs/visualizations/batch_XX/ (300 DPI PNGs)

Run Web Application

# Backend (Flask API)
cd api && python app.py  # http://localhost:5001

# Frontend (React)
cd ui && npm run dev     # http://localhost:3000

๐Ÿ“ Project Structure

CineScope/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                          # IMDb exports, personal data
โ”‚   โ””โ”€โ”€ processed/
โ”‚       โ”œโ”€โ”€ watched_movies_master.csv # 2,289 films ร— 113 columns
โ”‚       โ”œโ”€โ”€ people_cache.json         # 42,630 people with enrichment
โ”‚       โ””โ”€โ”€ keywords_cache.json       # 2,287 movies with keywords
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ enrich/                       # Data enrichment pipeline
โ”‚   โ”‚   โ”œโ”€โ”€ 01_enrich_tmdb.py
โ”‚   โ”‚   โ”œโ”€โ”€ 07_enrich_people_extended.py
โ”‚   โ”‚   โ””โ”€โ”€ 08_enrich_keywords.py
โ”‚   โ””โ”€โ”€ batch_*.py                    # Analysis batches (modular)
โ”œโ”€โ”€ analysis_outputs/
โ”‚   โ”œโ”€โ”€ visualizations/               # Generated charts (300 DPI)
โ”‚   โ””โ”€โ”€ reports/                      # Text reports
โ”œโ”€โ”€ api/                              # Flask REST API
โ””โ”€โ”€ ui/                               # React TypeScript frontend

๐Ÿ“Š Statistical Highlights

Diversity Metrics

  • Shannon Entropy: 11.86 (91.3% of theoretical maximum)
  • Gini Coefficient: 0.566 (moderate keyword concentration)
  • Language Diversity: 23 languages across 6 continents

Financial Intelligence

  • Total Analyzed: $41.21B budget | $148.27B revenue
  • Aggregate Profit: $106.46B across 1,349 films
  • Profitability Rate: 82.7% (1,116 profitable films)
  • Best ROI: 99,900% (extreme outlier, likely low-budget success)

Mortality Patterns

  • Mean Lifespan: 73.9 years (ฯƒ = 14.5)
  • Centenarians: 101 people (0.9% of deceased)
  • Longevity Trend: +0.2 years per decade by birth cohort
  • Geographic: Hollywood concentration (Forest Lawn: 498 burials)

๐Ÿ”ฎ Roadmap (Ongoing)

Planned Batches

  • Batch 13: Education & Origins (universities, birthplaces)
  • Batch 14: Family & Relationships (spouses, children, dynasties)
  • Batch 15: Height & Physical Attributes (casting patterns)
  • Batch 29: Awards Deep-Dive (parse Oscar/BAFTA/Cannes data)
  • Batch 31: Network Analysis (graph theory, centrality measures)
  • Batch 32: ML Recommendations (collaborative filtering)

Future Enhancements

  • PostgreSQL migration (currently CSV-based for portability)
  • Real-time API updates with webhooks
  • PDF report exports for batch analyses
  • Social sharing of insights and visualizations
  • Advanced ML models (neural collaborative filtering)

๐Ÿ“ Methodology

Data Enrichment Process

  1. IMDb export โ†’ Basic metadata (title, year, rating)
  2. TMDB API โ†’ Budget, revenue, cast, crew, keywords
  3. OMDb API โ†’ Awards, additional metadata
  4. Wikidata SPARQL โ†’ Biographical data (death, education, family)
  5. Derived fields โ†’ Zodiac signs, age calculations, diversity scores

Statistical Rigor

  • Hypothesis testing: T-tests with p < 0.05 threshold
  • Correlation analysis: Pearson correlation coefficients
  • Diversity metrics: Shannon entropy, Gini coefficient, Lorenz curves
  • Trend analysis: Linear regression, moving averages
  • Network analysis: Graph centrality, community detection

Quality Assurance

  • Null handling with intelligent fallbacks
  • Outlier detection and filtering (age: 10-120 years)
  • Data validation at each enrichment step
  • Statistical significance thresholds enforced

๐Ÿค Contributing

This is a personal learning project, but feedback is welcome:

  • Issues: Bug reports or feature suggestions
  • Discussions: Analytical approaches or visualization ideas
  • Forks: Adapt to your own cinema data

๐Ÿ“œ License & Data Sources

Educational/Personal Use

Movie data sources:

  • IMDb (personal export, non-commercial use)
  • TMDB (API usage compliant with terms)
  • OMDb (API usage compliant with terms)
  • Wikidata (CC0 public domain license)

๐Ÿ™ Acknowledgments

  • TMDB for comprehensive API access
  • Wikidata for biographical enrichment
  • Python data science ecosystem (pandas, scipy, matplotlib, seaborn)
  • React community for excellent UI libraries

๐Ÿ“ซ Project Status

๐ŸŸ  Ongoing Development (Active as of December 2025)


Transforming personal cinema history into statistical insights, one batch at a time.

Cinema Keywords Word Cloud

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages