Transforming raw movie data into analytics-ready dimensional models using dbt
- Project Overview
- Business Problem
- Technical Architecture
- Skills Demonstrated
- Data Model
- Key Challenges Solved
- Project Structure
- How to Run
- Sample Queries
- Future Enhancements
This project demonstrates end-to-end analytics engineering by transforming raw movie database data into a production-ready star schema data warehouse. The pipeline processes semi-structured JSON data, handles complex many-to-many relationships, and creates dimensional models optimized for analytical queries.
What This Project Does:
- Extracts and stages raw movie and credit data from The Movie Database (TMDB)
- Flattens nested JSON structures containing cast, crew, genres, and production details
- Builds a comprehensive dimensional model following Kimball methodology
- Implements bridge tables to properly handle many-to-many relationships
- Creates analytics-ready fact and dimension tables for business intelligence
Movie studios, streaming platforms, and entertainment analysts need to answer questions like:
- Which actors generate the highest box office revenue?
- How do different genres perform across regions and time periods?
- What production companies consistently produce profitable films?
- Which crew members (directors, writers) are associated with highly-rated movies?
Challenge: Raw movie data comes in semi-structured formats with nested JSON, making direct analysis difficult and inefficient.
Solution: This pipeline transforms complex raw data into a clean, queryable star schema that enables fast, accurate analytics.
Raw Data Sources (JSON/CSV)
↓
Staging Layer
- stg_movies__raw_movies
- stg_movies__raw_credit
↓
Intermediate Layer
- Bridge Tables (7)
- Handle many-to-many relationships
↓
Mart Layer
- Dimension Tables (7)
- Fact Tables (1)
↓
Analytics & BI Tools
| Layer | Technology |
|---|---|
| Data Warehouse | Snowflake |
| Transformation | dbt Core |
| Version Control | Git/GitHub |
| Data Modeling | Dimensional Modeling (Kimball) |
| Testing | dbt Tests |
| Documentation | dbt Docs |
This project showcases proficiency in:
- ✅ dbt Development - Modular SQL transformations with Jinja templating
- ✅ Dimensional Modeling - Star schema design with fact and dimension tables
- ✅ Data Modeling Patterns - Bridge tables for many-to-many relationships
- ✅ Incremental Models - Efficient data processing strategies
- ✅ Data Quality - Comprehensive testing and validation
- ✅ Complex SQL - Window functions, CTEs, lateral flattening
- ✅ JSON Handling - Parsing and flattening nested semi-structured data
- ✅ Snowflake - Platform-specific functions (
FLATTEN,PARSE_JSON) - ✅ Performance Optimization - Materialization strategies and indexing
- ✅ Version Control - Git workflow with 75+ commits
- ✅ Code Organization - Modular, maintainable project structure
- ✅ Documentation - Schema definitions and model descriptions
- ✅ Testing - Data quality tests and assertions
- ✅ Requirements Translation - Converting business questions into data models
- ✅ Data Modeling - Designing schemas for analytical workloads
- ✅ Stakeholder Thinking - Creating models for diverse data consumers
fct_movies- Core movie metrics (revenue, budget, ratings, runtime)
dim_actor- Actor information and attributesdim_company- Production companiesdim_countries- Country detailsdim_crews- Crew members (directors, writers, etc.)dim_genres- Movie genresdim_keywords- Movie keywords and tagsdim_languages- Language information
movie_actor_bridge- Links movies to actorsmovie_company_bridge- Links movies to production companiesmovie_country_bridge- Links movies to countriesmovie_crew_bridge- Links movies to crew membersmovie_genre_bridge- Links movies to genresmovie_key_bridge- Links movies to keywordsmovie_language_bridge- Links movies to languages
Movies have many-to-many relationships with actors, genres, companies, etc. Bridge tables:
- Maintain referential integrity
- Enable flexible querying
- Support historical tracking
- Follow dimensional modeling best practices
Challenge: Raw credit data contained deeply nested JSON arrays for cast and crew.
Solution:
-- Flattening cast JSON using LATERAL FLATTEN
select
movie_id,
c.value:id::string as actor_id,
c.value:name::string as actor_name,
c.value:character::string as character_name
from raw_credit,
lateral flatten(input => parse_json(cast_json)) as cImpact: Transformed unusable nested data into queryable relational tables.
Challenge: Single movies have multiple actors, genres, countries, etc., and vice versa.
Solution: Implemented bridge tables following Kimball methodology to:
- Preserve all relationships without data duplication
- Enable accurate aggregations
- Maintain query performance
Impact: Created a flexible model that handles complex relationships accurately.
Challenge: Discovered 4,803 movies (representing $395B in revenue) without cast data.
Solution:
- Implemented data quality tests
- Added null-handling logic
- Created data quality monitoring queries
- Documented data gaps for stakeholders
Impact: Ensured transparency about data limitations and prevented misleading analytics.
├── analyses/ # Ad-hoc analytical queries
├── macros/ # Reusable SQL macros
├── models/
│ ├── staging/ # Raw data staging models
│ │ ├── stg_movies__raw_movies.sql
│ │ └── stg_movies__raw_credit.sql
│ ├── intermediate/
│ │ └── bridges/ # Bridge tables (7 models)
│ └── mart/
│ ├── dimensions/ # Dimension tables (7 models)
│ └── fact/ # Fact tables
├── seeds/ # Static reference data
├── snapshots/ # Type 2 SCD snapshots (if applicable)
├── tests/ # Custom data tests
├── dbt_project.yml # dbt configuration
├── packages.yml # dbt package dependencies
└── README.md # This file
dbt docs generate dbt docs serve
select
a.actor_name,
count(distinct f.movie_id) as total_movies,
sum(f.revenue) as total_revenue,
avg(f.vote_average) as avg_rating
from fct_movies f
join movie_actor_bridge mab on f.movie_id = mab.movie_id
join dim_actor a on mab.actor_id = a.actor_id
where f.revenue > 0
group by a.actor_name
order by total_revenue desc
limit 10;select
g.genre_name,
count(distinct f.movie_id) as movie_count,
avg(f.revenue) as avg_revenue,
avg(f.vote_average) as avg_rating,
sum(f.revenue - f.budget) as total_profit
from fct_movies f
join movie_genre_bridge mgb on f.movie_id = mgb.movie_id
join dim_genres g on mgb.genre_id = g.genre_id
where f.revenue > 0 and f.budget > 0
group by g.genre_name
order by total_profit desc;select
c.company_name,
count(distinct f.movie_id) as total_movies,
sum(f.revenue) / 1000000 as total_revenue_millions,
avg(f.vote_average) as avg_rating,
sum(case when f.revenue > f.budget then 1 else 0 end) * 100.0 /
count(*) as profit_rate_pct
from fct_movies f
join movie_company_bridge mcb on f.movie_id = mcb.movie_id
join dim_company c on mcb.company_id = c.company_id
group by c.company_name
having count(distinct f.movie_id) >= 10
order by total_revenue_millions desc
limit 20;- 75+ Git Commits - Iterative development with version control
- 17 Models - Staging, intermediate, and mart layers
- 7 Bridge Tables - Proper many-to-many relationship handling
- Star Schema - Production-ready dimensional model
- Comprehensive Testing - Data quality validation throughout
Project by: Luke Mai (mrluke269)
I'm actively seeking Analytics Engineering and Data Engineering opportunities where I can apply these skills to solve real business problems.
What I Bring:
- Strong SQL and data modeling skills
- Experience with modern data stack (dbt, Snowflake)
- Understanding of dimensional modeling best practices
- Ability to translate business requirements into data solutions
- Software engineering mindset (version control, testing, documentation)
This project is available for educational and portfolio purposes.
⭐ If you found this project interesting, please star the repository!