IMDB is a popular online and authorative source for movies and TV shows review and ratings and has database of infomation related to films, TV series and streaming content online including cast and production crew.
Lovers of films and reviewer will like the know the most popular movies of all time and their total reviews, distribution of films by year of release from 1923 till date, the top rated movies in the current year.
For this solution to be possible, the data engineer and analytic will need to create data pipeline to extract the raw data from the IMDB online database to datalake (GCS) which will be transform, structure and storage in Data Warehouse (Bigquery). Thereafter populated into actionable dashboard.
Although the data is updated daily on the IMDB database, the pipeline will be set to refresh weekly and henceforth updating the dashboard.
The data is a subset of IMDB dataset which is available for access on the IMDB Datasets which is been refreshed daily. Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set.
- Cloud: GCP
- Infrastructure as code (IaC): Terraform
- Datalake: GCP Bucket
- Workflow orchestration: Airflow
- Data Warehouse: BigQuery
- Transformations: Data Build Tool (dbt)
- Dashboard with Data Studio
The Google Cloud Platform is used for deploying the pipeline, Google Cloud Storage (GCS) as a data lake, and BigQuery as a data warehouse.
Terraform is an open source tool which has been used for provisioning infrastructure resources. In this case, it was used to create GCP Infra by creating the following Terraform files.
Apache Airflow is a platform to programmatically schedule and mointor workflows as DAGs. With Airflow, we have command line utilities as well as a user interface to visualise pipelines, monitor progress and troubleshoot issues.
Transformation was done using dbt. The documentation of the data can be find here to understand the data and what the represent.
The title basics
contains movies not yet release and title ratings
are only movies and series that have been released. To get the released movies and series, join the title basics
and title ratings
.
The dashboard was built on Google Data Studio. Connecting the BigQuery data warehouse to Google Data Studio was very easy.
The steps to reproduce this pipeline is as follows: