Skip to content

This project get the IMDB rating data, build data warehouse on Bigquery and visualize the data on Data Studio

Notifications You must be signed in to change notification settings

oseghalepatrick/IMDB-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Zoomcamp Project

Problem description - IMDB Movies and Series Rating

IMDB is a popular online and authorative source for movies and TV shows review and ratings and has database of infomation related to films, TV series and streaming content online including cast and production crew.

Lovers of films and reviewer will like the know the most popular movies of all time and their total reviews, distribution of films by year of release from 1923 till date, the top rated movies in the current year.

For this solution to be possible, the data engineer and analytic will need to create data pipeline to extract the raw data from the IMDB online database to datalake (GCS) which will be transform, structure and storage in Data Warehouse (Bigquery). Thereafter populated into actionable dashboard.

Although the data is updated daily on the IMDB database, the pipeline will be set to refresh weekly and henceforth updating the dashboard.

Dataset

The data is a subset of IMDB dataset which is available for access on the IMDB Datasets which is been refreshed daily. Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set.

Data Pipeline

Technologies

  1. Cloud: GCP
  2. Infrastructure as code (IaC): Terraform
  3. Datalake: GCP Bucket
  4. Workflow orchestration: Airflow
  5. Data Warehouse: BigQuery
  6. Transformations: Data Build Tool (dbt)
  7. Dashboard with Data Studio

Cloud: GCP

The Google Cloud Platform is used for deploying the pipeline, Google Cloud Storage (GCS) as a data lake, and BigQuery as a data warehouse.

Infrastructure as code (IaC): Terraform

Terraform is an open source tool which has been used for provisioning infrastructure resources. In this case, it was used to create GCP Infra by creating the following Terraform files.

Workflow orchestration: Airflow

Apache Airflow is a platform to programmatically schedule and mointor workflows as DAGs. With Airflow, we have command line utilities as well as a user interface to visualise pipelines, monitor progress and troubleshoot issues.

Transformation (dbt)

Transformation was done using dbt. The documentation of the data can be find here to understand the data and what the represent.

The title basics contains movies not yet release and title ratings are only movies and series that have been released. To get the released movies and series, join the title basics and title ratings.

Dashboard

The dashboard was built on Google Data Studio. Connecting the BigQuery data warehouse to Google Data Studio was very easy.

Reproduce the Project

The steps to reproduce this pipeline is as follows:

  1. Clone the repo

        git clone https://github.com/oseghalepatrick/data-eng-zoomcamp.git
    
        cd data-eng-zoomcamp/project/
    
  2. Setup Terraform for IaC here
    Then run the scripts in the following order here

  3. Setup the Docker and Airflow here

  4. To setup dbt follow the instruction here

About

This project get the IMDB rating data, build data warehouse on Bigquery and visualize the data on Data Studio

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published