Skip to content

iparac/rotten-tomatoes-datacamp-project

Repository files navigation

1. Introduction

The final project made for the DataTalksClub/data-engineering-zoomcamp. This project was made using big data cloud technologies and data from Rotten Tomatoes. This document provides explanation to the installation process and is seperated into several sections.

Data Used

Data used is the rotten-tomatoes user and critics review.

Link to website

Important Links

Project visualization link

Tools

To be able to run this project you need to have the have following software installed.

  • Docker
  • Terraform
  • Google Cloud Platform Account and Google Cloud SDK

2. Installation Steps

  1. Clone this reposatory
  2. Create the folder named keys/ under the rotten-tomatoes-datacamp-project/dags/
  3. Download the dataset from here
  4. Move the downloaded dataset into the rotten-tomatoes-datacamp-project/datasets
  5. Create a Google Cloud Platform Account (GCP): here
    1. Create a service account by following the steps here
    2. Create and export a service account key by following the steps here
    3. Download the created service account key, rename it to gcp-cred.json and move it under rotten-tomatoes-datacamp-project/dags/keys/
  6. Creating the infrastructure with Terraform
    1. In your Google Cloud Account copy your project ID which you can find after you login in to your GCP in the left corner near the Google Cloud icon or create a new project if you have none.
    2. Open the file rotten-tomatoes-datacamp-project/terraform/main.tf
    3. Change the project ID to yours. So every "PUT YOUR GCP PROJECT ID HERE" occurence should be changed to your ID.
    4. Change the name of the bucket rotten-tomatoes-bucket since GCP buckets have to have unique names.
    5. Using the terminal move into rotten-tomatoes-datacamp-project/terraform folder
    6. Run the command: terraform init
    7. Run the command: terraform apply
  7. Changing project ID inside the python script
    1. Open the rotten-tomatoes-datacamp-project/dags/data_ingestion.py
    2. Use ctrl-f to find "PUT YOUR GCP PROJECT ID HERE" and replace it with your project ID
    3. Change the value of the variable GCS_BUCKET into the name of your terraform bucket.
  8. Open rotten-tomatoes-datacamp-project/rotten_project_dbt/profiles.yml and change "PUT YOUR GCP PROJECT ID HERE" with your ID
  9. Open rotten-tomatoes-datacamp-project/rotten_project_dbt/models/example/critic_reviews.sql and change "PUT YOUR GCP PROJECT ID HERE" with your ID
  10. Open rotten-tomatoes-datacamp-project/rotten_project_dbt/models/example/user_reviews.sql and change "PUT YOUR GCP PROJECT ID HERE" with your ID
  11. Docker
    1. Using terminal move into folder rotten-tomatoes-datacamp-project/
    2. Run the command: docker-compose up airflow-init
    3. Run the command: docker compose up -d
  12. Pipeline progress can be tracked here
  13. Final visualization can be seen here

3. Pipeline Infrastructure

pipeline

Used Technologies

  • Google Cloud Platform (GCP)
  • Google Cloud Storage (GCS): Data Lake
  • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • PostgreSQL: Data Analysis & Exploration
  • Airflow: Workflow Orchestration
  • dbt: Data Transformation
  • Looker Studio: Visualization

Disclaimer

Everything is automated and replicable, but the Looker Studio. Data is processed and is already going to be inside BigQuery tables, so use the data however you want for the visualization by using the BigQuery as the source.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published