1. Introduction

The final project made for the DataTalksClub/data-engineering-zoomcamp. This project was made using big data cloud technologies and data from Rotten Tomatoes. This document provides explanation to the installation process and is seperated into several sections.

Data Used

Data used is the rotten-tomatoes user and critics review.

Link to website

Important Links

Project visualization link

Tools

To be able to run this project you need to have the have following software installed.

Docker
Terraform
Google Cloud Platform Account and Google Cloud SDK

2. Installation Steps

Clone this reposatory
Create the folder named keys/ under the rotten-tomatoes-datacamp-project/dags/
Download the dataset from here
Move the downloaded dataset into the rotten-tomatoes-datacamp-project/datasets
Create a Google Cloud Platform Account (GCP): here

Create a service account by following the steps here
Create and export a service account key by following the steps here
Download the created service account key, rename it to gcp-cred.json and move it under rotten-tomatoes-datacamp-project/dags/keys/

Creating the infrastructure with Terraform

In your Google Cloud Account copy your project ID which you can find after you login in to your GCP in the left corner near the Google Cloud icon or create a new project if you have none.
Open the file rotten-tomatoes-datacamp-project/terraform/main.tf
Change the project ID to yours. So every "PUT YOUR GCP PROJECT ID HERE" occurence should be changed to your ID.
Change the name of the bucket rotten-tomatoes-bucket since GCP buckets have to have unique names.
Using the terminal move into rotten-tomatoes-datacamp-project/terraform folder
Run the command: terraform init
Run the command: terraform apply

Changing project ID inside the python script

Open the rotten-tomatoes-datacamp-project/dags/data_ingestion.py
Use ctrl-f to find "PUT YOUR GCP PROJECT ID HERE" and replace it with your project ID
Change the value of the variable GCS_BUCKET into the name of your terraform bucket.

Open rotten-tomatoes-datacamp-project/rotten_project_dbt/profiles.yml and change "PUT YOUR GCP PROJECT ID HERE" with your ID
Open rotten-tomatoes-datacamp-project/rotten_project_dbt/models/example/critic_reviews.sql and change "PUT YOUR GCP PROJECT ID HERE" with your ID
Open rotten-tomatoes-datacamp-project/rotten_project_dbt/models/example/user_reviews.sql and change "PUT YOUR GCP PROJECT ID HERE" with your ID
Docker

Using terminal move into folder rotten-tomatoes-datacamp-project/
Run the command: docker-compose up airflow-init
Run the command: docker compose up -d

Pipeline progress can be tracked here
Final visualization can be seen here

3. Pipeline Infrastructure

Used Technologies

Google Cloud Platform (GCP)
Google Cloud Storage (GCS): Data Lake
BigQuery: Data Warehouse
Terraform: Infrastructure-as-Code (IaC)
Docker: Containerization
PostgreSQL: Data Analysis & Exploration
Airflow: Workflow Orchestration
dbt: Data Transformation
Looker Studio: Visualization

Disclaimer

Everything is automated and replicable, but the Looker Studio. Data is processed and is already going to be inside BigQuery tables, so use the data however you want for the visualization by using the BigQuery as the source.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dags		dags
images		images
plugins		plugins
rotten_project_dbt		rotten_project_dbt
terraform		terraform
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

images

images

plugins

plugins

rotten_project_dbt

rotten_project_dbt

terraform

terraform

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

docker-compose.yaml

docker-compose.yaml

requirements.txt

requirements.txt

Repository files navigation

1. Introduction

Data Used

Important Links

Tools

2. Installation Steps

3. Pipeline Infrastructure

Used Technologies

Disclaimer

About

Releases

Packages

Languages

iparac/rotten-tomatoes-datacamp-project

Folders and files

Latest commit

History

Repository files navigation

1. Introduction

Data Used

Important Links

Tools

2. Installation Steps

3. Pipeline Infrastructure

Used Technologies

Disclaimer

About

Resources

Stars

Watchers

Forks

Languages