This project aims to develop an ETL (Extract, Transform, Load) pipeline to analyze content distribution trends on Netflix. The pipeline is built on AWS, using DBT for data transformation, Snowflake as the data warehouse, and Airflow for orchestration. The insights are visualized using AWS Quicksight, and alert mechanisms are integrated for monitoring pipeline failures.
- AWS EC2 instance
- AWS Snowflake account
- DBT installed
- Apache Airflow installed
- AWS EC2 Setup: Launch an EC2 instance on AWS to host DBT and Airflow.
- DBT Installation: Install DBT on the EC2 machine following DBT Installation Guide.
- Airflow Installation: Install Apache Airflow following Airflow Installation Guide.
- Install dbt and Airflow on an AWS EC2 machine.
- This is the initial setup step. Instructions can be found in the installation steps section.
- Create a dbt pipeline for the business use case discussed in this Project.
- Run the data transformations required for the analysis.
- Business Use Cases include:
- In terms of content, which movies/series dominate the market?
- Who are the top actors and directors dominating specific genres?
- Percentage share of movies/series on Netflix's platform.
- Use Airflow as the orchestrating tool.
- Set up DAGs to schedule and automate the pipeline tasks.
- Automate test cases in the pipeline.
- Implement testing to ensure data quality and integrity.
- Create visualizations using AWS Quicksight.
- Transform the processed data into insightful visualizations.
- Set up alerting mechanisms for pipeline failure using Slack and email alerts.
- Monitor the pipeline and receive immediate notifications for any issues.
- Step 1: Start your EC2 instance and connect to your Snowflake account.
- Step 2: Run the Airflow server.
- Step 3: Trigger the DBT transformations via Airflow.
- Step 4: View the Quicksight dashboard for results.
- Step 5: Monitor the pipeline through Slack and email alerts.