Skip to content

Data ETL pipeline to clean, process, and aggregate data from Canadian housing starts.

License

Notifications You must be signed in to change notification settings

jleung51/foundations-dags

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Foundations: Data Pipeline

Data ETL pipeline to clean, process, and aggregate data from Canadian housing starts.

Built with Apache Airflow, dbt, and Amazon Web Services EC2.

Learn more about the project by reading the design document.


Table of Contents

Table of contents created with VS Code Extension: Markdown All in One.


Setup Instructions

Set up AWS EC2 (Host for the Database and Orchestrator)

Provision an EC2 Instance

  1. Navigate to Amazon Web Services and create an account.
  2. In the search bar at the top, search for and click EC2.
  3. Click the Launch Instance button and follow the instructions. For this example, we will be using the Amazon Linux AMI operating system.
  4. Download the .pem key file and keep it secure.
  5. Wait until the instance state is Running.

Connect to the EC2 Instance using AWS CloudShell

  1. In the AWS EC2 service, in the left sidebar, select Instances.
  2. Select the newly created instance.
  3. At the top of the screen, click Connect.
  4. Choose any of the provided options to connect to the instance.

As an alternative to CloudShell, you can also use SSH from your local computer.

Set up Docker (Containerizer)

Follow the instructions at:

  1. Installing Docker to use with the AWS SAM CLI
  2. Install Docker Engine
  3. Install Docker Compose

Ensure your user has the permissions to execute Docker:

sudo groupadd docker
sudo usermod -aG docker $USER

Log out and back in to get permissions.

Set up Airflow (Orchestrator)

This section is based on the guide for running Airflow using Docker Compose.

Setup

If you are on Linux, update the Airflow UID in the .env file with the host user ID:

echo -e "AIRFLOW_UID=$(id -u)" > .env

Run database migrations and initialize the first user account:

docker compose up airflow-init

Expose Airflow Console with the EC2 Port

Follow the commands under these instructions to add security group rules which permit HTTP access to port 80.

Once complete, the security rules should look like this:

Security group rules


Usage

Start Airflow

Start all services:

sudo docker compose up

sudo is required to run the Airflow console on port 80.
If you want to avoid sudo or prefer another port:

  • Open docker-compose.yaml
  • Find the configuration for airflow-webserver
  • Change the port number in the variable ports

You can stop all services with:

docker compose down

Airflow is now running on your machine.

Interact with Airflow (Local)

If you set up Airflow and Docker locally, you can log into Airflow at http://localhost:80; otherwise use the port you used to expose the Airflow console.

The default username and password airflow; reset it immediately after logging in.

You can view information on the current environment:

docker compose run airflow-worker airflow info
OR
./bin/airflow info

Enter the running Docker container to execute commands:

./bin/airflow bash

Stop and delete all containers and volumes:

docker-compose down --volumes --rmi all

About

Data ETL pipeline to clean, process, and aggregate data from Canadian housing starts.

Topics

Resources

License

Stars

Watchers

Forks

Languages