NYC-TLC-ETL

The NYC-TLC-AIRFLOW-ETL repository is a comprehensive solution built with Apache Airflow DAG Running in Docker to extract High-Volume For-Hire Services parquet files from the year 2022, provided by the New York City Taxi and Limousine Commission (TLC), stored in an AWS bucket. The pipeline performs data transformation operations to cleanse and enrich the data before loading it into a Google BigQuery table. The purpose of this ETL pipeline is to prepare the data for consumption by a Looker Studio report, enabling detailed analysis and visualization of the trips by Uber and Lyft.

The DAG is manually triggered or via the API. However, we can easily modify the workflow to run on a defined schedule, like running after the AWS bucket gets the latest High Volume FHV trip parquet file.

Features

Extraction of High-Volume FHV parquet files from the year 2022.
Data transformation and cleansing operations.
Loading of the transformed data into a Google BigQuery table.
Manual triggering of the DAG or API-based triggering.
Configurable scheduling to run the pipeline after the AWS bucket receives the latest High Volume FHV trip parquet file.

Prerequisites

Docker installed on your local machine.
Google BigQuery project and credentials.
AWS S3 bucket credentials.

Setup

Configure the AWS S3 bucket connection

In order to access the High-Volume FHV parquet files stored in the AWS S3 bucket, you need to configure the AWS S3 bucket connection in Apache Airflow. Follow the steps below:
- Open the Airflow web interface.
- Go to Admin > Connections.
- Click on Create to create a new connection.
- Set the Conn Id field to s3_conn.
- Set the Connection Type to Amazon Web Services.
- Fill in the AWS Access Key ID.
- Fill in the AWS Secret Access Key.
To load the transformed data into a Google BigQuery table, you must to place the GOOGLE_APPLICATION_CREDENTIALS .json file in the "dags/" folder.
Create BigQuery Table with Bigquery.ipynb.
Set the Airflow variable

The ETL pipeline requires an Airflow variable called HV_FHV_TABLE_ID, which is the ID of the BigQuery table where the transformed data will be loaded. Follow the steps below to set the variable:
- Open the Airflow web interface.
- Go to Admin > Variables.
- Click on Create to create a new variable.
- Set the Key field to HV_FHV_TABLE_ID.
- Fill in the Value field with the ID of your BigQuery table.
- Save the variable.

Run

Configure files for the docker-compose. For more information.

mkdir -p ./logs ./plugins ./config
echo -e "AIRFLOW_UID=$(id -u)" > .env

Create custom apache airflow image with gcloud API
```
docker build -t exented:latest .
```
Running Airflow
```
docker compose up
```

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dags		dags
.gitignore		.gitignore
Bigquery.ipynb		Bigquery.ipynb
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC-TLC-ETL

Features

Prerequisites

Setup

Run

About

Releases

Packages

Languages

lesther-santana/NYC-TLC-Airflow-ETL

Folders and files

Latest commit

History

Repository files navigation

NYC-TLC-ETL

Features

Prerequisites

Setup

Run

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages