This cli script is set to be able to fetch the CSV datasets for NYC Yellow Trip Data, Green Trip Data, and Lookup Zones based on the endpoints in datasets.yml.
1. Create and activate a virtualenv for Python 3.11 with conda:
conda create -n pandas-sqlalchemy python=3.11 -y
conda activate pandas-sqlalchemy
2. Install the dependencies on pyproject.toml
:
pdm sync
3. (Optional) Install pre-commit:
brew install pre-commit
# From root folder where `.pre-commit-config.yaml` is located, run:
pre-commit install
4. Export ENV VARS to connect to DB:
export DATABASE_HOST=localhost
export DATABASE_PORT=5432
export DATABASE_NAME=nyc_taxi
export DATABASE_USERNAME=postgres
export DATABASE_PASSWORD=postgres
5. Run the script with the intended flags or use --help
:
-
python run.py -y
or--yellow
:- fetches the datasets under the key
yellow_trip_data
only - persists to Postgres, on table
yellow_taxi_data
- fetches the datasets under the key
-
python run.py -g
or--green
:- fetches the datasets under the key
green_trip_data
only, - persists to Postgres, on table
green_taxi_data
- fetches the datasets under the key
-
python run.py -f
or--fhv
:- fetches the datasets under the key
fhv_trip_data
- persists to Postgres, on table:
fhv_taxi_data
- fetches the datasets under the key
-
python run.py -z
or--zones
:- fetches the datasets under the key
zone_lookups
- persists to Postgres, on table:
zone_lookup
- fetches the datasets under the key
Additionally, you can use --use-polars
for a major speed boost with Polars.
You can use any combination of options above to fetch more than dataset group at a time. For instance: python run.py -gz --use-polars
fetches the NYC Green Trip Data and NYC Lookup Zones while using Polars as the Dataframe library.
1. Build the Docker Image with:
docker build -t iobruno/nyc-taxi-ingest:latest . --no-cache
2. Start a container with it:
docker run --rm \
-e DATABASE_HOST=host.docker.internal \
-e DATABASE_PORT=5432 \
-e DATABASE_NAME=nyc_taxi \
-e DATABASE_USERNAME=postgres \
-e DATABASE_PASSWORD=postgres \
--name db_ingest_postgres \
iobruno/nyc-taxi-ingest
- PEP-517: Packaging and dependency management with PDM
- Code format/lint with Ruff
- Build a CLI app with
Typer
- Progress Bars to keep track of the execution with
rich
- Run/Deploy the project on Docker
- Re-Implement the pipeline with Polars
- Define the DataFrame schemas for Polars to prevent DB errors