Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Python data ingestion with polars and pandas

Python Polars Pandas Docker

License

This cli script is set to be able to fetch the CSV datasets for NYC Yellow Trip Data, Green Trip Data, and Lookup Zones based on the endpoints in datasets.yml.

Tech Stack

Up and Running

Developer Setup

1. Create and activate a virtualenv for Python 3.11 with conda:

conda create -n pandas-sqlalchemy python=3.11 -y
conda activate pandas-sqlalchemy

2. Install the dependencies on pyproject.toml:

pdm sync

3. (Optional) Install pre-commit:

brew install pre-commit

# From root folder where `.pre-commit-config.yaml` is located, run:
pre-commit install

4. Export ENV VARS to connect to DB:

export DATABASE_HOST=localhost
export DATABASE_PORT=5432
export DATABASE_NAME=nyc_taxi
export DATABASE_USERNAME=postgres
export DATABASE_PASSWORD=postgres

5. Run the script with the intended flags or use --help:

  • python run.py -y or --yellow:

    • fetches the datasets under the key yellow_trip_data only
    • persists to Postgres, on table yellow_taxi_data
  • python run.py -g or --green:

    • fetches the datasets under the key green_trip_data only,
    • persists to Postgres, on table green_taxi_data
  • python run.py -f or --fhv:

    • fetches the datasets under the key fhv_trip_data
    • persists to Postgres, on table: fhv_taxi_data
  • python run.py -z or --zones:

    • fetches the datasets under the key zone_lookups
    • persists to Postgres, on table: zone_lookup

Additionally, you can use --use-polars for a major speed boost with Polars.

You can use any combination of options above to fetch more than dataset group at a time. For instance: python run.py -gz --use-polars fetches the NYC Green Trip Data and NYC Lookup Zones while using Polars as the Dataframe library.

Containerization and Testing

1. Build the Docker Image with:

docker build -t iobruno/nyc-taxi-ingest:latest . --no-cache

2. Start a container with it:

docker run --rm \
  -e DATABASE_HOST=host.docker.internal \
  -e DATABASE_PORT=5432 \
  -e DATABASE_NAME=nyc_taxi \
  -e DATABASE_USERNAME=postgres \
  -e DATABASE_PASSWORD=postgres \
  --name db_ingest_postgres \
  iobruno/nyc-taxi-ingest

TODO:

  • PEP-517: Packaging and dependency management with PDM
  • Code format/lint with Ruff
  • Build a CLI app with Typer
  • Progress Bars to keep track of the execution with rich
  • Run/Deploy the project on Docker
  • Re-Implement the pipeline with Polars
  • Define the DataFrame schemas for Polars to prevent DB errors