# Prepare the Data

Download the green taxi trips data for November 2025:

In [None]:
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet

You will also need the dataset with zones:

In [None]:
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv

Install Required Lib:

In [None]:
pip3 install pandas pyarrow sqlalchemy psycopg2-binary

## Step 1: Start Postgres + pgAdmin (if not running)

From your main folder where docker-compose.yaml exists:

docker compose up -d
docker ps

## Step 2: Create ingestion script (loads parquet â†’ Postgres)

Create ingest_green.py file and paste this: 

In [None]:
import pandas as pd
from sqlalchemy import create_engine

# Parquet file
PARQUET_FILE = "green_tripdata_2025-11.parquet"
CSV_ZONES = "taxi_zone_lookup.csv"

# Postgres connection (from docker-compose)
USER = "postgres"
PASSWORD = "postgres"
HOST = "localhost"   # host machine
PORT = "5433"        # mapped port
DB = "ny_taxi"

engine = create_engine(f"postgresql://{USER}:{PASSWORD}@{HOST}:{PORT}/{DB}")

print("Reading parquet...")
df = pd.read_parquet(PARQUET_FILE)
print("Rows:", len(df))

print("Writing green_trips table...")
df.to_sql("green_trips", engine, if_exists="replace", index=False)
print("Loaded green_trips")

print("Reading zones csv...")
zones = pd.read_csv(CSV_ZONES)
print("Writing zones table...")
zones.to_sql("zones", engine, if_exists="replace", index=False)
print("Loaded zones")

# Step 3: Run ingestion

In [None]:
python3 ingest_green.py

## Step 4: Verify in Postgres

Run:

In [None]:
docker exec -it postgres psql -U postgres -d ny_taxi

Then

In [None]:
\dt
SELECT COUNT(*) FROM green_trips;
SELECT COUNT(*) FROM zones;

![Prepare Data Image](images/prepare-data.png)

Exit : \q