## 🐳 Step 1: Set Up Docker Environment 

To get our environment set up, we'll create two config files. 

1. Dockerfile
2. docker-compose.yaml

| **File** | **Purpose** |
|---------|-------------|
| `Dockerfile` | Defines your service’s environment (e.g., Python version, packages). |
| `docker-compose.yaml` | Coordinates how containers are built and run — incl. ports, volumes, and dependencies (Postgres etc).|

#### 🧠 Docker Compose Mental Model

Instead of running manual Docker commands, Compose simplifies the workflow:

| **Goal** | **What Compose Does** | **Where It's Defined** |
|----------|------------------------|--------------------------|
| Define your app's image | Builds it from your `Dockerfile` | `docker-compose.yaml → build:` |
| Run your app + services | Launches containers for each service (app, Postgres, etc.) | `docker-compose.yaml → services:` |
| Share files between local + container | Mounts local folders inside containers | `docker-compose.yaml → volumes:` |
| Talk to the database | Exposes ports for local access | `docker-compose.yaml → ports:` |
| Start everything | `docker-compose up` | CLI command |
| Stop everything | `docker-compose down` | CLI command |

## Step 2: Run Postgres on Docker

- Create a `docker-compose.yaml` file to define the Postgres service and credentials:
  - Use the official Postgres image (`postgres:13`) from Docker Hub.
  - Use a **named volume** (`pgdata:/var/lib/postgresql/data`) to persist db data cleanly.
  - Note: DE Zoomcamp used a bind mount method, but **named volumes** is preferred as the data is managed by Docker

- Start the Postgres container from directory containing `docker-compose.yaml` :
  ```bash
  docker-compose up -d
- Install pgcli, a Postgres CLI client for easy SQL queries from the terminal:
  ```bash 
  pip install pgcli 
- Connect to the running Postgres container with:
  ```bash 
  pgcli -h localhost -p 5432 -u root -d nyc_taxi
  ```
  **Translation**
“Hey, connect to the Postgres server running on my machine (`localhost`), using port `5432`, log in as user `root`, and access the db named `nyc_taxi`.”


## 🐼 Step 3: Download Taxi Data & Read CSV with Pandas

### Download Taxi Data (CSV)
   - **Install `wget` using Homebrew:**  
     ```bash
     brew install wget
     ```
   - **Use CLI to run:**  
     ```bash
     wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
     ```

### Read CSV with Pandas

- **URL for compressed taxi data file:**  
  - `url = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"`

- **Extract file name from URL:**  
  - ```python
    csv_name = url.split("/")[-1]
- **Read CSV directly into Pandas DataFrame:**

    ```python
    df = pd.read_csv(csv_name)
    ```

## 🗂️ Step 4: Generate Postgres Schema from DataFrame

### Generate Postgres Schema 
- Use `pd.io.sql.get_schema()` to create DDL from the DataFrame.  
- Convert pickup/dropoff columns from text to datetime using `pd.to_datetime()` for accurate SQL types. 
- Create SQLAlchemy engine to connect pandas with Postgres.
- Generate Postgres-specific DDL using `get_schema()` with the engine connection. This output defines the table structure Pandas will create in Postgres.

### Check that the table was created succesfully 

- Run pgcli to connect to your Postgres container:
 ```bash
     pgcli -h localhost -p 5432 -u root -d nyc_taxi
```
- Once in your postgres container, get list of tables
```bash
     \dt 

## 🚕 Step 5: Batch Load Data into Postgres

### Batch Ingestion Setup
- Read CSV in chunks of 100,000 rows to avoid memory overload using `pd.read_csv` with `iterator=True` and `chunksize=100000`.
- Use `next(df_iter)` to fetch the first chunk.
- Print length of the chunk to confirm batch size.

### Table Creation and Data Ingestion
- Use `df.head(0)` to extract just the table schema (no data) and create a fresh table in Postgres with `if_exists='replace'`.
- Insert the first chunk of data with `to_sql(if_exists='append')`.
- Check for duplicates using `df.duplicated().sum()` to verify no double imports.
- Continue batch ingestion with a for loop over `df_iter` to append remaining chunks to the table.


---
### Additonal Notes
<br>

#### Key Docker Concepts
| **Concept**   | **Example**                        | **What It Means**                                         |
|---------------|------------------------------------|------------------------------------------------------------|
| **Image**     | `python:latest`, `postgres:15`     | A snapshot blueprint — like a cake recipe                  |
| **Container** | `docker run -it python:latest`     | A live instance of that image — like a baked cake          |
| **Dockerfile**| `FROM python:3.12 ...`             | A way to define your own image — ingredients + steps       |
| **Build**     | `docker build -t myimage .`        | Turn the Dockerfile into an image                          |
| **Run**       | `docker run -it myimage`           | Create a container from your image                         |
| **EntryPoint**| `--entrypoint=bash`                | Override the default "what to do when container starts"    |
| **Volume**    | `-v $(pwd):/app` (later)           | Mount your local files into container (for persistence)    |
