# Module 1 - Docker + Postres
## Introduction to Docker

Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called **containers**. Containers are isolated from each other and bundle their own software, libraries, and configuration files, though they can communicate with each other through defined channels.

```{mermaid}
flowchart LR
  A("Source(csv file)") --> B[Data Pipeline]
  B --> C("Destination (database)")

  style A fill: snow, stroke: silver
  style B fill: lightskyblue, stroke: lightslategrey
  style C fill: snow, stroke: silver
```

This is useful in data engineering for:

- Reproducibility - creating pipelines to be reused across local and cloud environments.
- Local experiments - easier to setup local experiments.
- Integration test (CI/CD) - allows for smoother integration tests.
- Serverless - easier to setup on serverless environments like AWS Lambda, Google Functions, etc.
- Spark - easier to setup dependencies for Spark.

### Installing Docker

Using `apt` from the [Docker docs](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository):

1. Set up Docker's `apt` repository.
```default
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
```

2. Install the latest Docker packages.
```default
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```

3. Verify that Docker is installed by running the `hello-world` image.
```default
sudo docker run hello-world
```

:::{.callout-important}
If the Docker daemon isn't running, restart with
```default
sudo service docker start
```
:::

### Docker Images
A Docker image is a read-only template containing a set of instructions for creating a container. Pre-defined images are availabe to download from container registries such as [Docker Hub](https://hub.docker.com).

To run a docker image:
```default
docker run -it ubuntu bash
```

- `-it` : is used to run the container in interactive mode.
- `ubuntu` : the name of the image (docker will first search for the image on Docker Hub).
- `bash` : a parameter command used to open bash.

To run a Python image:
```default
docker run -it --entrypoint=bash python:3.9
```

- `--entrypoint=bash` : control where to enter the container, either in bash or a Python REPL.
- `:3.9` : everything after the `:` is a tag, in this case it indicates which version of the Python image to download and use.

Docker containers are **stateless**, meaning changes are not persisted between runs.

### Dockerfile
A Dockerfile is used to define the instructions used to create an image.

- Usually begins with `FROM <image name>:<image tag>` to base the new image on a template image.
- Use `RUN <command>` to execute commands.
- Use `ENTRYPOINT <command>` to define how to enter the container.
- Use `WORKDIR <path>` to define the container's working directory.
- Use `COPY <source> <destination>` to copy files from the local working directory to the container working directory.

``` {.Dockerfile}
FROM python:3.11

RUN apt-get install wget
RUN pip install pandas sqlalchemy psycopg2

WORKDIR /app
COPY ingest_data.py ingest_data.py

ENTRYPOINT ["bash"] # or ["python", "pipeline.py"]

```


### Building the image
Once the Dockerfile is complete, it is used to build the image with `docker build`.
```default
docker build -t test:pandas .
```

- `-t` : is used to set an image name and tag.
- `test` : the name of the image.
- `pandas` : is the tag (usually used for versioning).
- `.` : the path to the Dockerfile, `.` indicates the current directory.


### Running Postgres in Docker
Postgres is a popular sql database management system. Running Postgres in docker requires the use of environmental variables, volumes, and ports. **Volumes** map a local host filesystem with a container filesytem to persist data. To communicate with our database inside the container we will need to map a **port** on our local host machine to a port in the container. To run the Postgres image:

```default
docker run -it \
    -e POSTGRES_USER="root" \
    -e POSTGRES_PASSWORD="root" \
    -e POSTGRES_DB="ny_taxi" \
    -v c:/.../ny_taxi_postgres_data:/var/lib/postgresql/data \
    -p 5432:5432
    postgres:13
```

- `-e` : pass an environmental variable to the container.
- `-v <local path>:<container path>` : map the local filesystem to a volume.
- `-p` : map the port from the host machine to the container.

**pgcli** is a useful command line tool for working with Postgres databases. Install with Python:
```default
pip install pgcli
```

After installation is complete, we can connect to our Postgres container:
```default
pgcli -h localhost -p 5432 -u root -d ny_taxi
```

### Connecting pgAdmin
pgAdmin is a convenient GUI for interacting with Postgres databases. We can run pgAdmin in its own container:
```default
docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  dpage/pgadmin4
```

However, to communicate with our Postgres database running in a separate container, we need to connect the containers using a docker **network**:
```default
docker network create pg-network
```

We will need to rerun our previous containers with extra network arguments, so stop the currently running containers:
```default
docker stop <container id>
```

Then restart the Postgres and pgAdmin containers with the network commands:
```default
docker run -it \
    -e POSTGRES_USER="root" \
    -e POSTGRES_PASSWORD="root" \
    -e POSTGRES_DB="ny_taxi" \
    -v c:/.../ny_taxi_postgres_data:/var/lib/postgresql/data \
    -p 5432:5432
    --network=pg-network \
    --name pg-database \
    postgres:13

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  --network=pg-network \
  --name pgadmin \
  dpage/pgadmin4
```

- `--network` : add the container to this network.
- `--name` : name this container on the network.

Now we can log into pgAdmin and connect to our Postgres database.

### Reading data into the Postgres database
As an exercise, we will ingest [NY Taxi Trip Record](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) data for January 2021 into the database.

Create a folder and download the dataset:
```default
mkdir ny_taxi_data && cd ny_taxi_data
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
```

Alternatively, you can download the gzipped dataset from: `https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz`

We will read the data into the database using the Python libraries `pandas` and `sqlalchemy`:

In [3]:
import os
import argparse
from time import time

import pandas as pd
from sqlalchemy import create_engine


def main(params: argparse.Namespace) -> None:
    user = params.user
    password = params.password
    host = params.host 
    port = params.port 
    db = params.db
    table_name = params.table_name
    url = params.url

    csv_name = 'output.csv.gz'

    # download the parquet file
    os.system(f"wget {url} -O {csv_name}")

    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')

    df_iter = pd.read_csv(csv_name, iterator=True, chunksize=100000, compression='gzip')


    for df_chunk in df_iter:
        t_start = time()

        df_chunk.tpep_pickup_datetime = pd.to_datetime(df_chunk.tpep_pickup_datetime)
        df_chunk.tpep_dropoff_datetime = pd.to_datetime(df_chunk.tpep_dropoff_datetime)

        df_chunk.to_sql(name=table_name, con=engine, if_exists='append')

        t_end = time()
        print(f'inserted chunk in {t_end - t_start:.3f} seconds.')


if __name__ == '__main__':
    # Parse the command line arguments and calls the main program
    parser = argparse.ArgumentParser(description='Ingest CSV data to Postgres')

    parser.add_argument('--user', help='user name for postgres')
    parser.add_argument('--password', help='password for postgres')
    parser.add_argument('--host', help='host for postgres')
    parser.add_argument('--port', help='port for postgres')
    parser.add_argument('--db', help='database name for postgres')
    parser.add_argument('--table_name', help='name of the table where we will write the results to')
    parser.add_argument('--url', help='url of the csv file')

    args = parser.parse_args()

    main(args)

### Docker Compose
Managing several separate containers can quickly become cumbersome. Docker Compose is a tool for configuring and running multi-container docker applications within a *single* YAML file.

In the YAML file, we define **services**, **networks**, and **volumes**.

#### Services
Services define the separate containers and their configurations. We first name the service (ex. `pgdatabase`) and then we can either pull an image from Docker Hub or build an image from a Dockerfile.

```yaml
services:
    pgdatabase:
        image: postgres:13  # pull an image 
        ...
    custom-container:
        build: /path/to/dockfile/  # build an image from a file
        image: custom-container  # name the image
        ...
    custom-online-container:
        build: https://github.com/custom/online/container.git  # build an image from a url
        image: custom-online-container  # name the image
```

#### Networks
By default a network is created for all services defined in a docker compose file. A service can communicate with another service on the same network by referencing the container name and port.

Ports are usually exposed within the images but to communicate with containers from the host machine, ports must be mapped in the compose file:

```yaml
services:
    pgdatabase:
        image: postgres:13
        ports:
            -"5432:5432"
```

We can now communicate with the `pgdatabase` container through port `5432`.

We can also define additional virtual networks to segregate our containers if needed:

```yaml
services:
    pgdatabase:
        image: postgres:13
        networks:
            - database-network
    otherservice:
        image: python:3.11
        networks:
            - python-network

networks:
    database-network: {}
    python-network: {}
```

#### Volumes
There are three types of volumes: *anonymous*, *named*, and *host*.

Docker manages anonymous and named volumes, automatically mounting them in self-generated directories in the host. Host volumes allow us to specify an existing folder in the host.

We can configure host volumes at the service level and named volumes at the top level to make named volumes visible to other containers.

```yaml
services:
    pgdatabase:
        image: postgres:13
        volumes:
            - /named-global-volume:/volumes/global-volume
            - /home:/volumes/read-write-volume:rw  # rw indicates read/write permissions
            - /home:/volumes/read-only-volume:ro  # ro indicates read only
        ...
    pgadmin:
        image: dpage/pgadmin4
        volumes:
            - /named-global-volume:/volumes/another-volume
        ...
volumes:
    named-global-volume:        
```

In this case, both containers will have read/write access to the `named-global-volume` shared folder, regardless of the path they've mapped it to in the container.

#### Dependencies
Often we need to create a dependency chain between services so that they run in a certain order (like starting up a Postgres database before pdAdmin). We can also specify conditions and requirements to control what happens when dependent services start or complete.

```yaml
services:
    pgdatabase:
        image: postgres:13
        depends_on:
            pgadmin:
                condition: service_healthy  # check if service is "healthy" before starting 
        ...
    pgadmin:
        image: dpage/pgadmin4
        ...
    
```

```yaml
services:
    pgdatabase:
        image: postgres:13
        enviroment:
            - POSTGRES_USER=root
            - POSTGRES_PASSWORD=root
            - POSTGRES_DB=ny_taxi
        volumes:
            - "./ny_taxi_data:/var/lib/postgresql/data:rw"
        ports:
            - "5432:5432"
    pgadmin:
        image: dpage/pgadmin4
        environment:
            - PGADMIN_DEFAULT_EMAIL=admin@admin.com
            - PGADMIN_DEFAULT_PASSWORD=root
        ports:
            - "8080:80"
```

#### Environment Variables
Working with environment variables is easy in Compose. We can define static variables or we can use dynamic environment variables enclosed with `${}` and specified with a `.env` file:
```yaml
services:
    pgdatabase:
        image: postgres:${POSTGRES_VERSION}
        environment:
            DB: ${DATABASE_NAME}
        env_file:
            - .env
```

Where the `.env` file contains:
```default
# .env
POSTGRES_VERSION=13
DATABASE_NAME=mydb
```

Then we can run these containers with:
```default
docker compose -f compose.yml up -d --build
```

And to shutdown these containers, we can use:
```default
docker compose down -v
```

