# What is Docker?
Docker is a tool used to automate the deployment of software using lightweight packages called **containers**. Containers are similar to virtual machines: they are isolated from one another and bundle their own software, libraries, and configuration files, but they are more portable and resource friendly.

They are used in data engineering to deploy **data pipelines**. A data pipeline is a process that intakes data and does something with the data (processing, cleaning, transforming) and then outputs the data. A pipeline can include many different steps.

<div style="background: #f8f8f2; text-align: center; border-radius: 6px">

```{mermaid}
flowchart LR
    A("CSV <br>(source)") ==> B[Data Pipeline]
    B[Data Pipeline] ==> C[("Postgres table<br>(dest)")]

    style A fill:#f8f8f2,stroke:darkgray,stroke-width:2px
    style B fill:#f8f8f2,stroke:darkgray,stroke-width:2px
    style C fill:#f8f8f2,stroke:darkgray,stroke-width:2px
```

</div>
<br>

Advantages of Docker:

- Pipelines and analyses are *reproducible*.
- Setting up local experiments.
- Setting up integration tests (CI/CD)
- Easily run in different cloud services.

# Installing Docker
It's recommended to use a Linux environment in Windows like [WSL](https://learn.microsoft.com/en-us/windows/wsl/install) or MINGW.

:::{.callout-tip}
To ensure the latest version, [install Docker](https://docs.docker.com/engine/install/ubuntu/) from the official Docker repository.
:::

First update the `apt` packages index: 
```default
sudo apt update
```
Then install packages to allow `apt` to use a repository over HTTPS:
```default
sudo apt install ca-certificates curl gnupg lsb-release
```

Add Docker's official GPG key (for signing packages):
```default
sudo mkdir -m 0755 -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
```
Setup the repository:
```default
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
```
Finally update `apt-get`:
```default
sudo apt-get update
```
And install Docker, containerd, and Docker Compose:
```default
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
If the docker daemon isn't running, restart by:
```default
sudo service docker start
```
Verify the installation was successful by running the `hello-world` image:
```default
sudo docker run hello-world
```

And congrats -- Docker is now installed!

# Using Docker
A container **image** is a template that contains a set of instructions for creating a container. They are defined in a **Dockerfile** and they are run with `docker run <image name>`.
```default
docker run hello-world
```
This will search Docker Hub for an predefined image called `hello-world`, load, and run it. 

We can also access an Ubuntu terminal within a container:
```default
docker run -it ubuntu bash
```
- The `-it` argument will run the prompt in an interactive terminal.
- The `bash` argument is a **parameter** passed to the `ubuntu` container to start bash.

**Docker containers are stateless** - containers themselves don't save any state (software, packages, libraries, etc.). So if we stop a container with `docker kill <container name>` all installed software and data will be lost.

We can run a container with Python pre-installed with:
```default
docker run -it --entrypoint=bash python:3.9
```
- The `:3.9` after `python` is a **tag** used to run specific versions of images.
- The `--entrypoint=bash` defines the entry point for the container, in this case we want to begin with a `bash` prompt (used to install Python packages).

Because containers are stateless we define which software, packages, libraries, etc. we want to begin with by creating a Dockerfile:
```dockerfile
# this is an example of a simple docker file.

FROM python:3.9

RUN pip install pandas

WORKDIR /app
COPY pipeline.py pipeline.py

ENTRYPOINT [ "bash" ]
```
- `FROM`: inherit from an existing image.
- `RUN`: run commands.
- `WORKDIR`: set the working directory.
- `COPY`: copy pipeline files from the current directory to the container's working directory.
- `ENTRYPOINT`: define a default entry point.

Next we need to build the docker image.
```default
docker build -t test:pandas .
```
- `test` is the name of our image.
- `:pandas` is a tag.
- `.` is the directory where the Dockerfile is located.

And then use the image to create a container with:
```default
docker run -it test:pandas
```

# Running Postgres in Docker
We will use the official Docker Hub image for Postgres:
```default
docker run -it \
    -e POSTGRES_USER="root" \
    -e POSTGRES_PASSWORD="root" \
    -e POSTGRES_DB="ny_taxi" \
    -v $(pwd)/data:/var/lib/postgresql/data \
    -p 5432:5432 \
postgres:13
```
- `-e` define an environment variable.
- `-v` mount a volume (map a local filesystem with the container filesystem).
- `-p` define a port to send and receive database requests.

We will access the Postgres database in the container from a python cli in another terminal window (we need a couple packages first):
```default
pip install pgcli
pip install psycopg2
pip install psycopg[binary]
```
Start a connection with the database with:
```default
pgcli -h localhost -p 5432 -u root -d ny_taxi
```
- `-h`: URL for the database server.
- `-p`: the port.
- `-u`: the username.
- `-d`: the specific database to connect to.

Take a look at the tables with `\dt`.

# Loading NYC Taxi Data
We will use a python script in Jupyter to load the data. We will need the following modules:
```default
pip install jupyterlab
pip install pandas
pip install sqlalchemy
```

In [13]:
from sqlalchemy import create_engine
import pandas as pd
pd.__version__

'1.5.3'

Find out more about the NYC taxi data at the [NYC Taxi and Limo Commision site](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The data was recently converted to parquet files. Download the csv here: [NYC yellow taxi data](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/yellow).
The data dictionary for these files can be found here: [Yellow Trips Data Dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).

In [14]:
df = pd.read_csv(
    '../data/yellow_tripdata_2021-01.csv',
    nrows=100,
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']
)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               100 non-null    int64         
 1   tpep_pickup_datetime   100 non-null    datetime64[ns]
 2   tpep_dropoff_datetime  100 non-null    datetime64[ns]
 3   passenger_count        100 non-null    int64         
 4   trip_distance          100 non-null    float64       
 5   RatecodeID             100 non-null    int64         
 6   store_and_fwd_flag     100 non-null    object        
 7   PULocationID           100 non-null    int64         
 8   DOLocationID           100 non-null    int64         
 9   payment_type           100 non-null    int64         
 10  fare_amount            100 non-null    float64       
 11  extra                  100 non-null    float64       
 12  mta_tax                100 non-null    float64       
 13  tip_am

To load the data into Postgres, we first need to generate a schema.

In [15]:
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')
engine.connect()

<sqlalchemy.engine.base.Connection at 0x227bd8f2c90>

In [16]:
print(pd.io.sql.get_schema(df, name='yellow_taxi_data', con=engine))


CREATE TABLE yellow_taxi_data (
	"VendorID" BIGINT, 
	tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE, 
	tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE, 
	passenger_count BIGINT, 
	trip_distance FLOAT(53), 
	"RatecodeID" BIGINT, 
	store_and_fwd_flag TEXT, 
	"PULocationID" BIGINT, 
	"DOLocationID" BIGINT, 
	payment_type BIGINT, 
	fare_amount FLOAT(53), 
	extra FLOAT(53), 
	mta_tax FLOAT(53), 
	tip_amount FLOAT(53), 
	tolls_amount FLOAT(53), 
	improvement_surcharge FLOAT(53), 
	total_amount FLOAT(53), 
	congestion_surcharge FLOAT(53)
)




When reading in large amounts of data into a database, it's a good idea to break the insert up into chunks and read those in one by one.

In [35]:
df_iter = pd.read_csv(
    '../data/yellow_tripdata_2021-01.csv',
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    iterator=True,
    chunksize=100000,
    low_memory=False
)

In [18]:
df = next(df_iter)

Let's insert the header row of the dataframe to create the table:

In [19]:
df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')

0

Check that the table was created with `\d yellow_taxi_data`. Now let's read the first chunk of data:

In [20]:
df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')

1000

Check that the first chunk was inserted with `SELECT * FROM yellow_taxi_data`. Now let's read in the rest of the data:

In [36]:
from time import time
try:
    while True:
        t_start = time()

        df = next(df_iter)
        df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')

        t_end = time()
        print(f'insert another chunk..., took {t_end - t_start:.3f} seconds.')
except StopIteration:
    print("Reached the end of the iterator.")
finally:
    del df_iter

insert another chunk..., took 0.478 seconds.
insert another chunk..., took 0.636 seconds.
insert another chunk..., took 1.045 seconds.
insert another chunk..., took 0.601 seconds.
insert another chunk..., took 1.093 seconds.
insert another chunk..., took 0.820 seconds.
insert another chunk..., took 0.862 seconds.
insert another chunk..., took 0.642 seconds.
insert another chunk..., took 0.640 seconds.
insert another chunk..., took 1.017 seconds.
insert another chunk..., took 0.635 seconds.
insert another chunk..., took 0.884 seconds.
insert another chunk..., took 0.740 seconds.
insert another chunk..., took 0.520 seconds.
Reached the end of the iterator.


# Using pgAdmin and Docker Networks

**pgAdmin** is a web-based GUI tool used to interact with the Postgres database sessions. We can run it in a separate container:
```default
docker run -it \
    -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
    -e PGADMIN_DEFAULT_PASSWORD="root" \
    -p 8080:80 \
    dpage/pgadmin4
```

To list the currently running containers we can use `docker ps`. We should have two containers running:
```default
docker ps
```

However, in order for the pgAdmin client to communicate with our Postgres server, we need to setup a [Docker network](https://docs.docker.com/engine/reference/commandline/network/) so the containers can communicate with each other.

First we need to create a docker network:
```default
docker network create pg-network
```

Next we need to stop our previously running containers:
```default
docker kill pgdatabase
docker kill dpage/pgadmin4
```
and restart them with the `--network` commands:
```default
docker run -it \
    -e POSTGRES_USER="root" \
    -e POSTGRES_PASSWORD="root" \
    -e POSTGRES_DB="ny_taxi" \
    -v $(pwd)/data:/var/lib/postgresql/data \
    -p 5432:5432 \
    --network=pg-network \
    --name pg-database \
postgres:13
```
```default
docker run -it \
    -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
    -e PGADMIN_DEFAULT_PASSWORD="root" \
    -p 8080:80 \
    --network=pg-network
    --name pgadmin
    dpage/pgadmin4
```

We can view the currently running networks with:
```default
docker network ls
```

Now navigate to pgAdmin at [localhost:8080](localhost:8080) and login with our set email and password.

Next we will create a server in pgAdmin: `Servers` > `Create` > `Server...` and fill in the fields:

* Host name/address: `pgdatabase`
* Port: `5432`
* Maintenance database: `postgres`
* Username: `root`
* Password: `root`
 
Now we are connected to our Postgres database and can make queries!

# Dockerizing the data ingestion script
In order to run the NYC taxi data ingestion script inside a Docker container we need to convert the Jupyter notebook code into a script named `ingest_data.py`:

```python
from time import time
import argparse

from sqlalchemy import create_engine
import pandas as pd



def main(params):
    """ Download and read NYC Yellow Taxi Data into a Postges table."""
    user = params.user
    password = params.password
    host = params.host
    port = params.port
    db = params.db
    table_name = params.table_name
    url = params.url
    csv_name = "output.csv.gz"
    
    # download the csv
    os.system(f"wget {url} -O {csv_name}")
    
    # setup data and table for ingestion
    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
    
    df_iter = pd.read_csv(
        '../data/yellow_tripdata_2021-01.csv',
        parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
        iterator=True,
        chunksize=100000,
        low_memory=False
    )
    df = next(df_iter)
    
    df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')
    df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
    
    try:
        while True:
            t_start = time()

            df = next(df_iter)
            # df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')

            t_end = time()
            print(f'insert another chunk..., took {t_end - t_start:.3f} seconds.')
    except StopIteration:
        print("Reached the end of the iterator.")
    finally:
        del df_iter
    
    
if __name__ == "__main__":
    # set up argparse arguments
    parser = argparse.ArgumentParser(description="Ingest csv data to Postgres")
    parser.add_argument('user', help='user name for postgres')
    parser.add_argument('pass', help='password for postgres')
    parser.add_argument('host', help='host for postgres')
    parser.add_argument('port', help='port for postgres')
    parser.add_argument('db', help='database name for postgres')
    parser.add_argument('table_name', help='table to write results to')
    parser.add_argument('url', help='url of the csv file')

    args = parser.parse_args()
    
    main(args)
```

Now to run this script in a container, let's:

1. Write the Dockerfile
```Dockerfile
FROM python:3.11

RUN apt-get install wget
RUN pip install pandas sqlalchemy psycopg2

WORKDIR /app
COPY ingest_data.py ingest_data.py

ENTRYPOINT [ "python", "ingest_data.py" ]
```

2. Build the Dockerfile
```default
docker build -t taxi_ingest:v001 .
```

3. Run the Docker image
```default
URL="https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"

docker run -it \
  --network=pg-network \
  taxi_ingest:v001 \
    --user=root \
    --password=root \
    --host=pgdatabase \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_taxi_trips \
    --url=${URL}
```

# Using the Docker CLI
The [Docker CLI](https://docs.docker.com/engine/reference/run/) has several useful commands for managing containers, networks, and volumes.

:::{.callout-note}
If you receive the error: `Cannot connect to the Docker daemon`, try restarting the docker service with:
```default
sudo service docker start
```
:::

List containers
``` default
docker ps -a --format "table {{.ID}}\t{{.Image}}\t{{.Name}}"
```
- `-a`: list all containers, omit to list only running containers.
- `--format`: format the table output with {{.ColumnName}} in quotes.
Enter a container (start an interactive shell within the container)
``` default
docker exec -it CONTAINER bash
```
Stop (or kill) a running container
```default
docker stop CONTAINER
docker kill CONTAINER
```
Remove containers
```
docker rm CONTAINER
```
Remove all containers
```
docker rm -f $(docker ps -aq)
```
Create networks
```default
docker network create NETWORK
```
List networks
```default
docker network ls
```
Connection (disconnect) networks from specific containers
```default
docker network connect NETWORK CONTAINER
docker network disconnect NETWORK CONTAINER
```
Remove networks
```default
docker network rm NETWORK
```
Remove all unused networks (networks that are not connected to a container)
```default
docker network prune
```
Create volumes
```default
docker volume create VOLUME
```
Display information about the volume
```default
docker volume inspect VOLUME
```
List volumes
```default
docker volume ls
```
Remove volumes
```default
docker volume rm VOLUME
```
Remove all unused volumes (volumes that are not connected to a container)
```default
docker volume prune
```

# Docker Compose
[Docker Compose](https://docs.docker.com/compose/compose-file/compose-file-v2/) is a tool to configure and run multiple docker containers (within an automatically created docker network) in one convenient `docker-compose.yaml` file. We just need to list our containers and image, environment, volume, and port parameters:
```yaml
services:
  pgdatabase:
    image: postgres:13
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - postgres_data:/var/lib/postgresql/data:rw
    ports:
      - "5432:5432"
    networks:
      - pg-network

  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    volumes:
      - type: volume
        source: pgadmin_data
        target: /var/lib/pgadmin
    ports:
      - "8080:80"
    networks:
      - pg-network
  
  taxi_ingest:
    container_name: taxi-data-ingest
    image: taxi_ingest:v001
    build:
      context: .
      dockerfile: nyc_ingest.Dockerfile
    command:
      - --user=root
      - --password=root
      - --host=pgdatabase
      - --port=5432
      - --db=ny_taxi
      - --table_name=yellow_taxi_trips
      - --url=https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
    networks:
      - pg-network
    depends_on:
      - pgdatabase

networks:
  pg-network:
    name: pg-network

volumes:
  postgres_data:
  pgadmin_data:

``` 

Now run the compose file with:
```default
docker compose up -d --build
```
- `-d`: runs the containers in detached mode in order to continue to use the current terminal.
- `--build`: specifically builds the image (only needed the first time).

And shutdown the containers with:
```default
docker compose down
```

# What is Terraform?
Terraform is an open source IAC (Infrastructure As Code) tool by Hashicorp that allows you to provision resources with decorated configuration files.

This lets you manage the infrastructure lifecycle in a set of configuration files that can be version controlled, reused, and shared.

This also lets us track resource changes accross data pipelines.

Install Terraform (Windows) from [terraform.io/downloads](https://developer.hashicorp.com/terraform/downloads).

Install Terraform (Linux) with:
```default
wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform
```

# Google Cloud Platform Setup
[Google Cloud Platform (GCP)](https://console.cloud.google.com/) works in **projects**. We need to create a new project.

1. Create New Project:
    - Project Name: "dtc-de-2023".
    - Project ID: generate, needs to be unique across GCP.
    
2. Create a Service Account
    - Navigate: `IAM & Admin` > `Service Accounts`
    - Service account name: "dtc-de-2023-user".
    - Service account ID: does not need to be unique.
    - Service account description: "DTC DE course".
    - Grant this role "Viewer" status for now.

:::{.callout-note}
A **service account** provides a set of credentials tied to a particular service or server. This allows services to interact with one another without the user or admin account and provides for security or permissions to be setup for each service (or set of services).
:::

3. Generate a key
    - Navigate to Manage Keys > Add Key > Create New Key.
    - Set key type to JSON.
    
4. Download GCP SDK
    - [Install GCP SDK](https://cloud.google.com/sdk/docs/install-sdk) and check installation with `gcloud -v`.
    - Set environment variable for GCP credentials:
    ```default
    set GOOGLE_APPLICATION_CREDENTIAL="<path>/<to>/<service-account-authkey>.json"
    ```
    - Refresh token, and verify authentication:
    ```default
    gcloud auth application-default login
    ```

5. Create GCP Resources

We will create 2 GCP resources in the Google Cloud web environment:

- **Google Cloud Storage (GCS)**: Data Lake (a "bucket" to store raw data in a directory of flat files (.csv, .parquet, etc.)
- **Big Query**: Data Warehouse (data is modeled into fact and dimension tables for querying)

But first we need to set up permissions for our service account:

- Navigate to IAM, select your service account and click **Edit Principal**
- We will add the following roles:
    - Create a "Storage Admin" role (allows us to create buckets).
    - Create a "Storage Object Admin" role (allows us to dump files).
    - Create a "Big Query Admin" role (for querying).
    
6. Enable APIs

APIs enable communication between our local environment and the cloud. We need to enable the following APIs:

- [Identity and Access Management (IAM) API](https://console.cloud.google.com/apis/library/iam.googleapis.com)
- [ IAM Service Account Credentials API](https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com)
    
:::{.callout-note}
In a production environment you would create custom roles to associate permissions on a per resource basis. For example, you would create one service account for Terraform with all of the "admin" roles and then a separate account for each data pipeline with separate permissions.
:::

We are now ready to go - refresh your service-accounts auth-token if needed:
```default
gcloud auth application-default login
```

# Terraform Setup
Terraform files (.tf) are written in the Hashicorp Configuration Language.

Create 3 files:

- `.terraform-version`: a simple file indicating the Terraform version.
- `main.tf`: the main configuration file (filename can be anything, convention is "main").
- `variables.tf`: 

Example .tf file:
```default
terraform {
    required_version = ">= 1.0"
    backend "local" {} # can change from "local" to "gcs" or "s3" depending on your provider.
    providers {
        google = {
            source = "hasicorp/google" # declaring a predefined public configuration for this service provider.
        }
    }
}
```

## Providers
Terraform works with plugins called **providers** for each data source. These add a set of predefined resources and data types that Terraform can manage. The Terraform Registry is the main directory of publicly available providers for most major cloud infrastructure platforms. These are declared with the `provider` tag:
```default
provider "google" {
    project = var.project
    region = var.region
    // credentails = file(var.credentials) # instead of setting env variables.
}
```

## Variables
Notice certain variables in `main.tf` are preceded with `var.`. These values come from the `variables.tf` file. We begin the `variables.tf` file with declaring `local` variables:
```terraform
locals {
    data_lake_bucket = "dtc_data_lake"
}
```
Variables are generally passed during runtime. Default variables are *optional* runtime arguments, defined variables are *mandatory* runtime arguments. 
```default
# this variable is mandatory and it's value will be entered at runtime
variable "project" {
    description = "GCP Project ID" 
}

# this variable is optional as it has a default "us-west1"
variable "region" {
    description = "Region for GCP resources."
    default = "us-west1"
    type = string
}
```

## Execution
- `terraform init`: Initialize and install.
- `terraform plan`: Describes the actions that Terraform will take to match changes against the previous state.
- `terraform apply`: Apply changes to the cloud: create or update new resources, increase memory, etc.
- `terraform destroy`: Remove your stack and resources from the cloud (used to save on idle resources).
