## Module 1 Homework

## Docker & SQL

In this homework we'll prepare the environment 
and practice with Docker and SQL

## Question 1. Knowing docker tags

### Answer 1 `--rm`

Run the command to get information on Docker 

```docker --help```

Now run the command to get help on the "docker build" command:

```docker build --help```

Do the same for "docker run".

Which tag has the following text? - *Automatically remove the container when it exits* 

- `--delete`
- `--rc`
- `--rmc`
- `--rm`


## Question 2. Understanding docker first run 

### Answer 2 `wheel      0.42.0`

Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash.
Now check the python modules that are installed ( use ```pip list``` ). 

What is version of the package *wheel* ?

- 0.42.0
- 1.0.0
- 23.0.1
- 58.1.0

```bash
docker run -it python:3.9 bash

pip list
Package    Version
---------- -------
pip        23.0.1
setuptools 58.1.0
wheel      0.42.0
```


## Prepare Postgres

Run Postgres and load data as shown in the videos
We'll use the green taxi trips from September 2019:

```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz```

You will also need the dataset with zones:

```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv```

Download this data and put it into Postgres (with jupyter notebooks or with a pipeline)

In [6]:
# !wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz

--2024-01-29 15:13:56--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/b5af7693-2f26-4bd5-8854-75edeb650bae?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240129%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240129T071356Z&X-Amz-Expires=300&X-Amz-Signature=1e943f5e06881424220b391d08466943b3120ae46b73ca27520abdef8c402b76&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=513814948&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2019-09.csv.gz&response-content-type=application%2Foctet-stream [following]
--2024-01-29 15:13:56--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/b5af

In [7]:
import gzip
import polars as pl
from sqlalchemy import create_engine
from time import time

# create conn string
conn_string = 'postgresql://root:root@localhost:5432/ny_taxi'
engine = create_engine(conn_string)
engine.connect()

<sqlalchemy.engine.base.Connection at 0x7f6ceaf7feb0>

In [8]:
pl.__version__

'0.20.5'

In [9]:
!pwd

/mnt/c/Users/ellabelle/Github/data-engineering-zoomcamp/cohorts/2024/01-docker-terraform


In [25]:
from pathlib import Path

file_path = 'green_tripdata_2019-09.csv.gz'
with gzip.open(file_path, 'rb') as f:
    df = pl.read_csv(f.read(), try_parse_dates=True)

In [11]:
df.shape

(449063, 20)

In [12]:
df.schema

OrderedDict([('VendorID', Int64),
             ('lpep_pickup_datetime',
              Datetime(time_unit='us', time_zone=None)),
             ('lpep_dropoff_datetime',
              Datetime(time_unit='us', time_zone=None)),
             ('store_and_fwd_flag', String),
             ('RatecodeID', Int64),
             ('PULocationID', Int64),
             ('DOLocationID', Int64),
             ('passenger_count', Int64),
             ('trip_distance', Float64),
             ('fare_amount', Float64),
             ('extra', Float64),
             ('mta_tax', Float64),
             ('tip_amount', Float64),
             ('tolls_amount', Float64),
             ('ehail_fee', String),
             ('improvement_surcharge', Float64),
             ('total_amount', Float64),
             ('payment_type', Int64),
             ('trip_type', Int64),
             ('congestion_surcharge', Float64)])

In [13]:
df.head(2)

VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
i64,datetime[μs],datetime[μs],str,i64,i64,i64,i64,f64,f64,f64,f64,f64,f64,str,f64,f64,i64,i64,f64
2,2019-09-01 00:10:53,2019-09-01 00:23:46,"""N""",1,65,189,5,2.0,10.5,0.5,0.5,2.36,0.0,,0.3,14.16,1,1,0.0
2,2019-09-01 00:31:22,2019-09-01 00:44:37,"""N""",1,97,225,5,3.2,12.0,0.5,0.5,0.0,0.0,,0.3,13.3,2,1,0.0


In [14]:
df.write_database(table_name="green_taxi_data",  connection=conn_string, if_table_exists="replace")

63

In [15]:
query = """
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND 
    schemaname != 'information_schema'
"""

pl.read_database_uri(query=query, uri=conn_string)

schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
str,str,str,str,bool,bool,bool,bool
"""public""","""green_taxi_dat…","""root""",,False,False,False,False
"""public""","""yellow_taxi_da…","""root""",,True,False,False,False
"""public""","""zones""","""root""",,True,False,False,False


In [16]:
query = """
SELECT COUNT(*) 
FROM green_taxi_data
"""

pl.read_database_uri(query=query, uri=conn_string)


count
i64
449063


In [28]:
!wc -l green_tripdata_2019-09.csv

449064 green_tripdata_2019-09.csv


## Question 3. Count records 

### Answer 3 `15612`

SELECT
    CAST(lpep_dropoff_datetime AS DATE) as "day",
    COUNT(1) as "count",
FROM
    green_taxi_data
WHERE
    lpep_dropoff_datetime is '2019-09-18'


GROUP BY
    CAST(lpep_dropoff_datetime AS DATE)
ORDER BY "count" DESC;
SELECT COUNT(*) FROM green_taxi_data
SELECT * FROM green_taxi_data LIMIT 10


SELECT lpep_pickup_datetime::date, COUNT(*) as count FROM green_taxi_data
-- WHERE lpep_pickup_datetime::date = '2019-09-18'
-- AND lpep_dropoff_datetime::date = '2019-09-18'
GROUP BY lpep_pickup_datetime::date
ORDER BY count desc

In [29]:
query = """
SELECT 
    lpep_pickup_datetime::date as "Date", 
    COUNT(*) as number_of_trips
FROM green_taxi_data
WHERE 
    lpep_pickup_datetime::date = '2019-09-18'
    AND lpep_dropoff_datetime::date = '2019-09-18'
GROUP BY 
    lpep_pickup_datetime::date
"""

pl.read_database_uri(query=query, uri=conn_string)

Date,number_of_trips
date,i64
2019-09-18,15612


How many taxi trips were totally made on September 18th 2019?

Tip: started and finished on 2019-09-18. 

Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date.

- 15767
- 15612
- 15859
- 89009


## Question 4. Largest trip for each day

### Answer 4 `2019-09-26`

In [18]:
query = """
SELECT 
    lpep_pickup_datetime::date as pickup_day, 
    -- COUNT(*) as count,
    MAX(trip_distance) as max_trip_distance
FROM green_taxi_data
GROUP BY pickup_day
ORDER BY max_trip_distance DESC
LIMIT 4
"""

pl.read_database_uri(query=query, uri=conn_string)

pickup_day,max_trip_distance
date,f64
2019-09-26,341.64
2019-09-21,135.53
2019-09-16,114.3
2019-09-28,89.64


Which was the pick up day with the largest trip distance
Use the pick up time for your calculations.

- 2019-09-18
- 2019-09-16
- 2019-09-26
- 2019-09-21


## Question 5. The number of passengers

### Answer 5 `"Brooklyn" "Manhattan" "Queens"`

In [19]:
query = """
SELECT 
    date(g.lpep_pickup_datetime) AS pickup_day, 
    SUM(g.total_amount) AS sum_total_passengers,
    zpu."Borough" AS "Borough"
FROM
    green_taxi_data g
JOIN zones zpu
    ON g."PULocationID" = zpu."LocationID"
GROUP BY 1,3
HAVING 
    SUM(g.total_amount) > 50000
    AND date(g.lpep_pickup_datetime) = '2019-09-18'
ORDER BY SUM(g.total_amount) DESC

"""

pl.read_database_uri(query=query, uri=conn_string)

pickup_day,sum_total_passengers,Borough
date,f64,str
2019-09-18,96333.24,"""Brooklyn"""
2019-09-18,92271.3,"""Manhattan"""
2019-09-18,78671.71,"""Queens"""


Consider lpep_pickup_datetime in '2019-09-18' and ignoring Borough has Unknown

Which were the 3 pick up Boroughs that had a sum of total_amount superior to 50000?
 
- "Brooklyn" "Manhattan" "Queens"
- "Bronx" "Brooklyn" "Manhattan"
- "Bronx" "Manhattan" "Queens" 
- "Brooklyn" "Queens" "Staten Island"

## Question 6. Largest tip

### Answer 6 `JFK Airport`

In [20]:
query = """
SELECT COUNT(*) FROM zones
"""

pl.read_database_uri(query=query, uri=conn_string)

count
i64
265


In [21]:
query = """
SELECT * FROM zones
LIMIT 5
"""

pl.read_database_uri(query=query, uri=conn_string)

index,LocationID,Borough,Zone,service_zone
i64,i64,str,str,str
0,1,"""EWR""","""Newark Airport…","""EWR"""
1,2,"""Queens""","""Jamaica Bay""","""Boro Zone"""
2,3,"""Bronx""","""Allerton/Pelha…","""Boro Zone"""
3,4,"""Manhattan""","""Alphabet City""","""Yellow Zone"""
4,5,"""Staten Island""","""Arden Heights""","""Boro Zone"""


In [22]:
query = """
SELECT 
    date(g.lpep_pickup_datetime) AS pickup_day,
    MAX(tip_amount) AS "max_tip_amount",
    zpu."Zone" AS "pickup_zone",
    zdo."Zone" AS "dropoff_zone"
FROM
    green_taxi_data g
JOIN zones zpu
    ON g."PULocationID" = zpu."LocationID"
JOIN zones zdo
    ON g."DOLocationID" = zdo."LocationID"
GROUP BY
    date(g.lpep_pickup_datetime), zpu."Zone" , zdo."Zone"
HAVING 
    zpu."Zone" = 'Astoria'
ORDER BY
    "max_tip_amount" DESC
LIMIT 4
"""

pl.read_database_uri(query=query, uri=conn_string)

pickup_day,max_tip_amount,pickup_zone,dropoff_zone
date,f64,str,str
2019-09-08,62.31,"""Astoria""","""JFK Airport"""
2019-09-15,30.0,"""Astoria""","""Woodside"""
2019-09-25,28.0,"""Astoria""","""Kips Bay"""
2019-09-03,25.0,"""Astoria""","""NV"""


For the passengers picked up in September 2019 in the zone name Astoria which was the drop off zone that had the largest tip?
We want the name of the zone, not the id.

Note: it's not a typo, it's `tip` , not `trip`

- Central Park
- Jamaica
- JFK Airport
- Long Island City/Queens Plaza

## Terraform

In this section homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP/Laptop/GitHub Codespace install Terraform. 
Copy the files from the course repo
[here](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.


## Question 7. Creating Resources

### Answer 7 ` `

After updating the main.tf and variable.tf files run:

```
terraform apply
```

Paste the output of this command into the homework submission form.


```bash
➜  01-docker-terraform git:(module-01) ✗ terraform apply

Terraform used the selected providers to generate the following execution plan. Resource actions
are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_bigquery_dataset.demo_dataset will be created
  + resource "google_bigquery_dataset" "demo_dataset" {
      + creation_time              = (known after apply)
      + dataset_id                 = "demo_dataset"
      + default_collation          = (known after apply)
      + delete_contents_on_destroy = false
      + effective_labels           = (known after apply)
      + etag                       = (known after apply)
      + id                         = (known after apply)
      + is_case_insensitive        = (known after apply)
      + last_modified_time         = (known after apply)
      + location                   = "US"
      + max_time_travel_hours      = (known after apply)
      + project                    = "tidy-daylight-411205"
      + self_link                  = (known after apply)
      + storage_billing_model      = (known after apply)
      + terraform_labels           = (known after apply)
    }

  # google_storage_bucket.demo-bucket will be created
  + resource "google_storage_bucket" "demo-bucket" {
      + effective_labels            = (known after apply)
      + force_destroy               = true
      + id                          = (known after apply)
      + location                    = "US"
      + name                        = "tidy-daylight-411205-hmwk01-terra-bucket"
      + project                     = (known after apply)
      + public_access_prevention    = (known after apply)
      + self_link                   = (known after apply)
      + storage_class               = "STANDARD"
      + terraform_labels            = (known after apply)
      + uniform_bucket_level_access = (known after apply)
      + url                         = (known after apply)

      + lifecycle_rule {
          + action {
              + type = "AbortIncompleteMultipartUpload"
            }
          + condition {
              + age                   = 1
              + matches_prefix        = []
              + matches_storage_class = []
              + matches_suffix        = []
              + with_state            = (known after apply)
            }
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

google_storage_bucket.demo-bucket: Creating...
google_bigquery_dataset.demo_dataset: Creating...
google_bigquery_dataset.demo_dataset: Creation complete after 1s [id=projects/tidy-daylight-411205/datasets/demo_dataset]
google_storage_bucket.demo-bucket: Creation complete after 2s [id=tidy-daylight-411205-hmwk01-terra-bucket]

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
```