## Homework

This notebook contains solutions for the [week 1 assignments](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup).

**Question 1. Knowing docker tags**

Run the command to get information on Docker: `docker --help`. Now run the command to get help on the `docker build` command - which tag has the following text: *Write the image ID to the file*?

Answer:

In [1]:
!docker build --help | grep "Write the image ID to the file"

      --iidfile string          Write the image ID to the file


**Question 2. Understanding docker first run**

Run docker with the *python:3.9* image in an interactive mode and the entrypoint of bash. Now check the python modules that are installed (use `pip list`). How many python packages/modules are installed?

Answer (commands you need to run in order):

    docker run -it --entrypoint=bash python:3.9

    <After container has been run, you will be in able to execute commands inside the container using bash. Your bash prompt should look something like this: root@92ffd1826989:/#>
    
    pip list
    Package    Version
    ---------- -------
    pip        22.0.4
    setuptools 58.1.0
    wheel      0.38.4
    WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available.
    You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.


**Database Setup**

Before proceeding with the tasks, we need to do some preparation. 

Firstly, we need to run a local instance of PostgreSQL - I have created a basic setup using Docker Compose [here](docker-compose.yaml). This setup uses [pgAdmin](https://www.pgadmin.org/) for convinience reasons, although it is not necessary - we might as well interact with the database using other tools, like [pgcli](https://www.pgcli.com/) or simply *pandas* library.

Secondly, we need to ingest some data to that database - the data we will be using is [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Before May 2022, that data was stored in CSV files, which DataTalksClub backed up and made available [here](https://github.com/DataTalksClub/nyc-tlc-data). As I found some minor discrepancies between the backup CSV files and the "current" Parquet files, I would recommend using the former to make sure the output of your queries will be consistent with the answers to the questions.

To ingest the data, we need to create tables with the appropriate schema inside our database. Although for this week we are only using Green Taxi Trip Records from Jan 2021 and Taxi Zone Lookup Table, we will create tables for each "type" of trips (except High Volume For-Hire Vehicle Trip Records, as the backup for these files is incomplete as of now).

In [2]:
from datetime import date

import pandas as pd
from sqlalchemy import create_engine

In [3]:
yellow_data = pd.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2019-01.csv.gz", nrows=100)
green_data = pd.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz", nrows=100)
fhv_data = pd.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-01.csv.gz", nrows=100)

# We are gonna load the entire Taxi Zone Lookup Table as we only have to ingest it once
zone_data = pd.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv")

We need to do some basic data cleanup, namely converting the appropriate columns in the datasets to *datetime* type (note, if you are working with Parquet files, you will not have this problem).

In [4]:
yellow_data["tpep_pickup_datetime"] = pd.to_datetime(yellow_data["tpep_pickup_datetime"])
yellow_data["tpep_dropoff_datetime"] = pd.to_datetime(yellow_data["tpep_dropoff_datetime"])

green_data["lpep_pickup_datetime"] = pd.to_datetime(green_data["lpep_pickup_datetime"])
green_data["lpep_dropoff_datetime"] = pd.to_datetime(green_data["lpep_dropoff_datetime"])

fhv_data["pickup_datetime"] = pd.to_datetime(fhv_data["pickup_datetime"])
fhv_data["dropOff_datetime"] = pd.to_datetime(fhv_data["dropOff_datetime"])

To extract the schema from the `pandas.DataFrame` and generate the DDL statement compatibile with Postgres, we need an appropriate *SQLAlchemy Engine*. Note that the URL you will pass to the `create_engine` function depends on the configuration you have specified in the *docker-compose* file.

In [5]:
engine = create_engine("postgresql://root:root@localhost:54320/nyc_tlc_trips")

Once we successfully established connection to the database, we can start creating tables and ingesting the data. For now, we are going to ingest the entire Taxi Zone Lookup data and only create tables for the remaining datasets.

In [6]:
# Ingesting Taxi Zone Lookup table
zone_data.to_sql(
    name="taxi_zone", con=engine, if_exists="replace", index=False
)

# Creating tables for Yellow Taxi Trips, Green Taxi Trips and For-Hire Vehicles Trips
yellow_data.head(0).to_sql(
    name="yellow_taxi_trip", con=engine, if_exists="replace", index=False
)
green_data.head(0).to_sql(
    name="green_taxi_trip", con=engine, if_exists="replace", index=False
)
fhv_data.head(0).to_sql(name="fhv_trip", con=engine, if_exists="replace", index=False)


0

If you want to see what kind of DDL was used to create the tables, you can see the exact queries through pgAdmin. You can also examine the data types assigned to your columns using pandas (note that this is not the **exact** query that was used to create the table).

In [7]:
print(pd.io.sql.get_schema(zone_data, con=engine, name="taxi_zone"))



CREATE TABLE taxi_zone (
	"LocationID" BIGINT, 
	"Borough" TEXT, 
	"Zone" TEXT, 
	service_zone TEXT
)




For the trip data ingestion - as I will want to load more than a single month of data into the database -  I have wrote a [dedicated script](ingest_data.py). To see the arguments you can pass to the script you can run it like `python ingest_data.py --help`. As mentioned before, to complete this weeks assignments, we have to ingest Green Taxi Trip Records from Jan 2021.

In [8]:
%run ingest_data.py green 2019 1 --db_name nyc_tlc_trips --db_port 54320

***Note about data ingestion***

In the *ingest_data.py* script I am using `pandas.DataFrame.to_sql` method to ingest data into the database. Although the volume of the data we are working with here is relatively small and this approach works just fine, sometimes - especially when working with larger volumes of data - one may want to try something more efficient. I would recommend researching different methods of data ingestion, like [here](https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/) before undertaking such tasks.


Now that the data has been ingested, we can proceed with the tasks. The remaining questions are supposed to be answered by running SQL queries against the data in our database (for example through pgAdmin).

**Question 3. Count records**

How many taxi trips were made on January 15 in total (that is, started and finished on 2019-01-15)?

Answer:

    SELECT     COUNT(*)
    FROM       green_taxi_trip
    WHERE      CAST(lpep_pickup_datetime AS DATE) = '2019-01-15'
               AND CAST(lpep_dropoff_datetime AS DATE) = '2019-01-15';


**Question 4. Largest trip for each day**

On which day the trip with the largest distance (use the pick up time for your calculations)?

Answer:

    SELECT     lpep_pickup_datetime
    FROM       green_taxi_trip
    ORDER BY   trip_distance DESC
    LIMIT      1;


**Question 5. The number of passengers**

How many trips had 2 and 3 passengers on 2019-01-01?

Answer:

    SELECT     COUNT(CASE WHEN passenger_count = 2 THEN 1 END) AS n_two_passengers
               , COUNT(CASE WHEN passenger_count = 3 THEN 1 END) AS n_three_passengers
    FROM       green_taxi_trip
    WHERE      CAST(lpep_pickup_datetime AS DATE) = '2019-01-01';

**Question 6. Largest tip**

For the passengers picked up in the Astoria Zone, which drop off zone had the largest tip (you should get the name of the zone, not the id)?

Answer:

	SELECT     d_zone."Zone" AS largest_tip_zone     
	FROM       green_taxi_trip AS trip
	INNER JOIN taxi_zone AS p_zone
	ON         trip."PULocationID" = p_zone."LocationID"
	INNER JOIN taxi_zone AS d_zone
	ON         trip."DOLocationID" = d_zone."LocationID"
	WHERE      p_zone."Zone" = 'Astoria'
	ORDER BY   trip.tip_amount DESC
	LIMIT      1;

## Homework Part B: Terraform

The Terraform configuration can be found in the [terraform directory](./terraform/) and is mostly based on the [course default configuration](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp/terraform). The notable differences between the two are:
1. I have used [terraform.tfvars file](./terraform/terraform.tfvars) to define the variables. For the alternative ways of handling the input variables, please refer to [this article](https://developer.hashicorp.com/terraform/language/values/variables).
2.  I have used the path to service account key file as authentication method instead of using the application default credentials.