## Homework

This notebook contains solutions for the [week 1 assignments](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup).

**Question 1. Knowing docker tags**

Run the command to get information on Docker: `docker --help`. Now run the command to get help on the `docker build` command - which tag has the following text: *Write the image ID to the file*?

Answer:

In [4]:
!docker build --help | grep "Write the image ID to the file"

      --iidfile string          Write the image ID to the file


**Question 2. Understanding docker first run**

Run docker with the *python:3.9* image in an interactive mode and the entrypoint of bash. Now check the python modules that are installed (use `pip list`). How many python packages/modules are installed?

Answer (commands you need to run in order):

    docker run -it --entrypoint=bash python:3.9

    <After container has been run, you will be in able to execute commands inside the container using bash. Your bash prompt should look something like this: root@92ffd1826989:/#>
    
    pip list
    Package    Version
    ---------- -------
    pip        22.0.4
    setuptools 58.1.0
    wheel      0.38.4
    WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available.
    You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.


**Database Setup**

Before proceeding with the tasks, we need to do some preparation. 

Firstly, we need to run a local instance of PostgreSQL - I have created a basic setup using Docker Compose [here]("./docker-compose.yaml"). This setup uses [pgAdmin](https://www.pgadmin.org/) for convinience reasons, although it is not necessary - we might as well interact with the database using other tools, like [pgcli](https://www.pgcli.com/) or simply *pandas* library.

Secondly, we need to ingest some data to that database - the data we will be using is [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Before May 2022, that data was stored in CSV files, which DataTalksClub backed up and made available [here](https://github.com/DataTalksClub/nyc-tlc-data). As I found some minor discrepancies between the backup CSV files and the "current" Parquet files, I'd recommend using the former to make sure the output of your queries will be consistent with the answers to the questions.

To ingest the data, we need to create tables with the appropriate schema inside our database. Although for this week we are only using Green Taxi Trip Records from Jan 2021 and Taxi Zone Lookup Table, we will create tables for each "type" of trips (except High Volume For-Hire Vehicle Trip Records, as the backup for those is incomplete as of now).

In [5]:
from datetime import date

import pandas as pd
from sqlalchemy import create_engine

In [6]:
yellow_data = pd.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2019-01.csv.gz", nrows=100)
green_data = pd.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz", nrows=100)
fhv_data = pd.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-01.csv.gz", nrows=100)

# We are gonna load the entire Taxi Zone Lookup Table as we only have to ingest it once
zone_data = pd.read_csv("https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv")

We need to do some basic data cleanup, namely converting the appropriate columns in the datasets to *datetime* type (note, if you are working with Parquet files, you will not have this problem).

In [18]:
yellow_data["tpep_pickup_datetime"] = pd.to_datetime(yellow_data["tpep_pickup_datetime"])
yellow_data["tpep_dropoff_datetime"] = pd.to_datetime(yellow_data["tpep_dropoff_datetime"])

green_data["lpep_pickup_datetime"] = pd.to_datetime(green_data["lpep_pickup_datetime"])
green_data["lpep_dropoff_datetime"] = pd.to_datetime(green_data["lpep_dropoff_datetime"])

fhv_data["pickup_datetime"] = pd.to_datetime(fhv_data["pickup_datetime"])
fhv_data["dropOff_datetime"] = pd.to_datetime(fhv_data["dropOff_datetime"])

To extract the schema from the `pandas.DataFrame` and generate the DDL statement compatibile with Postgres, we need an appropriate *SQLAlchemy Engine*. Note that the URL you will pass to the `create_engine` function depends on the configuration you have specified in the *docker-compose* file.

In [23]:
engine = create_engine("postgresql://root:root@localhost:54320/nyc_tlc_trips")

Once we successfully established connection to the database, we can start creating tables and ingesting the data. For now, we are gonna ingest the entire Taxi Zone Lookup data and only create tables for the remaining datasets.

In [24]:
# Ingesting Taxi Zone Lookup table
zone_data.to_sql(
    name="taxi_zone", con=engine, if_exists="replace", index=False
)

# Creating tables for Yellow Taxi Trips, Green Taxi Trips and For-Hire Vehicles Trips
yellow_data.to_sql(
    name="yellow_taxi_trip", con=engine, if_exists="replace", index=False
)
green_data.to_sql(
    name="green_taxi_trip", con=engine, if_exists="replace", index=False
)
fhv_data.to_sql(name="fhv_trip", con=engine, if_exists="replace", index=False)


265

If you want to see what kind of DDL was used to create the tables, you can see the exact queries through pgAdmin. You can also examine the data types assigned to your columns using pandas (note, that this is not the **exact** query that was used to create the table).

In [32]:
print(pd.io.sql.get_schema(zone_data, con=engine, name="taxi_zone"))



CREATE TABLE taxi_zone (
	"LocationID" BIGINT, 
	"Borough" TEXT, 
	"Zone" TEXT, 
	service_zone TEXT
)




NOTE ABOUT INGESTING