# Data Engineering Zoomcamp
### Maria Fisher 

## Module 1 Homework

## Docker & SQL

## Question 1. Knowing docker tags

Run the command to get information on Docker 

```docker --help```

Now run the command to get help on the "docker build" command:

```docker build --help```

Do the same for "docker run".

Which tag has the following text? - *Automatically remove the container when it exits* 

- `--delete`
- `--rc`
- `--rmc`
- `--rm` (X)


## Question 2. Understanding docker first run 

Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash.
Now check the python modules that are installed ( use ```pip list``` ). 

What is version of the package *wheel* ?

- 0.42.0 (x)
- 1.0.0
- 23.0.1
- 58.1.0


# Prepare Postgres

Run Postgres and load data as shown in the videos
We'll use the green taxi trips from September 2019:

```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz```

You will also need the dataset with zones:

```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv```

Download this data and put it into Postgres (with jupyter notebooks or with a pipeline)



## Question 3. Count records 

How many taxi trips were totally made on September 18th 2019?

Tip: started and finished on 2019-09-18. 

Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date.

- 15767
- 15612 (x)
- 15859
- 89009

In [2]:
import pandas as pd
import psycopg2

In [5]:
import pandas as pd
import psycopg2

# establish a connection to the database
conn = psycopg2.connect(
    host="localhost",
    dbname="ny_taxi",
    user="root",
    password="root"
)

# create a cursor object to execute SQL queries
cur = conn.cursor()

# execute the SQL query to count the number of taxi trips on September 18th, 2019
cur.execute("""
    SELECT COUNT(*) FROM green_taxi_data
    WHERE 
        CAST(lpep_pickup_datetime AS DATE) = '2019-09-18' AND 
        CAST(lpep_dropoff_datetime AS DATE) = '2019-09-18'
""")

# fetch the result of the query
result = cur.fetchone()[0]

# print the number of taxi trips
print(f"Total taxi trips on September 18th, 2019: {result}")

# close the cursor and connection
cur.close()
conn.close()

Total taxi trips on September 18th, 2019: 15612


## Question 4. Largest trip for each day

Which was the pick up day with the largest trip distance
Use the pick up time for your calculations.

- 2019-09-18
- 2019-09-16
- 2019-09-26
- 2019-09-21

In [18]:
import pandas as pd
import psycopg2

# Connect to the Postgres database
conn = psycopg2.connect(
    host="localhost",
    dbname="ny_taxi",
    user="root",
    password="root"
)
# Create a cursor object
cur = conn.cursor()

# Execute the SQL query to get the pick up day with the largest trip distance
cur.execute("""
    SELECT 
        lpep_pickup_datetime,
        trip_distance
    FROM 
        green_taxi_data
    WHERE 
        DATE(lpep_pickup_datetime) IN ('2019-09-18', '2019-09-16', '2019-09-26', '2019-09-21')

    ORDER BY trip_distance DESC;
    
""")

# Fetch the result
result = cur.fetchone()

# Print the pick up day with the largest trip distance
print("Pick up day with the largest trip distance:", result[0])

# Close the cursor and connection
cur.close()
conn.close()

Pick up day with the largest trip distance: 2019-09-26 19:32:52


## Join the tables 


In [157]:
import psycopg2
import pandas as pd

# Connect to the Postgres database
conn = psycopg2.connect(
    host="localhost",
    dbname="ny_taxi",
    user="root",
    password="root"
)


# Create the SQL query
query = ('''
   SELECT
        CAST(lpep_pickup_datetime AS DATE),
        CAST(lpep_dropoff_datetime AS DATE),
        "total_amount",
        "tip_amount",
        zpu."Borough" AS "zpu_local",
        zpu."Zone"  AS "zpu_zone",
        zdo."Borough" AS "zdo_local",
        zdo."Zone"  AS "zdo_zone"
    FROM
        green_taxi_data t,
        zones zpu,
        zones zdo
    WHERE
        t."PULocationID" = zpu."LocationID" AND
        t."DOLocationID" = zdo."LocationID"	
    ORDER BY
        "lpep_pickup_datetime" DESC;
    
''')

# Read the results into a pandas DataFrame
df = pd.read_sql_query(query, conn)

# Close the connection
conn.close()




  df = pd.read_sql_query(query, conn)


In [158]:
df.to_csv('ny_taxi_.csv', index=False)

## Question 5. Three biggest pick up Boroughs

Consider lpep_pickup_datetime in '2019-09-18' and ignoring Borough has Unknown

Which were the 3 pick up Boroughs that had a sum of total_amount superior to 50000?
 
- "Brooklyn" "Manhattan" "Queens" (x)
- "Bronx" "Brooklyn" "Manhattan"
- "Bronx" "Manhattan" "Queens" 
- "Brooklyn" "Queens" "Staten Island"


In [1]:
import pandas as pd
df = pd.read_csv("ny_taxi_.csv")
df

Unnamed: 0,lpep_pickup_datetime,lpep_dropoff_datetime,total_amount,tip_amount,zpu_local,zpu_zone,zdo_local,zdo_zone
0,2020-03-07,2020-03-07,6.8,0.0,Queens,Sunnyside,Queens,Sunnyside
1,2020-02-15,2020-02-15,9.8,0.0,Queens,Astoria,Queens,Long Island City/Queens Plaza
2,2020-01-03,2020-01-03,10.8,0.0,Queens,Sunnyside,Queens,Old Astoria
3,2019-12-15,2019-12-15,7.3,2.0,Queens,Sunnyside,Queens,Sunnyside
4,2019-12-15,2019-12-15,5.8,0.0,Queens,Astoria,Queens,Astoria
...,...,...,...,...,...,...,...,...
449058,2009-01-01,2009-01-01,15.3,0.0,Queens,Jamaica,Queens,Saint Albans
449059,2009-01-01,2009-01-01,4.8,0.0,Queens,Jamaica,Queens,Jamaica
449060,2009-01-01,2009-01-01,35.3,0.0,Queens,Jamaica,Brooklyn,Clinton Hill
449061,2009-01-01,2009-01-01,28.8,0.0,Queens,Jamaica,Brooklyn,Brownsville


In [152]:
set0 = df.loc[df["lpep_pickup_datetime"] == "2019-09-18"]
set0

Unnamed: 0,lpep_pickup_datetime,lpep_dropoff_datetime,total_amount,zpu_local,zpu_zone,zdo_local,zdo_zone
183453,2019-09-18,2019-09-18,14.16,Queens,Jamaica,Queens,Baisley Park
183454,2019-09-18,2019-09-18,10.80,Manhattan,Central Harlem,Bronx,Mott Haven/Port Morris
183455,2019-09-18,2019-09-18,9.12,Manhattan,East Harlem South,Manhattan,East Harlem North
183456,2019-09-18,2019-09-18,9.30,Brooklyn,Boerum Hill,Brooklyn,Red Hook
183457,2019-09-18,2019-09-18,19.05,Manhattan,Central Harlem,Manhattan,Lenox Hill East
...,...,...,...,...,...,...,...
199215,2019-09-18,2019-09-18,34.95,Brooklyn,Gravesend,Brooklyn,Prospect-Lefferts Gardens
199216,2019-09-18,2019-09-18,23.39,Bronx,Mott Haven/Port Morris,Manhattan,Central Harlem North
199217,2019-09-18,2019-09-18,23.98,Brooklyn,Boerum Hill,Brooklyn,Prospect-Lefferts Gardens
199218,2019-09-18,2019-09-18,31.08,Brooklyn,Gowanus,Brooklyn,Crown Heights South


In [153]:
set1 = set0[set0.zpu_local.isin(["Brooklyn", "Manhattan", "Queens","Staten Island","Bronx" ])]
set1

Unnamed: 0,lpep_pickup_datetime,lpep_dropoff_datetime,total_amount,zpu_local,zpu_zone,zdo_local,zdo_zone
183453,2019-09-18,2019-09-18,14.16,Queens,Jamaica,Queens,Baisley Park
183454,2019-09-18,2019-09-18,10.80,Manhattan,Central Harlem,Bronx,Mott Haven/Port Morris
183455,2019-09-18,2019-09-18,9.12,Manhattan,East Harlem South,Manhattan,East Harlem North
183456,2019-09-18,2019-09-18,9.30,Brooklyn,Boerum Hill,Brooklyn,Red Hook
183457,2019-09-18,2019-09-18,19.05,Manhattan,Central Harlem,Manhattan,Lenox Hill East
...,...,...,...,...,...,...,...
199215,2019-09-18,2019-09-18,34.95,Brooklyn,Gravesend,Brooklyn,Prospect-Lefferts Gardens
199216,2019-09-18,2019-09-18,23.39,Bronx,Mott Haven/Port Morris,Manhattan,Central Harlem North
199217,2019-09-18,2019-09-18,23.98,Brooklyn,Boerum Hill,Brooklyn,Prospect-Lefferts Gardens
199218,2019-09-18,2019-09-18,31.08,Brooklyn,Gowanus,Brooklyn,Crown Heights South


In [154]:
set1 = set1.groupby('zpu_local')['total_amount'].sum()
set1 


zpu_local
Bronx            32830.09
Brooklyn         96333.24
Manhattan        92271.30
Queens           78671.71
Staten Island      342.59
Name: total_amount, dtype: float64

## Question 6. Largest tip

For the passengers picked up in September 2019 in the zone name Astoria which was the drop off zone that had the largest tip?
We want the name of the zone, not the id.

Note: it's not a typo, it's `tip` , not `trip`

- Central Park
- Jamaica
- JFK Airport
- Long Island City/Queens Plaza (x)

In [2]:
set0 = df.loc[(df["lpep_pickup_datetime"] >= "2019-09-01") & (df['lpep_pickup_datetime'] <= '2019-09-31')]
set0


Unnamed: 0,lpep_pickup_datetime,lpep_dropoff_datetime,total_amount,tip_amount,zpu_local,zpu_zone,zdo_local,zdo_zone
31,2019-09-30,2019-09-30,11.77,2.72,Manhattan,East Harlem South,Manhattan,Upper East Side North
32,2019-09-30,2019-09-30,7.30,0.00,Manhattan,East Harlem North,Manhattan,East Harlem South
33,2019-09-30,2019-09-30,8.76,1.46,Queens,Elmhurst,Queens,Maspeth
34,2019-09-30,2019-09-30,10.80,0.00,Queens,Middle Village,Queens,Middle Village
35,2019-09-30,2019-09-30,13.30,0.00,Queens,Middle Village,Brooklyn,Bushwick North
...,...,...,...,...,...,...,...,...
449025,2019-09-01,2019-09-01,16.80,0.00,Brooklyn,Downtown Brooklyn/MetroTech,Brooklyn,Bushwick South
449026,2019-09-01,2019-09-01,70.70,11.78,Manhattan,Morningside Heights,Manhattan,Randalls Island
449027,2019-09-01,2019-09-01,16.00,3.20,Queens,Jamaica,Queens,Saint Albans
449028,2019-09-01,2019-09-01,36.92,0.00,Manhattan,Randalls Island,Brooklyn,Williamsburg (North Side)


In [11]:
setb = set0[set0.zpu_zone.isin(["Astoria"])]
setb

Unnamed: 0,lpep_pickup_datetime,lpep_dropoff_datetime,total_amount,tip_amount,zpu_local,zpu_zone,zdo_local,zdo_zone
91,2019-09-30,2019-09-30,19.92,0.00,Queens,Astoria,Manhattan,East Harlem North
92,2019-09-30,2019-09-30,34.30,0.00,Queens,Astoria,Queens,Jamaica
197,2019-09-30,2019-09-30,5.80,0.00,Queens,Astoria,Queens,Astoria
198,2019-09-30,2019-09-30,6.80,0.00,Queens,Astoria,Queens,Astoria
204,2019-09-30,2019-09-30,10.80,1.00,Queens,Astoria,Queens,Long Island City/Hunters Point
...,...,...,...,...,...,...,...,...
448880,2019-09-01,2019-09-01,4.30,0.00,Queens,Astoria,Queens,Astoria
448909,2019-09-01,2019-09-01,16.00,3.20,Queens,Astoria,Queens,Long Island City/Hunters Point
448923,2019-09-01,2019-09-01,36.36,6.06,Queens,Astoria,Brooklyn,Fort Greene
448959,2019-09-01,2019-09-01,9.80,0.00,Queens,Astoria,Queens,Woodside


In [233]:
seta = setb.zdo_zone.unique()


In [12]:
setb = setb.groupby('zdo_zone')['tip_amount', ].sum()
setb

Unnamed: 0_level_0,tip_amount
zdo_zone,Unnamed: 1_level_1
Allerton/Pelham Gardens,0.00
Alphabet City,34.24
Astoria,2135.36
Astoria Park,3.62
Auburndale,7.06
...,...
Woodlawn/Wakefield,8.38
Woodside,491.30
World Trade Center,27.84
Yorkville East,46.51


In [13]:

df = setb.sort_values(by='tip_amount', ascending=False)

In [14]:
df

Unnamed: 0_level_0,tip_amount
zdo_zone,Unnamed: 1_level_1
Astoria,2135.36
Steinway,1349.23
Old Astoria,1147.13
Long Island City/Queens Plaza,709.03
Sunnyside,574.25
...,...
Brighton Beach,0.00
Marine Park/Mill Basin,0.00
Longwood,0.00
Brownsville,0.00


Question 7. Terraform

In this section homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP/Laptop/GitHub Codespace install Terraform. 
Copy the files from the course repo
[here](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.


## Question 7. Creating Resources

After updating the main.tf and variable.tf files run:

```
terraform apply
```

Paste the output of this command into the homework submission form.

![terraform bucket](./terraform/terraform_apply.png)

![terraform bucket](./terraform/terraform1_bucket.png)

![terraform bucket](./terraform/terraform2_bigquery.png)