# Data Engineering Zoomcamp - Week 1: Upload Data to Postgres (Homework edition)

This notbook is part of the Data Engineering Zoomcamp 2023 and explores the dataset needed for the week1 homework as specified [here](../homework.md).

In [29]:
import pandas as pd
import os
from sqlalchemy import create_engine
from time import time

In [14]:
pd.__version__

'1.5.2'

## Step 1: Import Data

We import the `green_tripdata_2019-01.csv.gz` we downloaded beforehand using pandas

In [15]:
df = pd.read_csv('green_tripdata_2019-01.csv.gz', nrows=100)

In [16]:
df.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2018-12-21 15:17:29,2018-12-21 15:18:57,N,1,264,264,5,0.0,3.0,0.5,0.5,0.0,0.0,,0.3,4.3,2,1,
1,2,2019-01-01 00:10:16,2019-01-01 00:16:32,N,1,97,49,2,0.86,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2,1,
2,2,2019-01-01 00:27:11,2019-01-01 00:31:38,N,1,49,189,2,0.66,4.5,0.5,0.5,0.0,0.0,,0.3,5.8,1,1,
3,2,2019-01-01 00:46:20,2019-01-01 01:04:54,N,1,189,17,2,2.68,13.5,0.5,0.5,2.96,0.0,,0.3,19.71,1,1,
4,2,2019-01-01 00:19:06,2019-01-01 00:39:43,N,1,82,258,1,4.53,18.0,0.5,0.5,0.0,0.0,,0.3,19.3,2,1,


## Step 2: Explore and prepare data

Now we take a look at the column typed and clean the data where necessary

### explore column types

In [17]:
# jupyter doesn't print linebreaks thats why we need to use print
print(pd.io.sql.get_schema(df, name = "green_taxi_data"))

CREATE TABLE "green_taxi_data" (
"VendorID" INTEGER,
  "lpep_pickup_datetime" TEXT,
  "lpep_dropoff_datetime" TEXT,
  "store_and_fwd_flag" TEXT,
  "RatecodeID" INTEGER,
  "PULocationID" INTEGER,
  "DOLocationID" INTEGER,
  "passenger_count" INTEGER,
  "trip_distance" REAL,
  "fare_amount" REAL,
  "extra" REAL,
  "mta_tax" REAL,
  "tip_amount" REAL,
  "tolls_amount" REAL,
  "ehail_fee" REAL,
  "improvement_surcharge" REAL,
  "total_amount" REAL,
  "payment_type" INTEGER,
  "trip_type" INTEGER,
  "congestion_surcharge" REAL
)


### convert datetime columns

In [18]:
# we can see from the inferred schema that the time columns are text data and have not been recognized as dates
# thats why we convert them
df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)

In [19]:
# now they are converted
print(pd.io.sql.get_schema(df, name = "green_taxi_data"))

CREATE TABLE "green_taxi_data" (
"VendorID" INTEGER,
  "lpep_pickup_datetime" TIMESTAMP,
  "lpep_dropoff_datetime" TIMESTAMP,
  "store_and_fwd_flag" TEXT,
  "RatecodeID" INTEGER,
  "PULocationID" INTEGER,
  "DOLocationID" INTEGER,
  "passenger_count" INTEGER,
  "trip_distance" REAL,
  "fare_amount" REAL,
  "extra" REAL,
  "mta_tax" REAL,
  "tip_amount" REAL,
  "tolls_amount" REAL,
  "ehail_fee" REAL,
  "improvement_surcharge" REAL,
  "total_amount" REAL,
  "payment_type" INTEGER,
  "trip_type" INTEGER,
  "congestion_surcharge" REAL
)


## Step 3: Load taxi rides data into postgres database

### 3.1 connect to database

In [20]:
# create connection to postgres
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

### 3.2: writing the table schema into DB

In [44]:
# write the table schema into the DB
df.head(n=0).to_sql(name='green_taxi_data', con=engine, if_exists='replace')

0

*now we can check in the terminal (with pgcli) or in pgAdmin if the green_taxi table exists*

### 3.3: writing the data into the database in batches

the dataset is potentially too big to write it into the postgres database all in one piece, therefore we batch it and load each chunk separately

In [45]:
# for this purpose we can use the iterator argument from the pandas.read_csv function
# tip: press shift+tab to see help
df_iter = pd.read_csv('green_tripdata_2019-01.csv.gz', iterator=True, chunksize=100000)

In [46]:
# next returns TRUE as long as there is a next iteration, if there isn't it returns FALSE
while True:
    try: 
        t_start = time()

        df = next(df_iter)

        df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
        df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
        
        df.to_sql(name='green_taxi_data', con=engine, if_exists='append')

        t_end = time()

        print('inserted another chunk, took %.3f seconds' % (t_end - t_start))
        
    except StopIteration:
            print("Finished ingesting data into the postgres database")
            break

inserted another chunk, took 30.223 seconds
inserted another chunk, took 28.626 seconds
inserted another chunk, took 30.076 seconds
inserted another chunk, took 28.380 seconds
inserted another chunk, took 30.697 seconds
inserted another chunk, took 38.527 seconds
inserted another chunk, took 16.371 seconds
Finished ingesting data into the postgres database


## Step 4: Load Taxi Zones Data into postgres database

this time we only download the data if they do not already exist (-nc flag)

In [39]:
!wget -O ./taxi_zone_lookup.csv --no-clobber https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

2669.04s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
File ‘./taxi_zone_lookup.csv’ already there; not retrieving.


In [40]:
df_zones = pd.read_csv('taxi_zone_lookup.csv')

In [41]:
df_zones.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


In [43]:
df_zones.to_sql(name='zones', con=engine, if_exists='replace')

265