# Data Engineering Zoomcamp - Week 1: Upload Data to Postgres (Homework edition)

This notbook is part of the Data Engineering Zoomcamp 2023 and explores the dataset needed for the week1 homework as specified [here](../homework.md).

In [4]:
import pandas as pd                             # to help with dataframes
import os                                       # to help with system calls
from sqlalchemy import create_engine            # to help with named arguments
from time import time                           # to help with timing
import unittest                                 # to help with unit testing

In [5]:
pd.__version__

'1.5.2'

## Step 1: Import Data

We import the `green_tripdata_2019-01.csv.gz` we downloaded beforehand using pandas

In [12]:
peak_df = pd.read_csv('green_tripdata_2019-01.csv.gz', parse_dates=["lpep_pickup_datetime", "lpep_dropoff_datetime"], nrows=100)

## Step 2: Explore and prepare data

Now we take a look at the column typed and clean the data where necessary

In [13]:
peak_df.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2018-12-21 15:17:29,2018-12-21 15:18:57,N,1,264,264,5,0.0,3.0,0.5,0.5,0.0,0.0,,0.3,4.3,2,1,
1,2,2019-01-01 00:10:16,2019-01-01 00:16:32,N,1,97,49,2,0.86,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2,1,
2,2,2019-01-01 00:27:11,2019-01-01 00:31:38,N,1,49,189,2,0.66,4.5,0.5,0.5,0.0,0.0,,0.3,5.8,1,1,
3,2,2019-01-01 00:46:20,2019-01-01 01:04:54,N,1,189,17,2,2.68,13.5,0.5,0.5,2.96,0.0,,0.3,19.71,1,1,
4,2,2019-01-01 00:19:06,2019-01-01 00:39:43,N,1,82,258,1,4.53,18.0,0.5,0.5,0.0,0.0,,0.3,19.3,2,1,


### explore column types

In [14]:
# explore data
# peak_df.shape    # number of rows and columns
# peak_df.columns  # column names
# peak_df.dtypes   # column types
peak_df.info()   # index, column names, non-null counts and memory usage

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               100 non-null    int64         
 1   lpep_pickup_datetime   100 non-null    datetime64[ns]
 2   lpep_dropoff_datetime  100 non-null    datetime64[ns]
 3   store_and_fwd_flag     100 non-null    object        
 4   RatecodeID             100 non-null    int64         
 5   PULocationID           100 non-null    int64         
 6   DOLocationID           100 non-null    int64         
 7   passenger_count        100 non-null    int64         
 8   trip_distance          100 non-null    float64       
 9   fare_amount            100 non-null    float64       
 10  extra                  100 non-null    float64       
 11  mta_tax                100 non-null    float64       
 12  tip_amount             100 non-null    float64       
 13  tolls_

## Step 3: Load taxi rides data into postgres database

### 3.1 connect to database

In [15]:
# create connection to postgres
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

### 3.2: importing data in chunkes into a DataFrame

In [16]:
# for this purpose we can use the iterator argument from the pandas.read_csv function
# tip: press shift+tab to see help
df_iter = pd.read_csv('green_tripdata_2019-01.csv.gz', parse_dates=["lpep_pickup_datetime", "lpep_dropoff_datetime"], iterator=True, chunksize=100000)
df = next(df_iter)

### 3.3: writing the data into the database in batches

the dataset is potentially too big to write it into the postgres database all in one piece, therefore we batch it and load each chunk separately

In [17]:
# write the table schema into the DB
df.head(n=0).to_sql(name='green_taxi_data', con=engine, if_exists='replace')
df.to_sql(name='green_taxi_data', con=engine, if_exists='append')

1000

*now we can check in the terminal (with pgcli) or in pgAdmin if the green_taxi table exists*

In [18]:
# next returns TRUE as long as there is a next iteration, if there isn't it returns FALSE
while True:
    try: 
        t_start = time()

        df = next(df_iter)

        #df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
        #df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
        
        df.to_sql(name='green_taxi_data', con=engine, if_exists='append')

        t_end = time()

        print('inserted another chunk, took %.3f seconds' % (t_end - t_start))
        
    except StopIteration:
            print("Finished ingesting data into the postgres database")
            break

inserted another chunk, took 15.591 seconds
inserted another chunk, took 15.564 seconds
inserted another chunk, took 15.485 seconds
inserted another chunk, took 14.941 seconds
inserted another chunk, took 15.135 seconds
inserted another chunk, took 5.144 seconds
Finished ingesting data into the postgres database


In [19]:
# check if the data was ingested correctly
print(pd.read_sql("SELECT COUNT(*) FROM green_taxi_data", engine))

    count
0  630918


In [20]:
# check the number of rows in the csv with pandas
print(len(pd.read_csv('green_tripdata_2019-01.csv.gz')))

630918


In [21]:
# write unit test
class TestIngestData(unittest.TestCase):
    def test_ingest_data(self):
        # check if the data was ingested correctly
        self.assertEqual(pd.read_sql("SELECT COUNT(*) FROM green_taxi_data", engine).values[0][0], len(pd.read_csv('green_tripdata_2019-01.csv.gz')))

## Step 4: Load Taxi Zones Data into postgres database

this time we only download the data if they do not already exist (-nc flag)

In [22]:
!wget -O ./taxi_zone_lookup.csv --no-clobber https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

File ‘./taxi_zone_lookup.csv’ already there; not retrieving.


In [23]:
df_zones = pd.read_csv('taxi_zone_lookup.csv')

In [24]:
df_zones.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


In [25]:
df_zones.to_sql(name='zones', con=engine, if_exists='replace')

265