## Upload data

In the original zoomcamp video, the NYC dataset file is downloaded as a csv file. Currently, the dataset is only available in parquet format. Therefore, we first convert it from parquet to csv (look at Kyle A and taro.wp's comments in the [zoomcamp video](https://www.youtube.com/watch?v=2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)).

In [1]:
import pandas as pd
parquet_file = './yellow_tripdata_2021-01.parquet'
df = pd.read_parquet(parquet_file, engine = 'pyarrow')
df.to_csv(parquet_file.replace('parquet', 'csv.gz'), index=False, compression='gzip')

Below we generate the SQL to create the table in the database.

In [3]:
df = pd.read_csv('./yellow_tripdata_2021-01.csv.gz', nrows=100, compression='gzip')
print(pd.io.sql.get_schema(df, name='yellow_taxi_data'))

CREATE TABLE "yellow_taxi_data" (
"VendorID" INTEGER,
  "tpep_pickup_datetime" TEXT,
  "tpep_dropoff_datetime" TEXT,
  "passenger_count" REAL,
  "trip_distance" REAL,
  "RatecodeID" REAL,
  "store_and_fwd_flag" TEXT,
  "PULocationID" INTEGER,
  "DOLocationID" INTEGER,
  "payment_type" INTEGER,
  "fare_amount" REAL,
  "extra" REAL,
  "mta_tax" REAL,
  "tip_amount" REAL,
  "tolls_amount" REAL,
  "improvement_surcharge" REAL,
  "total_amount" REAL,
  "congestion_surcharge" REAL,
  "airport_fee" REAL
)


Next, we run a simple script to ingest the data to Postgres. Note that Postgres must be running, otherwise we will not be able to connect to the database.

In [4]:
from sqlalchemy import create_engine
from time import time
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')
engine.connect()

df_iter = pd.read_csv('./yellow_tripdata_2021-01.csv.gz', iterator=True, chunksize=100000)
run = True
while run:
    try:
        t_start = time()
        df = next(df_iter)
        df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
        df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
        df.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
        t_end = time()
        print(f'inserted another chunk, took {t_end-t_start:.3f} seconds')
    except Exception:
        run = False

inserted another chunk, took 3.692 seconds
inserted another chunk, took 3.603 seconds
inserted another chunk, took 3.637 seconds
inserted another chunk, took 3.767 seconds
inserted another chunk, took 3.614 seconds
inserted another chunk, took 3.631 seconds
inserted another chunk, took 3.632 seconds
inserted another chunk, took 3.555 seconds
inserted another chunk, took 3.593 seconds
inserted another chunk, took 3.603 seconds
inserted another chunk, took 3.615 seconds
inserted another chunk, took 3.757 seconds


  df = next(df_iter)


inserted another chunk, took 3.989 seconds
inserted another chunk, took 2.360 seconds
