## Transferring flat file database to PostgreSQL

While flat files are convenient for research, for our recommender server, we move to a locally hosted PostgreSQL database.

We first need to determine the schema of the database. To do this, we consider the types of each column.

In [1]:
import pandas as pd

In [6]:
df = pd.read_parquet('../data/tr_data/2009-01.parquet')
df.dtypes

pickup_datetime      datetime64[ns]
dropoff_datetime     datetime64[ns]
passenger_count               int64
trip_distance               float64
payment_type                 object
fare_amount                 float64
tip_amount                  float64
trip_time_in_secs           float64
fare_per_sec                float64
day                          object
time                         object
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
dtype: object

We first note that we do not use payment type in our analysis. We drop this column globally.

In [16]:
date_ptr = '2009-01'

while date_ptr != '2024-01':
    date_data = pd.read_parquet(f'../data/tr_data/{date_ptr}.parquet')
    if 'payment_type' in date_data:
        date_data = date_data.drop(columns=['payment_type'])
    date_data.to_parquet(f'../data/tr_data/{date_ptr}.parquet')
    date_ptr = (pd.to_datetime(date_ptr) + pd.DateOffset(months=1)).strftime('%Y-%m')

Now reading our dataframe again.

In [18]:
df = pd.read_parquet('../data/tr_data/2009-01.parquet')
df.dtypes

pickup_datetime      datetime64[ns]
dropoff_datetime     datetime64[ns]
passenger_count               int64
trip_distance               float64
fare_amount                 float64
tip_amount                  float64
trip_time_in_secs           float64
fare_per_sec                float64
day                          object
time                         object
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
dtype: object

These are the following data type mappings to SQL
- `pickup_datetime`: `TIMESTAMP`
- `dropoff_datetime`: `TIMESTAMP`
- `pasenger_count`: `INTEGER`
- `trip_distance`: `NUMERIC`
- `fare_amount`: `NUMERIC`
- `tip_amount`: `NUMERIC`
- `trip_time_in_secs`: `NUMERIC`
- `fare_per_sec`: `NUMERIC`
- `day`: `TEXT`
- `time`: `TIME`
- `pickup_longitude`: `NUMERIC`
- `pickup_latitude`: `NUMERIC`
- `dropoff_longitude`: `NUMERIC`
- `dropoff_latitude`: `NUMERIC`

Therfore, our schema becomes:

In [None]:
CREATE TABLE trips (
    pickup_datetime TIMESTAMP,
    dropoff_datetime TIMESTAMP,
    passenger_count INTEGER,
    trip_distance NUMERIC,
    fare_amount NUMERIC,
    tip_amount NUMERIC,
    trip_time_in_secs NUMERIC,
    fare_per_sec NUMERIC,
    day TEXT,
    "time" TIME,
    pickup_longitude NUMERIC,
    pickup_latitude NUMERIC,
    dropoff_longitude NUMERIC,
    dropoff_latitude NUMERIC
);

The database `taxis_and_ubers` is now created, with the table `trips`. We now insert these into the database.

In [34]:
from sqlalchemy import create_engine
engine = create_engine('postgresql://haekim:password@localhost:5432/taxis_and_ubers')

In [38]:
date_ptr = '2009-01'

while date_ptr != '2024-01':
    date_data = pd.read_parquet(f'../data/tr_data/{date_ptr}.parquet')
    date_data.to_sql('trips', engine, if_exists='append', index=False)
    date_ptr = (pd.to_datetime(date_ptr) + pd.DateOffset(months=1)).strftime('%Y-%m')

Now lets try to read in all the data from January 2023.

In [49]:
df = pd.read_sql_query(
    "SELECT * FROM trips WHERE pickup_datetime >= '2023-01-01'",
    con=engine
)

In [51]:
df

Unnamed: 0,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,fare_amount,tip_amount,trip_time_in_secs,fare_per_sec,day,time,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
0,2023-01-02 11:01:25,2023-01-02 11:24:52,1,9.97,41.5,11.16,1407.0,0.029495,Monday,11:01:25,-73.873364,40.775714,-73.950062,40.824145
1,2023-01-02 11:52:19,2023-01-02 12:13:08,2,9.20,38.0,10.95,1249.0,0.030424,Monday,11:52:19,-73.873364,40.775714,,
2,2023-01-02 11:13:24,2023-01-02 11:27:01,1,8.00,31.7,8.45,817.0,0.038800,Monday,11:13:24,-73.873364,40.775714,-73.961292,40.776654
3,2023-01-02 11:33:54,2023-01-02 11:52:23,1,8.85,35.2,10.40,1109.0,0.031740,Monday,11:33:54,-73.873364,40.775714,,
4,2023-01-02 11:25:33,2023-01-02 11:40:02,1,8.60,33.1,9.75,869.0,0.038090,Monday,11:25:33,-73.967272,40.750214,-73.873364,40.775714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
717984,2023-01-01 13:23:13,2023-01-01 13:47:19,1,10.03,40.8,11.27,1446.0,0.028216,Sunday,13:23:13,,,-73.873364,40.775714
717985,2023-01-01 00:56:04,2023-01-01 01:09:09,2,2.90,15.6,5.40,785.0,0.019873,Sunday,00:56:04,-73.984079,40.735519,-73.948499,40.745532
717986,2023-01-01 00:10:31,2023-01-01 00:21:58,2,6.50,26.8,6.32,687.0,0.039010,Sunday,00:10:31,-73.873364,40.775714,,
717987,2023-01-01 00:14:36,2023-01-01 00:27:40,2,9.00,34.5,10.45,784.0,0.044005,Sunday,00:14:36,-73.873364,40.775714,-73.959017,40.766437
