# Test PG-Connection

in this notbook you can read about how 
- to connect to a postgres DB from python and
- to interact with the DB via SQL questies from python

## Setup

install necessary libraries with: 

```bash
pip install -m sqlalchemy psycopg2 
```

In [1]:
# import needed libraries

import pandas as pd                     # for working with data frames

from sqlalchemy import create_engine    # for connecting to databases

from pandas.io import sql               # for executing sql on a db

In [2]:
# create a connection to the DB
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

In [3]:
# open the connection
engine.connect()

<sqlalchemy.engine.base.Connection at 0x1373c0650>

## Step 1: send an SQL query to the DB via python

to execute SQL via python we use the `pandas.read_sql` function:

In [4]:
query = """
SELECT 1 as number;
"""

pd.read_sql(query, con=engine)

Unnamed: 0,number
0,1


now we write a more useful query to list all the tables that we have created

```sql
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND 
    schemaname != 'information_schema';
```

Source: https://www.postgresqltutorial.com/postgresql-show-tables/

In [5]:
query = """
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND 
    schemaname != 'information_schema';
"""

pd.read_sql(query, con=engine)

Unnamed: 0,schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
0,public,yellow_taxi_data,root,,True,False,False,False
1,public,zones,root,,True,False,False,False
2,public,yellow_tripdata_test,root,,False,False,False,False


## Step 2: writing data into a DB

to write data into the connected DB we use `pandas.to_sql` function

In [6]:
# 2.1 import the data
df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=10)

In [7]:
# 2.2 data cleaning
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

In [8]:
# 2.3 data loading into the db
df.to_sql(name='yellow_tripdata_test', con=engine, index=False)

ValueError: Table 'yellow_tripdata_test' already exists.

In [None]:
# 2.4 check if it worked
query = """
SELECT * FROM yellow_tripdata_test LIMIT 10
"""

pd.read_sql(query, con=engine)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2021-01-01 00:30:10,2021-01-01 00:36:12,1,2.1,1,N,142,43,2,8.0,3.0,0.5,0.0,0,0.3,11.8,2.5
1,1,2021-01-01 00:51:20,2021-01-01 00:52:19,1,0.2,1,N,238,151,2,3.0,0.5,0.5,0.0,0,0.3,4.3,0.0
2,1,2021-01-01 00:43:30,2021-01-01 01:11:06,1,14.7,1,N,132,165,1,42.0,0.5,0.5,8.65,0,0.3,51.95,0.0
3,1,2021-01-01 00:15:48,2021-01-01 00:31:01,0,10.6,1,N,138,132,1,29.0,0.5,0.5,6.05,0,0.3,36.35,0.0
4,2,2021-01-01 00:31:49,2021-01-01 00:48:21,1,4.94,1,N,68,33,1,16.5,0.5,0.5,4.06,0,0.3,24.36,2.5
5,1,2021-01-01 00:16:29,2021-01-01 00:24:30,1,1.6,1,N,224,68,1,8.0,3.0,0.5,2.35,0,0.3,14.15,2.5
6,1,2021-01-01 00:00:28,2021-01-01 00:17:28,1,4.1,1,N,95,157,2,16.0,0.5,0.5,0.0,0,0.3,17.3,0.0
7,1,2021-01-01 00:12:29,2021-01-01 00:30:34,1,5.7,1,N,90,40,2,18.0,3.0,0.5,0.0,0,0.3,21.8,2.5
8,1,2021-01-01 00:39:16,2021-01-01 01:00:13,1,9.1,1,N,97,129,4,27.5,0.5,0.5,0.0,0,0.3,28.8,0.0
9,1,2021-01-01 00:26:12,2021-01-01 00:39:46,2,2.7,1,N,263,142,1,12.0,3.0,0.5,3.15,0,0.3,18.95,2.5


## Step 3: remove test data from DB

since this is for testing purposes only, we will delete out test table again

In [None]:
query = """
DROP TABLE IF EXISTS yellow_tripdata_test
"""
sql.execute(query, engine)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x120b4d710>

## Excourse

from [Stackoverflow](https://stackoverflow.com/questions/38840208/cannot-drop-table-in-pandas-to-sql-using-sqlalchemy): example of the best pracise to handle db connections:
> Because it ensures that your connection is always closed, even if your program exits with an error. This is important to prevent data corruption.

In [9]:

with engine.connect() as conn, conn.begin():
    query = """select * from some_table limit 1"""
    df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=10)
    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
    print(df.head())
    df.to_sql(name='yellow_tripdata_test', con=conn, index=False, if_exists='replace')

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2021-01-01 00:30:10   2021-01-01 00:36:12                1   
1         1  2021-01-01 00:51:20   2021-01-01 00:52:19                1   
2         1  2021-01-01 00:43:30   2021-01-01 01:11:06                1   
3         1  2021-01-01 00:15:48   2021-01-01 00:31:01                0   
4         2  2021-01-01 00:31:49   2021-01-01 00:48:21                1   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           2.10           1                  N           142            43   
1           0.20           1                  N           238           151   
2          14.70           1                  N           132           165   
3          10.60           1                  N           138           132   
4           4.94           1                  N            68            33   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \


In [10]:
query = """
    DROP TABLE IF EXISTS yellow_tripdata_test
    """
sql.execute(query, engine)
print("removed test data again")

removed test data again
