# From PostgreSQL to Pandas

*By Naysan Saran, May 2020.*

## 1 - Introduction

In this tutorial we will go through all the steps required to get a Pandas dataframe from an SQL request using Psycopg2. Here we suppose that the SQL table **MonthlyTemp** is already populated. The first rows of the table look like this

In [1]:
import psycopg2
import pandas as pd
import sys

## 2 - From csv file to pandas dataframe

In [2]:
import pandas as pd
import sys

csv_file = "./global-temp-monthly.csv"
df = pd.read_csv(csv_file)
print("Total number of rows = %s" % len(df.index))
df.head(3)

Total number of rows = 3288


Unnamed: 0,Source,Date,Mean
0,GCAG,2016-12-06,0.7895
1,GISTEMP,2016-12-06,0.81
2,GCAG,2016-11-06,0.7504


In [3]:
df = df.rename(columns={
    "Source": "source", 
    "Date": "datetime",
    "Mean": "mean_temp"
})
df.head(3)

Unnamed: 0,source,datetime,mean_temp
0,GCAG,2016-12-06,0.7895
1,GISTEMP,2016-12-06,0.81
2,GCAG,2016-11-06,0.7504


Fist, let's specify the connection parameters as a Python dictionary. 

In [4]:
param_dic = {
    "host"      : "localhost",
    "database"  : "globaldata",
    "user"      : "myuser",
    "password"  : "Passw0rd",
    "port"      : "5400"
}

This function will allow us to connect to the database

In [5]:
def connect(params_dic):
    """ Connect to the PostgreSQL database server """
    conn = None
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        conn = psycopg2.connect(**params_dic)
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        sys.exit(1) 
    print("Connection successful")
    return conn


Connect to the database

In [6]:
conn = connect(param_dic)

Connecting to the PostgreSQL database...
Connection successful


In [7]:
def postgresql_to_dataframe(conn, select_query, column_names):
    """
    Tranform a SELECT query into a pandas dataframe
    """
    cursor = conn.cursor()
    try:
        cursor.execute(select_query)
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        cursor.close()
        return 1
    
    # Naturally we get a list of tupples
    tupples = cursor.fetchall()
    cursor.close()
    
    # We just need to turn it into a pandas dataframe
    df = pd.DataFrame(tupples, columns=column_names)
    return df

In [8]:
# More efficient postgresql to dataframe based on iterators.
def efficient_postgresql_to_dataframe(conn, select_query, column_names, chunk_size=100):
    
    # Retrieve a set of iterators.
    df_chunks = pd.read_sql_query(select_query, conn, chunksize=100)
    chunks = []
    for temp in df_chunks:
        chunks.append(temp)
    return pd.concat(chunks).reset_index().drop('index', axis=1)

### Example 1: keeping the original column names

In [9]:
column_names = ["id", "source", "datetime", "mean_temp"]
# Execute the "SELECT *" query
df = postgresql_to_dataframe(conn, "select * from MonthlyTemp", column_names)
df.head()

Unnamed: 0,id,source,datetime,mean_temp
0,733078,GCAG,2016-12-06,0.7895
1,733079,GISTEMP,2016-12-06,0.81
2,733080,GCAG,2016-11-06,0.7504
3,733081,GISTEMP,2016-11-06,0.93
4,733082,GCAG,2016-10-06,0.7292


### Example 2: changing the original column names

In [10]:
column_names = ["timestamp", "temperature"]
df = postgresql_to_dataframe(conn, "select datetime, mean_temp from MonthlyTemp", column_names)
df.head()

Unnamed: 0,timestamp,temperature
0,2016-12-06,0.7895
1,2016-12-06,0.81
2,2016-11-06,0.7504
3,2016-11-06,0.93
4,2016-10-06,0.7292


### Example 3: changing the original column names (iterator version)

In [11]:
column_names = ["timestamp", "temperature"]
df = efficient_postgresql_to_dataframe(conn, "select datetime, mean_temp from MonthlyTemp", column_names)
df.head()

Unnamed: 0,datetime,mean_temp
0,2016-12-06,0.7895
1,2016-12-06,0.81
2,2016-11-06,0.7504
3,2016-11-06,0.93
4,2016-10-06,0.7292


In [12]:
# Close the connection
conn.close()