# SQLAlchemy - pandas

In previous tutorial, we have seen how to use `SQLAlchemy` to connect to a database. But the output of a select query is a list of python object. If you want to use them in a datascience
project, you will have to convert them into pandas dataframes or spark dataframes.

> For spark, the spark session can read directly from a database without using sqlAlchemy

In this tutorial, we will learn how to combine SQLAlchemy and pandas together to read data in DB and return a pandas dataframe.

In [11]:
import sqlalchemy as db
import pandas as pd
from sqlalchemy import Table, Column, Integer, SmallInteger, DateTime, String, MetaData,ForeignKey, Text, text

In [2]:
test_db_path="../../../data/test.db"
# as the sqlite db is a local file, so we need `/`
# echo attribute will enable the console to display the actual SQL query run by the engine
engine = db.create_engine(f'sqlite:///{test_db_path}', echo=True)

In [3]:
metadata=MetaData()

cohort=Table('cohort',
             metadata,
             Column('id',Integer,primary_key=True),
             Column('cname',String)
             )

dataset=Table('dataset',
              metadata,
              Column('id',Integer,primary_key=True),
              Column('cohort_id',Integer,ForeignKey("cohort.id")),
              Column('year',Integer),
              Column('name',String),
              Column('location',String),
              Column('status',SmallInteger)
              )
descriptor=Table('descriptor',
                 metadata,
                 Column('id',Integer,primary_key=True),
                 Column('name',String),
                 Column('location',String),
                 Column('dataset_id',Integer,ForeignKey("dataset.id"))
                 )

validation_rule=Table('validation_rule',
                      metadata,
                      Column('id',Integer,primary_key=True),
                      Column('name',String),
                      Column('description',Text),
                      Column('args',String),
                      Column('kwargs',String)
                      )

validation_task=Table('validation_task',
                      metadata,
                      Column('id',Integer,primary_key=True),
                      Column('starting_date',DateTime),
                      Column('ending_date',DateTime),
                      Column('dataset_id',Integer,ForeignKey("dataset.id")),
                      Column('validation_rule_id',Integer,ForeignKey("validation_rule.id")),
                      Column('task_status',SmallInteger),
                      Column('output',Text)
                      )

## Solution 1: SQLalchemy rows to pandas df

Pandas can build dataframe from a list of SQLalchemy rows. In below example,
1. Create  a mapping object of the target table
2. build a select query and execute it
3. Use pandas constructor to build dataframe from the list of row.LegacyRow


In [4]:
# start a connection by using the above engine dialect
connection = engine.connect()

In [7]:
# creat a mapping object of the table dataset
cohort = db.Table('cohort', metadata, autoload=True, autoload_with=engine)
results=connection.execute(db.select([cohort])).fetchall()

2023-02-13 11:13:14,625 INFO sqlalchemy.engine.Engine SELECT cohort.id, cohort.cname 
FROM cohort
2023-02-13 11:13:14,628 INFO sqlalchemy.engine.Engine [cached since 220.3s ago] ()
<class 'list'>


In [8]:
for item in results:
    print(type(item))
    print(f"{item.keys()}")

<class 'sqlalchemy.engine.row.LegacyRow'>
RMKeyView(['id', 'cname'])
<class 'sqlalchemy.engine.row.LegacyRow'>
RMKeyView(['id', 'cname'])
<class 'sqlalchemy.engine.row.LegacyRow'>
RMKeyView(['id', 'cname'])


In [9]:
df=pd.DataFrame(results)
df.head()

Unnamed: 0,id,cname
0,1,casd
1,2,toto
2,3,titi


## Solution 2: Use a raw sql query

This solution is similar to solution. The only difference is that we did not create a mapping object, but convert a raw sql query string to a query.

In [15]:

sql = '''
    SELECT * FROM cohort;
'''

query = connection.execute(text(sql))
df1 = pd.DataFrame(query.fetchall())

2023-02-13 11:30:45,065 INFO sqlalchemy.engine.Engine 
    SELECT * FROM cohort;

2023-02-13 11:30:45,084 INFO sqlalchemy.engine.Engine [cached since 409.4s ago] ()


In [16]:
df1.head()

Unnamed: 0,id,cname
0,1,casd
1,2,toto
2,3,titi


## Solution 3 Use pandas read_sql function

In above two solutions, we used `fetchall()` function to get data first, then convert it into a pandas data frame. Alternatively, we can also achieve it using `pandas.read_sql`. Since SQLAlchemy is integrated with Pandas, we only need to provide a SQLAlchemy connection.

Check below example

In [17]:
df2= pd.read_sql(f"""select * from cohort;""",con=connection)
df2.head()

2023-02-13 11:32:06,409 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("select * from cohort;")
2023-02-13 11:32:06,415 INFO sqlalchemy.engine.Engine [raw sql] ()
2023-02-13 11:32:06,417 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("select * from cohort;")
2023-02-13 11:32:06,418 INFO sqlalchemy.engine.Engine [raw sql] ()
2023-02-13 11:32:06,419 INFO sqlalchemy.engine.Engine select * from cohort;
2023-02-13 11:32:06,420 INFO sqlalchemy.engine.Engine [raw sql] ()


Unnamed: 0,id,cname
0,1,casd
1,2,toto
2,3,titi


You can also notice that pandas also provide two other methods `read_sql_table` and `read_sql_query`.

## Solution 4. Use pandas.DataFrame function

If you need to parsing the table first. You can use the pandas.createDataframe function

In [18]:
# get the cohort list
cohort = db.Table('cohort', metadata, autoload=True, autoload_with=engine)
results=connection.execute(db.select([cohort])).fetchall()

2023-02-13 11:46:40,568 INFO sqlalchemy.engine.Engine SELECT cohort.id, cohort.cname 
FROM cohort
2023-02-13 11:46:40,569 INFO sqlalchemy.engine.Engine [cached since 2226s ago] ()


In [19]:
cohort_data_list=[]
for item in results:
    cohort_data=[item["id"]+1,"cohortName:"+item["cname"]]
    cohort_data_list.append(cohort_data)

df3=pd.DataFrame(cohort_data_list,columns=["id","cname"])

In [20]:
df3.head()

Unnamed: 0,id,cname
0,2,cohortName:casd
1,3,cohortName:toto
2,4,cohortName:titi
