In [None]:
import pandas as pd
import sqlite3

# Pandas + SQL(Alchemy)

Pandas is a very powerful tool to work with data frames.
But it can also be used with databases!
We can load single tables from an existing database into dataframes or create new tables from dataframes, without specifying any schema!
SQLAlchemy is doing it under the hood.

Working with Pandas and SQL will always load tables into a dataframe, we do *not* get any Python objects as we did with the ORM when using SQLAlchemy.

Be aware that dataframes do *not* know about any relations you might have established with SQLAlchemy!

**Be aware**: Working with SQL+Pandas is usually only a comfortable workaround for simple use cases and "quick-and-dirty" approaches, e.g. when you need a simple lookup of some data. Also, if your amout of data is feasible for a dataframe and you plan to load the data once from a DB and do everything else in Pandas anyway, then this workflow will do.
For more complex tasks involving joining/aggregating/grouping/selecting data on a large data volume, you might want to rely on the features of your (relational) DB itself and do all these tasks using SQLAlchemy.

# Read from a DB with Pandas

## Open Connection

First, we have to establish a connection to the DB we have already filled.
In this example, we use the Sqlite DB-API for this task.
You can do the same with other systems, e.g. PostgreSQL, using the respective DB-API.

In [None]:
connection = sqlite3.connect("data/firmenlauf_demo.db")

## Run an SQL queries with Pandas

In [None]:
# Load the whole table "teams" into a dataframe
df_teams = pd.read_sql("SELECT * FROM teams", connection)

In [None]:
df_teams

In [None]:
# For better readability, we define the query string separately
# Note, that we have to JOIN two tables explicitly in SQL if we want to combine data from two tables
sql_query_runner_shoe_color = """
    SELECT runners.first_name, runners.shoe_size, teams.shoe_color 
    FROM runners
    JOIN teams
    ON runners.team_id = teams.id
"""

df_shoes = pd.read_sql(sql_query_runner_shoe_color, connection)

In [None]:
df_shoes

# Add a Table to a DB with Pandas

Let's say we want to add a new table containing the ranking from the actual Firmenlauf and the money each team gets.
We first create a dataframe and add this dataframe as new table to the DB.
Note, that we can not add any relationships as we did when using SQLAlchemy, since we do not use an ORM here.

In [None]:
df_ranking = pd.DataFrame({"rank": [1, 2, 3, 4], "team_id": [4, 3, 2, 1], "prize": [5000, 2000, 1000, 500]})

In [None]:
df_ranking

In [None]:
# Import the dataframe as table to the DB and replace it, if it already exists (this might cause data loss in real world!).
df_ranking.to_sql("rankings", connection, if_exists="replace")

In [None]:
# Read out the newly added table as dataframe again
pd.read_sql("SELECT * FROM rankings", connection)

# Show DB schema information

The `sqlite_master` element contains all information of the DB schema. 

We want to have a more structured output of the available tables and their columns therefore we define the following function:

In [None]:
def table_info(c, conn):
    '''
    prints out all of the columns of every table in db
    c : cursor object
    conn : database connection object
    '''
    tables = c.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
    for table_name in tables:
        table_name = table_name[0] # tables is a list of single item tuples
        table = pd.read_sql_query("SELECT * from {} LIMIT 0".format(table_name), conn)
        print(table_name)
        for col in table.columns:
            print('\t' + col)
        print()

In [None]:
cur = connection.cursor()
table_info(cur, connection)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © [Point 8 GmbH](https://point-8.de)_