<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/05b-rdbms.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 05b - RDBMS

Database issues

* open connections
* injection attacks

Database constraints

* table schema
* primary keys

## Create a table (review)

There are several steps involved with creating a database table...

1. Create a database connection
2. Use the connection to create a "cursor" for executing commands
3. Use the cursor to execute command(s)
4. Commit any changes
5. Close the connection

The data you’ve saved is persistent and available in subsequent sessions.

* Use [cursor.executescript()](https://docs.python.org/2/library/sqlite3.html#sqlite3.Cursor.executescript) for multiple SQL commands


In [None]:
import sqlite3
import pandas as pd

script = '''CREATE TABLE IF NOT EXISTS stocks
              (date text, trans text, symbol text, qty real, price real);
            INSERT INTO stocks VALUES ('2016-06-10','BUY','APPL',100,24.71)'''

con = sqlite3.connect('example.db')
cur = con.cursor()
cur.executescript(script)
con.commit()
con.close()

In [None]:
con = sqlite3.connect('example.db')
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

# Multiple in-memory connections with SQLite

* Every database created with ":memory:" is distinct from every other.
* However, you can create mutiple connections to an in-memory database (see reference)
* Documentation for [In-Memory Databases](https://www.sqlite.org/inmemorydb.html) shows how to do it

In [None]:
# These 3 lines open connections to 3 different in-memory databases.
con1 = sqlite3.connect(":memory:")
con2 = sqlite3.connect(":memory:")
con3 = sqlite3.connect(":memory:")

In [None]:
# You can verify that you've opened multiple connections
con1 = sqlite3.connect('example.db')
con2 = sqlite3.connect('example.db')
con3 = sqlite3.connect("example.db")

In [None]:
# You need to provide a distinct handle. These 3 lines only open one connection.
con1 = sqlite3.connect('example.db')
con1 = sqlite3.connect('example.db')
con1 = sqlite3.connect("example.db")

# Count the open connections

* You can use the Linux system command `lsof` to count open connections to a file in Colab.
* Colab runs Ubuntu Linux, probably in a [Docker container](https://cloud.google.com/containers).
* You can run the next few cells to see the effect of opening multiple connections.
* At some point, the effects are determined by the operating system, which is acting like an RDBMS server. 
* DBAs (Database Administrators) manage these kinds of issues with an enterprise RDBMS.

In [None]:
# Colab runs Linux in a container
! cat /etc/os-release

In [None]:
# Count the open connections. Close all connections by restarting the kernel.
!lsof example.db

# Beware of SQL injection attacks!!

* [sqlite3 reference docs](https://docs.python.org/3/library/sqlite3.html) describe safe parameter substitution in SQL
* See this [xkcd webcomic](https://xkcd.com/327/) for an explanation.

In [None]:
# Never do this kind of thing -- it's insecure
with sqlite3.connect('example.db') as con:
    cur = con.cursor()
    symbol = 'APPL'
    cur.execute("SELECT * FROM stocks WHERE symbol = '{}'".format(symbol))
    print(cur.fetchone())

# Database constraints

* [Relational algebra with pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra)
* [Group by: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)
* With all these capabilities, why bother with an RDBMS?
  * An RDBMS enforces constraints that help assure data integrity
  * An RDBMS can scale up to big data and enterprise applications.
  * Managing relational data is a science in itself.
* Database normalization and the normal forms 


In [None]:
import sqlite3
import pandas as pd

In [None]:
df = pd.DataFrame({'a': [1,2, 3], 'b': [3,4, 4], 'c': [5, 6, 7]})
df

In [None]:
df.groupby(['a', 'b']).size()

In [None]:
df[['a', 'b']].duplicated()

# Primary Key

* [`CREATE TABLE` reference docs](https://www.sqlite.org/lang_createtable.html) -- sqlite.org
* [SQL features that SQLite doesn't support](https://www.sqlite.org/omitted.html) -- sqlite.org
* [Foreign Key constraints](https://www.sqlite.org/foreignkeys.html) -- sqlite.org

In [None]:
# Create a table with PRIMARY KEY constraints
con = sqlite3.connect(':memory:')
cur = con.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS demo (
                 a int, 
                 b int, 
                 c int,
                 PRIMARY KEY (a, b)
               )''')
con.commit()

# Confirm the table schema

* You can recover the SQL command that created the table from `sqlite_master`
* But `PRAGMA table_info` provides confirmation

In [None]:
pd.read_sql_query("SELECT * FROM sqlite_master WHERE type='table'", con)

In [None]:
print(pd.read_sql_query("SELECT * FROM sqlite_master WHERE type='table'", con)['sql'][0])

### PRAGMA table_info

* [table_info pragma](https://learning.oreilly.com/library/view/using-sqlite/9781449394592/re205.html) -- from "Using SQLite"
* `pk` column indicates whether named column is the primary key or part of a multi-column primary key

In [None]:
# Note the "pk" column
pd.read_sql_query("PRAGMA table_info('demo')", con)

In [None]:
# Create a dataframe with data to be inserted in the database
df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 4], 'c': [5, 6, 7]})

## With a `PRIMARY KEY` constraint

* The next line adds the dataframe into the database.
* It can only be run once, because of the primary-key constraint
* The primary key must be unique

In [None]:
# Add the data to the "demo" table -- you can only do this once
df.to_sql("demo", con, if_exists="append", index=False)

In [None]:
# Confirm the content of the database
pd.read_sql_query("SELECT * FROM 'demo'", con)

## Without a `PRIMARY KEY` constraint

Without a PRIMARY KEY constraint, duplicates are allowed.

In [None]:
# Add a table with PRIMARY KEY constraints -- you can run this cell repeatedly
cur.execute('''CREATE TABLE IF NOT EXISTS demo2 (
                 a int, 
                 b int, 
                 c int
               )''')

df.to_sql("demo2", con, if_exists="append", index=False)
df.to_sql("demo2", con, if_exists="append", index=False)
pd.read_sql_query("SELECT * FROM 'demo2'", con)

# Outer join without primary key



In [None]:
# Clear out demo2 table
cur.execute("DELETE FROM demo2")
con.commit()

pd.read_sql_query("SELECT * FROM 'demo2'", con)

In [None]:
# Create demo3 table
cur.execute('''CREATE TABLE IF NOT EXISTS demo3 (
                 d int, 
                 e int
               )''')
con.commit()

In [None]:
# Create dataframes, d2 & d3, that we'll load into tables demo2 & demo3
df2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df3 = pd.DataFrame({'d': [21, 22], 'e': [23, 24]})

# Clear out the tables (in case we're running this cell more than once)
cur.execute("DELETE FROM demo2")
cur.execute("DELETE FROM demo3")
con.commit()

# Load the data
df2.to_sql("demo2", con, if_exists="append", index=False)
df3.to_sql("demo3", con, if_exists="append", index=False)

In [None]:
sql = "SELECT * " \
      " FROM demo2 " \
      " LEFT JOIN demo3 "

pd.read_sql_query(sql, con) 

In [None]:
sql = "SELECT * " \
      " FROM demo3 " \
      " LEFT JOIN demo2 "

pd.read_sql_query(sql, con) 