# Exercises: Connecting to PostgreSQL with Python

There may be other ways to solve these exercises.  They are aimed at getting you to think about the interplay between Python and a SQL database -- not reviewing the basic syntax of calling a database from Python.  So they'll require some thought and possibly a little more research/reading.  Try them with a partner or group.

In [None]:
import psycopg2

## Exercise: Create and Populate Tables

Connect to a database where you have permission to create tables.

Create three tables with appropriate columns:

* `person`: at least an ID and name, maybe other characteristics of a person
* `relationship`: links people together and labels it with a relationship type
* `relationship_type`: a table defining the allowed set of relationship types in the `relationship` table

Populate the tables with information about your friends and/or family.  Hint: think about how you want to handle IDs for people so that you can use them in the relationship table.  Hint 2: think about how to make it clear in the relationship table what the direction of the relationship is (e.g. who is the child and who is the parent?).

Print out sentences describing the family relationships.

#### Solution

There are obviously multiple ways to do this.  Here is one.

In [None]:
conn = psycopg2.connect(dbname="", host="", user="", password="") # fill in details
cur = conn.cursor()

In [None]:
# create tables
cur.execute("create table person (id serial primary key, name text not null);") # have to create before relationship
cur.execute("""create table relationship_type (
               type text primary key);""") ## have to create before relationship below
cur.execute("""create table relationship (
            id serial primary key, 
            subject int references person(id),
            predicate int references person(id),
            relationship text references relationship_type(type));""")
conn.commit()

Populate tables.  One option is to create a dict to store the auto generated IDs for the people to use later.  This isn't very efficient, but works fine for moderately sized databases where you want to keep all of this information in memory in Python anyway.  You could also define ids yourself, but this can get tricky to keep track of across multiple sessions of working with a database.

In [None]:
family = {x:None for x in ['Christina','Casey','Henry','Jessica','Denise','Bob']}
for person in family:
    # insert
    cur.execute("insert into person (name) values (%s);", [person])
    # retrieve ID
    cur.execute("select id from person where name=%s;", [person])
    family[person] = cur.fetchone()[0]
    
# commit
conn.commit()

Define relationship types.  There's no reason you have to use an underscore in the relationship type strings -- you could use a space.  The use of the underscore comes from experience with categorical variables in data analysis in other contexts.  

You have to populate the `relationship_type` table before `relationship`.

In [None]:
for rtype in ['spouse_of','parent_of','sibling_of','child_of']:
    cur.execute("insert into relationship_type values (%s);", [rtype])

# commit
conn.commit()

In [None]:
# not complete set of relationships, but we can do some both ways
relations = [(family['Christina'], family['Casey'], 'spouse_of'),
    (family['Christina'], family['Henry'], 'parent_of'),
    (family['Casey'], family['Henry'], 'parent_of'),
    (family['Henry'], family['Christina'], 'child_of'),
    (family['Henry'], family['Casey'], 'child_of'),
    (family['Christina'], family['Jessica'], 'sibling_of'),
    (family['Christina'], family['Denise'], 'child_of'),
    (family['Christina'], family['Bob'], 'child_of'),
    (family['Jessica'], family['Denise'], 'child_of'),
    (family['Jessica'], family['Bob'], 'child_of')]
for relation in relations:
    cur.execute("""insert into relationship (subject, predicate, relationship) 
                values (%s, %s, %s);""", relation)
conn.commit()

You could also use [`executemany()`](http://initd.org/psycopg/docs/cursor.html#cursor.executemany) above, but it isn't faster than a loop.

Note that instead of manually entering each relationship both ways, you could set up [triggers](https://www.postgresql.org/docs/9.1/static/sql-createtrigger.html) in the database to do this.  This would take some work to set up (you'd need to define the opposite of each relationship type), but it's possible.

Look at results.

In [None]:
cur.execute("""select a.name, b.name, relationship from person a, person b, relationship 
                where a.id=subject and b.id=predicate;""")
for row in cur.fetchall():
    print("{} is the {} {}.".format(row[0], row[2].replace("_", " "), row[1]))

In [None]:
cur.close()
conn.close()

## Exercise: Selecting Random Data

One thing that isn't easy to do with SQL is selecting random rows.  There are functions to generate random values, but generating a new random column on a large table, and then sorting by that column (or computing the max value and then selecting an observation) is costly.  This is one scenario when working with a database from Python is useful.

Use the code below to create a table in the database.  Then figure out how to select 3 random rows from that table (as if you didn't have access to the code or values that created the table).  Do this without reading the entire table into Python.  Hint: you'll probably want to use some combination of sorting the table, limiting the number of rows you retrieve, and offsetting results (which we probably didn't cover: learn more [here](http://www.postgresqltutorial.com/postgresql-limit/) or [here](https://www.tutorialspoint.com/postgresql/postgresql_limit_clause.htm)).

In [None]:
import string
import random

ids = random.sample(list(range(1000)), 100)

conn = psycopg2.connect(dbname="", host="", user="", password="") ## connect to a database where you can write
cur = conn.cursor()
cur.execute("""create table patient (
                id int primary key,
                name text not null);""")
for i in ids:
    cur.execute("insert into patient values (%s, %s)", (i, ''.join(random.sample(string.ascii_letters, 5))))
conn.commit()
cur.close()
conn.close()

#### Solution

In [None]:
conn = psycopg2.connect(dbname="", host="", user="", password="") ## connect to a database where you can write
cur = conn.cursor()
cur.execute("select * from patient;")
for row in cur.fetchmany(5):
    print(row)

First get the number of rows in the table.  Then select 3 random values between 0 and the number of rows - 1. Then for each, execute a query to get that row from the database.

In [None]:
cur.execute("select count(*) from patient;")
count = cur.fetchone()[0]
selection = random.sample(list(range(count)), 3)  ## sample between 0 and count to get the row offset
for val in selection:
    cur.execute("""select * from patient 
                    order by id  -- important so that we get rows in the same order each query
                    limit 1 -- we just need one row
                    offset %s;""", [val]) # use the offset to determine which row
    print(cur.fetchone())

An alternative approach, which could work well if the table isn't too big, is to retrieve all of the IDs, and then randomly sample the IDs, and retrieve just those rows.

In [None]:
cur.execute("select id from patient;")
ids = [x[0] for x in cur.fetchall()]
selection = random.sample(ids, 3)
for val in selection:
    cur.execute("""select * from patient 
                    where id = %s;""", [val]) # use the offset to determine which row
    print(cur.fetchone())

In [None]:
cur.close()
conn.close()