# Loading a Database

This lab is going to examine two data files that can be organized into a database.
The design will be implemented in PostgreSQL, then the data loaded using psycopg2.

#### There is a complete video walkthrough provided in the videos section of the schedule. 

We have linked the videos into this file as well.

For each section of the lab, 
 * Briefly read through the lab (no execution)
 * Then watch the video for that section
 * Then work through the lab section

## Section 1:  

**Video Link**: https://youtu.be/kp1vSoBAD2M

### Data Files:

 * game-of-thrones/battles.csv 
 * game-of-thrones/character-deaths.csv
 
Let's load each one up in a Pandas dataframe and inspect it.

In [None]:
import pandas as pd
battle_file = '/dsa/data/all_datasets/game-of-thrones/battles.csv'

battles = pd.read_csv(battle_file)
battles.head().transpose()

In [None]:
import pandas as pd
deaths_file = '/dsa/data/all_datasets/game-of-thrones/character-deaths.csv'

deaths = pd.read_csv(deaths_file)
deaths.head()


### Designing our tables

The tables, in this simple case, already define our Entities and their attributes:
 * Battle
 * Death
 
We will design the tables as such:
 * Note the schema name of SSO, **replace** with your PostgreSQL user/schema (your pawprint)
 * Note the change in field order to make battle_number the primary key
 * Note where some fields are NOT NULL, but others are allowing NULL
 
**NULL** - The SQL NULL is the term used to represent a missing value. A NULL value in a table is a value in a field that appears to be blank. A field with a NULL value is a field with no value. It is very important to understand that a NULL value is different than a zero value or a field that contains spaces.


Note - the below code cells are Raw and not cells that you run within the notebook. They are to show the SQL commands. They are not creating the tables in our database.

## Section 2:

**Video Link**: https://youtu.be/Z7LrW7ooeJA
 * Please note, the database name has changed since the video was made, `dbase` is now `pgsql.dsa.lan`, as you have been using all semester.

### Task: Log into your database and run your create table command for battle.

```BASH
$ psql -h pgsql.dsa.lan dsa_student
Password for user SSO:
Type "help" for help.

dsa_student=>
```
Once at this prompt copy and paste the code cell above into the terminal and hit enter. 

After creating the databases, we should see the following from the `\dt` command:

```
dsa_student=> \dt
              List of relations
 Schema |      Name       | Type  |  Owner
--------+-----------------+-------+----------
 SSO    | battle          | table | SSO
 public | spatial_ref_sys | table | postgres
(3 rows)
```

Examining the table structure with `\d SSO.`

```
dsa_student=> \d SSO.
                   Table "SSO.battle"
       Column        |          Type          | Modifiers
---------------------+------------------------+-----------
 battle_number       | integer                | not null
 name                | character varying(150) | not null
 year                | integer                |
 attacker_king       | character varying(50)  |
 defender_king       | character varying(50)  |
 attacker_1          | character varying(50)  | not null
 attacker_2          | character varying(50)  |
 attacker_3          | character varying(50)  |
 attacker_4          | character varying(50)  |
 defender_1          | character varying(50)  |
 defender_2          | character varying(50)  |
 defender_3          | character varying(50)  |
 defender_4          | character varying(50)  |
 attacker_outcome    | character varying(6)   |
 battle_type         | character varying(20)  |
 major_death         | integer                |
 major_capture       | integer                |
 attacker_size       | integer                |
 defender_size       | integer                |
 attacker_commanders | character varying(220) |
 defender_commanders | character varying(220) |
 summer              | integer                |
 location            | character varying(50)  |
 region              | character varying(50)  |
 note                | character varying(500) |
Indexes:
    "battle_pkey" PRIMARY KEY, btree (battle_number)
    
        Index "SSO.battle_pkey"
    Column     |  Type   |  Definition
---------------+---------+---------------
 battle_number | integer | battle_number
primary key, btree, for table "SSO.battle"
                                       
```

## Section 3:

**Video Link**: https://youtu.be/RNSANXMePPg

### Task: Use psycopg2 to create the death table


In [None]:
import getpass
# This collects a masked password from the user
mypasswd = getpass.getpass()

**In standard practice, the credentials would be read from a configuration file** but for convenience we will be using the code below. 

**NOTE** in the code below you will need to change the user to your SSO (pawprint)

In [None]:
import psycopg2
import numpy as np
from psycopg2.extensions import adapt, register_adapter, AsIs

# Then connects to the DB
connection = psycopg2.connect(database = 'dsa_student', 
                              user = 'SSO', 
                              host = 'pgsql.dsa.lan',
                              password = mypasswd)

cursor = connection.cursor()

In [None]:
# Then remove the password from computer memory
del mypasswd

In [None]:
sqlCreateTable = """
CREATE TABLE IF NOT EXISTS SSO.death (
 death_id SERIAL PRIMARY KEY,
 name VARCHAR(50),
 allegiances VARCHAR(50),
 death_year INT,
 book_of_death INT,
 death_chapter INT,
 book_intro_chapter INT,
 gender INT,
 nobility INT,
 GoT INT,
 CoK INT,
 SoS INT,
 FfC INT,
 DwD INT);"""

cursor.execute(sqlCreateTable)

connection.commit()

Notice above we added 'IF NOT EXISTS' to the create table command, 
this way if you are running the notebook cell multiple times it will not give you an error saying the table is already created. 
 
Also, you will notice that there are many different ways that you can work with PostgreSQL and psycopg2 for creating your statements. 
Each example will be done a little differently so that you can see the many ways of inputting the SQL statements. 

### Task: Use psycopg2 to load the death table.

Now that we have both tables created, we want to load data into those tables. 
We are already connected to the server, so let's just move along with getting the data how we need it. 

Below we are going thru and printing each of the columns in the battles dataframe and also printing the same number of %s. this helps us when we create the insert statement. See (https://en.wikipedia.org/wiki/SQL_injection) to understand the %s.


In [None]:
print(list(deaths))
s = ''
for i in list(deaths):
    s += '%s,'
print(s)

Below the insert statement is constructed,
and then it is printed along with the data. 
In order to execute the insert statement within PostgreSQL you need to uncomment the `cursor.execute` statement as noted in the comments of the code. 
It is good practice to verify your code using _prints_ to make sure the data is truly what you think it is before inserting it.  

**Note** replace SSO with your pawprint

In [None]:
import numpy as np
# Magic adapters for the Numpy Fun of Pandas
register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

# Convert our panda to have Null values (None) instead of NaN
deaths = deaths.where(pd.notnull(deaths), None)

# Note, we leave out the sequential counter.  
# Review the table definition above for the default value
INSERT_SQL = 'INSERT INTO SSO.death '
INSERT_SQL += ' (name,allegiances,death_year,book_of_death,death_chapter, '
INSERT_SQL += '  book_intro_chapter,gender,nobility,got,cok,sos,ffc,dwd ) VALUES '

# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'


print(INSERT_SQL)

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in deaths.itertuples(index=False, name=None):  # pull each row as a tuple
        
        ##### TODO:
        ##### Review the print output, then comment out the print(row)
        #####         and un-comment the cursor.execute row.
        
        # This is an un-indexed, un-named Tuple
        print(row) 
        
        # Insert the row
        #cursor.execute(INSERT_SQL,row)


<span style='background:yellow'> Ensure you have uncommented the `cursor.execute` command.</span>

#### Now check in the DB: Connect via a terminal, the run the commands.

```SQL
dsa_student=> select count(*) from SSO.death;
 count
-------
   917
(1 row)

dsa_student=> \x
Expanded display is on.
dsa_student=> select * from SSO.death limit 2;
-[ RECORD 1 ]------+------------------------
death_id           | 1
name               | Addam Marbrand
allegiances        | Lannister
death_year         |
book_of_death      |
death_chapter      |
book_intro_chapter | 56
gender             | 1
nobility           | 1
got                | 1
cok                | 1
sos                | 1
ffc                | 1
dwd                | 0
-[ RECORD 2 ]------+------------------------
death_id           | 2
name               | Aegon Frey (Jinglebell)
allegiances        | None
death_year         | 299
book_of_death      | 3
death_chapter      | 51
book_intro_chapter | 49
gender             | 1
nobility           | 1
got                | 0
cok                | 0
sos                | 1
ffc                | 0
dwd                | 0
```

### Task: Use psycopg2 to load the battles table.


In [None]:
print(list(battles))
s = ''
for i in list(battles):
    s += '%s,'
print(s)

### Let's load this table, which will take a little more work:



In [None]:
# Construct a parameterized SQL statement
INSERT_SQL = 'INSERT INTO SSO.battle '
INSERT_SQL += ' (battle_number,name, year,attacker_king,defender_king, '
INSERT_SQL += '  attacker_1,attacker_2,attacker_3,attacker_4,defender_1, '
INSERT_SQL += '  defender_2,defender_3,defender_4,attacker_outcome, '
INSERT_SQL += '  battle_type,major_death,major_capture,attacker_size, '
INSERT_SQL += '  defender_size,attacker_commanders,defender_commanders, '
INSERT_SQL += '  summer,location,region,note) VALUES '
INSERT_SQL += '(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'

# Convert our panda to have Null values (None) instead of NaN
battles = battles.where(pd.notnull(battles), None)

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in battles.itertuples():  # pull each row as a tuple
        
        # This is needed to remove the index element and re-order the columns
        data = (row.battle_number,row.name,row.year,row.attacker_king,row.defender_king,
                row.attacker_1,row.attacker_2,row.attacker_3,row.attacker_4,row.defender_1,
                row.defender_2,row.defender_3,row.defender_4,row.attacker_outcome,
                row.battle_type,row.major_death,row.major_capture,row.attacker_size,
                row.defender_size,row.attacker_commander,row.defender_commander,
                row.summer,row.location,row.region,row.note)
      
        # Insert the row
        cursor.execute(INSERT_SQL,data)


#### Checking the DB again!

```SQL
dsa_student=> \x
Expanded display is on.
dsa_student=> select * from SSO.battle limit 1;
-[ RECORD 1 ]-------+---------------------------
battle_number       | 1
name                | Battle of the Golden Tooth
year                | 298
attacker_king       | Joffrey/Tommen Baratheon
defender_king       | Robb Stark
attacker_1          | Lannister
attacker_2          |
attacker_3          |
attacker_4          |
defender_1          | Tully
defender_2          |
defender_3          |
defender_4          |
attacker_outcome    | win
battle_type         | pitched battle
major_death         | 1
major_capture       | 0
attacker_size       | 15000
defender_size       | 4000
attacker_commanders | Jaime Lannister
defender_commanders | Clement Piper, Vance
summer              | 1
location            | Golden Tooth
region              | The Westerlands
note                |
```


Finally lets close our connection to PostgreSQL.

In [None]:
if(connection):
    cursor.close()
    connection.close()
    print("PostgreSQL connection is closed")

# Save your notebook, then `File > Close and Halt`

---