# Data Loading

In this lab, we are going to focus on loading our data about baseball players into a database. 
Why use a database instead of files? 
Conceptually, we do this when we want to enforce rules on the structure of the data so that issues of cleanliness, 
inconsistency and missing data are identified prior to our attempts to do analysis. 

Database management systems provide well defined structure for data. 
They also have the advantage of giving us standard mechanisms for extracting data: 
"Structured Query Language", or "SQL. 
If  you have not used SQL before, it will require a little adjustment. 
Once you are familiar with it, however, you will find SQL intuitive and portable.

As you learned in the previous modules, Data Carpentry is often required to transform your messy data into a usable structure. 
Using a database allows you to store your transformed and cleaned data into a reusable, structured, and semantically labelled format.

You can then access this data in the future using structured query language (SQL). 



## Procedure

1. Inspect data, develop semantically structured data storage (i.e., database schema)
2. Develop data transformations, cleaning, and re-organizations
3. Push data into the database




----

## Inspect

For this lab, we are going to use relatively clean data that is in nice comma separated values (CSV) format.
Typically, the data requires data carpentry activities, but for the sake of a more simple collection of samples and discussion we are going to start with data that is already clean.

To work with our data files we are going to use Pandas and Numpy


In [1]:
import pandas as pd
import numpy as np
players = pd.read_csv('/dsa/data/all_datasets/baseball-databank/data/Master.csv')
teams = pd.read_csv('/dsa/data/all_datasets/baseball-databank/data/Teams.csv')
batting = pd.read_csv('/dsa/data/all_datasets/baseball-databank/data/Batting.csv')

Now we have loaded our three files into the variables: *players*, *teams*, and *batting*.
Each of these variables is a Pandas **`data frame`**.
As you have often done before, we can preview the data with the *`head()`* method called on the **`data frame`** variable.

In [2]:
players.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,...,Aardsma,David Allan,220.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


In [3]:
teams.head()

Unnamed: 0,yearID,lgID,teamID,franchID,divID,Rank,G,Ghome,W,L,...,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
0,1871,,BS1,BNA,,3,31,,20,10,...,,0.83,Boston Red Stockings,South End Grounds I,,103,98,BOS,BS1,BS1
1,1871,,CH1,CNA,,2,28,,19,9,...,,0.82,Chicago White Stockings,Union Base-Ball Grounds,,104,102,CHI,CH1,CH1
2,1871,,CL1,CFC,,8,29,,10,19,...,,0.81,Cleveland Forest Citys,National Association Grounds,,96,100,CLE,CL1,CL1
3,1871,,FW1,KEK,,7,19,,7,12,...,,0.8,Fort Wayne Kekiongas,Hamilton Field,,101,107,KEK,FW1,FW1
4,1871,,NY2,NNA,,5,33,,16,17,...,,0.83,New York Mutuals,Union Grounds (Brooklyn),,90,88,NYU,NY2,NY2


In [4]:
batting.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
1,addybo01,1871,1,RC1,,25,118.0,30.0,32.0,6.0,...,13.0,8.0,1.0,4.0,0.0,,,,,
2,allisar01,1871,1,CL1,,29,137.0,28.0,40.0,4.0,...,19.0,3.0,1.0,2.0,5.0,,,,,
3,allisdo01,1871,1,WS3,,27,133.0,28.0,44.0,10.0,...,27.0,1.0,1.0,0.0,2.0,,,,,
4,ansonca01,1871,1,RC1,,25,120.0,29.0,39.0,11.0,...,16.0,6.0,2.0,2.0,1.0,,,,,


----
As we can see, the CSV files are tabular data files. 
Note, in each case the tables cannot fit within the display and the *ellipsis* (...) is used to denote columns that are removed from the display.
Do you recall how to view the columns of a dataframe?
Like all things `python`, there are a few ways to do this.
We will just use the list function to inspect the dataframe.

In [5]:
list(players)

['playerID',
 'birthYear',
 'birthMonth',
 'birthDay',
 'birthCountry',
 'birthState',
 'birthCity',
 'deathYear',
 'deathMonth',
 'deathDay',
 'deathCountry',
 'deathState',
 'deathCity',
 'nameFirst',
 'nameLast',
 'nameGiven',
 'weight',
 'height',
 'bats',
 'throws',
 'debut',
 'finalGame',
 'retroID',
 'bbrefID']

We get to see all the columns of the dataframe this way

Now lets do it for the other two data frames, teams and batting.

In [7]:
print("teams data frame : \n {} \n".format(list(teams)))

print("batting data frame : \n {}".format(list(batting)))

teams data frame : 
 ['yearID', 'lgID', 'teamID', 'franchID', 'divID', 'Rank', 'G', 'Ghome', 'W', 'L', 'DivWin', 'WCWin', 'LgWin', 'WSWin', 'R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'CS', 'HBP', 'SF', 'RA', 'ER', 'ERA', 'CG', 'SHO', 'SV', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'DP', 'FP', 'name', 'park', 'attendance', 'BPF', 'PPF', 'teamIDBR', 'teamIDlahman45', 'teamIDretro'] 

batting data frame : 
 ['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH', 'SF', 'GIDP']


### We know the columns... now what?

## Database Design

Now that we know the columns, we need to contemplate how we will use a database to structure the data.
A relational database is an organized set of __tables__ (aka *Relations*).

__tables__ are a structured set of columns with semantic meaning, a particular data type, and constraints on validity.
Not everyone may have the domain knowledge for baseball and the column labels. if you need some assistance please ask in mutual aid. 
For example, the *batting* column *RBI* we can expect to be *Runs Batted In*. 

#### SQL : Create Table
```SQL
CREATE TABLE table_name (
  col_a_name col_a_datatype, 
  col_b_name col_b_datatype, 
  col_c_name col_c_datatype, 
  ...
  PRIMARY KEY(list_of_columns)
);
```
**REFERENCE LINK** [SQLite Create Table](https://www.sqlite.org/lang_createtable.html)

### Column Data Types

Databases support very rigid data typing, however SQLite permits looser *type affinity* based on storage classes.
From the SQLite documentation: 
```
Each column in an SQLite 3 database is assigned one of the following type affinities:
    TEXT
    NUMERIC
    INTEGER
    REAL
    BLOB
```

For this activity, we will limit our storage columns to one of:
 1. **TEXT** - Character strings
 2. **INTEGER** - whole numbers, no decimal places
 3. **REAL** - floating point numbers with decimal places


**REFERENCE LINK** [SQLite Column Data Types](https://www.sqlite.org/datatype3.html)

So, how can we examine in a programmatic way the data types as interpreted by Pandas?  
Recall that a dataframe provides column access via __dataframe__['*column_name*'], and the column is an object that holds a list of values and the data type.

In [8]:
# dtype is the Data Type of the column that is referenced by name in the square brackets
players['birthYear'].dtype

dtype('float64')

We see that birthYear is a floating point number, which above we refer to as a **REAL**

Since we have `python` at our fingertips... lets programmatically inspect the columns and data types.

In [9]:
# Remember above, the list command did an inspection and 
# got a list of column names in the data frame!
for columnName in list(players):
    print("Column {} is a {}".format(columnName, players[columnName].dtype))

Column playerID is a object
Column birthYear is a float64
Column birthMonth is a float64
Column birthDay is a float64
Column birthCountry is a object
Column birthState is a object
Column birthCity is a object
Column deathYear is a float64
Column deathMonth is a float64
Column deathDay is a float64
Column deathCountry is a object
Column deathState is a object
Column deathCity is a object
Column nameFirst is a object
Column nameLast is a object
Column nameGiven is a object
Column weight is a float64
Column height is a float64
Column bats is a object
Column throws is a object
Column debut is a object
Column finalGame is a object
Column retroID is a object
Column bbrefID is a object


The output above should look similar to:
```
Column playerID is a object
Column birthYear is a float64
...
Column bbrefID is a object
```

The **object** datatype we will interpret as **TEXT** for the database.
Scroll back up to when we previewed the data files, does this seem reasonable?  
Let us check, just to be sure

In [10]:
players['playerID'].head()

0    aardsda01
1    aaronha01
2    aaronto01
3     aasedo01
4     abadan01
Name: playerID, dtype: object

Seems OK!
You should typically check every column that you are going to load into the database.

Once you are ready to create a table, we can write the create table statement:

```SQL
CREATE TABLE players (
  playerID TEXT,
  birthYear REAL,
  ...
);
```

Can we automate this for a generic CSV to SQL Table?

----

This segment of code shows the use of Python to automate the generation of a SQL Create Table statement from a Pandas dataframe.

__Note:__ the special escape characters for *newline* (''\n'') and *tab* (''\t'') are used to generate visually pleasing SQL, they are not required.

In [11]:
# Begin the create table statement
createTableStmt = "CREATE TABLE players (\n"

# Build a translate from Panda to SQL type
dtype2SQL = {'object' : 'TEXT', 'float64' : 'REAL', 'int64' : "INTEGER"}
# Did you notice the int64 ?  That came from the teams and batting dataframes


columnList = list(players)

for columnName in columnList:
    pandaType = str(players[columnName].dtype) # Note, we need to force the conversion of the type name to a string
    sqlDataTypeStr = dtype2SQL[pandaType]      # Then we look up the SQL type Desired
    #
    #  Construct a Column Spec 
    #  col_name col_dtype , 
    createTableStmt += "\t{} {},\n".format(columnName, sqlDataTypeStr)
    #
    # NOTE:  the string1 += string2 appends string2 to the end of string 1, e.g., "ABC"+="XYZ" results in "ABCXYZ"
    #
    
    
# Note, the last column has a trailing comma, so we can now add a Primary Key specification
# If this is not suitable for the data file you have, you will need to make adjustments 
# such as removing the last comma before closing off the table.
createTableStmt += "\tPRIMARY KEY({})\n".format(columnList[0])


# Close off the Create Table Statement
createTableStmt += ");"

print(len(createTableStmt))

440


Yes...this looks like a lot of code but it is actually making our lives a lot easier. Instead of having to write an entire statement ourselves, we can harness `python` to write our statement for us. We can break this down:

The first line introduces the `createTableStmt` variable. You will notice throughout this code that we update this variable. It starts with a string that begins our `SQL` table construction. 

The next thing that we need to do is add the column names to the table and what type of data belongs in each column. To do this, we need to creating a mapping between analogous data types of `pandas` and `SQLite`. To do so, we create a dictionary, `dtype2SQL`, which uses the `panda`'s dtype as a key and the `SQLite` data type as the value.

Next, we create a variable called `columnList` so that we can iterate through the columns. The following `for` loop is responsible for creating the meat of the `createTableStmt`. This loop breaks down as follows:

```python
pandaType = str(players[columnName].dtype)
```
This line just stores a string version of the `pandas` data type.

```python
sqlDataTypeStr = dtype2SQL[pandaType]
```
This then maps the `pandas` data type to the `SQLite` data type. We store this in a variable called `sqlDataTypeStr`.


```python
createTableStmt += "\t{} {},\n".format(columnName, sqlDataTypeStr)
```
And this is the line that adds to the original `createTableStmt`. The `+=` updates and saves to this variable. 

It then goes through the rest of the list of column names and keeps updating until the last column.


After the `for` loop, the last thing to do is finish up the statement by updating the `createTableStmt`. Let's take a look at what this looks like by printing the statement...

In [12]:
print(createTableStmt)

CREATE TABLE players (
	playerID TEXT,
	birthYear REAL,
	birthMonth REAL,
	birthDay REAL,
	birthCountry TEXT,
	birthState TEXT,
	birthCity TEXT,
	deathYear REAL,
	deathMonth REAL,
	deathDay REAL,
	deathCountry TEXT,
	deathState TEXT,
	deathCity TEXT,
	nameFirst TEXT,
	nameLast TEXT,
	nameGiven TEXT,
	weight REAL,
	height REAL,
	bats TEXT,
	throws TEXT,
	debut TEXT,
	finalGame TEXT,
	retroID TEXT,
	bbrefID TEXT,
	PRIMARY KEY(playerID)
);


Notice how the `\t` and `\n` were not printed in this statement but instead were rendered as the intended tab and newline. 

**Let's modularize this**

Instead of writing out the whole block of code for every single table that we want to add into our database, we can create a function that takes only a couple arguments that will do all of the work for us.

In [13]:
def dataframe2CreateTable(dataFrame, tableName = "WHATS_MY_NAME",useFirstColumnAsPK=True):
    '''
    This function inspects a Panda Dataframe and converts it to 
    a SQL Create Table Statement String
    
    Arguments:
       dataFrame : a panda dataframe with column headers
       tableName : a valid SQL table name
       useFirstColumnAsPK : Use the first column as a primary key, default=True
    
    Returns : a Create Table tableName string
    '''
    createTableStmt = "CREATE TABLE {} (\n".format(tableName)  # used the format to splice in the table name 
    dtype2SQL = {'object' : 'TEXT', 'float64' : 'REAL', 'int64' : "INTEGER"}
    columnList = list(dataFrame)  # Replaced players from code with function variable
    
    for columnName in columnList:
        # NOTE: Some of the columns start with a number, this is not valid column naming
        # in most databases;  so the next four lines detect and fix
        if (columnName[0].isdigit()):
            sqlColumnName = "n"+columnName   # we will just prepend the letter 'n' (for number)
        else:
            sqlColumnName = columnName

        pandaType = str(dataFrame[columnName].dtype) # Note, we need to force the conversion of the type name to a string
        sqlDataTypeStr = dtype2SQL[pandaType]      # Then we look up the SQL type Desired
        createTableStmt += "\t{} {},\n".format(sqlColumnName, sqlDataTypeStr)
    # END OF FOR EACH COLUMN
    
    # Close off the Create Table Statement with the PK
    if (useFirstColumnAsPK):
        createTableStmt += "\tPRIMARY KEY({})\n".format(columnList[0])
    else: # replace last comma with a space, note it's minus 2 because -1 is the newline
                                          # This is the substring access 
                                          # see : https://docs.python.org/3/tutorial/introduction.html#strings
        createTableStmt = createTableStmt[:len(createTableStmt) -2] + "\n"
    createTableStmt += ");"
    
    return  createTableStmt
# ------- END OF dataframe2CreateTable


# Invoke
help(dataframe2CreateTable)

Help on function dataframe2CreateTable in module __main__:

dataframe2CreateTable(dataFrame, tableName='WHATS_MY_NAME', useFirstColumnAsPK=True)
    This function inspects a Panda Dataframe and converts it to 
    a SQL Create Table Statement String
    
    Arguments:
       dataFrame : a panda dataframe with column headers
       tableName : a valid SQL table name
       useFirstColumnAsPK : Use the first column as a primary key, default=True
    
    Returns : a Create Table tableName string



### Putting our DB creation together

Now we can put the pieces together. First we are going to prepare our statements by using our newly developed `dataframe2CreateTable` function.

In [14]:
# We are going to write a SQLite DB
import sqlite3

playersCreateTableStmt = dataframe2CreateTable(dataFrame = players, tableName = 'players')
teamsCreateTableStmt = dataframe2CreateTable(dataFrame = teams, tableName = 'teams', useFirstColumnAsPK=False)
battingCreateTableStmt = dataframe2CreateTable(dataFrame = batting, tableName = 'batting', useFirstColumnAsPK=False)


print(playersCreateTableStmt)
print(teamsCreateTableStmt)
print(battingCreateTableStmt)

CREATE TABLE players (
	playerID TEXT,
	birthYear REAL,
	birthMonth REAL,
	birthDay REAL,
	birthCountry TEXT,
	birthState TEXT,
	birthCity TEXT,
	deathYear REAL,
	deathMonth REAL,
	deathDay REAL,
	deathCountry TEXT,
	deathState TEXT,
	deathCity TEXT,
	nameFirst TEXT,
	nameLast TEXT,
	nameGiven TEXT,
	weight REAL,
	height REAL,
	bats TEXT,
	throws TEXT,
	debut TEXT,
	finalGame TEXT,
	retroID TEXT,
	bbrefID TEXT,
	PRIMARY KEY(playerID)
);
CREATE TABLE teams (
	yearID INTEGER,
	lgID TEXT,
	teamID TEXT,
	franchID TEXT,
	divID TEXT,
	Rank INTEGER,
	G INTEGER,
	Ghome REAL,
	W INTEGER,
	L INTEGER,
	DivWin TEXT,
	WCWin TEXT,
	LgWin TEXT,
	WSWin TEXT,
	R INTEGER,
	AB INTEGER,
	H INTEGER,
	n2B INTEGER,
	n3B INTEGER,
	HR INTEGER,
	BB INTEGER,
	SO REAL,
	SB REAL,
	CS REAL,
	HBP REAL,
	SF REAL,
	RA INTEGER,
	ER INTEGER,
	ERA REAL,
	CG INTEGER,
	SHO INTEGER,
	SV INTEGER,
	IPouts INTEGER,
	HA INTEGER,
	HRA INTEGER,
	BBA INTEGER,
	SOA INTEGER,
	E INTEGER,
	DP REAL,
	FP REAL,
	name TEXT,
	park TEXT,
	a

**NOTE:** In reality, some of the column data types such as year and counting statistics should be INTEGER type.
However, for this example we will just move forward.
There are ways to manipulate the panda dataframe to move the data into a better aligned column data type.
We will leave that as a thought exercise for now.

__WARNING__ : 
Please note, that when we first connect to a database using SQLite and it does not exist, it gets created for us.  
This *friendly* behavior can be REALLY CONFUSING on that day in the future when you have a file path wrong on a database you previously have populated and it looks empty to your code.  

__REFERENCE__ : [Python SQLite3](https://docs.python.org/3/library/sqlite3.html)


In [15]:
import os

# Below is a pathname = ../baseball.db 
# The path is broken into elements around the '/' character (i.e., "forward slash" because it leans forward)
# The first path element is the '..'  which is interpreted as the parent directory/folder. 
#        Look at the URL.  This notebook file is named: --- module4/labs/database_loading.ipynb
#        This file is in a folder named labs, which is in a folder named module4.
#        The above path name is therefore for module4/labs/../baseball.db
#                   ... which is equivalent to module4/baseball.db
#        We are putting the file there so it is accesible during exercises
databaseFilename = '../baseball.db'

# Just because we are creating this file here
#  we will remove it incase you re-run the cell
if os.path.exists(databaseFilename):
    os.remove(databaseFilename)
    
# Open / Create the baseball.db database file.
connection = sqlite3.connect(databaseFilename)

# SQLite uses a cursor to track and manage and group operations.
cursor = connection.cursor()
# A cursor is a database execution context that provides isoation between 
#  the operations in the cursor and other operations that are happening
#  simultaneously.
# These operations can be undone by cancelling (i.e., ROLLBACK) the transaction before the cursor context 
#  is committed

# Create tables
cursor.execute(playersCreateTableStmt)
cursor.execute(teamsCreateTableStmt)
cursor.execute(battingCreateTableStmt)

# Save (commit) the changes
connection.commit()

# We can also close the connection if we are done with it.
# Just be sure any changes have been committed or they will be lost.
connection.close()

What do we have here? Well, the first statement is going to create a string with the desired database file name. The next two lines ...

```python
if os.path.exists(databaseFilename):
    os.remove(databaseFilename)
```

... are going to check if that file exists on your operating system. If it does it will remove that database (because we are going to write a new one). 

Next we establish a connection to the database (take a look at the note above about if the database doesn't exist already).

After the connection is established, we create a `cursor`, which allows us to execute statements independent of other happenings going on in the database. Once we create a `cursor` object, we can `execute` the statements that we created above. 



### Did this work?
You can open the file using the command line ...

... or we can use SQL and Python.

In [16]:
## Did this actually work?
#  Open the DB file
databaseFilename = '../baseball.db'
connection = sqlite3.connect(databaseFilename)

# Select the list of tables from the SQLite Engine Catalog for the database file
cursor = connection.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

# Iterate through all the rows back, where the first column is the table name
for table_name in tables:
    print(table_name[0])

players
teams
batting


### FINALLY ... we get to load the data

What this entails is iterating through the dataframe and inserting the values into the table

In [17]:
databaseFilename = '../baseball.db'
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()

for row in players.itertuples(index=False):
    cursor.execute('INSERT INTO players VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',row)

# Save (commit) the changes
connection.commit()


__REFERENCE:__ [itertupples : Iterate Through Dataframe Rows, each row a tuple](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples)

__How to see the data from the database prompt__
![Select 5 players](../images/SQLite_baseball_select_5_players.png)

In [18]:
databaseFilename = '../baseball.db'
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()

batting.fillna(value=0) # Fill NaN values

# Or stand on the shoulders including giants
cursor.executemany('INSERT INTO batting VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',
                   batting.itertuples(index=False))


# Save (commit) the changes
connection.commit()


__REFERENCE:__ Now that we did all this in a drawn out fashion, see [SQL Loading from Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)  
See Also: [SQLAlchemy](http://www.sqlalchemy.org/)


__REMEMBER__ : This lab used clean CSV files that were mostly straight forward. Often, data carpentry activities require the efforts of the previous lab as well as the current lab.

__ALSO REMEMBER__ : This is just a introduction for databases and we will come back to everything in the database course.

# Save your notebook, then `File > Close and Halt`