# Practicing Data Loading and Queries

## Data loading revisited

Recall from the data loading lab that we ingested three CSV files into data frames:

```
import pandas as pd
import numpy as np
players = pd.read_csv('../../../datasets/baseball-databank/data/Master.csv')
teams = pd.read_csv('../../../datasets/baseball-databank/data/Teams.csv')
batting = pd.read_csv('../../../datasets/baseball-databank/data/Batting.csv')

```

Then we created three tables in the database (code comments removed for brevity)

```
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()
cursor.execute(playersCreateTableStmt)
cursor.execute(teamsCreateTableStmt)
cursor.execute(battingCreateTableStmt)
connection.commit()
connection.close()
```

We then looked a two techniques to load the data.  

__First__, using a *row iterator* on the Players dataframe with looped insert statements.  
__Second__, using a the *row iterator* of the Batting dataframe with a single *executemany* insert statement.

**Exercise 1**: Load data from the '../../../datasets/baseball-databank/data/Teams.csv' file into the teams SQL table in the __../baseball.db__

In [None]:
# Code for Exercise 1 goes here 
# -----------------------------
## Import libraries
import pandas as pd
import numpy as np
import sqlite3

# Read the data in from a .csv file
teams = pd.read_csv('../../../datasets/baseball-databank/data/Teams.csv')


databaseFilename = '../baseball.db'
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()

teams.fillna(value=0) # Fill NaN values
cursor.executemany('INSERT INTO teams VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',
                   teams.itertuples(index=False, name ='None'))

# Save (commit) the changes
connection.commit()










__DON'T WORRY IF YOU CORRUPT YOUR DATABASE.__  
Just go back to the data loading lab and choose 'Cell > Run All' from the menu at the top of the notebook.

## Clean up some data in the database

In this section we show you some database queries, and how you might execute them in an SQL window, not in python. 

One of the things we conveniently ignored during data loading was the NaN (Not a Number) and missing values (NULL).

For example

```SQL
sqlite> select GIDP,SF,SH,HBP,IBB from batting limit 10;
||||
||||
||||
||||
||||
||||
||||
||||
||||
||||
sqlite> select count(*) from batting where SH is NULL;
11487
sqlite> select count(*) from batting where SF is NULL;
41181
sqlite> select count(*) from batting where GIDP is NULL;
31257
sqlite> select count(*) from batting where HBP is NULL;
7959

```

Since these are numerical statistics, we need to choose a value to apply to these missing values. Depending on the type of statistic the proper value to set a missing value may vary.  In some cases, we may desire to leave the values as NULL.

For now we will update a column, *Hit by Pitch* (HBP) to be zero instead of NULL.

```SQL
UPDATE batting
SET HBP = 0
WHERE HBP is NULL;
```

Ponder the statement above.  Now we want to update the SH and the GIDP columns where they are NULL. Why is this next statement going to corrupt our data?

```SQL
UPDATE batting
SET SH = 0, GIDP = 0
WHERE GIDP is NULL OR SH is NULL;
```

What alternative command(s) should be used?

__Before you get started, read about transactions [here](http://www.tutorialspoint.com/sqlite/sqlite_transactions.htm)__

Example rollback of data changes:
![example_transaction_rollback](../images/example_transaction_rollback.png)


While we will show the SQL examples using plain SQL, please use the Python SQL Interface for the exercises.

**Exercise 2**: Update the SH and GIDP columns to convert NULL values to 0.

In [None]:
# Code for Exercise 2 goes here 
# correct the example below 
# -----------------------------
import sqlite3
import pandas as pd

databaseFilename = '../baseball.db'
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()

SQL = 'UPDATE batting '
SQL += 'SET SH = 0, GIDP = 0 '
SQL += ' WHERE GIDP is NULL OR SH is NULL; '

result = cursor.execute(SQL, connection)

print(result)

connection.close()






Once you have updated the missing HBP values to zero, we can compute some basic statistics. Can we determine which teams have the highest hit by pitches values, averaged across their players?  If so, only show the top 5!


### Queries with Python

Writing SQL queries and collecting the statistics into a Pandas Dataframe is remarkably easy! 

__REFERENCE:__ [Read SQL into Panda Dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html)

__QUERY__ : Three players hit by the most pitches?  
__EXAMPLE SQL__ :  *select playerID, HBP from batting  order by 2 desc  limit 3;*

__Panda Query to Dataframe__

```
import sqlite3
import pandas as pd

databaseFilename = '../baseball.db'
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()

SQL = 'select playerID, HBP from batting  order by 2 desc  limit 3;'

result = pd.read_sql_query(SQL, connection)

print(result)

connection.close()
```
__RESULT__
```
    playerID   HBP
0  jennihu01  51.0
1   huntro01  50.0
2  jennihu01  46.0
```

In the above example, the SQL is executed against the open database connection and then results are stored into the Panda dataframe.


**Exercise 3:** Complete the code cell below to show the summary HBP, *home run* (HR), and *runs batted in* (RBI)  statistics of the 100 players hit by the most pitches.

Expected output:
```
     count   mean        std   min    25%   50%    75%    max
HBP  100.0  25.48   6.173387  20.0  21.00  24.0  27.00   51.0
HR   100.0  12.72  12.647577   0.0   3.00   7.5  20.25   54.0
RBI  100.0  71.40  26.438035  18.0  53.75  70.0  87.25  156.0
```
[Layout Hint, Do not click yet](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html)

In [None]:
import sqlite3
import pandas as pd

databaseFilename = '../baseball.db'
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()

# Code for Exercise 3 goes here 
# -----------------------------

# Fill in the SQL here
SQL = ' '

result = pd.read_sql_query(SQL, connection)

print(result.describe().transpose())
# -----------------------------

connection.close()

**Exercise 4:** Write the SQL that finds the *names of players* with the best ratio of *runs* (R) to *at bats* (AB).

In [None]:
# Code for Exercise 4 goes here 
# -----------------------------
# Hint... build your query up one column at a time, 
# and ensure you provide a join condition.
import sqlite3
import pandas as pd

databaseFilename = '../baseball.db'
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()

## Fill in the table name and column to join on below. 

SQL = 'select nameFirst, nameLast, sum(AB),sum(R), sum(R)/sum(AB) '
SQL += ' from players join [--tablename--] using ([--column--]) '
SQL += ' group by nameFirst, nameLast '
SQL += ' having sum(AB) > 100 '
SQL += ' order by sum(R)/sum(AB) desc, 4  limit 10'

result = pd.read_sql_query(SQL, connection)

print(result)

connection.close()






__ EXPECTED RESULT__
```
  nameFirst   nameLast  sum(AB)  sum(R)  sum(R)/sum(AB)
0      Matt  Alexander    168.0   111.0        0.660714
1    Dickie    Flowers    120.0    40.0        0.333333
2     Jimmy       Wood    487.0   162.0        0.332649
3      Glen     Barker    164.0    53.0        0.323171
4      Jack       Reed    129.0    39.0        0.302326
5      Ross     Barnes   2392.0   698.0        0.291806
6     Steve       King    272.0    78.0        0.286765
7       Ced    Landrum    105.0    30.0        0.285714
8    Stuffy    Stewart    265.0    74.0        0.279245
9      Dave   Birdsall    240.0    66.0        0.275000
```

# SAVE YOUR NOTE BOOK