# Auxillary Notes / Helps on Database Practice

Some of the students have had a few challenges on the database practice for module 3.
This notebook is a supplement to add additional information.

## Practice Question 2

From the original practice Question:
----
For now we will update a column, *Hit by Pitch* (HBP) to be zero instead of NULL.

```SQL
UPDATE batting
SET HBP = 0
WHERE HBP is NULL;
```

Ponder the statement above.  Now we want to update the SH and the GIDP columns where they are NULL. Why is this next statement going to corrupt our data?

```SQL
UPDATE batting
SET SH = 0, GIDP = 0
WHERE GIDP is NULL OR SH is NULL;
```

What alternative command(s) should be used?
----

Before we dive in, lets look at the counts that were supplied via SQL.

```SQL
sqlite> select count(*) from batting where SH is NULL;
11487
sqlite> select count(*) from batting where SF is NULL;
41181
sqlite> select count(*) from batting where GIDP is NULL;
31257
sqlite> select count(*) from batting where HBP is NULL;
7959

```
Note, was have just counted the number of rows that are NULL for each column.
We will spend some time in Python / Pandas to break then tabular data down without the database being involved.

Then we will re-visit the original practice exercise.


In [4]:
## Import libraries
import pandas as pd
import numpy as np
import sqlite3

# Read the data in from a .csv file
batting = pd.read_csv('../../../datasets/baseball-databank/data/Batting.csv')


print(batting['SH'].dtype)
print(batting['SF'].dtype)
print(batting['GIDP'].dtype)
print(batting['HBP'].dtype)
#batting[batting['SH']==' '].head()

float64
float64
float64
float64


__Above__ we load the Batting file into a Panda Dataframe named *batting*

You saw in the [*modules/module1/labs/intro_data_science_python*](../../module1/labs/intro_data_science_python.ipynb#Data-Filtering) notebook how to filter rows of a dataframe.

While you previously saw the comparison based filtering on values, we can also filter dataframes using functions to check / evaluate a particular presence or absence of the value.

__REFERENCE__ : [isnull()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html) and [isnotnull](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.notnull.html#pandas.notnull)

Lets compare Pandas to our ealier DB answers.  **Please review the comments in the source below**

In [5]:
print ("Size of batting is {} rows".format( len(batting) ) )


 # Recall the len() function measures length or number of elements.
           #  |  # Recall the df[filter] generates a true/false list that selects the 'true' rows to filter
           #  |               |
           #  V               V
shNull =      len ( batting[  batting['SH'].isnull()   ] ) 
                    #      ^
                    #      |
                #  This df[] is the select operation on a data frame.
                    #         When a column name is supplied, we get the column as a Series
                    #         When a list of rows is supplied, we get a subset of the dataframe
    # So batting is filtered on the rows where the SF column has a null value
    # This is the equivalent of SQL statement part "where SH is NULL"
                
                
sfNull =  len ( batting[  batting['SF'].isnull()   ] ) 
gidpNull = len ( batting[ batting['GIDP'].isnull() ] )
hbpNull = len ( batting[  batting['HBP'].isnull()  ] )

print ("SH Nulls =", shNull)

print ("SF Nulls =", sfNull)

print ("GIDP Nulls =", gidpNull)

print ("HBP Nulls =", hbpNull)


Size of batting is 101332 rows
SH Nulls = 11487
SF Nulls = 41181
GIDP Nulls = 31257
HBP Nulls = 7959


Recall :
```SQL
sqlite> select count(*) from batting where SH is NULL;
11487
sqlite> select count(*) from batting where SF is NULL;
41181
sqlite> select count(*) from batting where GIDP is NULL;
31257
sqlite> select count(*) from batting where HBP is NULL;
7959
```

__Recall our challenge question__

```SQL
UPDATE batting
SET SH = 0, GIDP = 0
WHERE GIDP is NULL OR SH is NULL;
```

Ponder the statement above. Now we want to update the SH and the GIDP columns where they are NULL. Why is this next statement going to corrupt our data?


__Issue__: the counts of NULL values for SH and GIDP are not the same, 11487 versus 31257 respectively.

Even if they are the same, we cannot be sure that the values of both SH and GIDP are null in all cases.

So, imagine that the 11487 rows that are *SH is NULL* are completely different rows than the  *GIDP is NULL* rows.
This means that there are no rows that are simultaneously  *SH is NULL* and  *GIDP is NULL*.  
Then the update statement will match the where clause on  11487 + 31257 = 42744 rows.
Therefore, the **OR** in the where clause of the SQL is collecting too many rows to update; because a row matches if either the SH or the GIDP columns are NULL.

The effect then, given a row with SF = 1 and GIDP is NULL is that the database statement you were asked to ponder will erase the SF = 1 value and set it to SF = 0.

____

Can we test this with Pandas?  Yep!

In [6]:
#  SH OR GIDP is NULL 
shOrGidpNull =  len ( batting[  batting['SF'].isnull() | batting['GIDP'].isnull() ] ) 
print ("Number of OR rows = ", shOrGidpNull)

# SH AND GIDP are NULL
shAndGidpNull =  len ( batting[  batting['SF'].isnull() & batting['GIDP'].isnull() ] ) 
print ("Number of AND rows = ", shAndGidpNull)

Number of OR rows =  41186
Number of AND rows =  31252


The key concepts to take away from this practice exercise are these:
  1. Be careful with the WHERE clause on SQL!  Especially during UPDATE and DELETE.
  1. When you want to update multiple columns, then be sure the values you are setting for the multiple columns are linked.  
  
In our case, the UPDATE to set SH = 0 when it is NULL is **independent** of the UPDATE to set GIDP = 0 when it is NULL.
  ** Therefore, we should use two distinct update commands! **