## Introduction to Data Validation

One of the most important parts of starting a data science project is data. This exercise will go through some approaches/techniques to understand the data, check assumptions, and validate it appropriately. 

In [2]:
# Load the SQL magic extension
%load_ext sql
# Connect to the default database (using SQLAlchemy)
%sql postgresql://localhost/postgres
# Truncate output of your queries so that it's not blowing up the notebook
%config SqlMagic.displaylimit = 10

### Data and Table Schemas

The set of tables we will be using for all the exercises is the default Postgres rental DVD data. The schema of the tables and how they link together can be found below. Each field listed underneath the table names and marked with an asterisk denotes the [primary key](#Primary-Key) of the table. 

Maybe we'll learn what happened to Blockbuster through this exercise?
You can find more documentation on the data set [here](http://www.postgresqltutorial.com/postgresql-sample-database/).

<img src="http://www.postgresqltutorial.com/wp-content/uploads/2018/03/dvd-rental-sample-database-diagram.png">

### Primary Key

More often than not, a data table should have a primary key or keys. These are typically one attribute or a combination of attributes that unique identifies a row. The primary key ultimately determines the granularity of the table. 

In the schema above, notice that all tables have a primary key or keys(note that the tables film_category and film_actor each have a primary key with two fields).

However, how do you know that the documentation is accurate? For situations like these, it is best to verify for yourself before moving forward. To start validating, understand the dimensions of the table and what fields exist. Let's start by examining the rental table. 

In [None]:
%%sql
-- Select all fields from the table to look at each field
-- Is there anything in particular that you feel weird about?
select * from rental limit 5;

How many rows does the rental data have? You can use two functions here:

[`sum(1)`](https://www.techonthenet.com/sql/sum.php) adds up the value 1 for each row in the data set <br>
[`count(*)`](https://www.techonthenet.com/sql/count.php) counts all of the rows

In [None]:
%%sql
-- How many rows does the rental data set have?
-- Enter your query here!

How do you count distinct values? Sometimes rows will have duplicate values and in that case we don't want to double count. We can use the clause [`distinct`](https://www.techonthenet.com/sql/distinct.php) here. Note that `distinct` is not a function but a keyword, like `select`, `where`, and `from`.

In [None]:
%%sql
-- Investigating distinct values for rental_id to confirm primary key
select
  sum(1) as number_of_rows,
  count(distinct rental_id) as rental_id_count
from
  rental

Awesome! So we checked whether rental_id is the correct granularity for the `rental` table. Now that we know the granularity of the rental table, the other types of data validation will depend on the exact fields in the table. Let's look at the table one more time.

In [None]:
%%sql
select * from rental limit 5;

Note that for the first few rows, rental_date is way later than return date for these rentals. What does this mean? Check if this is happening for other rows. Should the manager be concerned that our data is wrong?

One of the keys to data validation and performing data cleaning is understanding the impact of bugs we find. Since there are a total of 16K rows, how many rows have a super early return_date? What percent of the total rows have this bug?

In [None]:
%%sql
-- How many rows contain the bug?
-- Enter your query here!

SQL also can perform mathematical operations as well. See an example below and try your hand at sizing the impact of the bug we found.

In [6]:
%%sql
select
  5 * 7 as thirty_five,
  1 / 2 as one_half
from
  rental
limit 5

 * postgresql://localhost/postgres
5 rows affected.


thirty_five,one_half
35,0
35,0
35,0
35,0
35,0


In [None]:
%%sql
-- What percent of total rows have this bug?
-- Enter your query here!

Do you think this bug will effect top-level metrics? In general, ~10% and above for bugs are pretty impactful. If the impact is <1%, don't worry about it unless you are filtering to a tiny portion of the dataset where that proportion blows up even more. <br>


Fortunately and unfortunately, this issue we found here is not a real bug. In this particular data table, `return_date` has a default date (in this case, in the 1900s) for rentals that have not been returned yet. Sometimes you find information about tables that you don't find anywhere else. Many of our clients have an internal wiki page that will detail these pitfalls and traps. Always ask for context!

### Distributions

In addition to understanding the granularity of the table, data validation is a combination of spot checking single rows through manual examination, but also sanity checks at a higher granularity. This typically involves understanding the distribution of a certain variable across the entire data set. We will go over distributions when we get to the grouping chapter later on.