## DATA VALIDATION

This notebook loads all of the data sets used in this SQL tool and quickly QAs them to make sure the level of data dirtiness is as expected. Any unexpected irregularities should be noted and fixed accordingly.

In [11]:
%load_ext sql
%sql postgresql://localhost/postgres
%config SqlMagic.displaylimit=10

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### TABLE OF CONTENTS
1. [ACTOR](#ACTOR)
2. [FILM_ACTOR](#FILM_ACTOR)
3. [RENTAL](#RENTAL)
4. [FILM](#FILM)
5. [LANGUAGE](#LANGUAGE)
6. [FILM_CATEGORY](#FILM_CATEGORY)
7. [CATEGORY](#CATEGORY)

### ACTOR
* Looks at overall data schema and performs basic distinct counts
* Update dates can be all the same time or different (assume that actors are manually maintained by an intern heh)
* Expect dirty data (duplicate actors in the system)

In [22]:
%%sql
select * from actor

 * postgresql://localhost/postgres
200 rows affected.


actor_id,first_name,last_name,last_update
1,Penelope,Guiness,2006-02-15 10:05:00
2,Nick,Wahlberg,2006-02-15 10:05:00
3,Ed,Chase,2006-02-15 10:05:00
4,Jennifer,Davis,2006-02-15 10:05:00
5,Johnny,Lollobrigida,2006-02-15 10:05:00
6,Bette,Nicholson,2006-02-15 10:05:00
7,Grace,Mostel,2006-02-15 10:05:00
8,Matthew,Johansson,2006-02-15 10:05:00
9,Joe,Swank,2006-02-15 10:05:00
10,Christian,Gable,2006-02-15 10:05:00


In [23]:
%%sql
select count(*) from actor

 * postgresql://localhost/postgres
1 rows affected.


count
200


In [24]:
%%sql
-- Would be good if we had more than one duplicate
select sum(1) from (select distinct first_name, last_name from actor) unique_actors

 * postgresql://localhost/postgres
1 rows affected.


sum
199


In [25]:
%%sql
select first_name, last_name from actor group by first_name, last_name having count(*) > 1

 * postgresql://localhost/postgres
1 rows affected.


first_name,last_name
Laura,VerHulst


In [26]:
%%sql
-- Looks like the actor_ids are messed up!
select * from actor where first_name = 'Laura' and last_name = 'VerHulst'

 * postgresql://localhost/postgres
2 rows affected.


actor_id,first_name,last_name,last_update
101,Laura,VerHulst,2006-02-15 10:05:00
110,Laura,VerHulst,2006-02-15 10:05:00


### FILM_ACTOR
* Looks at overall data schema and performs basic distinct counts
* Make sure that the film and mapping include our dirty actor

In [3]:
%%sql
select * from film_actor

 * postgresql://localhost/postgres
5462 rows affected.


actor_id,film_id,last_update
1,1,2006-02-15 10:05:00
1,23,2006-02-15 10:05:00
1,25,2006-02-15 10:05:00
1,106,2006-02-15 10:05:00
1,140,2006-02-15 10:05:00
1,166,2006-02-15 10:05:00
1,277,2006-02-15 10:05:00
1,361,2006-02-15 10:05:00
1,438,2006-02-15 10:05:00
1,499,2006-02-15 10:05:00


In [12]:
%%sql
select 
  column_name, 
  data_type 
from 
  information_schema.columns
where 
  table_name = 'film_actor';

 * postgresql://localhost/postgres
3 rows affected.


column_name,data_type
actor_id,integer
film_id,integer
last_update,timestamp without time zone


In [28]:
%%sql
select sum(1) from film_actor

 * postgresql://localhost/postgres
1 rows affected.


sum
5462


In [29]:
%%sql
select count(distinct film_id) from film_actor

 * postgresql://localhost/postgres
1 rows affected.


count
997


In [30]:
%%sql
select count(distinct actor_id) from film_actor

 * postgresql://localhost/postgres
1 rows affected.


count
200


In [31]:
%%sql
select sum(1) from (select actor_id, film_id from film_actor group by actor_id, film_id) actor_pairs

 * postgresql://localhost/postgres
1 rows affected.


sum
5462


In [32]:
%%sql
select actor_id, film_id from film_actor group by actor_id, film_id having count(*) > 1

 * postgresql://localhost/postgres
0 rows affected.


actor_id,film_id


In [33]:
%%sql
select * from film_actor where actor_id in (110, 101) limit 10

 * postgresql://localhost/postgres
10 rows affected.


actor_id,film_id,last_update
101,60,2006-02-15 10:05:00
101,66,2006-02-15 10:05:00
101,85,2006-02-15 10:05:00
101,146,2006-02-15 10:05:00
101,189,2006-02-15 10:05:00
101,250,2006-02-15 10:05:00
101,255,2006-02-15 10:05:00
101,263,2006-02-15 10:05:00
101,275,2006-02-15 10:05:00
101,289,2006-02-15 10:05:00


In [34]:
%%sql
select 
  min(num_actors) as min_num_actors,
  avg(num_actors) as avg_num_actors,
  max(num_actors) as max_num_actors
from
  (select film_id, count(distinct actor_id) as num_actors from film_actor group by film_id) actor_count

 * postgresql://localhost/postgres
1 rows affected.


min_num_actors,avg_num_actors,max_num_actors
1,5.478435305917753,15


### RENTAL
* Describes the rental activity of each DVD we have
* last_update field can be the same because we have the intern
* Looking for dirty data where return_date < rental_date

In [35]:
%%sql
select * from rental

 * postgresql://localhost/postgres
16044 rows affected.


rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update
1,2005-05-24 22:53:00,367,130,2005-05-26 22:04:00,1,2006-02-15 21:30:00
2,2005-05-24 22:54:00,1525,459,2005-05-28 19:40:00,1,2006-02-16 02:30:00
3,2005-05-24 23:03:00,1711,408,2005-06-01 22:12:00,1,2006-02-16 02:30:00
4,2005-05-24 23:04:00,2452,333,2005-06-03 01:43:00,2,2006-02-16 02:30:00
5,2005-05-24 23:05:00,2079,222,2005-06-02 04:33:00,1,2006-02-16 02:30:00
6,2005-05-24 23:08:00,2792,549,2005-05-27 01:32:00,1,2006-02-16 02:30:00
7,2005-05-24 23:11:00,3995,269,2005-05-29 20:34:00,2,2006-02-16 02:30:00
8,2005-05-24 23:31:00,2346,239,2005-05-27 23:33:00,2,2006-02-16 02:30:00
9,2005-05-25 00:00:00,2580,126,2005-05-28 00:22:00,1,2006-02-16 02:30:00
10,2005-05-25 00:02:00,1824,399,2005-05-31 22:44:00,2,2006-02-16 02:30:00


In [36]:
%%sql
select * from rental order by rental_id

 * postgresql://localhost/postgres
16044 rows affected.


rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update
1,2005-05-24 22:53:00,367,130,2005-05-26 22:04:00,1,2006-02-15 21:30:00
2,2005-05-24 22:54:00,1525,459,2005-05-28 19:40:00,1,2006-02-16 02:30:00
3,2005-05-24 23:03:00,1711,408,2005-06-01 22:12:00,1,2006-02-16 02:30:00
4,2005-05-24 23:04:00,2452,333,2005-06-03 01:43:00,2,2006-02-16 02:30:00
5,2005-05-24 23:05:00,2079,222,2005-06-02 04:33:00,1,2006-02-16 02:30:00
6,2005-05-24 23:08:00,2792,549,2005-05-27 01:32:00,1,2006-02-16 02:30:00
7,2005-05-24 23:11:00,3995,269,2005-05-29 20:34:00,2,2006-02-16 02:30:00
8,2005-05-24 23:31:00,2346,239,2005-05-27 23:33:00,2,2006-02-16 02:30:00
9,2005-05-25 00:00:00,2580,126,2005-05-28 00:22:00,1,2006-02-16 02:30:00
10,2005-05-25 00:02:00,1824,399,2005-05-31 22:44:00,2,2006-02-16 02:30:00


In [37]:
%%sql
select count(*) from rental

 * postgresql://localhost/postgres
1 rows affected.


count
16044


In [38]:
%%sql
select count(distinct rental_id) from rental

 * postgresql://localhost/postgres
1 rows affected.


count
16044


In [32]:
%%sql
-- Investigating distinct values to determine what identifies a unique row 
select 
    count(distinct rental_id) as rental_id_count,
    count(distinct rental_date) as rental_date_count,
    count(distinct inventory_id) as inventory_id_count, 
    count(distinct customer_id) as customer_id_count,
    count(distinct staff_id) as staff_id_count, 
    count(distinct last_update) as last_update_count
from rental

 * sqlite://
Done.


rental_id_count,rental_date_count,inventory_id_count,customer_id_count,staff_id_count,last_update_count
16044,13319,4580,599,2,3


In [44]:
%%sql
select * from rental where rental_date > return_date

 * postgresql://localhost/postgres
281 rows affected.


rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update
11611,2006-02-14 15:16:00,1857,192,1900-01-01 00:00:00,2,2006-02-16 02:30:00
11646,2006-02-14 15:16:00,478,11,1900-01-01 00:00:00,2,2006-02-16 02:30:00
11652,2006-02-14 15:16:00,1622,597,1900-01-01 00:00:00,2,2006-02-16 02:30:00
11657,2006-02-14 15:16:00,3043,53,1900-01-01 00:00:00,2,2006-02-16 02:30:00
11672,2006-02-14 15:16:00,3947,521,1900-01-01 00:00:00,2,2006-02-16 02:30:00
11676,2006-02-14 15:16:00,4496,216,1900-01-01 00:00:00,2,2006-02-16 02:30:00
11709,2006-02-14 15:16:00,1720,330,1900-01-01 00:00:00,1,2006-02-16 02:30:00
11739,2006-02-14 15:16:00,4568,373,1900-01-01 00:00:00,2,2006-02-16 02:30:00
11754,2006-02-14 15:16:00,3747,163,1900-01-01 00:00:00,2,2006-02-16 02:30:00
11757,2006-02-14 15:16:00,1295,550,1900-01-01 00:00:00,2,2006-02-16 02:30:00


In [48]:
%%sql
select 
  column_name, 
  data_type, 
  character_maximum_length
from 
  information_schema.columns 
where 
  table_name = 'rental';

 * postgresql://localhost/postgres
7 rows affected.


column_name,data_type,character_maximum_length
rental_id,integer,
rental_date,timestamp without time zone,
inventory_id,integer,
customer_id,integer,
return_date,timestamp without time zone,
staff_id,integer,
last_update,timestamp without time zone,


In [53]:
%%sql
select count(distinct rental_id) from rental

 * postgresql://localhost/postgres
1 rows affected.


count
16044


In [71]:
%%sql
-- Number of rentals per customer
select customer_id, count(*) as count
from rental
group by customer_id 
order by count desc

 * postgresql://localhost/postgres
599 rows affected.


customer_id,count
148,46
526,45
236,42
144,42
75,41
197,40
469,40
137,39
468,39
178,39


### FILM
* All of the films available from the dvd rental store <br>
* Useful for playing with arrays and strings <br>
* Wildcard matching also a good topic for this data set

In [5]:
%%sql
select sum(case when upper(description) like '%MOOSE%' then 1 else 0 end) as moose_count, count(*) from film

 * postgresql://localhost/postgres
1 rows affected.


moose_count,count
80,1000


In [6]:
%%sql
select count(distinct film_id) from film

 * postgresql://localhost/postgres
1 rows affected.


count
1000


In [66]:
%%sql
select special_features[1] from films

 * postgresql://localhost/postgres
1000 rows affected.


special_features
Trailers
Behind the Scenes
Trailers
Trailers
Deleted Scenes
Trailers
Trailers
Commentaries
Deleted Scenes
Deleted Scenes


In [7]:
%%sql
select distinct special_features from film

 * postgresql://localhost/postgres
15 rows affected.


special_features
"['Commentaries', 'Deleted Scenes', 'Behind the Scenes']"
"['Trailers', 'Deleted Scenes']"
['Trailers']
"['Trailers', 'Commentaries', 'Deleted Scenes']"
"['Trailers', 'Commentaries', 'Behind the Scenes']"
['Commentaries']
"['Commentaries', 'Behind the Scenes']"
"['Trailers', 'Commentaries']"
"['Deleted Scenes', 'Behind the Scenes']"
"['Commentaries', 'Deleted Scenes']"


In [8]:
%%sql
select 
  special_features, 
  count(distinct film_id) as count 
from 
  (select film_id, unnest(special_features) as special_features from film) expand
group by 
  special_features
order by 
  count desc

 * postgresql://localhost/postgres
4 rows affected.


special_features,count
Commentaries,539
Behind the Scenes,538
Trailers,535
Deleted Scenes,503


In [9]:
%%sql
select rating, count(distinct film_id) from film group by rating

 * postgresql://localhost/postgres
5 rows affected.


rating,count
G,178
NC-17,210
PG,194
PG-13,223
R,195


In [10]:
%%sql
select rental_rate, count(distinct film_id) from film group by rental_rate

 * postgresql://localhost/postgres
3 rows affected.


rental_rate,count
0.99,341
2.99,323
4.99,336


### LANGUAGE
* Metadata table that lists all the langauges of the films <br>
* Joins onto the films data set <br>
* This table shouldn't have any mistakes - the join might need to be looked at

In [82]:
%%sql
select * from language

 * postgresql://localhost/postgres
6 rows affected.


language_id,name,last_update
1,English,2006-02-15 10:02:00
2,Italian,2006-02-15 10:02:00
3,Japanese,2006-02-15 10:02:00
4,Mandarin,2006-02-15 10:02:00
5,French,2006-02-15 10:02:00
6,German,2006-02-15 10:02:00


In [90]:
%%sql
-- Why aren't there films in other langauges
select 
  name as film_language,
  count(distinct film_id)
from 
  films 
left join 
  language 
on films.language_id = language.language_id
group by
  name

 * postgresql://localhost/postgres
1 rows affected.


film_language,count
English,1000


### FILM_CATEGORY
* Denotes what category each film belongs in
* Each film only has one film category (which is not always necessarily true in the real world)
* Checks for legit categories that appear in other data sets

In [98]:
%%sql
select 
  cast(film_id as int) as film_id, 
  cast(category_id as int) as category_id 
from 
  film_category
group by 
  film_id, 
  category_id
having 
  count(*) > 1

 * postgresql://localhost/postgres
0 rows affected.


film_id,category_id


In [13]:
%%sql
select 
  column_name, 
  data_type 
from 
  information_schema.columns
where 
  table_name = 'film_category';

 * postgresql://localhost/postgres
3 rows affected.


column_name,data_type
film_id,double precision
category_id,double precision
last_update,timestamp without time zone


In [101]:
%%sql
select * from film_category where cast(film_id as int) not in (select distinct film_id from films)

 * postgresql://localhost/postgres
0 rows affected.


film_id,category_id,last_update


In [102]:
%%sql
select * from film_category where cast(category_id as int) not in (select distinct category_id from category)

 * postgresql://localhost/postgres
0 rows affected.


film_id,category_id,last_update


### CATEGORY
* Straightforward data set
* No data cleaning required
* last_update can be the same time

In [103]:
%%sql
select * from category

 * postgresql://localhost/postgres
16 rows affected.


category_id,name,last_update
1,Action,2006-02-15 09:46:00
2,Animation,2006-02-15 09:46:00
3,Children,2006-02-15 09:46:00
4,Classics,2006-02-15 09:46:00
5,Comedy,2006-02-15 09:46:00
6,Documentary,2006-02-15 09:46:00
7,Drama,2006-02-15 09:46:00
8,Family,2006-02-15 09:46:00
9,Foreign,2006-02-15 09:46:00
10,Games,2006-02-15 09:46:00


In [106]:
%%sql
select * from category where category_id not in (select distinct cast(category_id as int) from film_category)

 * postgresql://localhost/postgres
0 rows affected.


category_id,name,last_update
