# Data Quality Check
This notebook provides information about tables as following:  
[1. table info including number of rows, diststyle, distkey, sortkey](#1.-Table-Info)  
[2. the number of primary key values for each dimension table](#2.-The-number-of-primary-key-values-for-each-dimension-table)  
[3. the sample data of all tables](#3.-The-sample-data-of-all-tables)

In [1]:
import configparser
import psycopg2
from time import time
from iac import get_endpoint

In [2]:
%load_ext sql

In [3]:
config = configparser.ConfigParser()
config.read('dwh.cfg')

DWH_DB= config.get("CLUSTER","DB_NAME")
DWH_DB_USER= config.get("CLUSTER","DB_USER")
DWH_DB_PASSWORD= config.get("CLUSTER","DB_PASSWORD")
DWH_PORT = config.get("CLUSTER","DB_PORT")

host = get_endpoint()
conn_string = "postgresql://{}:{}@{}:{}/{}".format(DWH_DB_USER, DWH_DB_PASSWORD, host, DWH_PORT, DWH_DB)

%sql $conn_string

'Connected: dwhuser@dwh'

### 1. Table Info

In [4]:
%%sql
SELECT table_id,"table" tablename,schema schemaname,tbl_rows,unsorted,sortkey1,sortkey_num,diststyle 
FROM svv_table_info
ORDER BY schemaname, tablename

 * postgresql://dwhuser:***@dwhcluster.crymgwo1esz3.us-west-2.redshift.amazonaws.com:5439/dwh
7 rows affected.


table_id,tablename,schemaname,tbl_rows,unsorted,sortkey1,sortkey_num,diststyle
101379,artists,dist,9553,0.0,artist_id,1,ALL
101395,songplays,dist,309,0.0,start_time,1,KEY(song_id)
101383,songs,dist,14896,,,0,KEY(song_id)
101371,staging_events,dist,8056,,,0,AUTO(ALL)
101373,staging_songs,dist,14896,,,0,AUTO(EVEN)
101391,time,dist,6813,0.0,start_time,1,ALL
101375,users,dist,96,0.0,user_id,1,ALL


- **SONGPLAYS TABLE (FACT TABLE)**  

>SORTKEY : `start_time`  
>REASON : When loading user's play list, it's likely to be queried in the order of descending `start_time`.
>
>DISTKEY : `song_id`    
>REASON : The `song_id` column has the highest cardinality among foreign keys.

- **USERS TABLE**

> DISTSTYLE : all  
> REASON : Table size is small.

- **SONGS TABLE**

> DISTSTYPE : key  
> REASON : `song_id` column is DISTKEY.

- **ARTISTS TABLE**

> DISTSTYLE : all  
> REASON : Table size is small.

- **TIME TABLE**

> DISTSTYLE : all  
> REASON : Table size is small.

## 2. The number of primary key values for each dimension table
It requires extra work to eliminate duplicate values when inserting since Amazon Redshift does not enforce primary key constraints.  
Therefore, it's also necessary to **check if primary key has an unique value** after insertion.  
Following queries will help you compare uniqueness in the staging table, meaning before insertion, to the dimension table, meaning after insertion.

In [5]:
%%sql
SET search_path TO dist;

(SELECT 'artists' as table, 'artist_id' as column, count(distinct artist_id)
FROM artists
UNION
SELECT 'staging_songs', 'artist_id', count(distinct artist_id)
FROM staging_songs
UNION
SELECT 'songs', 'song_id', count(distinct song_id)
FROM songs
UNION
SELECT 'staging_songs', 'song_id', count(distinct song_id)
FROM staging_songs
UNION
SELECT 'users', 'user_id', count(distinct user_id)
FROM users
UNION
SELECT 'staging_events', 'user_id', count(distinct userId)
FROM staging_events
WHERE page = 'NextSong'
UNION
SELECT 'staging_events', 'start_time', count(distinct ts)
FROM staging_events
WHERE page = 'NextSong'
UNION
SELECT 'time', 'start_time', count(distinct start_time)
FROM time)
ORDER BY 2

 * postgresql://dwhuser:***@dwhcluster.crymgwo1esz3.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
8 rows affected.


table,column,count
staging_songs,artist_id,9553
artists,artist_id,9553
staging_songs,song_id,14896
songs,song_id,14896
time,start_time,6813
staging_events,start_time,6813
staging_events,user_id,96
users,user_id,96


## 3. The sample data of all tables
Following queries show some of rows in all tables.

In [6]:
%%sql
SET enable_result_cache_for_session TO OFF;
SET search_path TO dist;

SELECT * FROM songplays
LIMIT 5;

 * postgresql://dwhuser:***@dwhcluster.crymgwo1esz3.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
5 rows affected.


songplay_id,start_time,user_id,level,song_id,artist_id,session_id,location,user_agent
64,2018-11-02 18:36:53,71,free,SOBBZPM12AB017DF4B,ARH6W4X1187B99274F,70,"Columbia, SC","""Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D201 Safari/9537.53"""
48,2018-11-03 18:19:10,95,paid,SOPANEB12A8C13E81E,ARSW5F51187FB4CFC9,152,"Winston-Salem, NC","""Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53"""
256,2018-11-05 01:58:24,44,paid,SOHMNPP12A58A7AE4B,ARKZ13R1187FB54FEE,237,"Waterloo-Cedar Falls, IA",Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0
160,2018-11-06 20:12:11,97,paid,SODCQYZ12A6D4F9B26,ARYJ7KN1187B98CC73,293,"Lansing-East Lansing, MI","""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"""
120,2018-11-07 15:41:10,15,paid,SOWEUOO12A6D4F6D0C,ARQUMH41187B9AF699,221,"Chicago-Naperville-Elgin, IL-IN-WI","""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"""


In [7]:
%%sql
SET enable_result_cache_for_session TO OFF;
SET search_path TO dist;

SELECT * FROM users
LIMIT 5;

 * postgresql://dwhuser:***@dwhcluster.crymgwo1esz3.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
5 rows affected.


user_id,first_name,last_name,gender,level
10,Sylvie,Cruz,F,free
100,Adler,Barrera,M,free
101,Jayden,Fox,M,free
11,Christian,Porter,F,free
12,Austin,Rosales,M,free


In [8]:
%%sql
SET enable_result_cache_for_session TO OFF;
SET search_path TO dist;

SELECT * FROM artists
LIMIT 5;

 * postgresql://dwhuser:***@dwhcluster.crymgwo1esz3.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
5 rows affected.


artist_id,name,location,latitude,longitude
AR00B1I1187FB433EB,Eagle-Eye Cherry,"Stockholm, Sweden",,
AR00DG71187B9B7FCB,Basslovers United,,,
AR00FVC1187FB5BE3E,Panda,"Monterrey, NL, México",25.0,
AR00JIO1187B9A5A15,Saigon,Brooklyn,40.0,
AR00LNI1187FB444A5,Bruce BecVar,,,


In [9]:
%%sql
SET enable_result_cache_for_session TO OFF;
SET search_path TO dist;

SELECT * FROM songs
LIMIT 5;

 * postgresql://dwhuser:***@dwhcluster.crymgwo1esz3.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
5 rows affected.


song_id,title,artist_id,year,duration
SOYIJGH12AB018484E,Cosmic fusion,ARM3YXA1187B9B2598,0,447
SOQCVUD12A58A79372,I Live Off You,ARDXOWS1187FB5BAEE,1978,126
SOUBASN12AC468DB23,Income,ARCVOFZ1187FB58074,0,443
SOQKUVL12A8AE46636,Deep Black,ARA23XO1187B9AF18F,1987,176
SOFCGDW12A58A78012,Give It To Me (All Your Love),AR9VN011187B9ADD25,1968,135


In [10]:
%%sql
SET enable_result_cache_for_session TO OFF;
SET search_path TO dist;

SELECT * FROM time
LIMIT 5;

 * postgresql://dwhuser:***@dwhcluster.crymgwo1esz3.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
5 rows affected.


start_time,hour,day,weekofyear,month,year,weekday
2018-11-01 21:01:46,21,1,44,11,2018,4
2018-11-01 21:05:52,21,1,44,11,2018,4
2018-11-01 21:08:16,21,1,44,11,2018,4
2018-11-01 21:11:13,21,1,44,11,2018,4
2018-11-01 21:17:33,21,1,44,11,2018,4
