# STEP 6: Repeat the computation from the facts & dimension table

Note: You will not have to write any code in this notebook. It's purely to illustrate the performance difference between Star and 3NF schemas.

Start by running the code in the cell below to connect to the database.

In [None]:
# !PGPASSWORD=student createdb -h 127.0.0.1 -U student pagila_star
# !PGPASSWORD=student psql -q -h 127.0.0.1 -U student -d pagila_star -f Data/pagila-data.sql

# !PGPASSWORD=sigeMund67 createdb -h 127.0.0.1 -U hynso pagila_star
# !PGPASSWORD=sigeMund67 psql -q -h 127.0.0.1 -U hynso -d pagila_star -f Data/pagila-data.sql

In [1]:
%load_ext sql

# DB_ENDPOINT = "127.0.0.1"
# DB = 'pagila_star'
# DB_USER = 'student'
# DB_PASSWORD = 'student'
# DB_PORT = '5432'

DB_ENDPOINT = '127.0.0.1'
DB = 'pagila_star'
DB_USER = 'hynso'
DB_PASSWORD = 'sigeMund67'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)
%sql $conn_string

postgresql://hynso:sigeMund67@127.0.0.1:5432/pagila_star


'Connected: hynso@pagila_star'

## 6.1 Facts Table has all the needed dimensions, no need for deep joins

In [2]:
%%time
%%sql
SELECT 
  movie_key, 
  date_key, 
  customer_key, 
  sales_amount
FROM factSales 
LIMIT 5;

 * postgresql://hynso:***@127.0.0.1:5432/pagila_star
5 rows affected.
CPU times: user 8.64 ms, sys: 132 µs, total: 8.78 ms
Wall time: 7.94 ms


movie_key,date_key,customer_key,sales_amount
870,20170124,269,1.99
651,20170125,269,0.99
818,20170128,269,6.99
249,20170129,269,0.99
159,20170129,269,4.99


## 6.2 Join fact table with dimensions to replace keys with attributes

As you run each cell, pay attention to the time that is printed. Which schema do you think will run faster?

##### Star Schema

In [4]:
%%time
%%sql
SELECT 
  dimMovie.title, 
  dimDate.month, 
  dimCustomer.city, 
  SUM(sales_amount) AS revenue
FROM factSales
JOIN dimMovie    ON ( dimMovie.movie_key = factSales.movie_key )
JOIN dimDate     ON ( dimDate.date_key = factSales.date_key )
JOIN dimCustomer ON ( dimCustomer.customer_key = factSales.customer_key )
GROUP BY
(
  dimMovie.title, 
  dimDate.month, 
  dimCustomer.city
)
ORDER BY
  dimMovie.title, 
  dimDate.month, 
  dimCustomer.city, 
  revenue DESC
LIMIT 10;

 * postgresql://hynso:***@127.0.0.1:5432/pagila_star
10 rows affected.
CPU times: user 3.58 ms, sys: 840 µs, total: 4.42 ms
Wall time: 30.1 ms


title,month,city,revenue
ACADEMY DINOSAUR,1,Celaya,0.99
ACADEMY DINOSAUR,1,Cianjur,1.99
ACADEMY DINOSAUR,2,San Lorenzo,0.99
ACADEMY DINOSAUR,2,Sullana,1.99
ACADEMY DINOSAUR,2,Udaipur,0.99
ACADEMY DINOSAUR,3,Almirante Brown,1.99
ACADEMY DINOSAUR,3,Goinia,0.99
ACADEMY DINOSAUR,3,Kaliningrad,0.99
ACADEMY DINOSAUR,3,Kurashiki,0.99
ACADEMY DINOSAUR,3,Livorno,0.99


##### 3NF Schema

In [6]:
%%time
%%sql
SELECT 
  f.title, 
  EXTRACT(month FROM p.payment_date) 
                 AS month, 
  ci.city, 
  SUM(p.amount)  AS revenue
FROM payment   p
JOIN rental    r  ON ( p.rental_id = r.rental_id )
JOIN inventory i  ON ( r.inventory_id = i.inventory_id )
JOIN film      f  ON ( i.film_id = f.film_id)
JOIN customer  c  ON ( p.customer_id = c.customer_id )
JOIN address   a  ON ( c.address_id = a.address_id )
JOIN city      ci ON ( a.city_id = ci.city_id )
GROUP BY
(
  f.title, 
  month, 
  ci.city
)
ORDER BY
  f.title, 
  month, 
  ci.city, 
  revenue DESC
LIMIT 10;

 * postgresql://hynso:***@127.0.0.1:5432/pagila_star
10 rows affected.
CPU times: user 11.4 ms, sys: 3.89 ms, total: 15.3 ms
Wall time: 139 ms


title,month,city,revenue
ACADEMY DINOSAUR,1.0,Celaya,1.98
ACADEMY DINOSAUR,1.0,Cianjur,3.98
ACADEMY DINOSAUR,2.0,San Lorenzo,1.98
ACADEMY DINOSAUR,2.0,Sullana,3.98
ACADEMY DINOSAUR,2.0,Udaipur,1.98
ACADEMY DINOSAUR,3.0,Almirante Brown,3.98
ACADEMY DINOSAUR,3.0,Goinia,1.98
ACADEMY DINOSAUR,3.0,Kaliningrad,1.98
ACADEMY DINOSAUR,3.0,Kurashiki,1.98
ACADEMY DINOSAUR,3.0,Livorno,1.98


# Conclusion

We were able to show that:
* The star schema is easier to understand and write queries against.
* Queries with a star schema are more performant.