# Exercise 1 -  Sakila Star Schema & ETL  

All the database tables in this demo are based on public database samples and transformations
- `Sakila` is a sample database created by `MySql` [Link](https://dev.mysql.com/doc/sakila/en/sakila-structure.html)
- The postgresql version of it is called `Pagila` [Link](https://github.com/devrimgunduz/pagila)
- The facts and dimension tables design is based on O'Reilly's public dimensional modelling tutorial schema [Link](http://archive.oreilly.com/oreillyschool/courses/dba3/index.html)

##  1.1 Create the pagila db and fill it with data
- Adding `"!"` at the beginning of a jupyter cell runs a command in a shell, i.e. we are not running python code but we are running the `createdb` and `psql` postgresql commmand-line utilities

In [1]:
#!PGPASSWORD=123456 createdb -h 127.0.0.1 -p 5433 -U postgres pagila
#!PGPASSWORD=123456 psql -q -h 127.0.0.1 -p 5433 -U postgres -d pagila -f Data/pagila-schema.sql
#!PGPASSWORD=123456 psql -q -h 127.0.0.1 -p 5433 -U postgres -d pagila -f Data/pagila-data.sql

### Installation and loading of the database schema, tables, and data
> Run the following commands in the CMD after adding `bin` and `lib` folders of postgres in the system path variable.
It won't run at any location. It ran only with me at the user home <C:\Users\Muhammad>.
so, I had to move the three files (pagila-data.sql, pagila-schema.sql, pagila-star.sql) to the user folder, and run the following commands from there.
* createdb -h 127.0.0.1 -p 5432 -U postgres pagila
* psql -q -h 127.0.0.1 -p 5432 -U postgres -d pagila -f Data/pagila-schema.sql
* psql -q -h 127.0.0.1 -p 5432 -U postgres -d pagila -f Data/pagila-data.sql

In [2]:
# Install the needed libraries in case we don't have them

# This library is needed by ipython-sql library
!pip install psycopg2-binary

# Install ipython-sql library
!pip install ipython-sql

# Install pandas in case we don't already have it
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable


You should consider upgrading via the 'c:\program files\python38\python.exe -m pip install --upgrade pip' command.


Defaulting to user installation because normal site-packages is not writeable


You should consider upgrading via the 'c:\program files\python38\python.exe -m pip install --upgrade pip' command.


Defaulting to user installation because normal site-packages is not writeable


You should consider upgrading via the 'c:\program files\python38\python.exe -m pip install --upgrade pip' command.


# STEP0: Using ipython-sql

- Load ipython-sql: `%load_ext sql`

- To execute SQL queries you write one of the following atop of your cell: 
    - `%sql`
        - For a one-liner SQL query
        - You can access a python var using `$`    
    - `%%sql`
        - For a multi-line SQL query
        - You can **NOT** access a python var using `$`


- Running a connection string like:
`postgresql://postgres:postgres@db:5432/pagila` connects to the database


In [3]:
# Load ipython-sql extension
%load_ext sql

# STEP1 : Connect to the local database where Pagila is loaded

## 1.2 Connect to the newly created db

In [4]:
# Set connection string variables
DB_ENDPOINT = "127.0.0.1"
DB = 'pagila'
DB_USER = 'postgres'
DB_PASSWORD = '123456'
DB_PORT = '5432'

# Build the connection string: postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

# Print the connection string
print(conn_string)

postgresql://postgres:123456@127.0.0.1:5432/pagila


In [5]:
# connect to the database using the connection string
%sql $conn_string

# STEP2 : Explore the  3NF Schema

<img src="images/pagila-3nf.png">

## 2.1 How much? What data sizes are we looking at?

In [6]:
# The film table
film_rows_count = %sql SELECT COUNT(*) FROM film;
print("Number of rows in the film table = ", film_rows_count[0][0])

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.
Number of rows in the film table =  1000


In [7]:
# The customer table
customer_rows_count = %sql SELECT COUNT(*) FROM customer;
print("Number of rows in the customer table = ", customer_rows_count[0][0])

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.
Number of rows in the customer table =  599


In [8]:
# The rental table
rental_rows_count = %sql SELECT COUNT(*) FROM rental;
print("Number of rows in the rental table = ", rental_rows_count[0][0])

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.
Number of rows in the rental table =  16044


In [9]:
# The payment table
payment_rows_count = %sql SELECT COUNT(*) FROM payment;
print("Number of rows in the payment table = ", payment_rows_count[0][0])

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.
Number of rows in the payment table =  16049


In [10]:
# The staff table
staff_rows_count = %sql SELECT COUNT(*) FROM staff;
print("Number of rows in the staff table = ", staff_rows_count[0][0])

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.
Number of rows in the staff table =  2


In [11]:
# The store table
store_rows_count = %sql SELECT COUNT(*) FROM store;
print("Number of rows in the store table = ", store_rows_count[0][0])

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.
Number of rows in the store table =  2


In [12]:
# The city table
city_rows_count = %sql SELECT COUNT(*) FROM city;
print("Number of rows in the city table = ", city_rows_count[0][0])

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.
Number of rows in the city table =  600


In [13]:
# The country table
country_rows_count = %sql SELECT COUNT(*) FROM country;
print("Number of rows in the country table = ", country_rows_count[0][0])

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.
Number of rows in the country table =  109


## 2.2 When? What time period are we talking about?

In [14]:
%%sql
SELECT MIN(rental_date) AS start, MAX(rental_date) as end
FROM rental;

 * postgresql://postgres:***@127.0.0.1:5432/pagila
1 rows affected.


start,end
2005-05-24 22:53:30+03:00,2017-02-14 15:16:03+02:00


## 2.3 Where? Where do events in this database occur?
TODO: Write a query that displays the number of addresses by district in the address table. Limit the table to the top 10 districts. Your results should match the table below.

In [15]:
%%sql
SELECT district, SUM(city_id) AS n
FROM address
GROUP BY district
ORDER BY n DESC
LIMIT 10;

 * postgresql://postgres:***@127.0.0.1:5432/pagila
10 rows affected.


district,n
Shandong,3237
England,2974
So Paulo,2952
West Bengali,2623
Buenos Aires,2572
Uttar Pradesh,2462
California,2444
Southern Tagalog,1931
Tamil Nadu,1807
Hubei,1790
