# Exploring Queries Against The Sakila Database

The `sakila` database consists of tables, view and triggers, simulating operations for a DVD rental service.
For more details: https://dev.mysql.com/doc/sakila/en/.

**Note**: We use `SQLAlchemy` with `mysqlclient` in order to connect to `MySQL`, since this is more robust
and well-maintained than `mysql-connector-python`. `SQLAlchemy` is a powerful framework, where tables, rows, etc.
can be modelled as Python classes, and queries can be constructed using such objects. We do not use these
properties here.

Extra dependencies required: `SQLAlchemy`, `mysqlclient`, `python-dotenv`, `pandas`.

In [1]:
import pandas as pd

from script_utils.sql import get_database_engine, run_sql_queries, run_sql_query, DatabaseMetaData

Create a `.env` file in this directory, containing the following information about the database and the user:

```text
MYSQL_DATABASE=<name-of-sakila-database>  [default: sakila]
MYSQL_USER=<user-name>
MYSQL_PASSWD=<password>
MYSQL_PORT=<port>                         [default: 3306]
```

For this notebook to work, the user must have `SELECT` privilege on all tables of the database.

In [4]:
# Configuration parameters read from .env file

db_engine = get_database_engine()
db_metadata = DatabaseMetaData(db_engine)

  self._metadata.reflect(engine)


In [5]:
query = """
SELECT customer_id, first_name, last_name
  FROM customer
"""

df = run_sql_query(query, engine=db_engine)
print(f"Number of rows: {df.shape[0]}")
df.head(20)

Number of rows: 599


Unnamed: 0,customer_id,first_name,last_name
0,1,MARY,SMITH
1,2,PATRICIA,JOHNSON
2,3,LINDA,WILLIAMS
3,4,BARBARA,JONES
4,5,ELIZABETH,BROWN
5,6,JENNIFER,DAVIS
6,7,MARIA,MILLER
7,8,SUSAN,WILSON
8,9,MARGARET,MOORE
9,10,DOROTHY,TAYLOR


In [15]:
# Determine the number of rows of all tables in the database

query_mask = "SELECT COUNT(*) AS size FROM {}"

table_names = db_metadata.table_names()
maxlen = max(len(table) for table in table_names)
queries = [query_mask.format(table) for table in table_names]
dfs = run_sql_queries(queries, db_engine)

for table, df in zip(table_names, dfs):
    print(f"{table:{maxlen}s}: {df.loc[0, 'size']}")

actor        : 200
address      : 603
city         : 600
country      : 109
category     : 16
customer     : 599
store        : 2
staff        : 2
film         : 1000
language     : 6
film_actor   : 5462
film_category: 1000
film_text    : 1000
inventory    : 4581
payment      : 16044
rental       : 16044


In [16]:
# Which films appear most often in the inventory of each store?

query = """
SELECT f.title as title, a.address as store, COUNT(*) as number
  FROM inventory as i
  JOIN film as f
    ON f.film_id = i.film_id
  JOIN store as s
    ON s.store_id = i.store_id
  JOIN address as a
    ON a.address_id = s.address_id
 GROUP BY f.film_id, s.store_id
 ORDER BY number DESC
"""

df = run_sql_query(query, db_engine)
df

Unnamed: 0,title,store,number
0,ACADEMY DINOSAUR,28 MySQL Boulevard,4
1,ACADEMY DINOSAUR,47 MySakila Drive,4
2,ADAPTATION HOLES,28 MySQL Boulevard,4
3,AFFAIR PREJUDICE,47 MySakila Drive,4
4,AIRPORT POLLOCK,28 MySQL Boulevard,4
...,...,...,...
1516,YENTL IDAHO,28 MySQL Boulevard,2
1517,YOUNG LANGUAGE,47 MySakila Drive,2
1518,YOUTH KICK,47 MySakila Drive,2
1519,ZHIVAGO CORE,28 MySQL Boulevard,2


In [17]:
# What is the range of rental dates?

query1 = """
SELECT MIN(rental_date), MAX(rental_date)
  FROM rental
"""

query2 = """
SELECT MIN(return_date), MAX(return_date)
  FROM rental
 WHERE return_date IS NOT NULL
"""

dfs = run_sql_queries([query1, query2], db_engine)
print(dfs[0])
print(dfs[1])

     MIN(rental_date)    MAX(rental_date)
0 2005-05-24 22:53:30 2006-02-14 15:16:03
     MIN(return_date)    MAX(return_date)
0 2005-05-25 23:55:21 2005-09-02 02:35:22


In [21]:
query1 = """
SELECT rental_date, return_date
  FROM rental
 WHERE return_date IS NOT NULL
 ORDER BY return_date
"""

query2 = """
SELECT rental_date
  FROM rental
 WHERE return_date IS NULL
 ORDER BY rental_date
"""

dfs = run_sql_queries([query1, query2], db_engine)
dfs[0]

Unnamed: 0,rental_date,return_date
0,2005-05-25 04:06:21,2005-05-25 23:55:21
1,2005-05-25 01:59:46,2005-05-26 01:01:46
2,2005-05-25 00:31:15,2005-05-26 02:56:15
3,2005-05-25 00:43:11,2005-05-26 04:42:11
4,2005-05-25 02:19:23,2005-05-26 04:52:23
...,...,...
15856,2005-08-23 18:07:31,2005-09-01 22:27:31
15857,2005-08-23 18:23:24,2005-09-01 23:43:24
15858,2005-08-23 19:59:33,2005-09-02 01:28:33
15859,2005-08-23 22:19:33,2005-09-02 02:19:33


In [22]:
dfs[1]

Unnamed: 0,rental_date
0,2005-08-21 00:30:32
1,2006-02-14 15:16:03
2,2006-02-14 15:16:03
3,2006-02-14 15:16:03
4,2006-02-14 15:16:03
...,...
178,2006-02-14 15:16:03
179,2006-02-14 15:16:03
180,2006-02-14 15:16:03
181,2006-02-14 15:16:03


In [23]:
query = """
SELECT rental_date, inventory_id, customer_id
  FROM rental
 WHERE return_date IS NULL
 ORDER BY rental_date
"""
run_sql_query(query, db_engine)

Unnamed: 0,rental_date,inventory_id,customer_id
0,2005-08-21 00:30:32,6,554
1,2006-02-14 15:16:03,2047,155
2,2006-02-14 15:16:03,2026,335
3,2006-02-14 15:16:03,1545,83
4,2006-02-14 15:16:03,4106,219
...,...,...,...
178,2006-02-14 15:16:03,925,215
179,2006-02-14 15:16:03,837,505
180,2006-02-14 15:16:03,3611,41
181,2006-02-14 15:16:03,4416,168


The dates in `rental` are odd. We have `rental_date` and `return_date`, the latter can be missing, which probably means the item has not been returned yet.

* For records with `return_date` present, the `'rental_date` to `return_date` spans ranges between 2005-05-25 and 2005-09-02
* For records with `return_date` missing, one `rental_date` is 2005-08-21, followed by 181 rows with `rental_date` equal to "2006-02-14 15:16:03". These records are for different items and customers.

It looks like `rental_date` has been corrupted for these 181 rows. They should be filtered out.

Once this is done, `rental` contains 15860 rows with `return_date` present in the date range 2005-05-25 to 2005-09-02 (4.5 months), and one row with `return_date` missing and `rental_date` 2005-08-21.

As the first more interesting query, for a given timestamp, we'd like a table of films which are in stock at that time in one of the stores, and how many items of each film are currently in stock.