# Exploring Queries Against The Sakila Database

The `sakila` database consists of tables, view and triggers, simulating operations for a DVD rental service.
For more details: https://dev.mysql.com/doc/sakila/en/.

**Note**: We use `SQLAlchemy` with `mysqlclient` in order to connect to `MySQL`, since this is more robust
and well-maintained than `mysql-connector-python`. `SQLAlchemy` is a powerful framework, where tables, rows, etc.
can be modelled as Python classes, and queries can be constructed using such objects. We do not use these
properties here.

Extra dependencies required: `SQLAlchemy`, `mysqlclient`, `python-dotenv`, `pandas`.

In [1]:
import pandas as pd

from script_utils.sql import get_database_engine, run_sql_queries, run_sql_query, DatabaseMetaData

Create a `.env` file in this directory, containing the following information about the database and the user:

```text
MYSQL_DATABASE=<name-of-sakila-database>  [default: sakila]
MYSQL_USER=<user-name>
MYSQL_PASSWD=<password>
MYSQL_PORT=<port>                         [default: 3306]
```

For this notebook to work, the user must have `SELECT` privilege on all tables of the database.

In [4]:
# Configuration parameters read from .env file

db_engine = get_database_engine()
db_metadata = DatabaseMetaData(db_engine)

  self._metadata.reflect(engine)


In [5]:
query = """
SELECT customer_id, first_name, last_name
  FROM customer
"""

df = run_sql_query(query, engine=db_engine)
print(f"Number of rows: {df.shape[0]}")
df.head(20)

Number of rows: 599


Unnamed: 0,customer_id,first_name,last_name
0,1,MARY,SMITH
1,2,PATRICIA,JOHNSON
2,3,LINDA,WILLIAMS
3,4,BARBARA,JONES
4,5,ELIZABETH,BROWN
5,6,JENNIFER,DAVIS
6,7,MARIA,MILLER
7,8,SUSAN,WILSON
8,9,MARGARET,MOORE
9,10,DOROTHY,TAYLOR


In [15]:
# Determine the number of rows of all tables in the database

query_mask = "SELECT COUNT(*) AS size FROM {}"

table_names = db_metadata.table_names()
maxlen = max(len(table) for table in table_names)
queries = [query_mask.format(table) for table in table_names]
dfs = run_sql_queries(queries, db_engine)

for table, df in zip(table_names, dfs):
    print(f"{table:{maxlen}s}: {df.loc[0, 'size']}")

actor        : 200
address      : 603
city         : 600
country      : 109
category     : 16
customer     : 599
store        : 2
staff        : 2
film         : 1000
language     : 6
film_actor   : 5462
film_category: 1000
film_text    : 1000
inventory    : 4581
payment      : 16044
rental       : 16044


First, we would like to understand whether