In [1]:
%lsmagic
%load_ext sql
%sql postgresql://postgres@localhost:5432/

# Data Exploration

## 1. Identifying table relationships

Before diving straight into solution mode for the business requirements, I need to take a look at the data with EDR (Entity-Relationship Diagrams) to identify different data relationships between tables. The EDR of these datasets can be viewed as below:
(image)
I’ve labeled each of our spotlight tables from 1 through to 7 so we can keep track of where we are as we explore the available data.

## 2. Identifying Key Columns
To generate each customer insight we need the following inputs:

* `category_name`: The name of the top or second ranking category

* `rental_count`: How many total films have they watched in this category?

* `average_comparison`: How many more films has the customer watched compared to the average DVD Rental Co customer?

* `percentile`: How does the customer rank in terms of the top X% compared to all other customers in this film category?

* `category_percentage`: What proportion of each customer’s total films watched does this count make?

The top category insight will use these inputs in the following text output:

> You’ve watched {`rental_count`} {`category_name`} films, that’s {`average_comparison`} more than the Dvd Rental Co average and puts you in the top {percentile}% of {`category_name`} gurus!

The second category insight text output uses the fields in a similar way:

> You’ve watched {`rental_count`} {`category_name`} films making up {`category_percentage`}% of your entire viewing history!

The top actor information output will use the fields as follows:

> You’ve watched <`rental_count`> films featuring <`actor_name`>! Here are some other films <`first_name`> stars in that might interest you!


## 3. Identifying Start & End Points

In order to generate datasets required to calculate rental_count at a `customer_id` level, the following information was needed:

* `customer_id`
* `category_name`

However, if going back to the EDR review, I also noticed that the `dvd_rentals.rental` table was the only place where my `customer_id` field exists - it’s the only place where we can identify how many films a customer has watched and the `dvd_rentals.category` table was the only table which I can get values of `category_name` field.

## 4. Mapping the Joining Journey

After inspecting the ERD, I need to somehow connect all the data dots from tables starting from `dvd_rentals.rental` labeled as number 1 all the way through to table number 5 - `dvd_rentals.category`.

So here is the final version of my 4 part table joining journey itinerary:

| Join Journey Part | Start          | End            | Foreign Key   |
|-------------------|----------------|----------------|---------------|
| Part 1            | `rental`       | `inventory`    | `inventory_id`|
| Part 2            | `inventory`    | `film`         | `film_id`     |
| Part 3            | `film`         | `film_category`| `film_id`     |
| Part 4            | `film_category`| `category`     | `category_id` |

If I were to map out the different tables which I need to join as part of my journey - it would look something like this gif below:

<img src="images/joining-journey.gif">

### 4.1. Join Journey Part 1

To answer this one question *What type of table join should I use?* , actually need to answer 3 additional questions.

* What is the purpose of joining these two tables?

* What is the distribution of foreign keys within each table?

* How many unique foreign key values exist in each table?

#### Question 1: What is the purpose of joining these two tables?

If going back to the insights found in the Identify Key Columns, the important thing needed was to generate the `rental_count` calculation - the number of films that a customer has watched in a specific category.

In order to do this, I would need to:

Keep all of the customer rental records from `dvd_rentals.rental` and match up each record with its equivalent `film_id` value from the `dvd_rentals.inventory` table.

However, there are a few unknowns that I need to address as I are matching the `inventory_id` foreign key between the rental and inventory tables.

*1.1. How many records exist per inventory_id value in rental or inventory tables?*

*1.2. How many overlapping and missing unique foreign key values are there between the two tables?*

##### 1.1. How many records exist per inventory_id value in rental or inventory tables?

Since I know that the `rental` table contains every single rental for each customer - it makes sense logically that each valid rental record in the `rental` table should have a relevant `inventory_id` record as people need to physically hire some item in the store.

Additionally - it also makes sense that a specific item might be rented out by multiple customers at different times as customers return the DVD as shown by the `return_date` column in the dataset.

Now when I think about the `inventory` table - it should follow that every item should have a unique `inventory_id` but there may also be multiple copies of a specific film.

From these 2 key pieces of real life insight - I can generate 3 hypotheses about my datasets.

* The number of unique `inventory_id` records will be equal in both `dvd_rentals.rental` and `dvd_rentals.inventory` tables

* There will be a multiple records per unique `inventory_id` in the `dvd_rentals.rental` table

* There will be multiple `inventory_id` records per unique `film_id` value in the `dvd_rentals.inventory` table

###### Hypothesis 1:

**The number of unique `inventory_id` records will be equal in both `dvd_rentals.rental` and `dvd_rentals.inventory` tables**

In [17]:
%%sql
(
SELECT
  'rental table' AS table_name,
  COUNT(DISTINCT inventory_id)
FROM dvd_rentals.rental
)
UNION
(
SELECT
  'inventory table' AS table_name,
  COUNT(DISTINCT inventory_id)
FROM dvd_rentals.inventory
)


 * postgresql://postgres@localhost:5432/
2 rows affected.


table_name,count
inventory table,4581
rental table,4580


**Findings:**
There seems to be 1 additional `inventory_id` value in the `dvd_rentals.inventory` table compared to the `dvd_rentals.rental` table.


###### Hypothesis 2: 

**There will be a multiple records per unique `inventory_id` in the `dvd_rentals.rental` table**

In [19]:
%%sql
-- first generate group by counts on the target_column_values column
WITH counts_base AS (
SELECT
  inventory_id AS target_column_values,
  COUNT(*) AS row_counts
FROM dvd_rentals.rental
GROUP BY target_column_values
)
-- summarize the group by counts above by grouping again on the row_counts from counts_base CTE part
SELECT
  row_counts,
  COUNT(target_column_values) as count_of_target_values
FROM counts_base
GROUP BY row_counts
ORDER BY row_counts


 * postgresql://postgres@localhost:5432/
5 rows affected.


row_counts,count_of_target_values
1,4
2,1126
3,1151
4,1160
5,1139


**Findings:**
There are multiple rows per `inventory_id` value in the `dvd_rentals.rental` table.

###### Hypothesis 3:

**There will be multiple `inventory_id` records per unique `film_id` value in the `dvd_rentals.inventory` table**

In [16]:
%%sql
-- first generate group by counts on the target_column_values column
WITH counts_base AS (
SELECT
  film_id AS target_column_values,
  COUNT(DISTINCT inventory_id) AS unique_record_counts
FROM dvd_rentals.inventory
GROUP BY target_column_values
)
-- summarize the group by counts above by grouping again on the row_counts from counts_base CTE part
SELECT
  unique_record_counts,
  COUNT(target_column_values) as count_of_target_values
FROM counts_base
GROUP BY unique_record_counts
ORDER BY unique_record_counts;


 * postgresql://postgres@localhost:5432/
7 rows affected.


unique_record_counts,count_of_target_values
2,133
3,131
4,183
5,136
6,187
7,116
8,72


**Findings:**
There are indeed multiple unique `inventory_id` per `film_id` value in the `dvd_rentals.inventory` table.


##### 1.2. How many overlapping and missing unique foreign key values are there between the two tables?

One of the first places to start inspecting my datasets is to look at the distribution of foreign key values in each `rental` and `inventory` table used for my join.

**`rental` distribution analysis on `inventory_id` foreign key**

In [20]:
%%sql 
-- first generate group by counts on the foreign_key_values column
WITH counts_base AS (
SELECT
  inventory_id AS foreign_key_values,
  COUNT(*) AS row_counts
FROM dvd_rentals.rental
GROUP BY foreign_key_values
)
-- summarize the group by counts above by grouping again on the row_counts from counts_base CTE part
SELECT
  row_counts,
  COUNT(foreign_key_values) as count_of_foreign_keys
FROM counts_base
GROUP BY row_counts
ORDER BY row_counts

 * postgresql://postgres@localhost:5432/
5 rows affected.


row_counts,count_of_foreign_keys
1,4
2,1126
3,1151
4,1160
5,1139


**`inventory` distribution analysis on `inventory_id` foreign key**

In [22]:
%%sql
WITH counts_base AS (
SELECT
  inventory_id AS foreign_key_values,
  COUNT(*) AS row_counts
FROM dvd_rentals.inventory
GROUP BY foreign_key_values
)
SELECT
  row_counts,
  COUNT(foreign_key_values) as count_of_foreign_keys
FROM counts_base
GROUP BY row_counts
ORDER BY row_counts

 * postgresql://postgres@localhost:5432/
1 rows affected.


row_counts,count_of_foreign_keys
1,4581


**FINDING:**

* Rental table: There may exist 1 or more record for each unique `inventory_id` value in this table - *a 1-to-many relationship* for the `inventory_id`

* Inventory table: For every single unique `inventory_id` value in the inventory table - there exists only 1 table row record - *a 1-to-1 relationship*


**How many foreign keys only exist in the left table and not in the right?**

In [24]:
%%sql
SELECT
  COUNT(DISTINCT inventory.inventory_id)
FROM dvd_rentals.inventory
WHERE NOT EXISTS (
  SELECT inventory_id
  FROM dvd_rentals.rental
  WHERE rental.inventory_id = inventory.inventory_id
)


 * postgresql://postgres@localhost:5432/
1 rows affected.


count
1


In [29]:
%%sql 
-- Spot a single inventory_id record.
SELECT *
FROM dvd_rentals.inventory
WHERE NOT EXISTS (
  SELECT inventory_id
  FROM dvd_rentals.rental
  WHERE rental.inventory_id = inventory.inventory_id
)

 * postgresql://postgres@localhost:5432/
1 rows affected.


inventory_id,film_id,store_id,last_update
5,1,2,2006-02-15 05:09:17


**Get the count of unique foreign key values that are in the intersection.**

In [30]:
%%sql 
SELECT
  COUNT(DISTINCT rental.inventory_id)
FROM dvd_rentals.rental
WHERE EXISTS (
  SELECT inventory_id
  FROM dvd_rentals.inventory
  WHERE rental.inventory_id = inventory.inventory_id
)

 * postgresql://postgres@localhost:5432/
1 rows affected.


count
4580


**FINDING:**
After performing this analysis I can conclude there is in fact no difference between running a LEFT JOIN or an INNER JOIN.

#### Implementing the join

**Inspect if the INNER JOIN is the same with LEFT JOIN or not in this case example:**

In [31]:
%%sql
-- Create LEFT JOIN table
DROP TABLE IF EXISTS left_rental_join;
CREATE TEMP TABLE left_rental_join AS
SELECT
   rental.customer_id,
   rental.inventory_id,
   inventory.film_id
FROM dvd_rentals.rental
LEFT JOIN dvd_rentals.inventory
ON rental.inventory_id = inventory.inventory_id;
  -- Create INNER JOIN table
DROP TABLE IF EXISTS inner_rental_join;
CREATE TEMP TABLE inner_rental_join AS
SELECT
   rental.customer_id,
   rental.inventory_id,
   inventory.film_id
FROM dvd_rentals.rental
INNER JOIN dvd_rentals.inventory
ON rental.inventory_id = inventory.inventory_id;

  -- Check the counts for each output
(
SELECT
  'left join' AS join_type,
  COUNT(*) AS record_count,
  COUNT (DISTINCT inventory_id) AS unique_key_values
FROM left_rental_join
)
UNION
(
SELECT
  'inner join' AS join_type,
  COUNT(*) AS record_count,
  COUNT (DISTINCT inventory_id) AS unique_key_values
FROM inner_rental_join
);


 * postgresql://postgres@localhost:5432/
Done.
16044 rows affected.
Done.
16044 rows affected.
2 rows affected.


join_type,record_count,unique_key_values
inner join,16044,4580
left join,16044,4580


**FINDING:**
There is no difference between an INNER JOIN or LEFT JOIN for these datasets.

> **Summary**

I have now successfully answered all 3 questions for table join.

*1. What is the purpose of joining these two tables?*

I need to keep all of the customer rental records from `dvd_rentals.rental` and match up each record with its equivalent `film_id` value from the `dvd_rentals.inventory` table.

*2. What is the distribution of foreign keys within each table?*

There is a 1-to-many relationship between the `inventory_id` and the rows of the `dvd_rentals.rental` table.

There is a 1-to-1 relationship between the `inventory_id` and the rows of the `dvd_rentals.inventory` table.

*3. How many unique foreign key values exist in each table?*

All of the foreign key values in `dvd_rentals.rental` exist in `dvd_rentals.inventory` and only 1 record `inventory_id = 5` exists only in the `dvd_rentals.inventory` table.

There is an overlap of 4,580 unique `inventory_id` foreign key values which will exist after the join is complete.


### 4.2. Join Journey Part 2

#### Question 1: What is the purpose of joining these two tables?

I want to match the films on film_id to obtain the title of each film.

##### a. What contextual hypotheses do I have about the data?

There be 1-to-many relationship for `film_id` and the rows of the `dvd_rentals.inventory` table as one specific film might have multiple copies to be purchased at the rental store.

There should be 1-to-1 relationship for `film_id` and the rows of the `dvd_rentals.film` table as it doesn’t make sense for there to be duplicates in this `dvd_rentals.film`.

##### b. How can I validate these assumptions?

In [32]:
%%sql
WITH base_counts AS (
SELECT
  film_id,
  COUNT(*) AS record_count
FROM dvd_rentals.inventory
GROUP BY film_id
)
SELECT
  record_count,
  COUNT(DISTINCT film_id) as unique_film_id_values
FROM base_counts
GROUP BY record_count
ORDER BY record_count;


 * postgresql://postgres@localhost:5432/
7 rows affected.


record_count,unique_film_id_values
2,133
3,131
4,183
5,136
6,187
7,116
8,72


> I have a 1-to-many relationship for the `film_id` foreign key in our `dvd_rentals.inventory` table.

In [33]:
%%sql
SELECT
  film_id,
  COUNT(*) AS record_count
FROM dvd_rentals.film
GROUP BY film_id
ORDER BY record_count DESC
LIMIT 5;


 * postgresql://postgres@localhost:5432/
5 rows affected.


film_id,record_count
273,1
51,1
951,1
839,1
652,1


> I can now also confirm that there is a 1-to-1 relationship in the `dvd_rentals.film`.

#### Question 2: What is the distribution of foreign keys within each table?

I already did this with my previous summarized group by count when confirming my hypotheses for both tables.

#### Question 3: How many unique foreign key values exist in each table?

**How many foreign keys only exist in the inventory table**

In [34]:
%%sql
SELECT
  COUNT(DISTINCT inventory.film_id)
FROM dvd_rentals.inventory
WHERE NOT EXISTS (
  SELECT film_id
  FROM dvd_rentals.film
  WHERE film.film_id = inventory.film_id
);


 * postgresql://postgres@localhost:5432/
1 rows affected.


count
0


> I can conclude that all of the `film_id` records from the `dvd_rentals.inventory` table exist in the `dvd_rentals.film` table.

**How many foreign keys only exist in the film table**

In [35]:
%%sql
SELECT
  COUNT(DISTINCT film.film_id)
FROM dvd_rentals.film
WHERE NOT EXISTS (
  SELECT film_id
  FROM dvd_rentals.inventory
  WHERE film.film_id = inventory.film_id
)

 * postgresql://postgres@localhost:5432/
1 rows affected.


count
42


**Check that total count of distinct foreign key values that will be generated when I use a left semi join on `dvd_rentals.inventory` as the base left table.**

In [36]:
%%sql
SELECT
  COUNT(DISTINCT film_id)
FROM dvd_rentals.inventory
-- note how the NOT is no longer here for a left semi join
-- compared to the anti join!
WHERE EXISTS (
  SELECT film_id
  FROM dvd_rentals.film
  WHERE film.film_id = inventory.film_id
)

 * postgresql://postgres@localhost:5432/
1 rows affected.


count
958


**Join implementation**

In [38]:
%%sql
DROP TABLE IF EXISTS left_join_part_2;
CREATE TEMP TABLE left_join_part_2 AS
SELECT
  inventory.inventory_id,
  inventory.film_id,
  film.title
FROM dvd_rentals.inventory
LEFT JOIN dvd_rentals.film
  ON film.film_id = inventory.film_id;

DROP TABLE IF EXISTS inner_join_part_2;
CREATE TEMP TABLE inner_join_part_2 AS
SELECT
  inventory.inventory_id,
  inventory.film_id,
  film.title
FROM dvd_rentals.inventory
INNER JOIN dvd_rentals.film
  ON film.film_id = inventory.film_id;
-- check the counts for each output (bonus UNION usage)
(
  SELECT
    'left join' AS join_type,
    COUNT(*) AS record_count,
    COUNT(DISTINCT film_id) AS unique_key_values
  FROM left_join_part_2
)
-- Use UNION ALL here because I do not need UNION for distinct values!
UNION ALL
(
  SELECT
    'inner join' AS join_type,
    COUNT(*) AS record_count,
    COUNT(DISTINCT film_id) AS unique_key_values
  FROM inner_join_part_2
)


 * postgresql://postgres@localhost:5432/
Done.
4581 rows affected.
Done.
4581 rows affected.
2 rows affected.


join_type,record_count,unique_key_values
left join,4581,958
inner join,4581,958


In [39]:
%%sql
DROP TABLE IF EXISTS join_parts_1_and_2;
CREATE TEMP TABLE join_parts_1_and_2 AS
SELECT
  rental.customer_id,
  inventory.film_id,
  film.title
FROM dvd_rentals.rental
INNER JOIN dvd_rentals.inventory
  ON rental.inventory_id = inventory.inventory_id
INNER JOIN dvd_rentals.film
  ON inventory.film_id = film.film_id;

SELECT * FROM join_parts_1_and_2 limit 10;

 * postgresql://postgres@localhost:5432/
Done.
16044 rows affected.
10 rows affected.


customer_id,film_id,title
130,80,BLANKET BEVERLY
459,333,FREAKY POCUS
408,373,GRADUATE LORD
333,535,LOVE SUICIDES
222,450,IDOLS SNATCHERS
549,613,MYSTIC TRUMAN
269,870,SWARM GOLD
239,510,LAWLESS VISION
126,565,MATRIX SNOWMAN
399,396,HANGING DEEP


### 4.4. Joining Part 3

 There is a 1-to-1 relationship for `film_id` in both left and right tables `dvd_rentals.film` and `dvd_rentals.film_category` forpart 3 of my table join journey.

 For part 4 - I can do the same for the 1-to-many relationship between `category_id` and the left `dvd_rentals.film_category` and a 1-to-1 relationship for the `dvd_rentals.category` table.

In [40]:
%%sql
DROP TABLE IF EXISTS complete_joint_dataset;
CREATE TEMP TABLE complete_joint_dataset AS
SELECT
  rental.customer_id,
  inventory.film_id,
  film.title,
  film_category.category_id,
  category.name AS category_name
FROM dvd_rentals.rental
INNER JOIN dvd_rentals.inventory
  ON rental.inventory_id = inventory.inventory_id
INNER JOIN dvd_rentals.film
  ON inventory.film_id = film.film_id
INNER JOIN dvd_rentals.film_category
  ON film.film_id = film_category.film_id
INNER JOIN dvd_rentals.category
  ON film_category.category_id = category.category_id;

SELECT * FROM complete_joint_dataset limit 10;

 * postgresql://postgres@localhost:5432/
Done.
16044 rows affected.
10 rows affected.


customer_id,film_id,title,category_id,category_name
130,80,BLANKET BEVERLY,8,Family
459,333,FREAKY POCUS,12,Music
408,373,GRADUATE LORD,3,Children
333,535,LOVE SUICIDES,11,Horror
222,450,IDOLS SNATCHERS,3,Children
549,613,MYSTIC TRUMAN,5,Comedy
269,870,SWARM GOLD,11,Horror
239,510,LAWLESS VISION,2,Animation
126,565,MATRIX SNOWMAN,9,Foreign
399,396,HANGING DEEP,7,Drama


Compare to the inner join option:

In [41]:
%%sql
DROP TABLE IF EXISTS complete_left_join_dataset;
CREATE TEMP TABLE complete_left_join_dataset AS
SELECT
  rental.customer_id,
  inventory.film_id,
  film.title,
  category.name AS category_name
FROM dvd_rentals.rental
LEFT JOIN dvd_rentals.inventory
  ON rental.inventory_id = inventory.inventory_id
LEFT JOIN dvd_rentals.film
  ON inventory.film_id = film.film_id
LEFT JOIN dvd_rentals.film_category
  ON film.film_id = film_category.film_id
LEFT JOIN dvd_rentals.category
  ON film_category.category_id = category.category_id;

SELECT
  'left join' AS join_type,
  COUNT(*) AS final_record_count
FROM complete_left_join_dataset
UNION
SELECT
  'inner join' AS join_type,
  COUNT(*) AS final_record_count
FROM complete_joint_dataset;

 * postgresql://postgres@localhost:5432/
Done.
16044 rows affected.
2 rows affected.


join_type,final_record_count
inner join,16044
left join,16044
