Skip to content

Cleaned and filtered, grouped data in PostgreSQL, joined tables of data, created a data dictionary Wrote moderately complex commands in PostgreSQL to answer some basic business questions

Notifications You must be signed in to change notification settings

halinakryvanos/Rockbuster-Stealth-CareerFoundry-Practice

Repository files navigation

Rockbuster Stealth

Introduction.

Rockbuster Stealth LLC is a movie rental company that used to have stores all over the world. Facing stiff competition from streaming services like Netflix and Amazon Prime, Rockbuster Stealth planed to use its existing movie licenses to launch an online video rental service to stay competitive. As a Data Analyst in the Business Intelligence (BI) department of Rockbuster Stealth, I used SQL to analyze data and answer specific business questions that other departments had while developing a strategy for launching a new online video service.

Key Questions and Objectives

The Rockbuster Stealth Management Board has asked a series of business questions and they expected data-driven answers that they could use for their 2020 company strategy.
Here are the main questions they’d like to answer:

- Which movies contributed the most/least to revenue gain?
- What was the average rental duration for all videos?
- Which countries are Rockbuster customers based in?
- Where are customers with a high lifetime value based?
- Do sales figures vary between geographic regions?

Data Set

Data set that contains information about Rockbuster’s film inventory, customers, and payments, among other things.

Final Analysis Criteria

Final project should be evaluated on ability to:

  • Write moderately complex SQL queries to answer business questions.
  • Present SQL results to business managers by creating visualizations in Tableau.
  • Present SQL results to technical colleagues using Excel and by creating a data dictionary.

Developed skills:

  • Introduction to Relational Databases (online analytical processing databases (OLAP) and relational database management systems (RDBMS))

  • Data Storage & Structure (data storage structures, including keys and indexes, identify common data types in relational databases, the different types of relational database schemas and their components).

  • Extracting the ERD
    This is an image

  • Creating the Data Dictionary

  • Filtering Data (Filter and order data using the WHERE and HAVING clauses).

SELECT rating,
COUNT(title) AS number_of_film,
AVG(rental_rate) AS average_movie_rental_rate,
MAX(rental_duration) AS maximum_rental_duration,
MIN(rental_duration) AS minimum_rental_duration
FROM film
WHERE rating IN ('PG', 'G')
GROUP BY rating;

image

  • Check for and clean dirty data (Duplicate Data from Film)
    image
  • Summary for customer (descriptive statistics)
--descriptive statistics for customer table
SELECT MIN(customer_id) AS min_customer_id,
MAX(customer_id) AS max_customer_id,
AVG(customer_id) AS avg_customer_id,
MIN(store_id) AS min_store_id,
MAX(store_id) AS max_store_id,
AVG(store_id) AS avg_store_id,
MIN(address_id) AS min_address_id,
MAX(address_id) AS max_address_id,
AVG(address_id) AS avg_address_id,
MIN(create_date) AS min_create_date,
MAX(create_date) AS max_create_date,
MODE() WITHIN GROUP (ORDER BY create_date) AS create_date,
MIN(last_update) AS min_last_update,
MAX(last_update) AS max_last_update,
MODE() WITHIN GROUP (ORDER BY last_update) AS last_update,
MODE() WITHIN GROUP (ORDER BY first_name) AS first_name,
MODE() WITHIN GROUP (ORDER BY last_name) AS last_name,
MODE() WITHIN GROUP (ORDER BY email) AS email,
MODE() WITHIN GROUP (ORDER BY create_date) AS create_date,
MODE() WITHIN GROUP (ORDER BY active) AS mode_active
FROM customer;

image

  • Summary for film (descriptive statistics)
--descriptive statistics for film table
SELECT MIN(rental_rate) AS min_renatl_rate,
MAX(rental_rate) AS max_rental_rate,
AVG(rental_rate) AS avg_renatal_rate,
MIN(rental_duration) AS min_rental_duration,
MAX(rental_duration) AS max_rental_duration,
AVG(rental_duration) AS avg_rental_duration,
MIN(film_id) AS min_film,
MAX(film_id) AS max_film,
AVG(film_id) AS avg_film,
MIN(language_id) AS min_language,
MAX(language_id) AS max_language,
AVG(language_id) AS avg_language,
MIN(length) AS min_length,
MAX(length) AS max_length,
AVG(length) AS avg_length,
MIN(replacement_cost) AS min_replacement_cost,
MAX(replacement_cost) AS max_replacement_cost,
AVG(replacement_cost) AS avg_replacement_cost,
MODE() WITHIN GROUP (ORDER BY rating) AS rating_value,
MODE() WITHIN GROUP (ORDER BY special_features) AS feature_value,
MODE() WITHIN GROUP (ORDER BY release_year) AS release_year,
MODE() WITHIN GROUP (ORDER BY title) AS title_value,
MODE() WITHIN GROUP (ORDER BY fulltext) AS fulltext
FROM film

image

  • Joining Tables of Data
    a query to find the top 10 countries for Rockbuster in terms of customer numbers
    image
    a query to find the top 10 cities within the top 10 countries identified in step 1
    image
    a query to find the top 5 customers in the top 10 cities who have paid the highest total amounts to Rockbuster. The customer team would like to reward them for their loyalty!
    image
  • Performing Subqueries
SELECT AVG(payment) AS average
FROM
(SELECT A.customer_id, A.first_name, A.last_name, D.country, C.city,
SUM (E.amount) AS payment
FROM customer A
INNER JOIN address B ON A.address_id = B.address_id
INNER JOIN city C ON B.city_id = C.city_id
INNER JOIN country D ON C.country_id = D.country_id
INNER JOIN payment E ON A.customer_id = E.customer_id
WHERE city IN ('Aurora', 'Atlixco', 'Xintai', 'Adoni', 'Dhule(Dhulia)', 'Kurashiki',
'Pingxiang', 'Sivas', 'Celaya', 'So Leopoldo')
GROUP BY A.customer_id, first_name, last_name, city, country
ORDER BY payment DESC LIMIT 5) AS total_amount_paid;

image

Find out how many of the top 5 customers are based within each country

SELECT DISTINCT (A.country),
COUNT (DISTINCT D.customer_id) AS all_customer_count,
COUNT (DISTINCT A.country) AS top_customer_count
FROM country A
INNER JOIN city B ON A.country_id = B.country_id
INNER JOIN address C ON B.city_id = C.city_id
INNER JOIN customer D ON C.address_id = D.address_id
LEFT JOIN (SELECT A.customer_id, A.first_name, A.last_name, E.country, B.city,
SUM (C.amount) AS total_paid
FROM customer A
INNER JOIN address D ON A.address_id = D.address_id
INNER JOIN city B ON D.city_id = B.city_id
INNER JOIN country E ON B.country_id = E.country_id
INNER JOIN payment C ON A.customer_id = C.customer_id
WHERE E.country IN ('India', 'China', 'United States', 'Japan', 'Mexico', 'Brazil', 'Russian Federation',
'Philippines', 'Turkey', 'Indonesia')
AND B.city IN ('Aurora', 'Atlixco', 'Xintai', 'Adoni', 'Dhule(Dhulia)', 'Kurashiki',
'Pingxiang', 'Sivas', 'Celaya', 'So Leopoldo')
GROUP BY A.customer_id, E.country, B.city
ORDER BY total_paid DESC LIMIT 5) AS top_5_customers
ON A.country = top_5_customers.country
GROUP BY A.country, top_5_customers
ORDER BY all_customer_count DESC LIMIT 5;

image.

  • Common Table Expressions
    Answer the business questions from steps above using CTEs
WITH average_cte (customer_id, first_name, last_name,
country, city, sum_amount) AS
(SELECT A.customer_id, B.first_name, B.last_name, D.city, E.country,
SUM (A.amount) AS total_amount_paid
FROM payment A
INNER JOIN customer B ON A.customer_id = B.customer_id
INNER JOIN address C ON B.address_id = C.address_id
INNER JOIN city D ON C.city_id = D.city_id
INNER JOIN country E ON D.country_id = E.country_id
WHERE country IN ('India', 'China', 'United States', 'Japan', 'Mexico', 'Brazil', 'Russia Federation',
'Philippines', 'Turkey', 'Indonesia')
AND city IN ('Aurora', 'Atlixco', 'Xintai', 'Adoni', 'Dhule(Dhulla)', 'Kurashiki',
'Pingxiang', 'Sivas', 'Celaya', 'So Leopoldo')
GROUP BY A.customer_id, B.first_name, B.last_name, D.city, E.country
ORDER BY total_amount_paid DESC LIMIT 5)
SELECT AVG (sum_amount)
FROM average_cte

image

WITH top_5_cte (customer_id, first_name, last_name,
country, city, sum_amount) AS
(SELECT A.customer_id, B.first_name, B.last_name, D.city, E.country,
SUM (A.amount) AS total_amount_paid
FROM payment A
INNER JOIN customer B ON A.customer_id = B.customer_id
INNER JOIN address C ON B.address_id = C.address_id
INNER JOIN city D ON C.city_id = D.city_id
INNER JOIN country E ON D.country_id = E.country_id
WHERE country IN ('India', 'China', 'United States', 'Japan', 'Mexico', 'Brazil', 'Russia Federation',
'Philippines', 'Turkey', 'Indonesia')
AND city IN ('Aurora', 'Atlixco', 'Xintai', 'Adoni', 'Dhule(Dhulla)', 'Kurashiki',
'Pingxiang', 'Sivas', 'Celaya', 'So Leopoldo')
GROUP BY A.customer_id, B.first_name, B.last_name, D.city, E.country
ORDER BY total_amount_paid DESC LIMIT 5)
SELECT country.country,
COUNT (DISTINCT customer.customer_id) AS all_customer_count,
COUNT (DISTINCT country.country) AS top_customer_count
FROM top_5_cte
LEFT JOIN customer ON customer.customer_id = customer.customer_id
LEFT JOIN address ON customer.address_id = address.address_id
LEFT JOIN city ON address.city_id = city.city_id
LEFT JOIN country ON city.country_id = country.country_id
GROUP BY country.country
ORDER BY all_customer_count DESC LIMIT 5

image

Idefined the CTE with the ‘WITH’ clause and gave it an appropriate expression name. Then, I listed columns that will be listed in the CTE and used the AS keyword. Then used SELECT again to display the result.
When joining the customer table with city and country, only 2 columns are added each time, whereas when the payment table is joined with the customer data, it becomes much, much larger. Doing aggregations with a larger table is the biggest contributor to that high cost as well. Also, when the large table of customer data was compiled via JOINs, there was no filter.
Using subqueries instead of CTE is such a lower cost. I imagine it’s because each subquery is only identifying exactly what it needs and because the scope of each subquery is small, aggregations don’t cost as much. The time it took to run each query varied; this is because the database is stored locally and it’s dependent on the other processes being executed by my computer.

image

The problems I've run into include determining what is needed to create a CTE. I know that unlike subrequests, CTEs are defined at the beginning of the request. Also, writing an entirely new SELECT statement that queries a temporary table created with a CTE seems odd at first. However, if you spend more time reading the syntax, everything will be much easier to understand.

Screenshot 2022-11-27 at 13 28 19

About

Cleaned and filtered, grouped data in PostgreSQL, joined tables of data, created a data dictionary Wrote moderately complex commands in PostgreSQL to answer some basic business questions

Topics

Resources

Stars

Watchers

Forks