# Milestone 4

Your boss is excited that you now have the schema for the database and all the sales data is in one location.
Since you've done such a great job they would like you to get some up-to-date metrics from the data.
The business can then start making more data-driven decisions and get a better understanding of its sales.
In this milestone, you will be tasked with answering business questions and extracting the data from the database using SQL.

1. How many stores does the business have and in which countries?

The Operations team would like to know which countries we currently operate in and which country now has the most stores. Perform a query on the database to get the information, it should return the following information:

    +----------+-----------------+
    | country  | total_no_stores |
    +----------+-----------------+
    | GB       |             265 |
    | DE       |             141 |
    | US       |              34 |
    +----------+-----------------+
Note: DE is short for Deutschland(Germany)

The information for this is in the dim_stores_details table. We want to SELECT country and COUNT the numbers of stores, grouped by country.

In [None]:
SELECT 
    country_code AS country, COUNT(country_code) AS total_no_stores
FROM 
    dim_stores_details 
GROUP BY 
    country
ORDER BY total_no_stores DESC
LIMIT 3;

I have one more store, in the GB. This must be the webstore, where there was a GB value for the webstore. I have now amended this.

2. Which locations have the most stores?

The business stakeholders would like to know which locations currently have the most stores.

They would like to close some stores before opening more in other locations.

Find out which locations have the most stores currently. The query should return the following:

In [None]:
SELECT 
    locality, 
    COUNT(locality) AS total_no_stores
FROM
    dim_stores_details
GROUP BY
    locality
ORDER BY 
    total_no_stores DESC
LIMIT 7;

3. Which months produce the average highest cost of sales typically?

Query the database to find out which months typically have the most sales your query should return the following information:

    +-------------+-------+
    | total_sales | month |
    +-------------+-------+
    |   673295.68 |     8 |
    |   668041.45 |     1 |
    |   657335.84 |    10 |
    |   650321.43 |     5 |
    |   645741.70 |     7 |
    |   645463.00 |     3 |
    +-------------+-------+


We need to join date_uuid on orders with date_uuid on dim_date_times to get the dates, and then we need to join product code on orders with product code in dim_products to get the product_price, which we then need to multiply by the product quantity to get the individual amoutn of the transaction, and then sum this per month.

We will use EXTRACT('MONTH' FROM published_date);

In [None]:
--SELECT EXTRACT('MONTH' FROM date) AS month from dim_date_times;

SELECT 
    ROUND(CAST(SUM(product_price * product_quantity) AS numeric), 2) AS total_sales, 
    --SUM(product_price * product_quantity) AS total_sales, 
    EXTRACT('MONTH' FROM date) AS month
FROM
    orders_table 
    INNER JOIN dim_date_times ON orders_table.date_uuid = dim_date_times.date_uuid
    INNER JOIN dim_products ON orders_table.product_code = dim_products.product_code
 GROUP BY EXTRACT('MONTH' FROM date)
 ORDER BY total_sales DESC
 LIMIT 6;

4. How many sales are coming from online?

The company is looking to increase its online sales.

They want to know how many sales are happening online vs offline.

Calculate how many products were sold and the amount of sales made for online and offline purchases.

You should get the following information:

    +------------------+-------------------------+----------+
    | numbers_of_sales | product_quantity_count  | location |
    +------------------+-------------------------+----------+
    |            26957 |                  107739 | Web      |
    |            93166 |                  374047 | Offline  |
    +------------------+-------------------------+----------+

    Web is any transaction where the store_code begins with WEB. 
    A "sale" is an entryin the orders table. So we count the date_uuid as this is the only unique value in this table, i think. We then also sum the product quantity. Then we need the "location" from the stores info, so we need to do an inner join on the dim_stores_details table.

In [None]:
SELECT COUNT(date_uuid) AS number_of_sales, SUM(product_quantity) AS product_quantity_count,
	CASE
		WHEN store_code ILIKE 'WEB%' THEN 'Web'
		ELSE 'Offline'
	END location
FROM orders_table  -- INNER JOIN orders_table ON dim_stores_details.store_code = orders_table.store_code;
GROUP BY location
ORDER BY number_of_sales; -- number_of_sales, product_quantity_count, location;


5. What percentage of sales comes through each type of store?

The sales team wants to know which of the different store types is generated the most revenue so they know where to focus.

Find out the total and percentage of sales coming from each of the different store types.

The query should return:

    +-------------+-------------+---------------------+
    | store_type  | total_sales | percentage_total(%) |
    +-------------+-------------+---------------------+
    | Local       |  3440896.52 |               44.87 |
    | Web portal  |  1726547.05 |               22.44 |
    | Super Store |  1224293.65 |               15.63 |
    | Mall Kiosk  |   698791.61 |                8.96 |
    | Outlet      |   631804.81 |                8.10 |
    +-------------+-------------+---------------------+

So, store_type is in dim_stores_details, and total sales will be in orders_table and as product quantity * product_price in dim_products.

There is a discrepancy with the percentage figures, which is odd as the percentages I have appear to be correct percentages according to the figures as a proportion of the total.


In [None]:
SELECT 
    store_type, 
    ROUND(SUM(product_quantity::numeric * product_price::numeric), 2) AS total_sales, -- trying a different way to cast to get around the annoying CAST with brackets. Either way, casting appears to be necessary for ROUND to work
    ROUND(SUM(product_quantity::numeric * product_price::numeric) / (
        SELECT 
            SUM(product_quantity * product_price)
        FROM 
            orders_table
        INNER JOIN dim_products ON orders_table.product_code = dim_products.product_code
    )::numeric * 100, 2) AS "percentage_total(%)"
FROM 
    orders_table
INNER JOIN dim_stores_details ON orders_table.store_code = dim_stores_details.store_code 
INNER JOIN dim_products ON orders_table.product_code = dim_products.product_code
GROUP BY store_type
ORDER BY total_sales DESC;

6. Which month in each year produced the highest cost of sales?

The company stakeholders want assurances that the company has been doing well recently.

Find which months in which years have had the most sales historically.

The query should return the following information:

    +-------------+------+-------+
    | total_sales | year | month |
    +-------------+------+-------+
    |    27936.77 | 1994 |     3 |
    |    27356.14 | 2019 |     1 |
    |    27091.67 | 2009 |     8 |
    |    26679.98 | 1997 |    11 |
    |    26310.97 | 2018 |    12 |
    |    26277.72 | 2019 |     8 |
    |    26236.67 | 2017 |     9 |
    |    25798.12 | 2010 |     5 |
    |    25648.29 | 1996 |     8 |
    |    25614.54 | 2000 |     1 |
    +-------------+------+-------+

In [None]:
-- total sales is SUM of product_price from dim_products and product_quantity
-- we get year and month by using extract from the date column of dim_dates_times

SELECT 
    ROUND(SUM(product_price::numeric * product_quantity::numeric), 2) AS total_sales,
    EXTRACT('YEAR' FROM date) AS year,
    EXTRACT('MONTH' FROM date) AS month
FROM    
    dim_date_times 
    INNER JOIN orders_table ON orders_table.date_uuid = dim_date_times.date_uuid
    INNER JOIN dim_products ON orders_table.product_code = dim_products.product_code
GROUP BY EXTRACT('YEAR' FROM date), EXTRACT('MONTH' FROM date)
ORDER BY total_sales DESC
LIMIT 10;

7. What is our staff headcount?

The operations team would like to know the overall staff numbers in each location around the world. Perform a query to determine the staff numbers in each of the countries the company sells in.

The query should return the values:

    +---------------------+--------------+
    | total_staff_numbers | country_code |
    +---------------------+--------------+
    |               13307 | GB           |
    |                6123 | DE           |
    |                1384 | US           |
    +---------------------+--------------+

We need to sum the staff_numbers from dim_store_details, grouping by country_code

In [None]:
-- SELECT 
--     SUM(staff_numbers),
--     country_code
-- FROM dim_stores_details
-- GROUP BY country_code;

In [None]:
WITH cte AS (
    UPDATE dim_stores_details
    SET country_code = REPLACE(country_code, 'N/A', 'GB')
    RETURNING *
)
SELECT 
    SUM(staff_numbers) AS total_staff_numbers,
    country_code
FROM cte
GROUP BY country_code
ORDER BY total_staff_numbers DESC;

8. Which German store type is selling the most?

The sales team is looking to expand their territory in Germany. Determine which type of store is generating the most sales in Germany.

The query will return:

    +--------------+-------------+--------------+
    | total_sales  | store_type  | country_code |
    +--------------+-------------+--------------+
    |   198373.57  | Outlet      | DE           |
    |   247634.20  | Mall Kiosk  | DE           |
    |   384625.03  | Super Store | DE           |
    |  1109909.59  | Local       | DE           |
    +--------------+-------------+--------------+

So we need to join dim_stores_details, orders_table and dim_products



In [None]:
SELECT 
    ROUND(SUM(product_price * product_quantity)::numeric, 2) AS total_sales, -- only need to cast output
    store_type,
    country_code
FROM
    orders_table 
    INNER JOIN dim_stores_details ON orders_table.store_code = dim_stores_details.store_code
    INNER JOIN dim_products ON orders_table.product_code = dim_products.product_code
WHERE 
    country_code LIKE 'DE'
GROUP BY
    store_type, country_code
ORDER BY
    total_sales;

9. How quickly is the company making sales?

Sales would like the get an accurate metric for how quickly the company is making sales.

Determine the average time taken between each sale grouped by year, the query should return the following information:

    +------+-------------------------------------------------------+
    | year |                           actual_time_taken           |
    +------+-------------------------------------------------------+
    | 2013 | "hours": 2, "minutes": 17, "seconds": 12, "millise... |
    | 1993 | "hours": 2, "minutes": 15, "seconds": 35, "millise... |
    | 2002 | "hours": 2, "minutes": 13, "seconds": 50, "millise... | 
    | 2022 | "hours": 2, "minutes": 13, "seconds": 6,  "millise... |
    | 2008 | "hours": 2, "minutes": 13, "seconds": 2,  "millise... |
    +------+-------------------------------------------------------+
 
Hint: You will need the SQL command LEAD.

For this I am going to have to reinstigate the time stamp. I cannot do this because we now have PK and FKs in the constraints, so for the purposes of getting something to work I have uploaded a new table, dim_date_times2. In the final version of this we need to replace this back.

So we can list all the dates per transaction by joining the dim_date_times2 table with the orders table. Then we need to calculate the difference between each transaction somehow. And then order by year, which we will need to extract. 


In [None]:

SELECT year, AVG(actual_time_taken) AS actual_time_taken
FROM (
    SELECT EXTRACT('YEAR' FROM date) AS year, 
        SUM(difference_between_timestamps) AS actual_time_taken-- need to divide this by the total per year  
        FROM (
        SELECT 
            EXTRACT('YEAR' FROM date) AS year, 
            date, 
            date- LAG(date) OVER (ORDER BY date) AS difference_between_timestamps
        FROM dim_date_times
        ORDER BY date
    ) 
    GROUP BY year, date
    ORDER BY year
)
GROUP BY year
ORDER BY actual_time_taken DESC
LIMIT 5;

-- TODO: can this be simplified? And why are the times slightly off?

Results are mostly correct, although a few milliseconds off.