<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/2_Date_Calculations/2_Date_Calculations.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Date Calculations

## Overview

### 🥅 Analysis Goals

Explore sales data using various PostgreSQL functions to derive insights about net_revenue trends, categories, and processing times. 

- Summarize net_revenue data by time dimensions (e.g., year, month, day).
- Analyze net_revenue by product categories.
- Understand order processing times and their trends over time.

This information can help identify trends, seasonal patterns, and potential bottlenecks in the supply chain, which are crucial for time series analysis and forecasting future performance.

### 📘 Concepts Covered

Date Calculations: 
- `INTERVAL`
- `AGE()`

---

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

### 💡 Note

**We may delete this note if we delete the date dimension table**

You may notice this specific database actually has a **date dimensions** table which is a static table that has one row per day, with other date attributes like day of the week, month name, etc. So you could join a table to this table to get the month or year. 

We **won't** be using this because not every database you'll work with has this. Also, it's important to understand how to calculate dates for different types of analysis (as you'll see). 

---
## INTERVAL

### 📝 Notes

`INTERVAL`

- **INTERVAL** represents a span of time, such as days, months, hours, or seconds.
- Used in date calculations (e.g., `CURRENT_DATE + INTERVAL '1 month'` adds one month to the current date).

**Note:** Similar to `CURRENT_DATE` there's also `NOW` which gets the current date *and* time. 

### 💻 Final Result

- Restrict results to the last 5 years of sales, excluding the current year.

#### Filter Data by Time Intervals**

**`INTERVAL`** and **`CURRENT_DATE`**

1. Use the last query to only return orders within the last 5 years of the current date.

In [6]:
%%sql

SELECT 
	DATE_PART('year', s.orderdate) AS order_year,
    p.categoryname, 
	SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM sales s
	LEFT JOIN product p ON s.productkey = p.productkey
WHERE
	s.orderdate >= CURRENT_DATE - INTERVAL '5 years' -- Added
GROUP BY
	order_year,
    p.categoryname
ORDER BY
	order_year,
    p.categoryname

Unnamed: 0,order_year,categoryname,net_revenue
0,2020.0,Audio,344725.7
1,2020.0,Cameras and camcorders,1253638.95
2,2020.0,Cell phones,1784693.27
3,2020.0,Computers,4846329.75
4,2020.0,Games and Toys,132404.4
5,2020.0,Home Appliances,707243.49
6,2020.0,"Music, Movies and Audio Books",640317.06
7,2020.0,TV and Video,906731.78
8,2021.0,Audio,393160.16
9,2021.0,Cameras and camcorders,1449672.87


2. Validate data by replacing `order_year` with `orderdate`:

    - Replace `DATE_PART('year', orderdate)` with `orderdate` in the `SELECT` clause.
    - Use the same `WHERE` clause and group the data by `orderdate`.

In [7]:
%%sql

SELECT 
	s.orderdate, -- Added
    p.categoryname, 
	SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM sales s
	LEFT JOIN product p ON s.productkey = p.productkey
WHERE
	s.orderdate >= CURRENT_DATE - INTERVAL '5 years' -- Added
GROUP BY
	s.orderdate,
    p.categoryname
ORDER BY
	s.orderdate,
    p.categoryname

Unnamed: 0,orderdate,categoryname,net_revenue
0,2020-01-08,Audio,4799.18
1,2020-01-08,Cameras and camcorders,6347.89
2,2020-01-08,Cell phones,17867.85
3,2020-01-08,Computers,29108.37
4,2020-01-08,Games and Toys,3358.19
...,...,...,...
11113,2024-04-20,Computers,58353.68
11114,2024-04-20,Games and Toys,1744.30
11115,2024-04-20,Home Appliances,1562.04
11116,2024-04-20,"Music, Movies and Audio Books",4949.43


3. Use `DATE_TRUNC` to calculate `last_5_year` and `current_date_year`:

    - Add `DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '5 years'` to find the start date.
    - Subtract `INTERVAL '1 day'` from `DATE_TRUNC('year', CURRENT_DATE)` to find the end date.
    - Include these calculated dates in the `SELECT` clause for validation.


 💡 Note

You could just add in the `WHERE` clause: 
```sql
s.orderdate::date BETWEEN '2019-01-01' AND '2023-12-01'
```
But it doesn't update dynamically and you'd have to remember to update it. So it's better to use something automatic rather than hard coded in.

In [8]:
%%sql

SELECT 
	s.orderdate,
    DATE_TRUNC('year', s.orderdate) AS order_year, -- Added
	DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '5 years' AS start_date, -- Added
	DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 day' AS end_date, -- Added
    p.categoryname, 
	SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM sales s
	LEFT JOIN product p ON s.productkey = p.productkey
WHERE 
    s.orderdate >= CURRENT_DATE - INTERVAL '5 years'
GROUP BY
	s.orderdate,
    p.categoryname
ORDER BY
	s.orderdate,
    p.categoryname

Unnamed: 0,orderdate,order_year,start_date,end_date,categoryname,net_revenue
0,2020-01-08,2020-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,Audio,4799.18
1,2020-01-08,2020-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,Cameras and camcorders,6347.89
2,2020-01-08,2020-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,Cell phones,17867.85
3,2020-01-08,2020-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,Computers,29108.37
4,2020-01-08,2020-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,Games and Toys,3358.19
...,...,...,...,...,...,...
11113,2024-04-20,2024-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,Computers,58353.68
11114,2024-04-20,2024-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,Games and Toys,1744.30
11115,2024-04-20,2024-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,Home Appliances,1562.04
11116,2024-04-20,2024-01-01 00:00:00-08:00,2020-01-01 00:00:00-08:00,2024-12-31 00:00:00-08:00,"Music, Movies and Audio Books",4949.43


5. Refine the `WHERE` clause to exclude partial years:

    - Replace `orderdate` with `order_year` in the `SELECT` clause.
    - Use the calculated `last_5_year` and `current_date_year` in the `WHERE` clause to filter complete years.
    - Group by `order_year` and order the results.

In [9]:
%%sql

SELECT 
	s.orderdate,
    DATE_TRUNC('year', s.orderdate) AS order_year,
    p.categoryname, 
	SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM sales s
	LEFT JOIN product p ON s.productkey = p.productkey
WHERE 
    s.orderdate BETWEEN DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '5 years' AND DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 day' --Updated
GROUP BY
	s.orderdate,
    p.categoryname
ORDER BY
	s.orderdate,
    p.categoryname

Unnamed: 0,orderdate,order_year,categoryname,net_revenue
0,2020-01-01,2020-01-01 00:00:00-08:00,Audio,5490.14
1,2020-01-01,2020-01-01 00:00:00-08:00,Cameras and camcorders,18880.06
2,2020-01-01,2020-01-01 00:00:00-08:00,Cell phones,22593.00
3,2020-01-01,2020-01-01 00:00:00-08:00,Computers,78554.54
4,2020-01-01,2020-01-01 00:00:00-08:00,Games and Toys,1476.43
...,...,...,...,...
11166,2024-04-20,2024-01-01 00:00:00-08:00,Computers,58353.68
11167,2024-04-20,2024-01-01 00:00:00-08:00,Games and Toys,1744.30
11168,2024-04-20,2024-01-01 00:00:00-08:00,Home Appliances,1562.04
11169,2024-04-20,2024-01-01 00:00:00-08:00,"Music, Movies and Audio Books",4949.43


---
## AGE

### 📝 Notes

`AGE()`

- **AGE()** calculates the interval between two dates or timestamps.
- Returns a human-readable interval (e.g., `1 year 2 mons 3 days`) when passed two arguments or the difference from the current timestamp if given one.
- Example: 
  ```sql
  AGE(deliverydate, orderdate) --gives the processing time.
  ``` 

### 💻 Final Result

- Compute average processing times and total sales, aggregated by time periods to understand the efficiency of the order fulfillment process and its impact on revenue.

#### Calculate Processing Time

**`AGE`**

1. Calculate the difference in time between the delivery date and order date using `AGE`:
    - Use `AGE(deliverydate, orderdate)` to compute the processing time for each order.
    - Exclude rows with `NULL` delivery dates in the `WHERE` clause.
    - Return the order date, processing time, and total sale amount for each transaction.

In [10]:
%%sql

SELECT 
    s.orderdate,
    AGE(s.deliverydate, s.orderdate) AS processing_time,
    s.quantity * s.netprice * s.exchangerate AS net_revenue
FROM 
    sales s
LEFT JOIN 
    product p ON s.productkey = p.productkey
WHERE 
    s.deliverydate IS NOT NULL
    AND s.orderdate BETWEEN DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '5 years' AND DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 day'
ORDER BY 
    s.orderdate;

Unnamed: 0,orderdate,processing_time,net_revenue
0,2020-01-01,1 days,99.47
1,2020-01-01,1 days,139.97
2,2020-01-01,1 days,669.39
3,2020-01-01,1 days,4090.60
4,2020-01-01,0 days,237.15
...,...,...,...
124446,2024-04-20,1 days,914.61
124447,2024-04-20,1 days,150.18
124448,2024-04-20,2 days,147.78
124449,2024-04-20,2 days,2019.62


2. Extract the DAY from the difference between delivery date and order date:

    - Use `EXTRACT(DAY FROM AGE(deliverydate, orderdate))` to extract the day component.
    - Display the `orderdate` as Month-Year using `TO_CHAR(orderdate, 'MM-YYYY')`.

In [11]:
%%sql

SELECT 
    TO_CHAR(s.orderdate, 'MM-YYYY') AS order_month,
    EXTRACT(DAY FROM AGE(s.deliverydate, s.orderdate)) AS processing_time, -- Update
    s.quantity * s.netprice * s.exchangerate AS net_revenue
FROM 
    sales s
LEFT JOIN 
    product p ON s.productkey = p.productkey
WHERE 
    s.deliverydate IS NOT NULL
    AND s.orderdate BETWEEN DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '5 years' AND DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 day'
ORDER BY 
    order_month;

Unnamed: 0,order_month,processing_time,net_revenue
0,01-2020,3,835.20
1,01-2020,0,776.57
2,01-2020,0,3593.57
3,01-2020,0,439.51
4,01-2020,0,66.11
...,...,...,...
124446,12-2023,0,27.56
124447,12-2023,0,20.45
124448,12-2023,0,459.00
124449,12-2023,0,257.50


3. Aggregate data by month to get total sales and average processing time:

    - Calculate the average processing time using `AVG(EXTRACT(DAY FROM AGE(...)))`.
    - Compute the total sales using `SUM(quantity * netprice * exchangerate)`.
    - Group by `TO_CHAR(orderdate, 'MM-YYYY')` and order the results.

In [12]:
%%sql

SELECT 
    TO_CHAR(s.orderdate, 'MM-YYYY') AS order_month,
    AVG(EXTRACT(DAY FROM AGE(s.deliverydate, s.orderdate))) AS avg_processing_time, -- Update
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM 
    sales s
LEFT JOIN 
    product p ON s.productkey = p.productkey
WHERE 
    s.deliverydate IS NOT NULL
    AND s.orderdate BETWEEN DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '5 years' AND DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 day'
GROUP BY 
    s.orderdate
ORDER BY 
    order_month;

Unnamed: 0,order_month,avg_processing_time,net_revenue
0,01-2020,1.0090090090090090,105089.84
1,01-2020,1.2127659574468085,41287.89
2,01-2020,0.70370370370370370370,39548.56
3,01-2020,2.1250000000000000,13395.79
4,01-2020,0.11111111111111111111,17744.61
...,...,...,...
1527,12-2023,1.8811881188118812,86605.92
1528,12-2023,1.8689655172413793,117527.00
1529,12-2023,1.9448818897637795,119874.69
1530,12-2023,1.7090909090909091,141981.34


4. Reformat results:

    - Use `ROUND()` to format the average processing time and total sales to two decimal places.

In [13]:
%%sql

SELECT 
    TO_CHAR(s.orderdate, 'MM-YYYY') AS order_month,
    ROUND(CAST(AVG(EXTRACT(DAY FROM AGE(s.deliverydate, s.orderdate))) AS NUMERIC), 2) AS avg_processing_time, -- Update
    ROUND(CAST(SUM(s.quantity * s.netprice * s.exchangerate) AS NUMERIC), 2) AS net_revenue -- Update
FROM 
    sales s
LEFT JOIN 
    product p ON s.productkey = p.productkey
WHERE 
    s.deliverydate IS NOT NULL
    AND s.orderdate BETWEEN DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '5 years' AND DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 day'
GROUP BY 
    order_month
ORDER BY 
    order_month;

Unnamed: 0,order_month,avg_processing_time,net_revenue
0,01-2020,1.01,2132132.93
1,01-2021,0.97,669787.93
2,01-2022,1.46,3647525.92
3,01-2023,1.69,3664431.34
4,01-2024,1.75,2677498.55
5,02-2020,0.8,2713593.19
6,02-2021,1.12,1094980.88
7,02-2022,1.53,4840124.87
8,02-2023,1.73,4465204.57
9,02-2024,1.64,3542322.55


  5. Look at the yearly data.  
     - Replace monthly grouping with yearly grouping by changing `TO_CHAR(orderdate, 'MM-YYYY')` to `DATE_PART('year', orderdate)`.
     - Group data by `order_year` and order the results.

In [14]:
%%sql

SELECT 
    DATE_PART('year', s.orderdate) AS order_year, -- Update
    ROUND(CAST(AVG(EXTRACT(DAY FROM AGE(s.deliverydate, s.orderdate))) AS NUMERIC), 2) AS avg_processing_time,
    ROUND(CAST(SUM(s.quantity * s.netprice * s.exchangerate) AS NUMERIC), 2) AS net_revenue
FROM 
    sales s
LEFT JOIN 
    product p ON s.productkey = p.productkey
WHERE 
    s.deliverydate IS NOT NULL
    AND s.orderdate BETWEEN DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '5 years' AND DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 day'
GROUP BY 
    order_year -- Update
ORDER BY 
    order_year; -- Update

Unnamed: 0,order_year,avg_processing_time,net_revenue
0,2020.0,0.93,11218435.79
1,2021.0,1.36,21357976.66
2,2022.0,1.62,44864557.21
3,2023.0,1.75,33108565.51
4,2024.0,1.67,8396527.38
