<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/2_Date_Time/3_Date_Differences.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Date & Time Differences

## Overview

### 🥅 Analysis Goals

Continue time series analysis and analyzing the relationship between sales (net revenue) and delivery processing times:

- **Analyze processing times**: Calculate the time difference between order dates and delivery dates to evaluate operational efficiency.  
- **Aggregate and summarize sales**: Group sales data by time intervals (month, year) to identify trends and patterns in revenue and processing times.

### 📘 Concepts Covered

- `INTERVAL`
- `AGE()`

[Source Documentation on Date/Time Functions.](https://www.postgresql.org/docs/current/functions-datetime.html)

---

In [20]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## INTERVAL

### 📝 Notes

`INTERVAL`

- **INTERVAL** represents a span of time, such as days, months, hours, or seconds.
- Commonly used for date arithmetic (e.g., `CURRENT_DATE + INTERVAL '1 month'` adds one month to the current date).

- Syntax:
    ```sql
    SELECT INTERVAL 'value unit'
    ```
- Units:
    - years
    - months 
    - days
    - hours
    - minutes
    - seconds
    - microseconds
    - millenniums
    - centuries
    - decades
    - weeks
    - quarters

In [33]:
%%sql

SELECT INTERVAL '5 years'

Unnamed: 0,interval
0,1825 days


In [2_Date_Filtering.ipynb](2_Date_Filtering.ipynb), based on our knowledge we had to use this verbose language to get dates in the last 5 years.

In [22]:
%%sql

SELECT 
	CURRENT_DATE,
	orderdate
FROM sales
WHERE
	EXTRACT(YEAR FROM orderdate) >= EXTRACT(YEAR FROM CURRENT_DATE) - 5  -- last 5 years

Unnamed: 0,current_date,orderdate
0,2025-02-14,2020-01-01
1,2025-02-14,2020-01-01
2,2025-02-14,2020-01-01
3,2025-02-14,2020-01-01
4,2025-02-14,2020-01-01
...,...,...
124446,2025-02-14,2024-04-20
124447,2025-02-14,2024-04-20
124448,2025-02-14,2024-04-20
124449,2025-02-14,2024-04-20


Now we can write it more succient and get it to the date of exactly 5 years, with `INTERVAL`.

In [23]:
%%sql

SELECT 
	CURRENT_DATE,
	orderdate
FROM sales
WHERE
	orderdate >= CURRENT_DATE - INTERVAL '5 years' -- Added

Unnamed: 0,current_date,orderdate
0,2025-02-14,2020-02-14
1,2025-02-14,2020-02-14
2,2025-02-14,2020-02-14
3,2025-02-14,2020-02-14
4,2025-02-14,2020-02-14
...,...,...
121320,2025-02-14,2024-04-20
121321,2025-02-14,2024-04-20
121322,2025-02-14,2024-04-20
121323,2025-02-14,2024-04-20


### 📈 Analysis

- Limit results to the last 5 years of sales, excluding the current year. Which makes the query's date filter dynamically update (instead of having to manually update it).

#### Filter Data by Time Intervals

0. Starting query from last lesson on [2_Date_Filtering.ipynb](2_Date_Filtering.ipynb).

In [24]:
%%sql

SELECT 
	CURRENT_DATE,
	s.orderdate,
	p.categoryname, 
	SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM sales s
	LEFT JOIN product p ON s.productkey = p.productkey
WHERE
	EXTRACT(YEAR FROM s.orderdate) >= EXTRACT(YEAR FROM CURRENT_DATE) - 5  -- last 5 years
GROUP BY
	s.orderdate,
	p.categoryname
ORDER BY
	s.orderdate,
	p.categoryname

Unnamed: 0,current_date,orderdate,categoryname,net_revenue
0,2025-02-14,2020-01-01,Audio,5490.14
1,2025-02-14,2020-01-01,Cameras and camcorders,18880.06
2,2025-02-14,2020-01-01,Cell phones,22593.00
3,2025-02-14,2020-01-01,Computers,78554.54
4,2025-02-14,2020-01-01,Games and Toys,1476.43
...,...,...,...,...
11166,2025-02-14,2024-04-20,Computers,58353.68
11167,2025-02-14,2024-04-20,Games and Toys,1744.30
11168,2025-02-14,2024-04-20,Home Appliances,1562.04
11169,2025-02-14,2024-04-20,"Music, Movies and Audio Books",4949.43


**`INTERVAL`** and **`CURRENT_DATE`**

1. Use the last query to only return orders within the last 5 years of the current date.
    - Use `CURRENT_DATE` to dynamically reference the current date.
    - Subtract `INTERVAL '5 years'` from `CURRENT_DATE` to calculate the start date for filtering.
    - Add a `WHERE` clause to include only rows where `orderdate` is greater than or equal to the calculated start date.

In [25]:
%%sql

SELECT 
	s.orderdate,
    p.categoryname, 
	SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM sales s
	LEFT JOIN product p ON s.productkey = p.productkey
WHERE
	s.orderdate >= CURRENT_DATE - INTERVAL '5 years' -- Added
GROUP BY
	s.orderdate,
    p.categoryname
ORDER BY
	s.orderdate,
    p.categoryname

Unnamed: 0,orderdate,categoryname,net_revenue
0,2020-02-14,Audio,893.92
1,2020-02-14,Cameras and camcorders,12265.49
2,2020-02-14,Cell phones,15050.12
3,2020-02-14,Computers,37146.48
4,2020-02-14,Games and Toys,705.07
...,...,...,...
10831,2024-04-20,Computers,58353.68
10832,2024-04-20,Games and Toys,1744.30
10833,2024-04-20,Home Appliances,1562.04
10834,2024-04-20,"Music, Movies and Audio Books",4949.43


<img src="../Resources/images/2.3_year_rev_category_filtered.
png" alt="Rev & Category" width="50%">  

**NOTE:** The 2020 values shown here only include those dates withing the past 5 years.

---
## AGE

### 📝 Notes

`AGE()`
- `AGE` calculates the difference between two dates and returns the result as an interval.  

- Syntax:
    ```sql
    AGE(end_date, start_date)
    ```


In [26]:
%%sql

SELECT AGE('2024-01-08', '2024-01-01')

Unnamed: 0,age
0,7 days


`EXTRACT`
- `EXTRACT` retrieves a specific component (e.g., day, month, year) from a timestamp or interval.

- Syntax:
    ```sql
    EXTRACT(unit FROM source)
    ```

In [27]:
%%sql

SELECT EXTRACT(DAY FROM AGE('2024-01-08', '2024-01-01'));

Unnamed: 0,extract
0,7


### 📈 Analysis

- Evaluate operational performance by calculating the average time taken between order and delivery dates.  
- Aggregate data by time intervals (month, year) to provide actionable insights into revenue and efficiency.

#### Calculate Processing Time

**`AGE`**

1. Calculate the difference in time between the delivery date and order date using `AGE`.
    - Use `AGE(deliverydate, orderdate)` to compute the processing time for each order.
    - Return the order date, processing time, and delivery date for each transaction.

In [43]:
%%sql

SELECT
    orderdate,
    deliverydate,
    EXTRACT(DAYS FROM AGE(deliverydate, orderdate)) AS processing_time
FROM
    sales
WHERE
	orderdate >= CURRENT_DATE - INTERVAL '5 years'
ORDER BY RANDOM()
LIMIT 10

Unnamed: 0,orderdate,deliverydate,processing_time
0,2022-02-25,2022-02-25,0
1,2023-08-05,2023-08-08,3
2,2021-09-11,2021-09-11,0
3,2023-10-11,2023-10-11,0
4,2021-10-25,2021-10-29,4
5,2023-06-08,2023-06-13,5
6,2023-05-11,2023-05-16,5
7,2022-11-24,2022-11-24,0
8,2020-10-08,2020-10-11,3
9,2023-12-02,2023-12-05,3


2. Calculate `order_month`, `net_revenue`, and then groupby the `order_month`:
     - Compute the total net revenue using `SUM(quantity * netprice * exchangerate)`.
     - Return the order month, processing time, and total sale amount for each transaction.
     - Use `ROUND()` to format the average processing time and total sales to two decimal places.
     - `ROUND`: Rounds numeric values to a specified number of decimal places for better readability.
         - Syntax:
             ```sql
             ROUND(value, precision)
             ```
         - Example:
             ```sql
             SELECT ROUND(1234.56789, 2);
             ```
     - `CAST`: Converts a value from one data type to another
         - Syntax:
             ```sql
             CAST(expression AS datatype)
             ```
         - Example:
             ```sql
             SELECT CAST(123.45 AS INTEGER);
             ```

In [39]:
%%sql

SELECT
    TO_CHAR(s.orderdate, 'YYYY-MM') AS order_month,
    ROUND(AVG(EXTRACT(DAYS FROM AGE(deliverydate, orderdate))), 2) AS avg_processing_time,
    CAST(SUM(quantity * netprice * exchangerate) AS INTEGER) AS net_revenue  --update
FROM 
    sales s
WHERE 
    orderdate >= CURRENT_DATE - INTERVAL '5 years'
GROUP BY 
    order_month
ORDER BY 
    order_month;

Unnamed: 0,order_month,avg_processing_time,net_revenue
0,2020-02,0.86,1764611
1,2020-03,0.97,1127543
2,2020-04,0.91,508320
3,2020-05,0.93,1215686
4,2020-06,0.85,799668
5,2020-07,0.89,619915
6,2020-08,1.08,524675
7,2020-09,1.02,328013
8,2020-10,1.1,381505
9,2020-11,1.14,342576


<img src="../Resources/images/2.3_month_processing_rev.png" alt="Processing & Revenue" width="50%">

  3. Evaluate the yearly data.  
     - 🔔 Replace monthly grouping with yearly grouping by changing `TO_CHAR(orderdate, 'MM-YYYY')` to `DATE_PART('year', orderdate)`.
     - 🔔 Group data by `order_year` and order the results.


In [34]:
%%sql

SELECT
  DATE_PART('year', orderdate) AS order_year,
  ROUND(AVG(EXTRACT(DAYS FROM AGE(deliverydate, orderdate))), 2) AS avg_processing_time,
  CAST(SUM(quantity * netprice * exchangerate) AS INTEGER) AS net_revenue
FROM
  sales
WHERE
	orderdate >= CURRENT_DATE - INTERVAL '5 years'
GROUP BY
  order_year
ORDER BY
  order_year

Unnamed: 0,order_year,avg_processing_time,net_revenue
0,2020.0,0.94,8137321
1,2021.0,1.36,21357977
2,2022.0,1.62,44864557
3,2023.0,1.75,33108566
4,2024.0,1.67,8396527


<img src="../Resources/images/2.3_yearly_processing_rev.png" alt="Processing & Revenue" width="50%">