### Time Series Analysis

Time series analysis is one of the most common types of analysis done with SQL. A `time seires` is a sequence of measurements or data points recorded in time order, often at regularly spaced intervals.

Forecasting is a common goal of time series analysis. Since time only marches forward, future values can be expressed as a function of past values, while the reverse is not ture. However, it's important to note that the past doesn't perfectly predict the future.

Still, looking at historical data can lead to insights, and developing a range of plausible outcomes is useful for planning.

#### The Retail Sales Data Set

The examples use a data set of montly US retail sales from the [Montly Retail Trade Report: Retail and Food Services Sales: Excel (1992-2020)](https://www.census.gov/retail/index.html#mrts).

The data in this report is uesd as an economic indicator to understand trends in US consumer spending patterns.

While gross domestic product(GDP) figures are published quarterly, this retail sales data is published monthly, so it is also used to help predict GDP.

The data spans from 1992 to 2020 and includes both total sales as well as details for subcategories of retail sales.

Sales figures are in millions of US dollars.

In [1]:
import pandas as pd
import sqlite3

df = pd.read_csv('us_retail_sales.csv')

conn = sqlite3.connect("sql-analysis.db")
#df.to_sql('retail_sales', conn)

In [2]:
import matplotlib as mpl
import matplotlib.pyplot as plt

#### Simple Trends

In [3]:
query_01 = """
        SELECT sales_month, sales
        FROM retail_sales
        WHERE kind_of_business = 'Retail and food services sales, total'
        """

sql_01 = pd.read_sql(query_01, conn)
sql_01.head()

Unnamed: 0,sales_month,sales
0,1992-01-01,146376.0
1,1992-02-01,147079.0
2,1992-03-01,159336.0
3,1992-04-01,163669.0
4,1992-05-01,170068.0


<img align="left" width="437" alt="Screen Shot 2022-04-18 at 4 28 01 PM" src="https://user-images.githubusercontent.com/73784742/163780536-143cd751-373d-4d84-9e19-763575601ccf.png">

This data clearly has some patterns, but it also has some noise. Transforming the data and aggregating at the yearly level can help us gain a better understanding.

In [4]:
'''
plt.subplots(figsize=(12, 8))
plt.plot(sql_01['sales_month'], sql_01['sales'])

plt.title("Monthly Retail and Food Services Sales")
plt.xlabel("Month")
plt.ylabel("$MM Sales")

plt.xticks(ticks=sql_01['sales_month'], labels=sql_01['sales_month'], rotation=45)
plt.locator_params(axis='x', nbins=len(sql_01['sales_month'])/12)

plt.show();
'''

'\nplt.subplots(figsize=(12, 8))\nplt.plot(sql_01[\'sales_month\'], sql_01[\'sales\'])\n\nplt.title("Monthly Retail and Food Services Sales")\nplt.xlabel("Month")\nplt.ylabel("$MM Sales")\n\nplt.xticks(ticks=sql_01[\'sales_month\'], labels=sql_01[\'sales_month\'], rotation=45)\nplt.locator_params(axis=\'x\', nbins=len(sql_01[\'sales_month\'])/12)\n\nplt.show();\n'

In [5]:
query_02 = """
        SELECT strftime('%Y', sales_month) AS sales_year, SUM(sales) AS sales
        FROM retail_sales
        WHERE kind_of_business = 'Retail and food services sales, total'
        GROUP BY sales_year
        """

sql_02 = pd.read_sql(query_02, conn)
sql_02.head()

Unnamed: 0,sales_year,sales
0,1992,2014102.0
1,1993,2153095.0
2,1994,2330235.0
3,1995,2450628.0
4,1996,2603794.0


<img align="left" width="435" alt="Screen Shot 2022-04-18 at 4 31 29 PM" src="https://user-images.githubusercontent.com/73784742/163781000-f42f328b-51a9-4131-8819-1eb89433d8b5.png">

In [6]:
'''
plt.subplots(figsize=(12, 8))
plt.plot(sql_02['sales_year'], sql_02['sales'])

plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)

plt.title("Yearly Total Retail and Food Services Sales")
plt.xlabel("Year")
plt.ylabel("$MM Sales")
plt.xticks(ticks=sql_02['sales_year'], labels=sql_02['sales_year'], rotation=45)
plt.locator_params(axis='x', nbins=len(sql_02['sales_year']))

plt.show();
'''

'\nplt.subplots(figsize=(12, 8))\nplt.plot(sql_02[\'sales_year\'], sql_02[\'sales\'])\n\nplt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)\n\nplt.title("Yearly Total Retail and Food Services Sales")\nplt.xlabel("Year")\nplt.ylabel("$MM Sales")\nplt.xticks(ticks=sql_02[\'sales_year\'], labels=sql_02[\'sales_year\'], rotation=45)\nplt.locator_params(axis=\'x\', nbins=len(sql_02[\'sales_year\']))\n\nplt.show();\n'

After graphing this data, we now have a smoother time series that is generally incresing over time, as might be expected, since the sales values are not adjusted for inflation. Sales for all retail and food services fell in 2009, during the global financial crsis. After growing every year throughout the 2010s, sales were flat in 2020 compared to 2019, due to the impact of the COVID-19 pandemic.

#### Comparing Components

In [7]:
query_03 = """
        SELECT strftime('%Y', sales_month) AS sales_year, kind_of_business, SUM(sales) AS sales
        FROM retail_sales
        WHERE kind_of_business IN ('Book stores', 'Sporting goods stores', 'Hobby, toy, and game stores')
        GROUP BY sales_year, sales
        """

sql_03 = pd.read_sql(query_03, conn)
sql_03

Unnamed: 0,sales_year,kind_of_business,sales
0,1992,Book stores,523.0
1,1992,Book stores,535.0
2,1992,Book stores,539.0
3,1992,Book stores,552.0
4,1992,"Hobby, toy, and game stores",585.0
...,...,...,...
1023,2020,Sporting goods stores,4803.0
1024,2020,Sporting goods stores,5163.0
1025,2020,Sporting goods stores,5435.0
1026,2020,Sporting goods stores,5887.0


<img align="left" width="435" alt="Screen Shot 2022-04-18 at 4 40 33 PM" src="https://user-images.githubusercontent.com/73784742/163782175-0fc4e8a8-54a5-4356-8248-d27c7c5fed88.png"> 

Sales at sporting goods retailers started the highest among the three categories and grew much faster during the time period, and by the end of the time series, those sales were substantially higher. Sales at sporting goods sotres started declining in 2017 but had a big rebound in 2020. Sales at hobby, toy, and games stores were relatively flat over this time span, with a slight dip in the mid-2000s and another slight decline prior to a rebound 2020. Sales at book stores grew until the mid-2000s and have been on the decline since then.

In [8]:
query_04 = """
        SELECT strftime('%Y', sales_month) AS sales_year, kind_of_business, SUM(sales) AS sales
        FROM retail_sales
        WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
        GROUP BY sales_year, kind_of_business
        """

sql_04 = pd.read_sql(query_04, conn)
sql_04.head()

Unnamed: 0,sales_year,kind_of_business,sales
0,1992,Men's clothing stores,10179.0
1,1992,Women's clothing stores,31815.0
2,1993,Men's clothing stores,9962.0
3,1993,Women's clothing stores,32350.0
4,1994,Men's clothing stores,10032.0


<img align="left" width="435" alt="Screen Shot 2022-04-18 at 7 49 03 PM" src="https://user-images.githubusercontent.com/73784742/163804053-ee0df89b-f073-4720-be07-7c9cab0c414c.png">

The gap between men’s and women’s sales does not appear constant but rather was increasing during the early to mid-2000s. Women’s clothing sales in particular dipped during the global financial crisis of 2008–2009, and sales in both categories dropped a lot during the pandemic in 2020.

In [9]:
query_05 = """
        SELECT strftime('%Y', sales_month) AS sales_year,
            SUM(CASE WHEN kind_of_business = 'Women''s clothing stores'
                THEN sales
                END) AS womens_sales,
            SUM(CASE WHEN kind_of_business = 'Men''s clothing stores'
                THEN sales
                END) AS mens_sales
        FROM retail_sales
        WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
        GROUP BY sales_year
        """

sql_05 = pd.read_sql(query_05, conn)
sql_05.head()

Unnamed: 0,sales_year,womens_sales,mens_sales
0,1992,31815.0,10179.0
1,1993,32350.0,9962.0
2,1994,30585.0,10032.0
3,1995,28696.0,9315.0
4,1996,28238.0,9546.0


In [10]:
query_06 = """    
        SELECT sales_year,
            womens_sales - mens_sales AS womens_minus_mens
        FROM
        (
            SELECT strftime('%Y', sales_month) AS sales_year,
                SUM(CASE WHEN kind_of_business = 'Women''s clothing stores'
                    THEN sales
                    END) AS womens_sales,
                SUM(CASE WHEN kind_of_business = 'Men''s clothing stores'
                    THEN sales
                    END) AS mens_sales
                FROM retail_sales
                WHERE kind_of_business in ('Men''s clothing stores','Women''s clothing stores')
                AND sales_month <= '2019-12-01'
                GROUP BY 1
        )
        """
        
sql_06 = pd.read_sql(query_06, conn)
sql_06.head()       

Unnamed: 0,sales_year,womens_minus_mens
0,1992,21636.0
1,1993,22388.0
2,1994,20553.0
3,1995,19381.0
4,1996,18692.0


<img align="left" width="443" alt="Screen Shot 2022-04-18 at 8 09 16 PM" src="https://user-images.githubusercontent.com/73784742/163806248-93b1b5c3-692f-45f1-94b8-801e52d63858.png">

The gap decreased between 1992 and about 1997, began a long increase through about 2011 (with a brief dip in 2007), and then was more or less flat through 2019.

#### Percent of Total Calculations

When working with time series data that has multiple parts or attributes that constitute a whole, it’s often useful to analyze each part’s contribution to the whole and whether that has changed over time. 
Unless the data already contains a time series of the total values, we’ll need to calculate the overall total in order to calculate the percent of total for each row. 

This can be accomplished with a `self-JOIN`, or a `window function`.

- SELF JOIN

In [17]:
query_07 = """
        SELECT a.sales_month,
                a.kind_of_business,
                a.sales,
                SUM(b.sales) AS total_sales
        FROM retail_sales AS a
        JOIN retail_sales AS b
            ON a.sales_month = b.sales_month
            AND b.kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
        WHERE a.kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
        GROUP BY 1, 2
        """
sql_07 = pd.read_sql(query_07, conn)
sql_07 

Unnamed: 0,sales_month,kind_of_business,sales,total_sales
0,1992-01-01,Men's clothing stores,701.0,2574.0
1,1992-01-01,Women's clothing stores,1873.0,2574.0
2,1992-02-01,Men's clothing stores,658.0,2649.0
3,1992-02-01,Women's clothing stores,1991.0,2649.0
4,1992-03-01,Men's clothing stores,731.0,3134.0
...,...,...,...,...
691,2020-10-01,Women's clothing stores,2634.0,2634.0
692,2020-11-01,Men's clothing stores,,2726.0
693,2020-11-01,Women's clothing stores,2726.0,2726.0
694,2020-12-01,Men's clothing stores,604.0,4003.0


In [19]:
query_08 = """
        SELECT sales_month, kind_of_business, sales * 100 / total_sales AS pct_total_sales
        FROM
        (
            SELECT a.sales_month,
                a.kind_of_business,
                a.sales,
                SUM(b.sales) AS total_sales
            FROM retail_sales AS a
            JOIN retail_sales AS b
                ON a.sales_month = b.sales_month
                AND b.kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
            WHERE a.kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
            GROUP BY a.sales_month, a.kind_of_business, a.sales
        )
        """
        
sql_08 = pd.read_sql(query_08, conn)
sql_08

Unnamed: 0,sales_month,kind_of_business,pct_total_sales
0,1992-01-01,Men's clothing stores,27.233877
1,1992-01-01,Women's clothing stores,72.766123
2,1992-02-01,Men's clothing stores,24.839562
3,1992-02-01,Women's clothing stores,75.160438
4,1992-03-01,Men's clothing stores,23.324825
...,...,...,...
691,2020-10-01,Women's clothing stores,100.000000
692,2020-11-01,Men's clothing stores,
693,2020-11-01,Women's clothing stores,100.000000
694,2020-12-01,Men's clothing stores,15.088683


In [20]:
query_09 = """
        SELECT sales_month, kind_of_business, sales * 100 / total_sales AS pct_total_sales
        FROM
        (
            SELECT a.sales_month,
                a.kind_of_business,
                a.sales,
                SUM(b.sales) AS total_sales
            FROM retail_sales AS a
            JOIN retail_sales AS b
                ON a.sales_month = b.sales_month
                AND b.kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
            WHERE a.kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
            GROUP BY a.sales_month, a.kind_of_business, a.sales
        )
        """
        
sql_09 = pd.read_sql(query_09, conn)
sql_09

Unnamed: 0,sales_month,kind_of_business,pct_total_sales
0,1992-01-01,Men's clothing stores,27.233877
1,1992-01-01,Women's clothing stores,72.766123
2,1992-02-01,Men's clothing stores,24.839562
3,1992-02-01,Women's clothing stores,75.160438
4,1992-03-01,Men's clothing stores,23.324825
...,...,...,...
691,2020-10-01,Women's clothing stores,100.000000
692,2020-11-01,Men's clothing stores,
693,2020-11-01,Women's clothing stores,100.000000
694,2020-12-01,Men's clothing stores,15.088683


- WINDOW Function

In [22]:
query_10 = """
        SELECT sales_month,
            kind_of_business,
            sales,
            SUM(sales) OVER (PARTITION BY sales_month) AS total_sales,
            sales * 100 / SUM(sales) OVER (PARTITION BY sales_month) AS pct_total
        FROM retail_sales
        WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
        """
        
sql_10 = pd.read_sql(query_10, conn)
sql_10

Unnamed: 0,sales_month,kind_of_business,sales,total_sales,pct_total
0,1992-01-01,Men's clothing stores,701.0,2574.0,27.233877
1,1992-01-01,Women's clothing stores,1873.0,2574.0,72.766123
2,1992-02-01,Men's clothing stores,658.0,2649.0,24.839562
3,1992-02-01,Women's clothing stores,1991.0,2649.0,75.160438
4,1992-03-01,Men's clothing stores,731.0,3134.0,23.324825
...,...,...,...,...,...
691,2020-10-01,Women's clothing stores,2634.0,2634.0,100.000000
692,2020-11-01,Men's clothing stores,,2726.0,
693,2020-11-01,Women's clothing stores,2726.0,2726.0,100.000000
694,2020-12-01,Men's clothing stores,604.0,4003.0,15.088683


<img align="left" width="447" alt="Screen Shot 2022-04-19 at 11 36 23 AM" src="https://user-images.githubusercontent.com/73784742/163915154-aa81e83a-b772-493f-b5b9-4ea23dc7577e.png">

First, starting in the late 1990s, women’s clothing store sales became an increasing percentage of the total. 

Second, early in the series a seasonal pattern is evident, where men’s sales spike as a percent of total sales in December and January. 

In the first decade of the 21st century, two seasonal peaks appear, in the summer and the winter, but by the late 2010s, the seasonal patterns are dampened almost to the point of randomness.

In [24]:
query_11 = """
        SELECT sales_month,
            kind_of_business,
            sales,
            SUM(sales) OVER (PARTITION BY strftime('%Y',sales_month), kind_of_business) AS yearly_sales,
            sales * 100 / SUM(sales) OVER (PARTITION BY strftime('%Y',sales_month), kind_of_business) AS pct_yearly
        FROM retail_sales
        WHERE kind_of_business IN ('Men''s clothing stores','Women''s clothing stores')
        """

sql_11 = pd.read_sql(query_11, conn)
sql_11

Unnamed: 0,sales_month,kind_of_business,sales,yearly_sales,pct_yearly
0,1992-01-01,Men's clothing stores,701.0,10179.0,6.886728
1,1992-02-01,Men's clothing stores,658.0,10179.0,6.464289
2,1992-03-01,Men's clothing stores,731.0,10179.0,7.181452
3,1992-04-01,Men's clothing stores,816.0,10179.0,8.016505
4,1992-05-01,Men's clothing stores,856.0,10179.0,8.409470
...,...,...,...,...,...
691,2020-08-01,Women's clothing stores,2386.0,26526.0,8.994948
692,2020-09-01,Women's clothing stores,2494.0,26526.0,9.402096
693,2020-10-01,Women's clothing stores,2634.0,26526.0,9.929880
694,2020-11-01,Women's clothing stores,2726.0,26526.0,10.276710


<img align="left" width="442" alt="Screen Shot 2022-04-19 at 11 49 15 AM" src="https://user-images.githubusercontent.com/73784742/163916515-d51235cb-89b9-4bea-b93c-f7d1a46e4558.png">

Zoomed in to 2019, the two time series track fairly closely, but men’s stores had a greater percentage of their sales in January than did women’s stores. Men’s stores had a summer dip in July, while the corresponding dip in women’s store sales wasn’t until September.

#### Indexing to See Percent Change over Time

Indexing data is a way to understand the changes in a time series relative to a base period (starting point). Pick a base period and compute the percent change in value from that base period for each subsequent period.

In [26]:
query_12 = """
        SELECT sales_year,
            sales,
            FIRST_VALUE(sales) OVER (ORDER BY sales_year) AS index_sales
        FROM
        (
            SELECT strftime('%Y',sales_month) AS sales_year, 
                SUM(sales) AS sales
            FROM retail_sales
            WHERE kind_of_business = 'Women''s clothing stores'
            GROUP BY sales_year
        )
        """

sql_12 = pd.read_sql(query_12, conn)
sql_12.head()

Unnamed: 0,sales_year,sales,index_sales
0,1992,31815.0,31815.0
1,1993,32350.0,31815.0
2,1994,30585.0,31815.0
3,1995,28696.0,31815.0
4,1996,28238.0,31815.0


We index women’s clothing store sales to the first year in the series, 1992. The first step is to aggregate the sales by sales_year in a subquery, as we’ve done previously.

In the outer query, the `FIRST_VALUE` window function finds the value associated with the first row in the PARTITION BY clause, according to the sort in the ORDER BY clause. In this example, we can omit the PARTITION BY clause, because we want to return the sales value for the first row in the entire data set returned by the subquery

In [34]:
query_13 = """
        SELECT sales_year,
            sales,
            (sales / FIRST_VALUE(sales) OVER (ORDER BY sales_year) - 1) * 100 AS pct_from_index
        FROM
        (
            SELECT strftime('%Y',sales_month) AS sales_year, 
                SUM(sales) AS sales
            FROM retail_sales
            WHERE kind_of_business = 'Women''s clothing stores'
            GROUP BY sales_year
        )
        """

sql_13 = pd.read_sql(query_13, conn)
sql_13.head()

Unnamed: 0,sales_year,sales,pct_from_index
0,1992,31815.0,0.0
1,1993,32350.0,1.681597
2,1994,30585.0,-3.866101
3,1995,28696.0,-9.803552
4,1996,28238.0,-11.243124


In [36]:
query_14 = """
        SELECT sales_year,
            kind_of_business,
            sales,
            (sales / FIRST_VALUE(sales) OVER (
                PARTITION BY kind_of_business ORDER BY sales_year) - 1) * 100 AS pct_from_index
        FROM
        (
            SELECT strftime('%Y',sales_month) AS sales_year,
                kind_of_business,
                SUM(sales) AS sales
            FROM retail_sales
            WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
            GROUP BY sales_year, kind_of_business
        )
        """

sql_14 = pd.read_sql(query_14, conn)
sql_14

Unnamed: 0,sales_year,kind_of_business,sales,pct_from_index
0,1992,Men's clothing stores,10179.0,0.0
1,1993,Men's clothing stores,9962.0,-2.13184
2,1994,Men's clothing stores,10032.0,-1.44415
3,1995,Men's clothing stores,9315.0,-8.488064
4,1996,Men's clothing stores,9546.0,-6.218686
5,1997,Men's clothing stores,10069.0,-1.080656
6,1998,Men's clothing stores,10196.0,0.167011
7,1999,Men's clothing stores,9667.0,-5.029964
8,2000,Men's clothing stores,9507.0,-6.601827
9,2001,Men's clothing stores,8625.0,-15.266726


<img align="left" width="563" alt="Screen Shot 2022-04-19 at 12 16 33 PM" src="https://user-images.githubusercontent.com/73784742/163919082-5ea55f41-3a6b-4634-96c1-8026167377d5.png">

It’s apparent from this graph that 1992 was something of a high-water mark for sales at men’s clothing stores. After 1992 sales dropped, then returned briefly to the same level in 1998, and have been declining ever since. This is striking since the data set is not adjusted for inflation, the tendency for prices to rise over time. Sales at women’s clothing stores decreased from 1992 levels initially, but they returned to the 1992 level by 2003. They have increased since, with the exception of the drop during the financial crisis that decreased sales in 2009 and 2010.

One explanation for these trends is that men simply decreased spending on clothes over time, perhaps becoming less fashion conscious relative to women. Perhaps men’s clothing simply became less expensive as global supply chains decreased costs. 

Yet another explanation might be that men shifted their clothing purchases from retailers categorized as “men’s clothing stores” to other types of retailers, such as sporting goods stores or online retailers.

#### Rolling Time Windows

Another technique for smoothing data is `rolling time windows`, also known as moving calculations, that take into account multiple periods. Moving averages are probably the most common, but with the power of SQL, any aggregate function is available for analysis.

Rolling time windows are used in a wide variety of analysis areas, including stock markets, macro economic trends, and audience measurement. 

Some calculations are so commonly used that they have their own acronyms: last twelve months (LTM), trailing twelve months (TTM), and year-to-date (YTD).

Larger windows with more time periods have a greater smoothing effect, but at the risk of losing sensitivity to important short-term changes in the data. 

Shorter win‐ dows with fewer time periods do less smoothing and thus are more sensitive to short- term changes, but at the risk of too little noise reduction.

---

##### Measuring “Active Users”: DAU, WAU, and MAU

- **DAU** (Daily Active Users) helps companies with capacity planning, such as estimating how much load to expect on **servers**. Depending on the service, however, even more detailed data might be needed, such as peak hourly or even minute-by-minute concurrent user information.


- **MAU** (Monthly Active Users) is commonly used to estimate relative sizes of applications or services. It is useful for measuring fairly stable or growing user populations that have regular usage patterns that aren’t necessarily daily, such as higher use on the weekend for leisure products, or higher weekday use for work- or school-related products. MAU is not as well suited to detecting changes in underlying churn from users who stop using an application. Since it takes a user 30 days, the most common window, to pass through MAU, a user can have been absent from the product for 29 days before they trigger a drop in MAU.


- **WAU** (Weekly Active User), calculated over 7 days, can be a happy medium between DAU and MAU. WAU is more sensitive to short-term fluctuations, alerting teams to changes in churn more quickly than MAU while smoothing over day of week fluctuations that are tracked by DAU. A drawback to WAU is that it is still sensitive to short-term fluctuations driven by events such as holidays.

#### Calculating Rolling Time Windows

- SELF JOIN

In [51]:
query_15 = """
        SELECT a.sales_month,
            a.sales,
            b.sales_month AS rolling_sales_month,
            b.sales AS rolling_sales
        FROM retail_sales a
        JOIN retail_sales b ON a.kind_of_business = b.kind_of_business
            AND b.sales_month BETWEEN date(a.sales_month, '-11 months') AND a.sales_month
            AND b.kind_of_business = 'Women''s clothing stores'
        WHERE a.kind_of_business = 'Women''s clothing stores'
            AND a.sales_month = '2019-12-01'
        """

sql_15 = pd.read_sql(query_15, conn)
sql_15

Unnamed: 0,sales_month,sales,rolling_sales_month,rolling_sales
0,2019-12-01,4496.0,2019-01-01,2511.0
1,2019-12-01,4496.0,2019-02-01,2680.0
2,2019-12-01,4496.0,2019-03-01,3585.0
3,2019-12-01,4496.0,2019-04-01,3604.0
4,2019-12-01,4496.0,2019-05-01,3807.0
5,2019-12-01,4496.0,2019-06-01,3272.0
6,2019-12-01,4496.0,2019-07-01,3261.0
7,2019-12-01,4496.0,2019-08-01,3325.0
8,2019-12-01,4496.0,2019-09-01,3080.0
9,2019-12-01,4496.0,2019-10-01,3390.0


- WINDOW Function

Window functions are another way to calculate rolling time windows. 

To make a rolling window, we need to use another optional part of a window calculation: `the frame clause`.The frame clause allows you to specify which records to include in the window.

In [52]:
query_16 = """
        SELECT sales_month,
            sales,
            AVG(sales) OVER (ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND current row) AS moving_avg,
            COUNT(sales) over (ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND current row) AS records_count
        FROM retail_sales
        WHERE kind_of_business = 'Women''s clothing stores'
        """

sql_16 = pd.read_sql(query_16, conn)
sql_16

Unnamed: 0,sales_month,sales,moving_avg,records_count
0,1992-01-01,1873.0,1873.000000,1
1,1992-02-01,1991.0,1932.000000,2
2,1992-03-01,2403.0,2089.000000,3
3,1992-04-01,2665.0,2233.000000,4
4,1992-05-01,2752.0,2336.800000,5
...,...,...,...,...
343,2020-08-01,2386.0,2507.416667,12
344,2020-09-01,2494.0,2458.583333,12
345,2020-10-01,2634.0,2395.583333,12
346,2020-11-01,2726.0,2301.916667,12


<img align="left" width="574" alt="Screen Shot 2022-04-19 at 2 19 12 PM" src="https://user-images.githubusercontent.com/73784742/163938572-6afd9d81-2b68-459a-8d82-bc544d8fb373.png">

While the monthly trend is noisy, the smoothed moving average trend makes detecting changes such as the increase from 2003 to 2007 and the subsequent dip through 2011 easier to spot. Notice that the extreme drop in early 2020 pulls the moving average down even after sales start to rebound later in the year.

#### Rolling Time Windows with Sparse Data

In [55]:
query_17 = """
        SELECT a.sales_month, avg(b.sales) AS moving_avg
        FROM
        (
            SELECT DISTINCT sales_month
            FROM retail_sales
            WHERE sales_month BETWEEN '1993-01-01' AND '2020-12-01'
        )a
        JOIN retail_sales b ON b.sales_month BETWEEN date(a.sales_month, '-11 months') AND a.sales_month
             AND b.kind_of_business = 'Women''s clothing stores'
        GROUP BY 1
        """

sql_17 = pd.read_sql(query_17, conn)
sql_17

Unnamed: 0,sales_month,moving_avg
0,1993-01-01,2672.083333
1,1993-02-01,2673.250000
2,1993-03-01,2676.500000
3,1993-04-01,2684.583333
4,1993-05-01,2694.666667
...,...,...
331,2020-08-01,2507.416667
332,2020-09-01,2458.583333
333,2020-10-01,2395.583333
334,2020-11-01,2301.916667


#### Calculating Cumulative Values

YTD (Year-To-Date)

In [57]:
- WINDOW Function

SyntaxError: invalid syntax (<ipython-input-57-bf97851a6102>, line 1)

In [58]:
query_18 = """
        SELECT sales_month,
            sales,
            SUM(sales) OVER (PARTITION BY strftime('%Y',sales_month) ORDER BY sales_month) AS sales_ytd
        FROM retail_sales
        WHERE kind_of_business = 'Women''s clothing stores'
        """

sql_18 = pd.read_sql(query_18, conn)
sql_18

Unnamed: 0,sales_month,sales,sales_ytd
0,1992-01-01,1873.0,1873.0
1,1992-02-01,1991.0,3864.0
2,1992-03-01,2403.0,6267.0
3,1992-04-01,2665.0,8932.0
4,1992-05-01,2752.0,11684.0
...,...,...,...
343,2020-08-01,2386.0,15273.0
344,2020-09-01,2494.0,17767.0
345,2020-10-01,2634.0,20401.0
346,2020-11-01,2726.0,23127.0


- SELF JOIN

In [59]:
query_19 = """
        SELECT a.sales_month, 
            a.sales,
            SUM(b.sales) AS sales_ytd
        FROM retail_sales a
        JOIN retail_sales b ON
            strftime('%Y', a.sales_month) = strftime('%Y', b.sales_month)
            AND b.sales_month <= a.sales_month
            AND b.kind_of_business = 'Women''s clothing stores'
        WHERE a.kind_of_business = 'Women''s clothing stores'
        GROUP BY 1,2
        """

sql_19 = pd.read_sql(query_19, conn)
sql_19

Unnamed: 0,sales_month,sales,sales_ytd
0,1992-01-01,1873.0,1873.0
1,1992-02-01,1991.0,3864.0
2,1992-03-01,2403.0,6267.0
3,1992-04-01,2665.0,8932.0
4,1992-05-01,2752.0,11684.0
...,...,...,...
343,2020-08-01,2386.0,15273.0
344,2020-09-01,2494.0,17767.0
345,2020-10-01,2634.0,20401.0
346,2020-11-01,2726.0,23127.0


<img align="left" width="565" alt="Screen Shot 2022-04-19 at 3 21 42 PM" src="https://user-images.githubusercontent.com/73784742/163947968-347843b7-61eb-4960-bc1e-23711a437b63.png">

The query returns a record for each sales_month, the sales for that month, and the running total sales_ytd. 

The series starts in 1992 and then resets in January 1993, as it will for every year in the data set.

In the rusults fot years 2016 through 2020, the first four years show similar patterns through the year, but of course 2020 looks very different.

#### Analyzing with Seasonality

**Seasonality** is any pattern that repeats over regular intervals. Unlike other noise in the data, seasonality can be predicted. To understand whether seasonality exists in a time series, and at what scale, it’s useful to graph it and then visually inspect for patterns. Try aggregating at different levels, from hourly to daily, weekly, and monthly. You should also incorporate knowledge about the data set.



#### Period-over-Period Comparisons: YoY and MoM

Period-over-period comparisons can take multiple forms. The first one is to compare a time period to the previous value in the series, a practice so common in analysis that there are acronyms for the most often-used comparisons. 

Depending on the level of aggregation the comparison might be year-over-year (YoY), month-over-month (MoM), day-over-day (DoD), and so on.

For these calculations we’ll use the `lag` function, another one of the window functions. The lag function returns a previous or lagging value from a series.

In [61]:
query_20 = """
        SELECT kind_of_business, sales_month, sales,
            LAG(sales_month) OVER (PARTITION BY kind_of_business ORDER BY sales_month) AS prev_month,
            LAG(sales) OVER (PARTITION BY kind_of_business ORDER BY sales_month) AS prev_month_sales
        FROM retail_sales
        WHERE kind_of_business = 'Book stores'
        """

sql_20 = pd.read_sql(query_20, conn)
sql_20

Unnamed: 0,kind_of_business,sales_month,sales,prev_month,prev_month_sales
0,Book stores,1992-01-01,790.0,,
1,Book stores,1992-02-01,539.0,1992-01-01,790.0
2,Book stores,1992-03-01,535.0,1992-02-01,539.0
3,Book stores,1992-04-01,523.0,1992-03-01,535.0
4,Book stores,1992-05-01,552.0,1992-04-01,523.0
...,...,...,...,...,...
343,Book stores,2020-08-01,770.0,2020-07-01,437.0
344,Book stores,2020-09-01,620.0,2020-08-01,770.0
345,Book stores,2020-10-01,455.0,2020-09-01,620.0
346,Book stores,2020-11-01,496.0,2020-10-01,455.0


In [62]:
query_21 = """
        SELECT kind_of_business, sales_month, sales,
            (sales / LAG(sales) OVER (PARTITION BY kind_of_business ORDER BY sales_month) - 1) * 100 
            AS pct_growth_from_previous
        FROM retail_sales
        WHERE kind_of_business = 'Book stores'
        """

sql_21 = pd.read_sql(query_21, conn)
sql_21

Unnamed: 0,kind_of_business,sales_month,sales,pct_growth_from_previous
0,Book stores,1992-01-01,790.0,
1,Book stores,1992-02-01,539.0,-31.772152
2,Book stores,1992-03-01,535.0,-0.742115
3,Book stores,1992-04-01,523.0,-2.242991
4,Book stores,1992-05-01,552.0,5.544933
...,...,...,...,...
343,Book stores,2020-08-01,770.0,76.201373
344,Book stores,2020-09-01,620.0,-19.480519
345,Book stores,2020-10-01,455.0,-26.612903
346,Book stores,2020-11-01,496.0,9.010989


In [66]:
query_22 = """
        SELECT sales_year, yearly_sales,
            LAG(yearly_sales) OVER (ORDER BY sales_year) AS prev_year_sales,
            (yearly_sales / LAG(yearly_sales) OVER (ORDER BY sales_year) -1) * 100 
            AS pct_growth_from_previous
        FROM
        (
            SELECT strftime('%Y',sales_month) AS sales_year,
                SUM(sales) AS yearly_sales
            FROM retail_sales
            WHERE kind_of_business = 'Book stores'
            GROUP BY sales_year
        )
        """

sql_22 = pd.read_sql(query_22, conn)
sql_22.head()

Unnamed: 0,sales_year,yearly_sales,prev_year_sales,pct_growth_from_previous
0,1992,8327.0,,
1,1993,9108.0,8327.0,9.379128
2,1994,10107.0,9108.0,10.968379
3,1995,11196.0,10107.0,10.774711
4,1996,11905.0,11196.0,6.332619


<img alighn="left" width="573" alt="Screen Shot 2022-04-19 at 3 59 19 PM" src="https://user-images.githubusercontent.com/73784742/163954565-415a7757-0d5e-44bd-8e3d-ff6b4f1e7c63.png">

Sales grew more than 9.3% from 1992 to 1993, and almost 11% from 1993 to 1994. These period-over-period calculations are useful, but they don’t quite allow us to analyze the seasonality in the data set.

#### Period-over-Period Comparisosns: Same Month Versus Last Year

In [80]:
query_23 = """
        SELECT sales_month, sales,
            LAG(sales_month) OVER (PARTITION BY strftime('%m',sales_month) ORDER BY sales_month) 
                AS prev_year_month,
            LAG(sales) OVER (PARTITION BY strftime('%m',sales_month) ORDER BY sales_month) 
                AS prev_year_sales
        FROM retail_sales
        WHERE kind_of_business = 'Book stores'
        """

sql_23 = pd.read_sql(query_23, conn)
sql_23.head()

Unnamed: 0,sales_month,sales,prev_year_month,prev_year_sales
0,1992-01-01,790.0,,
1,1993-01-01,998.0,1992-01-01,790.0
2,1994-01-01,1053.0,1993-01-01,998.0
3,1995-01-01,1308.0,1994-01-01,1053.0
4,1996-01-01,1373.0,1995-01-01,1308.0


In [81]:
query_24 = """
        SELECT sales_month, sales,
            sales - LAG(sales) OVER (PARTITION BY strftime('%m',sales_month) ORDER BY sales_month) 
                AS absolute_diff,
            (sales / LAG(sales) OVER (PARTITION BY strftime('%m',sales_month) ORDER BY sales_month) - 1) * 100
                AS pct_diff
        FROM retail_sales
        WHERE kind_of_business = 'Book stores'
        """

sql_24 = pd.read_sql(query_24, conn)
sql_24.head()

Unnamed: 0,sales_month,sales,absolute_diff,pct_diff
0,1992-01-01,790.0,,
1,1993-01-01,998.0,208.0,26.329114
2,1994-01-01,1053.0,55.0,5.511022
3,1995-01-01,1308.0,255.0,24.216524
4,1996-01-01,1373.0,65.0,4.969419


<img align="left" width="567" alt="Screen Shot 2022-04-19 at 4 26 15 PM" src="https://user-images.githubusercontent.com/73784742/163959210-f0fc683b-b4e4-46c9-961c-b927475b7a50.png">

Growth was unusually high, such as January 2002, or unusually low, such as December 2001.

In [91]:
query_25 = """
        SELECT strftime('%m',sales_month) AS month_number,
            MAX(CASE WHEN strftime('%Y', sales_month) = '1992' THEN sales END) AS sales_1992,
            MAX(CASE WHEN strftime('%Y', sales_month) = '1993' THEN sales END) AS sales_1993,
            MAX(CASE WHEN strftime('%Y', sales_month) = '1994' THEN sales END) AS sales_1994
        FROM retail_sales
        WHERE kind_of_business = 'Book stores'
            AND sales_month BETWEEN '1992-01-01' AND '1994-12-01'
        GROUP BY 1
        """

sql_25 = pd.read_sql(query_25, conn)
sql_25

Unnamed: 0,month_number,sales_1992,sales_1993,sales_1994
0,1,790.0,998.0,1053.0
1,2,539.0,568.0,635.0
2,3,535.0,602.0,634.0
3,4,523.0,583.0,610.0
4,5,552.0,612.0,684.0
5,6,589.0,618.0,724.0
6,7,592.0,607.0,678.0
7,8,894.0,983.0,1154.0
8,9,861.0,903.0,1022.0
9,10,645.0,669.0,732.0


<img align="left" width="571" alt="Screen Shot 2022-04-19 at 4 50 56 PM" src="https://user-images.githubusercontent.com/73784742/163964983-91895066-3446-4ef1-80ba-f82602b56a84.png">

By lining the data up in this way, we can see some trends immediately. December sales are the highest monthly sales of the year. Sales in 1994 were higher every month than sales in 1992 and 1993. The August-to-September sales bump is visible, and particularly easy to spot in 1994.

#### Comparing to Multiple Prior Periods

We’ll take advantage of the optional offset value. Recall that when no offset is provided to lag, the function returns the immediate prior value according to the PARTITION BY and ORDER BY clauses. An `offset` value of 2 skips over the immediate prior value and returns the value prior to that, an offset value of 3 returns the value from 3 rows back, and so on.

In [93]:
query_26 = """
        SELECT sales_month, sales,
            LAG(sales, 1) OVER (PARTITION BY strftime('%m',sales_month) ORDER BY sales_month)
                AS prev_sales_1,
            LAG(sales, 2) OVER (PARTITION BY strftime('%m',sales_month) ORDER BY sales_month)
                AS prev_sales_2,
            LAG(sales, 3) OVER (PARTITION BY strftime('%m',sales_month) ORDER BY sales_month)
                AS prev_sales_3
        FROM retail_sales
        WHERE kind_of_business = 'Book stores'
        """

sql_26 = pd.read_sql(query_26, conn)
sql_26

Unnamed: 0,sales_month,sales,prev_sales_1,prev_sales_2,prev_sales_3
0,1992-01-01,790.0,,,
1,1993-01-01,998.0,790.0,,
2,1994-01-01,1053.0,998.0,790.0,
3,1995-01-01,1308.0,1053.0,998.0,790.0
4,1996-01-01,1373.0,1308.0,1053.0,998.0
...,...,...,...,...,...
343,2016-12-01,1249.0,1321.0,1332.0,1327.0
344,2017-12-01,1114.0,1249.0,1321.0,1332.0
345,2018-12-01,1122.0,1114.0,1249.0,1321.0
346,2019-12-01,1037.0,1122.0,1114.0,1249.0


In [112]:
query_27 = """
         SELECT sales_month, sales,
             sales / AVG(sales) OVER (PARTITION BY strftime('%m', sales_month) 
                 ORDER BY sales_month ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS pct_of_prev_3
            FROM retail_sales
            WHERE kind_of_business = 'Book stores'
        """

sql_27 = pd.read_sql(query_27, conn)
sql_27

Unnamed: 0,sales_month,sales,pct_of_prev_3
0,1992-01-01,790.0,
1,1993-01-01,998.0,1.263291
2,1994-01-01,1053.0,1.177852
3,1995-01-01,1308.0,1.381204
4,1996-01-01,1373.0,1.226258
...,...,...,...
343,2016-12-01,1249.0,0.941457
344,2017-12-01,1114.0,0.856484
345,2018-12-01,1122.0,0.913681
346,2019-12-01,1037.0,0.892683


Analyzing data with seasonality often involves trying to reduce noise in order to make clear conclusions about the underlying trends in the data. 

Comparing data points against multiple prior time periods can give us an even smoother trend to compare to and determine what is actually happening in the current time period. 

This does require that the data include enough history to make these comparisons, but when we have a long enough time series, it can be insightful.