### Time Series Analysis

Time series analysis is one of the most common types of analysis done with SQL. A `time seires` is a sequence of measurements or data points recorded in time order, often at regularly spaced intervals.

Forecasting is a common goal of time series analysis. Since time only marches forward, future values can be expressed as a function of past values, while the reverse is not ture. However, it's important to note that the past doesn't perfectly predict the future.

Still, looking at historical data can lead to insights, and developing a range of plausible outcomes is useful for planning.

#### The Retail Sales Data Set

The examples use a data set of montly US retail sales from the [Montly Retail Trade Report: Retail and Food Services Sales: Excel (1992-2020)](https://www.census.gov/retail/index.html#mrts).

The data in this report is uesd as an economic indicator to understand trends in US consumer spending patterns.

While gross domestic product(GDP) figures are published quarterly, this retail sales data is published monthly, so it is also used to help predict GDP.

The data spans from 1992 to 2020 and includes both total sales as well as details for subcategories of retail sales.

Sales figures are in millions of US dollars.

In [2]:
import pandas as pd
import sqlite3

df = pd.read_csv('us_retail_sales.csv')

conn = sqlite3.connect("sql-analysis.db")
#df.to_sql('retail_sales', conn)

In [3]:
import matplotlib as mpl
import matplotlib.pyplot as plt

#### Simple Trends

In [4]:
query_01 = """
        SELECT sales_month, sales
        FROM retail_sales
        WHERE kind_of_business = 'Retail and food services sales, total'
        """

sql_01 = pd.read_sql(query_01, conn)
sql_01.head()

Unnamed: 0,sales_month,sales
0,1992-01-01,146376.0
1,1992-02-01,147079.0
2,1992-03-01,159336.0
3,1992-04-01,163669.0
4,1992-05-01,170068.0


<img align="left" width="437" alt="Screen Shot 2022-04-18 at 4 28 01 PM" src="https://user-images.githubusercontent.com/73784742/163780536-143cd751-373d-4d84-9e19-763575601ccf.png">

This data clearly has some patterns, but it also has some noise. Transforming the data and aggregating at the yearly level can help us gain a better understanding.

In [5]:
'''
plt.subplots(figsize=(12, 8))
plt.plot(sql_01['sales_month'], sql_01['sales'])

plt.title("Monthly Retail and Food Services Sales")
plt.xlabel("Month")
plt.ylabel("$MM Sales")

plt.xticks(ticks=sql_01['sales_month'], labels=sql_01['sales_month'], rotation=45)
plt.locator_params(axis='x', nbins=len(sql_01['sales_month'])/12)

plt.show();
'''

'\nplt.subplots(figsize=(12, 8))\nplt.plot(sql_01[\'sales_month\'], sql_01[\'sales\'])\n\nplt.title("Monthly Retail and Food Services Sales")\nplt.xlabel("Month")\nplt.ylabel("$MM Sales")\n\nplt.xticks(ticks=sql_01[\'sales_month\'], labels=sql_01[\'sales_month\'], rotation=45)\nplt.locator_params(axis=\'x\', nbins=len(sql_01[\'sales_month\'])/12)\n\nplt.show();\n'

In [6]:
query_02 = """
        SELECT strftime('%Y', sales_month) AS sales_year, SUM(sales) AS sales
        FROM retail_sales
        WHERE kind_of_business = 'Retail and food services sales, total'
        GROUP BY sales_year
        """

sql_02 = pd.read_sql(query_02, conn)
sql_02.head()

Unnamed: 0,sales_year,sales
0,1992,2014102.0
1,1993,2153095.0
2,1994,2330235.0
3,1995,2450628.0
4,1996,2603794.0


<img align="left" width="435" alt="Screen Shot 2022-04-18 at 4 31 29 PM" src="https://user-images.githubusercontent.com/73784742/163781000-f42f328b-51a9-4131-8819-1eb89433d8b5.png">

In [7]:
'''
plt.subplots(figsize=(12, 8))
plt.plot(sql_02['sales_year'], sql_02['sales'])

plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)

plt.title("Yearly Total Retail and Food Services Sales")
plt.xlabel("Year")
plt.ylabel("$MM Sales")
plt.xticks(ticks=sql_02['sales_year'], labels=sql_02['sales_year'], rotation=45)
plt.locator_params(axis='x', nbins=len(sql_02['sales_year']))

plt.show();
'''

'\nplt.subplots(figsize=(12, 8))\nplt.plot(sql_02[\'sales_year\'], sql_02[\'sales\'])\n\nplt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)\n\nplt.title("Yearly Total Retail and Food Services Sales")\nplt.xlabel("Year")\nplt.ylabel("$MM Sales")\nplt.xticks(ticks=sql_02[\'sales_year\'], labels=sql_02[\'sales_year\'], rotation=45)\nplt.locator_params(axis=\'x\', nbins=len(sql_02[\'sales_year\']))\n\nplt.show();\n'

After graphing this data, we now have a smoother time series that is generally incresing over time, as might be expected, since the sales values are not adjusted for inflation. Sales for all retail and food services fell in 2009, during the global financial crsis. After growing every year throughout the 2010s, sales were flat in 2020 compared to 2019, due to the impact of the COVID-19 pandemic.

#### Comparing Components

In [8]:
query_03 = """
        SELECT strftime('%Y', sales_month) AS sales_year, kind_of_business, SUM(sales) AS sales
        FROM retail_sales
        WHERE kind_of_business IN ('Book stores', 'Sporting goods stores', 'Hobby, toy, and game stores')
        GROUP BY sales_year, sales
        """

sql_03 = pd.read_sql(query_03, conn)
sql_03

Unnamed: 0,sales_year,kind_of_business,sales
0,1992,Book stores,8327.0
1,1992,"Hobby, toy, and game stores",11251.0
2,1992,Sporting goods stores,15583.0
3,1993,Book stores,9108.0
4,1993,"Hobby, toy, and game stores",11651.0
...,...,...,...
82,2019,"Hobby, toy, and game stores",16261.0
83,2019,Sporting goods stores,43808.0
84,2020,Book stores,6425.0
85,2020,"Hobby, toy, and game stores",17287.0


<img align="left" width="435" alt="Screen Shot 2022-04-18 at 4 40 33 PM" src="https://user-images.githubusercontent.com/73784742/163782175-0fc4e8a8-54a5-4356-8248-d27c7c5fed88.png"> 

Sales at sporting goods retailers started the highest among the three categories and grew much faster during the time period, and by the end of the time series, those sales were substantially higher. Sales at sporting goods sotres started declining in 2017 but had a big rebound in 2020. Sales at hobby, toy, and games stores were relatively flat over this time span, with a slight dip in the mid-2000s and another slight decline prior to a rebound 2020. Sales at book stores grew until the mid-2000s and have been on the decline since then.

In [17]:
query_04 = """
        SELECT strftime('%Y', sales_month) AS sales_year, kind_of_business, SUM(sales) AS sales
        FROM retail_sales
        WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
        GROUP BY sales_year, kind_of_business
        """

sql_04 = pd.read_sql(query_04, conn)
sql_04.head()

Unnamed: 0,sales_year,kind_of_business,sales
0,1992,Men's clothing stores,10179.0
1,1992,Women's clothing stores,31815.0
2,1993,Men's clothing stores,9962.0
3,1993,Women's clothing stores,32350.0
4,1994,Men's clothing stores,10032.0


<img align="left" width="435" alt="Screen Shot 2022-04-18 at 7 49 03 PM" src="https://user-images.githubusercontent.com/73784742/163804053-ee0df89b-f073-4720-be07-7c9cab0c414c.png">

The gap between men’s and women’s sales does not appear constant but rather was increasing during the early to mid-2000s. Women’s clothing sales in particular dipped during the global financial crisis of 2008–2009, and sales in both categories dropped a lot during the pandemic in 2020.

In [16]:
query_05 = """
        SELECT strftime('%Y', sales_month) AS sales_year,
            SUM(CASE WHEN kind_of_business = 'Women''s clothing stores'
                THEN sales
                END) AS womens_sales,
            SUM(CASE WHEN kind_of_business = 'Men''s clothing stores'
                THEN sales
                END) AS mens_sales
        FROM retail_sales
        WHERE kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores')
        GROUP BY sales_year
        """

sql_05 = pd.read_sql(query_05, conn)
sql_05.head()

Unnamed: 0,sales_year,womens_sales,mens_sales
0,1992,31815.0,10179.0
1,1993,32350.0,9962.0
2,1994,30585.0,10032.0
3,1995,28696.0,9315.0
4,1996,28238.0,9546.0


In [20]:
query_06 = """    
        SELECT sales_year,
            womens_sales - mens_sales AS womens_minus_mens
        FROM
        (
            SELECT strftime('%Y', sales_month) AS sales_year,
                SUM(CASE WHEN kind_of_business = 'Women''s clothing stores'
                    THEN sales
                    END) AS womens_sales,
                SUM(CASE WHEN kind_of_business = 'Men''s clothing stores'
                    THEN sales
                    END) AS mens_sales
                FROM retail_sales
                WHERE kind_of_business in ('Men''s clothing stores','Women''s clothing stores')
                AND sales_month <= '2019-12-01'
                GROUP BY 1
        )
        """
        
sql_06 = pd.read_sql(query_06, conn)
sql_06.head()       

Unnamed: 0,sales_year,womens_minus_mens
0,1992,21636.0
1,1993,22388.0
2,1994,20553.0
3,1995,19381.0
4,1996,18692.0


<img align="left" width="443" alt="Screen Shot 2022-04-18 at 8 09 16 PM" src="https://user-images.githubusercontent.com/73784742/163806248-93b1b5c3-692f-45f1-94b8-801e52d63858.png">

The gap decreased between 1992 and about 1997, began a long increase through about 2011 (with a brief dip in 2007), and then was more or less flat through 2019.

#### Percent of Total Calculations