# Querying Sales Data fString SQL
## Introduction
In this notebook, we will use fString queries to preview a mock dataset and then build a query that will help analysts get all of the information they need for exploratory analysis, standardized measurement studies, and data transfer.

This is a companion notebook to QueryingSalesDataIbis.ipynb

## Setup

We have a basic star schema containing (mock) sales data for several retailers.  The tables are as follows:

* `fact_sale`: `fact_id` | `date_id` | `store_id` | `product_id` | `unit_id` | `value` | `created_at`
    * Sales data in aggregate by some time period (in this case, weeks.  See `dim_date`)
* `dim_date`: `date_id` | `date_value`
    * date dimensions (ID to value).  Just contains week values.
    * (join fact by `date_id`)
* `dim_store`: `store_id` | `store_name` | `retailer_id`
    * store dimensions (ID to names)
    * (join fact by `store_id`)
* `dim_retailer`: `retailer_id` | `retailer_name`
    * retailer dimensions (ID to names)
    * (join stores by `retailer_id`)
* `dim_product`: `product_id` | `product_name`
    * product dimensions (ID to names)
    * (join fact by `product_id`)
* `dim_unit`: `unit_id` | `unit_name`
    * unit dimensions (ID to names).  For now, just contains Units and Sales USD
    * (join fact by `unit_id`)

Our analysts will use this data to run some rudimentary analyses for some product set at some retailer.  A typical analysis will have the following parameters:

* dates:
    * A test period that we want to measure.  Typically 4 weeks with a 2 period lag
    * A control period that we will use to establish a baseline wrt time
* products:
    * An arbitrary binning of products
* store segments:
    * Test: treatment applied
    * Control: treatment not applied, to help establish a baseline wrt stores
    * Not Selected: n/a
* units
    * Typically Sales in Currency (USD in this example)

Store segmentations created in our measurement studies are held in another star schema by an analysis ID, which connects to our sales schema using `store_id`:

* `fact_segmentation`: `fact_id` | `analysis_id` | `segment_id` | `store_id`
    * (join `dim_store` by `store_id`)
* `dim_segment`: `segment_id` | `segment_name`
    * (join fact by `segment_id`)

In the event a store segmentation doesn't exist, we can simply create one segment for all interesting stores (some subset of the universe) and then upload it to `fact_segmentation`.

## Things to Keep in Mind

At some point, an analyst or downstream vendor might want to actually look at the data and map values to real world objects, so it will help to have as much of the data human-readable as possible.  Therefore, we want to include an acceptable amount of detail so internal users can limit our `SELECT * FROM TABLE WHERE TABLE_ID = VALUE` calls and external users can limit their prying emails.

Also, it might be tempting to filter only on those products that are called for, but for exploratory reasons we should at least aggregate them into additional buckets.  We will create three additional groups for all products:

1. `included_products`
2. `excluded_products`
3. `all_products`

# Sales Data Query

In this section, we will build a Sales Data Query using the mock sales data referenced above.  To generate a mock dataset and follow along, you can use the MockSalesData notebook provided in this repo.

The columns we need are chosen to help an analyst comb through the data quickly in the event of an anomaly or for exploratory analyses:

1. `analysis_id`: to keep track of what analysis ID we're using
1. `[ date_id | store_id | unit_id ]`: a set of ID references that act as a unique label for our product groups at a given store, during a given week, for a given unit
1. `[ retailer_name | store_name ]`: human-readable store label
1. `week_value`: human-readable date label
1. `unit_name`: human-readable unit label
1. product group names: we will group our products by product group label and then pivot our data so it takes up fewer rows (also so that a human can keep better track of the labels in (2))

We will begin by exploring the tables in our database and then construct our query to accept arbitrary analysis parameters:

* `ANALYSIS_ID`: filtering store segments and to label the analysis
* `PRODUCT GROUPS`: a mapping of product group names (group labels) to product names (`product_name`)
    * These are assumed to be mutually exclusive for the purposes of addition
* dates:
    * `TEST_PERIOD_START`: week of treatment start
    * `TEST_PERIOD_END`: week of treatment end
    * `LAG_PERIOD_END`: week of the lag period end
    * `CONTROL_PERIOD_START`: week of control start
    * `CONTROL_PERIOD_END`: week of control end

## Setup

Analysis vars for filtering and transforming

In [1]:
ANALYSIS_ID = 3

# In a more sophisticated data warehouse, we would store analysis paramters as a configuration
# inside of our database and connect this configuration to our data using the configuration id
PRODUCT_GROUPS = {
    'group_1': ['PRODUCT_0', 'PRODUCT_4']
    ,'group_2': ['PRODUCT_1']
    ,'group_3': ['PRODUCT_3', 'PRODUCT_5', 'PRODUCT_10']
}

TEST_PERIOD_START = 'WEEK_52'
TEST_PERIOD_END = 'WEEK_55'

LAG_PERIOD_END = 'WEEK_57'

CONTROL_PERIOD_START = 'WEEK_00'
CONTROL_PERIOD_END = 'WEEK_51'

Misc vars for pointing to stuff:

In [2]:
SALES_DB = 'sales_data.db'

To get our data using SQL fString, we will use a sqlite3 connector and pandas:

In [3]:
import sqlite3
import pandas as pd

sconn = sqlite3.connect(SALES_DB)


# an ipynb classic
def read_sql(sql, con=sconn):
    return pd.read_sql(sql, con)

### Exploratory Functions

Here we are just counting the rows to each table of interest.

In [4]:
def table_ct_fstr(table_name: str):
    sql = f"select count(*) as ct from {table_name}"
    return read_sql(sql)['ct'][0]

for tbl in ['fact_sale', 'dim_date', 'dim_store', 'dim_retailer', 'dim_product', 'dim_unit', 'fact_segmentation', 'dim_segment']:
    print(tbl, "row count:", table_ct_fstr(tbl))

fact_sale row count: 26856660
dim_date row count: 58
dim_store row count: 8107
dim_retailer row count: 4
dim_product row count: 15
dim_unit row count: 2
fact_segmentation row count: 8107
dim_segment row count: 3


`fact_sale` is a bit big, so let's filter it down preview it before moving on.

In [8]:
sql = "SELECT * FROM fact_sale LIMIT 5"
read_sql(sql)

Unnamed: 0,fact_id,date_id,store_id,product_id,unit_id,value,created_at
0,1,1,1,1,1,6.0,2022-06-03 14:20:22
1,2,1,1,1,2,31.56,2022-06-03 14:20:22
2,3,2,1,1,1,26.0,2022-06-03 14:20:22
3,4,2,1,1,2,136.76,2022-06-03 14:20:22
4,5,3,1,1,1,24.0,2022-06-03 14:20:22


Since we expect there to be only one row per set over the partition of `date_id` | `store_id` | `product_id` | `unit_id`, let's check this condition real quick before moving forward.

In [9]:
sql = """
SELECT
    date_id
    ,store_id
    ,product_id
    ,unit_id
    ,count(fact_id)
FROM
    fact_sale
GROUP BY
    1
    ,2
    ,3
    ,4
HAVING
    count(fact_id) > 1
LIMIT 5
"""

read_sql(sql)

Unnamed: 0,date_id,store_id,product_id,unit_id,count(fact_id)
0,1,1,1,1,2
1,1,1,1,2,2
2,1,1,2,1,2
3,1,1,2,2,2
4,1,1,3,1,2


Looks like there are duplicates, so let's grab one and see what a duplicate looks like:

In [13]:
def fact_sale_filter_fstr(**kwargs):
    sql = f"""
    SELECT
        *
    FROM
        fact_sale fs
    JOIN
        dim_date dd
        ON dd.date_id = fs.date_id
    JOIN
        dim_store ds
        ON ds.store_id = fs.store_id
    JOIN
        dim_retailer dr
        ON dr.retailer_id = ds.retailer_id
    JOIN
        dim_product dp
        ON dp.product_id = fs.product_id
    JOIN
        dim_unit du
        ON du.unit_id = fs.unit_id
    """
    wstm = "fs.{} = {}"
    wstms = [wstm.format(k, v) for k, v in kwargs.items()]

    if not wstms:
        raise RuntimeError("you should filter on something there bud")

    sql = sql + 'WHERE\n        ' + '\n    AND\n        '.join(wstms)
    return read_sql(sql)

fact_sale_filter_fstr(date_id=1, store_id=1, product_id=1, unit_id=1)

Unnamed: 0,fact_id,date_id,store_id,product_id,unit_id,value,created_at,date_id.1,date_value,store_id.1,store_name,retailer_id,retailer_id.1,retailer_name,product_id.1,product_name,unit_id.1,unit_name
0,1,1,1,1,1,6.0,2022-06-03 14:20:22,1,WEEK_00,1,0,1,1,RETAILER_0,1,PRODUCT_0,1,SALES_UNITS
1,13428331,1,1,1,1,3.0,2022-06-03 14:22:57,1,WEEK_00,1,0,1,1,RETAILER_0,1,PRODUCT_0,1,SALES_UNITS


In this dataset, it looks like `value` is double the other entry.  It is likely that data was doubled for upload and this error was quickly corrected.

Some core principles of this table are:

1. never delete data, including data uploaded in error
2. if something is uploaded in error, fix it somewhere else and then re-upload it
3. append the current timestamp to the data upon upload

So in this table, we use `created_at` to give us the latest value of truth for each `date_id`, `store_id`, `product_id`, `unit_id` set.  So we need to pick those rows where `created_at` is equal to the `MAX(created_at OVER (PARTITION BY date_id, store_id, product_id, unit_id)`.

Running this window function every time we want data can be expensive, so we should filter our data as much as possible before running it.

To help filter down our data, we will join `fact_sale` on to filtered dimension tables for the values we care about.  Let's craft those filter queries now.

### Filtering `dim_date`

We are given 5 date points: test start and end, lag end, and control start and end.  We can smartly lump the test period and lag period together to create two ranges: the intervention+lag period, and the control period.

To filter dates, we'll find all weeks between both of those ranges:

* `BETWEEN TEST_PERIOD_START AND LAG_PERIOD_END`
* `BETWEEN CONTROL_PERIOD_START AND CONTROL_PERIOD_END`

We create two distinct ranges because it's possible that we do a period over period analysis instead of a 52-week lead control and following 6 week test back-to-back.

In [14]:
def fstr_dim_date(control_period, test_period):
    sql = """
    SELECT
        date_id
        ,date_value
    FROM
        dim_date
    WHERE
        date_value BETWEEN '{}' AND '{}'
    OR
        date_value BETWEEN '{}' AND '{}'
    """.format(*control_period, *test_period)
    return sql

q_dim_date = fstr_dim_date([CONTROL_PERIOD_START, CONTROL_PERIOD_END], [TEST_PERIOD_START, LAG_PERIOD_END])
print(q_dim_date)


    SELECT
        date_id
        ,date_value
    FROM
        dim_date
    WHERE
        date_value BETWEEN 'WEEK_00' AND 'WEEK_51'
    OR
        date_value BETWEEN 'WEEK_52' AND 'WEEK_57'
    


Note that weeks are stored as strings in this database for generality.  This query _could_ actually work with date values.

### Filtering `dim_store`

We have an `analysis_id` (`ANALYSIS_ID`), and we have store segments for that config saved in `fact_segmentation`.  Let's filter `dim_stores` and pull in store segments, retailer names, and store names at the same time:

In [15]:
def fstr_dim_store(analysis_ids: list):
    r"filter dim_stores using an analysis store segmentation"
    id_lst = ', '.join(str(e) for e in analysis_ids)

    sql = f"""
    SELECT
        ds.store_id
        ,ds.store_name
        ,dseg.segment_name
        ,dr.retailer_id
        ,dr.retailer_name
    FROM
        fact_segmentation fseg
    JOIN
        dim_segment dseg
        ON dseg.segment_id = fseg.segment_id
    JOIN
        dim_store ds
        ON ds.store_id = fseg.store_id
    JOIN
        dim_retailer dr
        ON dr.retailer_id = ds.retailer_id
    WHERE
        fseg.analysis_id IN ({id_lst})
    """
    return sql

q_dim_store = fstr_dim_store([ANALYSIS_ID])
print(q_dim_store)


    SELECT
        ds.store_id
        ,ds.store_name
        ,dseg.segment_name
        ,dr.retailer_id
        ,dr.retailer_name
    FROM
        fact_segmentation fseg
    JOIN
        dim_segment dseg
        ON dseg.segment_id = fseg.segment_id
    JOIN
        dim_store ds
        ON ds.store_id = fseg.store_id
    JOIN
        dim_retailer dr
        ON dr.retailer_id = ds.retailer_id
    WHERE
        fseg.analysis_id IN (3)
    


### Filtering `dim_product` and `dim_unit`

As mentioned before, it may be tempting to filter `dim_product`.  Our example analysis uses only 6 products, so if we cut out the rest that's 11 products cross however many stores cross however many dates cross however many units we care about.  We're instead going to bucket them and include them in our analysis.  Same deal with units.

For larger datasets, it may be a good idea to filter further--for example, `dim_unit` might contain multiple currencies or on hand units, and `dim_product` might contain products for multiple irrelevant companies that we would want to exclude from our analysis.

In [16]:
def fstr_dim_product():
    sql = """
    SELECT
        product_id
        ,product_name
    FROM
        dim_product
    """
    return sql


def fstr_dim_unit():
    sql = """
    SELECT
        unit_id
        ,unit_name
    FROM
        dim_unit
    """
    return sql

q_dim_product = fstr_dim_product()
print(q_dim_product)
q_dim_unit = fstr_dim_unit()
print(q_dim_unit)


    SELECT
        product_id
        ,product_name
    FROM
        dim_product
    

    SELECT
        unit_id
        ,unit_name
    FROM
        dim_unit
    


### Putting it All Together to Filter Fact Sale and Calculate `MAX(created_at)` (maxfact)

Now we'll combine all of our queries to filter `fact_sale` in preparation for our maxfact query:

In [17]:
def fstr_fact_sale(analysis_ids: list, control_period:iter, test_period: iter) -> str:
    dim_date = fstr_dim_date(control_period, test_period)
    dim_store = fstr_dim_store(analysis_ids)
    dim_product = fstr_dim_product()
    dim_unit = fstr_dim_unit()

    sql = f"""
    SELECT
        fs.fact_id
        ,fs.created_at
        ,MAX(fs.created_at) OVER (PARTITION BY fs.date_id, fs.store_id, fs.product_id, fs.unit_id) AS maxfact
        ,fs.date_id
        ,fs.store_id
        ,fs.product_id
        ,fs.unit_id
        ,fs.value
        ,dd.date_value
        ,ds.store_name
        ,ds.retailer_name
        ,ds.segment_name
        ,dp.product_name
        ,du.unit_name
    FROM
        fact_sale fs
    JOIN (
    {dim_date}
        ) dd
        ON dd.date_id = fs.date_id
    JOIN (
    {dim_store}
        ) ds
        ON ds.store_id = fs.store_id
    JOIN (
    {dim_product}
        ) dp
        ON dp.product_id = fs.product_id
    JOIN (
    {dim_unit}
        ) du
        ON du.unit_id = fs.unit_id
    """
    return sql

q_fact_sale = fstr_fact_sale([ANALYSIS_ID], [CONTROL_PERIOD_START, CONTROL_PERIOD_END], [TEST_PERIOD_START, LAG_PERIOD_END])
print(q_fact_sale)


    SELECT
        fs.fact_id
        ,fs.created_at
        ,MAX(fs.created_at) OVER (PARTITION BY fs.date_id, fs.store_id, fs.product_id, fs.unit_id) AS maxfact
        ,fs.date_id
        ,fs.store_id
        ,fs.product_id
        ,fs.unit_id
        ,fs.value
        ,dd.date_value
        ,ds.store_name
        ,ds.retailer_name
        ,ds.segment_name
        ,dp.product_name
        ,du.unit_name
    FROM
        fact_sale fs
    JOIN (
    
    SELECT
        date_id
        ,date_value
    FROM
        dim_date
    WHERE
        date_value BETWEEN 'WEEK_00' AND 'WEEK_51'
    OR
        date_value BETWEEN 'WEEK_52' AND 'WEEK_57'
    
        ) dd
        ON dd.date_id = fs.date_id
    JOIN (
    
    SELECT
        ds.store_id
        ,ds.store_name
        ,dseg.segment_name
        ,dr.retailer_id
        ,dr.retailer_name
    FROM
        fact_segmentation fseg
    JOIN
        dim_segment dseg
        ON dseg.segment_id = fseg.segment_id
    JOIN
        dim_store ds
   

### "Pivoting" using `SUM(CASE`/`WHEN IN set THEN value ELSE 0 END) AS name` on Arbitrary Sets

We now have one last thing to do: flatten our data by bucketing our products into their respective groups.  We can pivot our data by aggregating `fact_sale.value` if a product name is in the set.

Let's create some functions to help us do that:

First, let's deal with our defined groups:

In [18]:
# return a list here so we can add up all of our case/when statement lists
# and then format a final case/when statement
def cw_def_groups(product_groups: dict) -> list:
    def fmt_set(s) -> str:
        return "'" + "', '".join(str(e) for e in s) + "'"

    cw_skel = 'SUM(CASE WHEN product_name IN ({}) THEN fs.value ELSE 0 END) AS "{}"'
    return [cw_skel.format(fmt_set(v), k) for k, v in product_groups.items()]

q_agg_cw_def_groups = cw_def_groups(PRODUCT_GROUPS)
print('\n'.join(q_agg_cw_def_groups))

SUM(CASE WHEN product_name IN ('PRODUCT_0', 'PRODUCT_4') THEN fs.value ELSE 0 END) AS "group_1"
SUM(CASE WHEN product_name IN ('PRODUCT_1') THEN fs.value ELSE 0 END) AS "group_2"
SUM(CASE WHEN product_name IN ('PRODUCT_3', 'PRODUCT_5', 'PRODUCT_10') THEN fs.value ELSE 0 END) AS "group_3"


Next, let's deal included, excluded, and all:

In [19]:
def cw_undef_groups(product_groups: dict) -> list:
    def fmt_set(s) -> str:
        return "'" + "', '".join(sorted(str(e) for e in s)) + "'"

    cw_skel = 'SUM(CASE WHEN product_name {} IN ({}) THEN fs.value ELSE 0 END) AS "{}"'

    pset = set()
    for pg in product_groups.values():
        pset = pset.union(set(pg))

    cws = [
        cw_skel.format('', fmt_set(pset), 'included_products')
        ,cw_skel.format('NOT', fmt_set(pset), 'excluded_products')
        ,'SUM(value) AS all_products'
    ]
    return cws

Finally, we will compile the aggregate expressions:

In [20]:
def fstr_cw_stms(product_groups: dict) -> str:
    grps = cw_def_groups(product_groups)
    misc = cw_undef_groups(product_groups)

    cws = '\n        ,'.join(grps + misc)
    return cws

stm_cws = fstr_cw_stms(PRODUCT_GROUPS)
print(stm_cws)

SUM(CASE WHEN product_name IN ('PRODUCT_0', 'PRODUCT_4') THEN fs.value ELSE 0 END) AS "group_1"
        ,SUM(CASE WHEN product_name IN ('PRODUCT_1') THEN fs.value ELSE 0 END) AS "group_2"
        ,SUM(CASE WHEN product_name IN ('PRODUCT_3', 'PRODUCT_5', 'PRODUCT_10') THEN fs.value ELSE 0 END) AS "group_3"
        ,SUM(CASE WHEN product_name  IN ('PRODUCT_0', 'PRODUCT_1', 'PRODUCT_10', 'PRODUCT_3', 'PRODUCT_4', 'PRODUCT_5') THEN fs.value ELSE 0 END) AS "included_products"
        ,SUM(CASE WHEN product_name NOT IN ('PRODUCT_0', 'PRODUCT_1', 'PRODUCT_10', 'PRODUCT_3', 'PRODUCT_4', 'PRODUCT_5') THEN fs.value ELSE 0 END) AS "excluded_products"
        ,SUM(value) AS all_products


### The Final Query

By putting this all together, we get our Sales Data Query:

In [21]:
def sales_data_query(
    analysis_ids: list
    ,product_groups: dict
    ,control_period: iter
    ,test_period: iter
) -> str:
    fact_sale = fstr_fact_sale(analysis_ids, control_period, test_period)
    cw_stms = fstr_cw_stms(product_groups)
    gbcols = """fs.date_id
        ,fs.store_id
        ,fs.unit_id
        ,fs.date_value
        ,fs.store_name
        ,fs.retailer_name
        ,fs.segment_name"""

    sql = f"""
    SELECT
        {gbcols}
        ,fs.unit_name
        ,{cw_stms}
    FROM (
    {fact_sale}
        ) fs
    WHERE
        fs.maxfact = fs.created_at
    GROUP BY
        {gbcols}
    """
    return sql

sdq = sales_data_query(
    [ANALYSIS_ID]
    ,PRODUCT_GROUPS
    ,[CONTROL_PERIOD_START, CONTROL_PERIOD_END]
    ,[TEST_PERIOD_START, LAG_PERIOD_END]
)
print(sdq)


    SELECT
        fs.date_id
        ,fs.store_id
        ,fs.unit_id
        ,fs.date_value
        ,fs.store_name
        ,fs.retailer_name
        ,fs.segment_name
        ,fs.unit_name
        ,SUM(CASE WHEN product_name IN ('PRODUCT_0', 'PRODUCT_4') THEN fs.value ELSE 0 END) AS "group_1"
        ,SUM(CASE WHEN product_name IN ('PRODUCT_1') THEN fs.value ELSE 0 END) AS "group_2"
        ,SUM(CASE WHEN product_name IN ('PRODUCT_3', 'PRODUCT_5', 'PRODUCT_10') THEN fs.value ELSE 0 END) AS "group_3"
        ,SUM(CASE WHEN product_name  IN ('PRODUCT_0', 'PRODUCT_1', 'PRODUCT_10', 'PRODUCT_3', 'PRODUCT_4', 'PRODUCT_5') THEN fs.value ELSE 0 END) AS "included_products"
        ,SUM(CASE WHEN product_name NOT IN ('PRODUCT_0', 'PRODUCT_1', 'PRODUCT_10', 'PRODUCT_3', 'PRODUCT_4', 'PRODUCT_5') THEN fs.value ELSE 0 END) AS "excluded_products"
        ,SUM(value) AS all_products
    FROM (
    
    SELECT
        fs.fact_id
        ,fs.created_at
        ,MAX(fs.created_at) OVER (PARTITION BY 

And, finally, running it:

In [25]:
sales = read_sql(sdq)

sales.shape

(198946, 14)

In [26]:
sales.head()

Unnamed: 0,date_id,store_id,unit_id,date_value,store_name,retailer_name,segment_name,unit_name,group_1,group_2,group_3,included_products,excluded_products,all_products
0,1,2305,1,WEEK_00,0,RETAILER_2,T,SALES_UNITS,20.0,6.0,10.0,36.0,21.0,57.0
1,1,2305,2,WEEK_00,0,RETAILER_2,T,SALES_USD,107.63,12.0,66.22,185.85,81.03,266.88
2,1,2306,1,WEEK_00,1,RETAILER_2,T,SALES_UNITS,11.0,5.0,7.0,23.0,28.0,51.0
3,1,2306,2,WEEK_00,1,RETAILER_2,T,SALES_USD,57.86,10.0,49.4,117.26,143.19,260.45
4,1,2307,1,WEEK_00,2,RETAILER_2,T,SALES_UNITS,17.0,2.0,10.0,29.0,33.0,62.0


Now that we have our data in a pandas DataFrame, we can export, transform, and filter as we (or our analysts) see fit.

### All Functions, Together

In [None]:
def fstr_dim_date(control_period, test_period):
    sql = """
    SELECT
        date_id
        ,date_value
    FROM
        dim_date
    WHERE
        date_value BETWEEN '{}' AND '{}'
    OR
        date_value BETWEEN '{}' AND '{}'
    """.format(*control_period, *test_period)
    return sql


def fstr_dim_store(analysis_ids: list):
    r"filter dim_stores using an analysis store segmentation"
    id_lst = ', '.join(str(e) for e in analysis_ids)

    sql = f"""
    SELECT
        ds.store_id
        ,ds.store_name
        ,dseg.segment_name
        ,dr.retailer_id
        ,dr.retailer_name
    FROM
        fact_segmentation fseg
    JOIN
        dim_segment dseg
        ON dseg.segment_id = fseg.segment_id
    JOIN
        dim_store ds
        ON ds.store_id = fseg.store_id
    JOIN
        dim_retailer dr
        ON dr.retailer_id = ds.retailer_id
    WHERE
        fseg.analysis_id IN ({id_lst})
    """
    return sql


def fstr_dim_product():
    sql = """
    SELECT
        product_id
        ,product_name
    FROM
        dim_product
    """
    return sql


def fstr_dim_unit():
    sql = """
    SELECT
        unit_id
        ,unit_name
    FROM
        dim_unit
    """
    return sql


def fstr_fact_sale(analysis_ids: list, control_period:iter, test_period: iter) -> str:
    dim_date = fstr_dim_date(control_period, test_period)
    dim_store = fstr_dim_store(analysis_ids)
    dim_product = fstr_dim_product()
    dim_unit = fstr_dim_unit()

    sql = f"""
    SELECT
        fs.fact_id
        ,fs.created_at
        ,MAX(fs.created_at) OVER (PARTITION BY fs.date_id, fs.store_id, fs.product_id, fs.unit_id) AS maxfact
        ,fs.date_id
        ,fs.store_id
        ,fs.product_id
        ,fs.unit_id
        ,fs.value
        ,dd.date_value
        ,ds.store_name
        ,ds.retailer_name
        ,ds.segment_name
        ,dp.product_name
        ,du.unit_name
    FROM
        fact_sale fs
    JOIN (
    {dim_date}
        ) dd
        ON dd.date_id = fs.date_id
    JOIN (
    {dim_store}
        ) ds
        ON ds.store_id = fs.store_id
    JOIN (
    {dim_product}
        ) dp
        ON dp.product_id = fs.product_id
    JOIN (
    {dim_unit}
        ) du
        ON du.unit_id = fs.unit_id
    """
    return sql


def cw_def_groups(product_groups: dict) -> list:
    def fmt_set(s) -> str:
        return "'" + "', '".join(str(e) for e in s) + "'"

    cw_skel = 'SUM(CASE WHEN product_name IN ({}) THEN fs.value ELSE 0 END) AS "{}"'
    return [cw_skel.format(fmt_set(v), k) for k, v in product_groups.items()]


def cw_undef_groups(product_groups: dict) -> list:
    def fmt_set(s) -> str:
        return "'" + "', '".join(sorted(str(e) for e in s)) + "'"

    cw_skel = 'SUM(CASE WHEN product_name {} IN ({}) THEN fs.value ELSE 0 END) AS "{}"'

    pset = set()
    for pg in product_groups.values():
        pset = pset.union(set(pg))

    cws = [
        cw_skel.format('', fmt_set(pset), 'included_products')
        ,cw_skel.format('NOT', fmt_set(pset), 'excluded_products')
        ,'SUM(value) AS all_products'
    ]
    return cws


def fstr_cw_stms(product_groups: dict) -> str:
    grps = cw_def_groups(product_groups)
    misc = cw_undef_groups(product_groups)

    cws = '\n        ,'.join(grps + misc)
    return cws


def sales_data_query(
    analysis_ids: list
    ,product_groups: dict
    ,control_period: iter
    ,test_period: iter
) -> str:
    fact_sale = fstr_fact_sale(analysis_ids, control_period, test_period)
    cw_stms = fstr_cw_stms(product_groups)
    gbcols = """fs.date_id
        ,fs.store_id
        ,fs.unit_id
        ,fs.date_value
        ,fs.store_name
        ,fs.retailer_name
        ,fs.segment_name"""

    sql = f"""
    SELECT
        {gbcols}
        ,fs.unit_name
        ,{cw_stms}
    FROM (
    {fact_sale}
        ) fs
    WHERE
        fs.maxfact = fs.created_at
    GROUP BY
        {gbcols}
    """
    return sql