# Querying Sales Data Ibis
## Introduction
In this notebook, we will use Ibis expressions to preview a mock dataset and then build a query that will help analysts get all of the information they need for exploratory analysis, standardized measurement studies, and data transfer.

This is a companion notebook to QueryingSalesDatafStringSQL.ipynb and will outline the advantages of using Ibis over formatted SQL Strings.  Most of the steps are exactly the same, but there is some discussion about the difference between the two methods.

## Setup

We have a basic star schema containing (mock) sales data for several retailers.  The tables are as follows:

* `fact_sale`: `fact_id` | `date_id` | `store_id` | `product_id` | `unit_id` | `value` | `created_at`
    * Sales data in aggregate by some time period (in this case, weeks.  See `dim_date`)
* `dim_date`: `date_id` | `date_value`
    * date dimensions (ID to value).  Just contains week values.
    * (join fact by `date_id`)
* `dim_store`: `store_id` | `store_name` | `retailer_id`
    * store dimensions (ID to names)
    * (join fact by `store_id`)
* `dim_retailer`: `retailer_id` | `retailer_name`
    * retailer dimensions (ID to names)
    * (join stores by `retailer_id`)
* `dim_product`: `product_id` | `product_name`
    * product dimensions (ID to names)
    * (join fact by `product_id`)
* `dim_unit`: `unit_id` | `unit_name`
    * unit dimensions (ID to names).  For now, just contains Units and Sales USD
    * (join fact by `unit_id`)

Our analysts will use this data to run some rudimentary analyses for some product set at some retailer.  A typical analysis will have the following parameters:

* dates:
    * A test period that we want to measure.  Typically 4 weeks with a 2 period lag
    * A control period that we will use to establish a baseline wrt time
* products:
    * An arbitrary binning of products
* store segments:
    * Test: treatment applied
    * Control: treatment not applied, to help establish a baseline wrt stores
    * Not Selected: n/a
* units
    * Typically Sales in Currency (USD in this example)

Store segmentations created in our measurement studies are held in another star schema by an analysis ID, which connects to our sales schema using `store_id`:

* `fact_segmentation`: `fact_id` | `analysis_id` | `segment_id` | `store_id`
    * (join `dim_store` by `store_id`)
* `dim_segment`: `segment_id` | `segment_name`
    * (join fact by `segment_id`)

In the event a store segmentation doesn't exist, we can simply create one segment for all interesting stores (some subset of the universe) and then upload it to `fact_segmentation`.

## Things to Keep in Mind

At some point, an analyst or downstream vendor might want to actually look at the data and map values to real world objects, so it will help to have as much of the data human-readable as possible.  Therefore, we want to include an acceptable amount of detail so internal users can limit our `SELECT * FROM TABLE WHERE TABLE_ID = VALUE` calls and external users can limit their prying emails.

Also, it might be tempting to filter only on those products that are called for, but for exploratory reasons we should at least aggregate them into additional buckets.  We will create three additional groups for all products:

1. `included_products`
2. `excluded_products`
3. `all_products`

# Sales Data Query

In this section, we will build a Sales Data Query using the mock sales data referenced above.  To generate a mock dataset and follow along, you can use the MockSalesData notebook provided in this repo.

The columns we need are chosen to help an analyst comb through the data quickly in the event of an anomaly or for exploratory analyses:

1. `analysis_id`: to keep track of what analysis ID we're using
1. `[ date_id | store_id | unit_id ]`: a set of ID references that act as a unique label for our product groups at a given store, during a given week, for a given unit
1. `[ retailer_name | store_name ]`: human-readable store label
1. `week_value`: human-readable date label
1. `unit_name`: human-readable unit label
1. product group names: we will group our products by product group label and then pivot our data so it takes up fewer rows (also so that a human can keep better track of the labels in (2))

We will begin by exploring the tables in our database and then construct our query to accept arbitrary analysis parameters:

* `ANALYSIS_ID`: filtering store segments and to label the analysis
* `PRODUCT GROUPS`: a mapping of product group names (group labels) to product names (`product_name`)
    * These are assumed to be mutually exclusive for the purposes of addition
* dates:
    * `TEST_PERIOD_START`: week of treatment start
    * `TEST_PERIOD_END`: week of treatment end
    * `LAG_PERIOD_END`: week of the lag period end
    * `CONTROL_PERIOD_START`: week of control start
    * `CONTROL_PERIOD_END`: week of control end

## Setup

Analysis vars for filtering and transforming.  Normally these would be kept in a set of tables on a database somewhere, but for now let's just make them variables we can play around with in-notebook.

In [1]:
ANALYSIS_ID = 3

# In a more sophisticated data warehouse, we would store analysis paramters as a configuration
# inside of our database and connect this configuration to our data using the configuration id
PRODUCT_GROUPS = {
    'group_1': ['PRODUCT_0', 'PRODUCT_4']
    ,'group_2': ['PRODUCT_1']
    ,'group_3': ['PRODUCT_3', 'PRODUCT_5', 'PRODUCT_9']
}

TEST_PERIOD_START = 'WEEK_52'
TEST_PERIOD_END = 'WEEK_55'

LAG_PERIOD_END = 'WEEK_57'

CONTROL_PERIOD_START = 'WEEK_00'
CONTROL_PERIOD_END = 'WEEK_51'

Misc vars for pointing to stuff:

In [2]:
SALES_DB = 'sales_data.db'

To get our data using ibis, we will use the sqlite backend connector.

The default limit on `.execute()` is 10,000 rows.  We will instead set this to 20,000,000 (please change this value depending on how hard you are willing to make your kernel work).

In [3]:
import ibis

iconn = ibis.sqlite.connect(SALES_DB)

ibis.options.sql.default_limit = 2e7

### Table Expressions

Table expressions allow us to pull in table metadata.  By having table metadata on hand, Ibis can typecheck our operations and referencecheck our calls.

Let's establish some table expressions before we get started:

In [4]:
table_list = [
    'fact_sale'
    ,'dim_date'
    ,'dim_store'
    ,'dim_retailer'
    ,'dim_product'
    ,'dim_unit'
    ,'fact_segmentation'
    ,'dim_segment'
]

TABLES = {
    key: iconn.table(key)
    for key in table_list
}

### Exploratory Functions

Here we are just counting the rows to each table of interest.  By using `count()` on a table expression we can get the count:

In [5]:
for tbl in TABLES:
    print(tbl, "row count:", TABLES[tbl].count().execute())

fact_sale row count: 26856660
dim_date row count: 58
dim_store row count: 8107
dim_retailer row count: 4
dim_product row count: 15
dim_unit row count: 2
fact_segmentation row count: 8107
dim_segment row count: 3


`fact_sale` is a bit big, so let's filter it down to preview it before moving on.

In [6]:
TABLES['fact_sale'].limit(5).execute()

Unnamed: 0,fact_id,date_id,store_id,product_id,unit_id,value,created_at
0,1,1,1,1,1,6.0,2022-06-03 14:20:22
1,2,1,1,1,2,31.56,2022-06-03 14:20:22
2,3,2,1,1,1,26.0,2022-06-03 14:20:22
3,4,2,1,1,2,136.76,2022-06-03 14:20:22
4,5,3,1,1,1,24.0,2022-06-03 14:20:22


Since we expect there to be only one row per set over the partition of `date_id` | `store_id` | `product_id` | `unit_id` (since, for this dataset, there _should_ only be one value for a given product at a given store for a given unit at a given time--it's total sales data for a week), let's check this condition real quick before moving forward.

In [7]:
gbcols = [
    'date_id'
    ,'store_id'
    ,'product_id'
    ,'unit_id'
]

(
    # fact sale in TABLES
    TABLES['fact_sale']
    # select only the id columns
    .select(gbcols)
    # group by the id columns
    .group_by(gbcols)
    # count occurance of set
    .size()
    # sort by count, descending (largest at top)
    .sort_by(('count', False))
    # pick top 5
    .limit(5)
    # execute
    .execute()
)

Unnamed: 0,date_id,store_id,product_id,unit_id,count
0,1,1,1,1,2
1,1,1,1,2,2
2,1,1,2,1,2
3,1,1,2,2,2
4,1,1,3,1,2


Looks like there are duplicates, so let's grab one and see what a duplicate looks like:

In [8]:
def filter_fact_sale(fact_sale=TABLES['fact_sale'], **kwargs):
    bix = None
    for k, v in kwargs.items():
        bix = bix & (fact_sale[k] == v) if bix is not None else (fact_sale[k] == v)

    return fact_sale.filter(bix).execute()

filter_fact_sale(date_id=1, store_id=1, product_id=1, unit_id=1)

Unnamed: 0,fact_id,date_id,store_id,product_id,unit_id,value,created_at
0,1,1,1,1,1,6.0,2022-06-03 14:20:22
1,13428331,1,1,1,1,3.0,2022-06-03 14:22:57


In this dataset, it looks like `value` is double the other entry.  It is likely that data was doubled for upload and this error was quickly corrected.

Some core principles of this table are:

1. never delete data, including data uploaded in error
2. if something is uploaded in error, fix it somewhere else and then re-upload it
3. append the current timestamp to the data upon upload

So in this table, we use `created_at` to give us the latest value of truth for each `date_id`, `store_id`, `product_id`, `unit_id` set.  So we need to pick those rows where `created_at` is equal to the `MAX(created_at OVER (PARTITION BY date_id, store_id, product_id, unit_id)`.

Running this window function every time we want data can be expensive, so we should filter our data as much as possible before running it.

To help filter down our data, we will join `fact_sale` on to filtered dimension tables for the values we care about.  Let's craft those filter queries now.

### Filtering `dim_date`

We are given 5 date points: test start and end, lag end, and control start and end.  We can smartly lump the test period and lag period together to create two ranges: the intervention+lag period, and the control period.

To filter dates, we'll find all weeks between both of those ranges:

* `BETWEEN TEST_PERIOD_START AND LAG_PERIOD_END`
* `BETWEEN CONTROL_PERIOD_START AND CONTROL_PERIOD_END`

We create two distinct ranges because it's possible that we do a period over period analysis instead of a 52-week lead control and following 6 week test back-to-back.

In [9]:
def filter_dim_date(control_period: iter, test_period: iter, dim_date=TABLES['dim_date']):
    return dim_date.filter(
        dim_date['date_value'].between(*control_period)
        | dim_date['date_value'].between(*test_period)
    )[dim_date]

filter_dim_date([CONTROL_PERIOD_START, CONTROL_PERIOD_END], [TEST_PERIOD_START, LAG_PERIOD_END]).limit(5).execute()

Unnamed: 0,date_id,date_value
0,1,WEEK_00
1,2,WEEK_01
2,3,WEEK_02
3,4,WEEK_03
4,5,WEEK_04


Note that weeks are stored as strings in this database for generality.  This query _could_ actually work with date values.

### Filtering `dim_store`

We have an `analysis_id` (`ANALYSIS_ID`), and we have store segments for that config saved in `fact_segmentation`.  Let's filter `dim_stores` and pull in store segments, retailer names, and store names at the same time:

In [10]:
def filter_dim_store(
    analysis_ids: iter=[1]
    ,dim_store=TABLES['dim_store']
    ,dim_segment=TABLES['dim_segment']
    ,fact_segmentation=TABLES['fact_segmentation']
    ,dim_retailer=TABLES['dim_retailer']
):
    fseg = fact_segmentation.filter(
        fact_segmentation['analysis_id'].isin(analysis_ids)
    )['store_id', 'segment_id', 'analysis_id']

    seg = fseg.inner_join(
        dim_segment
        ,predicates=dim_segment['segment_id'] == fseg['segment_id']
    # relabel to avoid suffixes in the next line
    )['store_id', 'segment_name', 'analysis_id'].relabel(
        {'store_id': 'seg_store_id'}
    )

    store_filt = seg.inner_join(
        dim_store
        ,predicates=dim_store['store_id'] == seg['seg_store_id']
    )['store_id', 'store_name', 'segment_name', 'retailer_id', 'analysis_id'].relabel(
        {'retailer_id': 'store_retailer_id'}
    )

    result = store_filt.inner_join(
        dim_retailer
        ,predicates=dim_retailer['retailer_id'] == store_filt['store_retailer_id']
    )
    return result['store_id', 'store_name', 'segment_name', 'retailer_name', 'analysis_id']

filter_dim_store().limit(5).execute()

Unnamed: 0,store_id,store_name,segment_name,retailer_name,analysis_id
0,1,0,T,RETAILER_0,1
1,2,1,T,RETAILER_0,1
2,3,2,T,RETAILER_0,1
3,4,3,C,RETAILER_0,1
4,5,4,NS,RETAILER_0,1


### Filtering `dim_product` and `dim_unit`

As mentioned before, it may be tempting to filter `dim_product`.  Our example analysis uses only 6 products, so if we cut out the rest that's 11 products cross however many stores cross however many dates cross however many units we care about.  We're instead going to bucket them and include them in our analysis.  Same deal with units.

For larger datasets, it may be a good idea to filter further--for example, `dim_unit` might contain multiple currencies or on hand units, and `dim_product` might contain products for multiple irrelevant companies that we would want to exclude from our analysis.

For Ibis, in this case, we don't need to do anything since we'll just pull in those table expressions.

### Putting it All Together to Filter Fact Sale and Calculate `MAX(created_at)` (maxfact)

Now we'll combine all of our queries to filter `fact_sale` in preparation for our maxfact query:

In [50]:
def get_table_reference(table_name, dict_tables=None, con=None):
    return (
        con.table(table_name)
        if not dict_tables and table_name not in dict_tables
        else dict_tables[table_name]
    )


def filter_fact_sale(analysis_ids: list, control_period:iter, test_period: iter, con=None, **tables):
    fact_sale = get_table_reference("fact_sale", dict_tables=tables, con=con)

    dim_store = filter_dim_store(
        analysis_ids=analysis_ids
        ,dim_store=get_table_reference("dim_store", dict_tables=tables, con=con)
        ,dim_segment=get_table_reference("dim_segment", dict_tables=tables, con=con)
        ,dim_retailer=get_table_reference("dim_retailer", dict_tables=tables, con=con)
        ,fact_segmentation=get_table_reference("fact_segmentation", dict_tables=tables, con=con)
    ).relabel(
        {'store_id': 'ds_store_id'}
    )

    dim_date = filter_dim_date(
        control_period
        ,test_period
        ,dim_date=get_table_reference('dim_date', dict_tables=tables, con=con)
    ).relabel(
        {'date_id': 'dd_date_id'}
    )

    dim_product = get_table_reference("dim_product", dict_tables=tables, con=con).relabel(
        {'product_id': 'dp_product_id'}
    )

    dim_unit = get_table_reference("dim_unit", dict_tables=tables, con=con).relabel(
        {'unit_id': 'du_unit_id'}
    )

    cols = [
        'analysis_id'
        ,'date_id'
        ,'store_id'
        ,'product_id'
        ,'unit_id'
        ,'created_at'
        ,'value'
        ,'date_value'
        ,'store_name'
        ,'retailer_name'
        ,'product_name'
        ,'unit_name'
        ,'segment_name'
    ]

    join = fact_sale.inner_join(
        dim_date
        ,predicates=dim_date['dd_date_id'] == fact_sale['date_id']
    ).inner_join(
        dim_store
        ,predicates=dim_store['ds_store_id'] == fact_sale['store_id']
    ).inner_join(
        dim_product
        ,predicates=dim_product['dp_product_id'] == fact_sale['product_id']
    ).inner_join(
        dim_unit
        ,predicates=dim_unit['du_unit_id'] == fact_sale['unit_id']
    ).select(cols)

    maxfactw = ibis.window(group_by=['date_id', 'store_id', 'product_id', 'unit_id'])
    maxfact = fact_sale['created_at'].max().over(maxfactw).name('maxfact')

    return join[join, maxfact]

In [51]:
filter_fact_sale([1], [CONTROL_PERIOD_START, CONTROL_PERIOD_END], [TEST_PERIOD_START, LAG_PERIOD_END], con=iconn).limit(5).execute()

Unnamed: 0,analysis_id,date_id,store_id,product_id,unit_id,created_at,value,date_value,store_name,retailer_name,product_name,unit_name,segment_name,maxfact
0,1,1,1,1,1,2022-06-03 14:20:22,6.0,WEEK_00,0,RETAILER_0,PRODUCT_0,SALES_UNITS,T,2022-06-03 14:22:57
1,1,1,1,1,1,2022-06-03 14:22:57,3.0,WEEK_00,0,RETAILER_0,PRODUCT_0,SALES_UNITS,T,2022-06-03 14:22:57
2,1,1,1,1,2,2022-06-03 14:20:22,31.56,WEEK_00,0,RETAILER_0,PRODUCT_0,SALES_USD,T,2022-06-03 14:22:57
3,1,1,1,1,2,2022-06-03 14:22:57,15.78,WEEK_00,0,RETAILER_0,PRODUCT_0,SALES_USD,T,2022-06-03 14:22:57
4,1,1,1,2,1,2022-06-03 14:20:22,10.0,WEEK_00,0,RETAILER_0,PRODUCT_1,SALES_UNITS,T,2022-06-03 14:22:57


### "Pivoting" using `SUM(CASE`/`WHEN IN set THEN value ELSE 0 END) AS name` on Arbitrary Sets

We now have one last thing to do: flatten our data by bucketing our products into their respective groups.  We can pivot our data by aggregating `fact_sale.value` if a product name is in the set.

Let's create some functions to help us do that:

First, let's deal with our defined groups:

In [52]:
# return a list here so we can add up all of our case/when statement lists
# and then format a final case/when statement
def cw_def_groups(product_groups: dict, maxfact_expr) -> list:
    cws = [
        maxfact_expr['product_name'].isin(v).ifelse(maxfact_expr['value'], 0).sum().name(k)
        for k, v in product_groups.items()
    ]
    return cws

Next, let's deal included, excluded, and all:

In [53]:
def cw_undef_groups(product_groups: dict, maxfact_expr) -> list:
    pset = set()
    for pg in product_groups.values():
        pset = pset.union(set(pg))

    cws = [
        maxfact_expr['product_name'].isin(pset).ifelse(maxfact_expr['value'], 0).sum().name('included_products')
        ,maxfact_expr['product_name'].notin(pset).ifelse(maxfact_expr['value'], 0).sum().name('excluded_products')
        ,maxfact_expr['value'].sum().name('all_products')
    ]
    return cws

Finally, we will compile the aggregate expressions:

In [54]:
def cw_stms(product_groups: dict, maxfact_expr):
    grps = cw_def_groups(product_groups, maxfact_expr)
    misc = cw_undef_groups(product_groups, maxfact_expr)
    return grps + misc

### The Final Query

By putting this all together, we get our Sales Data Query:

In [58]:
def sales_data_query(
    analysis_ids: list
    ,product_groups: dict
    ,control_period: iter
    ,test_period: iter
    ,con=iconn
    ,dict_tables: dict=None
) -> str:
    maxfact = filter_fact_sale(
        analysis_ids
        ,control_period
        ,test_period
        ,con=con
        ,**dict_tables
    )

    gbcols = [
        'analysis_id'
        ,'date_id'
        ,'store_id'
        ,'unit_id'
        ,'date_value'
        ,'store_name'
        ,'retailer_name'
        ,'segment_name'
        ,'unit_name'
    ]

    cws = cw_stms(product_groups, maxfact)

    result = (
        maxfact
        # Filter to latest rows
        .filter(maxfact['created_at'] == maxfact['created_at'])
        # select necessary columns
        .select(gbcols + ['product_name', 'value'])
        # Group by store, date, unit
        .groupby(gbcols)
        # aggregate over our groups
        .aggregate(cws)
    )
    return result

sdq = sales_data_query(
    [ANALYSIS_ID]
    ,PRODUCT_GROUPS
    ,[CONTROL_PERIOD_START, CONTROL_PERIOD_END]
    ,[TEST_PERIOD_START, LAG_PERIOD_END]
    ,iconn
    ,TABLES
)

And running it:

In [59]:
sales = sdq.execute()

sales.shape

(198946, 15)

In [60]:
sales.head()

Unnamed: 0,analysis_id,date_id,store_id,unit_id,date_value,store_name,retailer_name,segment_name,unit_name,group_1,group_2,group_3,included_products,excluded_products,all_products
0,3,1,2305,1,WEEK_00,0,RETAILER_2,T,SALES_UNITS,60.0,18.0,30.0,108.0,63.0,171.0
1,3,1,2305,2,WEEK_00,0,RETAILER_2,T,SALES_USD,322.89,36.0,198.66,557.55,243.09,800.64
2,3,1,2306,1,WEEK_00,1,RETAILER_2,T,SALES_UNITS,33.0,15.0,21.0,69.0,84.0,153.0
3,3,1,2306,2,WEEK_00,1,RETAILER_2,T,SALES_USD,173.58,30.0,148.2,351.78,429.57,781.35
4,3,1,2307,1,WEEK_00,2,RETAILER_2,T,SALES_UNITS,51.0,6.0,30.0,87.0,99.0,186.0


Now that we have our data in a pandas DataFrame, we can export, transform, and filter as we (or our analysts) see fit.

### All Functions, Together

In [None]:
def filter_dim_date(control_period: iter, test_period: iter, dim_date=TABLES['dim_date']):
    return dim_date.filter(
        dim_date['date_value'].between(*control_period)
        | dim_date['date_value'].between(*test_period)
    )[dim_date]


def filter_dim_store(
    analysis_ids: iter=[1]
    ,dim_store=TABLES['dim_store']
    ,dim_segment=TABLES['dim_segment']
    ,fact_segmentation=TABLES['fact_segmentation']
    ,dim_retailer=TABLES['dim_retailer']
):
    fseg = fact_segmentation.filter(
        fact_segmentation['analysis_id'].isin(analysis_ids)
    )['store_id', 'segment_id', 'analysis_id']

    seg = fseg.inner_join(
        dim_segment
        ,predicates=dim_segment['segment_id'] == fseg['segment_id']
    # relabel to avoid suffixes in the next line
    )['store_id', 'segment_name', 'analysis_id'].relabel(
        {'store_id': 'seg_store_id'}
    )

    store_filt = seg.inner_join(
        dim_store
        ,predicates=dim_store['store_id'] == seg['seg_store_id']
    )['store_id', 'store_name', 'segment_name', 'retailer_id', 'analysis_id'].relabel(
        {'retailer_id': 'store_retailer_id'}
    )

    result = store_filt.inner_join(
        dim_retailer
        ,predicates=dim_retailer['retailer_id'] == store_filt['store_retailer_id']
    )
    return result['store_id', 'store_name', 'segment_name', 'retailer_name', 'analysis_id']


def get_table_reference(table_name, dict_tables=None, con=None):
    return (
        con.table(table_name)
        if not dict_tables and table_name not in dict_tables
        else dict_tables[table_name]
    )


def filter_fact_sale(analysis_ids: list, control_period:iter, test_period: iter, con=None, **tables):
    fact_sale = get_table_reference("fact_sale", dict_tables=tables, con=con)

    dim_store = filter_dim_store(
        analysis_ids=analysis_ids
        ,dim_store=get_table_reference("dim_store", dict_tables=tables, con=con)
        ,dim_segment=get_table_reference("dim_segment", dict_tables=tables, con=con)
        ,dim_retailer=get_table_reference("dim_retailer", dict_tables=tables, con=con)
        ,fact_segmentation=get_table_reference("fact_segmentation", dict_tables=tables, con=con)
    ).relabel(
        {'store_id': 'ds_store_id'}
    )

    dim_date = filter_dim_date(
        control_period
        ,test_period
        ,dim_date=get_table_reference('dim_date', dict_tables=tables, con=con)
    ).relabel(
        {'date_id': 'dd_date_id'}
    )

    dim_product = get_table_reference("dim_product", dict_tables=tables, con=con).relabel(
        {'product_id': 'dp_product_id'}
    )

    dim_unit = get_table_reference("dim_unit", dict_tables=tables, con=con).relabel(
        {'unit_id': 'du_unit_id'}
    )

    cols = [
        'analysis_id'
        ,'date_id'
        ,'store_id'
        ,'product_id'
        ,'unit_id'
        ,'created_at'
        ,'value'
        ,'date_value'
        ,'store_name'
        ,'retailer_name'
        ,'product_name'
        ,'unit_name'
        ,'segment_name'
    ]

    join = fact_sale.inner_join(
        dim_date
        ,predicates=dim_date['dd_date_id'] == fact_sale['date_id']
    ).inner_join(
        dim_store
        ,predicates=dim_store['ds_store_id'] == fact_sale['store_id']
    ).inner_join(
        dim_product
        ,predicates=dim_product['dp_product_id'] == fact_sale['product_id']
    ).inner_join(
        dim_unit
        ,predicates=dim_unit['du_unit_id'] == fact_sale['unit_id']
    ).select(cols)

    maxfactw = ibis.window(group_by=['date_id', 'store_id', 'product_id', 'unit_id'])
    maxfact = fact_sale['created_at'].max().over(maxfactw).name('maxfact')

    return join[join, maxfact]


# return a list here so we can add up all of our case/when statement lists
# and then format a final case/when statement
def cw_def_groups(product_groups: dict, maxfact_expr) -> list:
    cws = [
        maxfact_expr['product_name'].isin(v).ifelse(maxfact_expr['value'], 0).sum().name(k)
        for k, v in product_groups.items()
    ]
    return cws


def cw_undef_groups(product_groups: dict, maxfact_expr) -> list:
    pset = set()
    for pg in product_groups.values():
        pset = pset.union(set(pg))

    cws = [
        maxfact_expr['product_name'].isin(pset).ifelse(maxfact_expr['value'], 0).sum().name('included_products')
        ,maxfact_expr['product_name'].notin(pset).ifelse(maxfact_expr['value'], 0).sum().name('excluded_products')
        ,maxfact_expr['value'].sum().name('all_products')
    ]
    return cws


def cw_stms(product_groups: dict, maxfact_expr):
    grps = cw_def_groups(product_groups, maxfact_expr)
    misc = cw_undef_groups(product_groups, maxfact_expr)
    return grps + misc


def sales_data_query(
    analysis_ids: list
    ,product_groups: dict
    ,control_period: iter
    ,test_period: iter
    ,con=iconn
    ,dict_tables: dict=None
) -> str:
    maxfact = filter_fact_sale(
        analysis_ids
        ,control_period
        ,test_period
        ,con=con
        ,**dict_tables
    )

    gbcols = [
        'analysis_id'
        ,'date_id'
        ,'store_id'
        ,'unit_id'
        ,'date_value'
        ,'store_name'
        ,'retailer_name'
        ,'segment_name'
        ,'unit_name'
    ]

    cws = cw_stms(product_groups, maxfact)

    result = (
        maxfact
        # Filter to latest rows
        .filter(maxfact['created_at'] == maxfact['created_at'])
        # select necessary columns
        .select(gbcols + ['product_name', 'value'])
        # Group by store, date, unit
        .groupby(gbcols)
        # aggregate over our groups
        .aggregate(cws)
    )
    return result
