## Booze R Us Model

Fitting a model (or two) based on our proposal.

- **Goal:** Build a model to predict sales in a month for any given store.
- **Response Variable:** Monthly Sales
- **Possible Features:** store, month, county, population stuff, proximity stuff, alcohol categories

In [42]:
import duckdb as db 
con = db.connect()
import pandas as pd 
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [43]:
# MAIN TABLE
con.execute("""
        DROP TABLE IF EXISTS sales;
        CREATE TABLE sales AS 
        SELECT EXTRACT(MONTH FROM date) AS month, EXTRACT (YEAR FROM date) AS year,
            store, city, county, 
            category_name AS category, sale_bottles AS bottles, sale_dollars AS dollars
        FROM read_parquet('../data/iowa_liquor_2023_2025.parquet');
""")
sales = con.execute("SELECT * FROM sales").df()

# POPULATION
con.execute(
"""
        DROP TABLE IF EXISTS pop;
        CREATE TABLE population AS
        SELECT name AS county, year_1 AS year, popestimate AS population, over21, propOver21, median_age_tot AS median_age
        FROM read_csv_auto('../data/pop.csv');
"""
)
pop = con.execute("SELECT * FROM population").df()

# PROXIMITY
con.execute(
"""
        DROP TABLE IF EXISTS prox;
        CREATE TABLE proximity AS
        SELECT *
        FROM read_csv_auto('../data/proximity.csv');
"""
)
prox = con.execute("SELECT * FROM proximity").df()

In [44]:
sales.columns

Index(['month', 'year', 'store', 'city', 'county', 'category', 'bottles',
       'dollars'],
      dtype='object')

## Creating the Dataset

First, I am going to engineer the category column a little bit to use as features. Knowing which alcohol sells the best is could be useful for telling Booze R Us what they should buy in order to increase profits.

In [45]:
con.execute("""
    CREATE OR REPLACE TABLE sales AS
    SELECT *,
        CASE
            WHEN category ILIKE '%VODKA%' THEN 'Vodka'
            WHEN category ILIKE '%WHISK%' THEN 'Whiskey'
            WHEN category ILIKE '%TEQUILA%' OR category ILIKE '%MEZCAL%' THEN 'Tequila'
            WHEN category ILIKE '%RUM%' THEN 'Rum'
            ELSE 'Other'
        END AS super_category
    FROM sales
""")
sales = con.execute("SELECT * FROM sales").df()

In [46]:
sales.head()

Unnamed: 0,month,year,store,city,county,category,bottles,dollars,super_category
0,1,2023,4829,DES MOINES,POLK,100% AGAVE TEQUILA,12,261.0,Tequila
1,1,2023,4829,DES MOINES,POLK,AMERICAN VODKAS,60,418.8,Vodka
2,1,2023,4829,DES MOINES,POLK,IMPORTED FLAVORED VODKA,24,358.56,Vodka
3,1,2023,4829,DES MOINES,POLK,CREAM LIQUEURS,12,306.0,Other
4,1,2023,4829,DES MOINES,POLK,SPICED RUM,60,1124.4,Rum


Now I need to agreggate to create our appropriate observational units: monthly sales per store.

- Dollars (our response variable) will be summed. 
- Category will be made into new columns representing the distribution of category sales
    - e.g. 70% tequila, 20% vodkas, etc.
    - we will not use total bottles because this would be almost perfectly collinear 
    - answer questions like: 'what liquor should we sell more/less of?'

In [47]:
monthly_sales = con.execute(
""" 
    WITH month_totals AS (
        SELECT year, month, store, city, county,
            SUM(dollars) AS revenue,
        FROM sales
        GROUP BY year, month, store, city, county
    ), category_totals AS (
        SELECT year, month, store, city, county,
            super_category,
            SUM(dollars) AS category_sales
        FROM sales
        GROUP BY year, month, store, city, county, super_category
    )
    SELECT mt.year, mt.month, mt.store, mt.city, mt.county,
        ROUND((SUM(CASE WHEN ct.super_category = 'Vodka' THEN ct.category_sales ELSE 0 END) / mt.revenue),2) AS vodka_ptc,
        ROUND((SUM(CASE WHEN ct.super_category = 'Whiskey' THEN ct.category_sales ELSE 0 END) / mt.revenue),2) AS whiskey_ptc,
        ROUND((SUM(CASE WHEN ct.super_category = 'Tequila' THEN ct.category_sales ELSE 0 END) / mt.revenue),2) AS tequila_ptc,
        ROUND((SUM(CASE WHEN ct.super_category = 'Rum' THEN ct.category_sales ELSE 0 END) / mt.revenue),2) AS rum_ptc,
        ROUND((SUM(CASE WHEN ct.super_category = 'Other' THEN ct.category_sales ELSE 0 END) / mt.revenue),2) AS other_ptc,
        mt.revenue
    FROM month_totals mt
    LEFT JOIN category_totals ct
        ON mt.year = ct.year AND mt.month = ct.month 
            AND mt.city = ct.city AND mt.county = ct.county
            AND mt.store = ct.store 
    GROUP BY mt.year, mt.month, mt.store, mt.city, mt.county, mt.revenue
    
"""
).fetchdf()

In [48]:
monthly_sales.head(2)

Unnamed: 0,year,month,store,city,county,vodka_ptc,whiskey_ptc,tequila_ptc,rum_ptc,other_ptc,revenue
0,2023,3,4873,GRANGER,DALLAS,0.2,0.58,0.05,0.12,0.05,6044.36
1,2023,3,4678,ADEL,DALLAS,0.22,0.44,0.02,0.18,0.14,44737.3


Now I will join with our other datasets, proximity and population. Using an inner join because it still leaves plenty of complete data for modelling. 

In [59]:
df = con.execute(
    """
        SELECT sales.*, 
            pop.population, pop.over21, pop.propOver21, pop.median_age,
            prox."# of stores within 5 mile radius" AS stores_within_5_miles,
            prox."Nearest other store (mi)" AS nearest_store_miles
        FROM monthly_sales sales
        JOIN pop
            ON LOWER(sales.county) = LOWER(pop.county) AND sales.year = pop.year
        JOIN prox
            ON sales.store = prox.store
        
    """
    ).fetchdf()

In [62]:
df.to_csv('../data/brs_model_data.csv', index=False)