# Core Question Process
What are the overall trends in sales?
 --> This starts very broad, so we need to iteratively clarify

### Part 1
READ Framework for framing the project before querying anything

R - Representative Data: 
    do we have the right data to answer this? 
We don't have order dates or revenue so we can't see change over calendar periods or true sales volume beyond quantities. We have day of the week, reoder information, even time of day so we can see trends in those areas via order/product counts. 

E - Exec questions: 
    transformation from vague to business-relevant
Metric clarifications: what are we measuring? 
    Before: What are the overall trends in sales?
    After: How did total order volume and reorder rate change?
Dimension: how do we slice it? 
    Department, time of day, day of the week.
    We can remove time of day for now. 
Deliverable: for whom, in what format? 
    Marketing and operations management would likely find insights useful. 
Clarified question: How do total order volume and reorder rate change across departments and between weekdays and weekends? .

A - Analytical Framework: 
    Time series - time of day and day of week
    Segmentation by department

D - Data best practices: 
    Check for nulls, odd ranges

In [None]:

import duckdb
import pandas as pd
import matplotlib.pyplot as plt
con = duckdb.connect("instacart.duckdb")

df_user_orders = con.execute("""
    SELECT 
    SUM(order_id IS NULL) AS null_id,
    SUM(user_id IS NULL) AS null_user,
    SUM(eval_set IS NULL) AS null_eval,
    SUM(order_number IS NULL) AS null_num,
    SUM(order_dow IS NULL) AS null_dow,
    SUM(order_hour_of_day IS NULL) AS null_hour,
    -- SUM(days_since_prior_order IS NULL) AS null_days_since, -> not counting because this represents the number of first time orders
    MIN(CAST(order_hour_of_day AS INT)) AS min_hour,
    MAX(CAST(order_hour_of_day AS INT)) AS max_hour
    FROM orders;
""").fetchdf()
print(df_user_orders)



   null_id  null_user  null_eval  null_num  null_dow  null_hour  min_hour  \
0      0.0        0.0        0.0       0.0       0.0        0.0         0   

   max_hour  
0        23  


Data is well-cleaned, mostly IDs and integers. 

### Part 2: Mapping goals to data features

Stakeholder Goals  | what KPIs and dimensions matter the most? 
Columns and Coverage  | what data do we have available and how can I use it? 
Aggregates and Anomalies  | the high level metrics, outliers, and unexpected patterns 
Notable Segments  | slice by category, time, or other key dimensions to surface early insights 