# Cohorts analysis

## Problem

We are launching a recommendation tool that identifies vendor cohorts and suggests performance improvements based on peer comparisons within those cohorts. Our current cohort definition uses a six-level nested hierarchy (Country → City → Area → Price → Cuisine → Grade) that was designed primarily for explainability to account managers rather than analytical rigor.

This ad hoc approach creates two risks for our MVP rollout:

Weak statistical foundation: We haven't validated whether our cohorts actually group similar-performing vendors together Stakeholder confidence: Without a principled justification for cohort boundaries, leadership are questioning recommendation validity

Our SVP has specified that cohorts must be "sensible and comparable" - meaning they should be both statistically meaningful and intuitive to business stakeholders. We need a methodology that validates our current hierarchy against this standard and provides a framework for refinement.

## Solution
To identify which vendor characteristics create the most "comparable" cohorts, we'll use a three-step approach: 

1. **dimension ranking through regression analysis** to find the dimensions that best predict vendor performance 
2. **hierarchy optimization**, to find the hierarchy that best groups vendors into similar performing vendors. 
The measure will help identify better cohort separation - vendors within cohorts are similar while cohorts differ meaningfully 
3. **cluster validation and refinement** to ensure that the final hierarchy is sensible and comparable.

## Data 

Collected in `create_data.ipynb`

In [1]:
from datetime import date
from highlight_text import fig_text

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import bigframes.pandas as bpd
import statsmodels.api as sm

import warnings
warnings.filterwarnings(action='once')

%load_ext google.cloud.bigquery
bpd.options.bigquery.project = "dhh-ncr-stg"



In [2]:
%%bigquery df 
SELECT * FROM `dhh-ncr-stg.patrick_doupe.cohort_vendor_base`
TABLESAMPLE SYSTEM (1 PERCENT)
WHERE successful_orders IS NOT NULL

  query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
  record_batch = self.to_arrow(


In [3]:
df.head()

Unnamed: 0,entity_id,vendor_id,chain_id,chain_name,entity,city,area,fixed_vendor_grade,fixed_is_new_vendor,cuisine,budget,key_account_sub_category,cuisine_ids,successful_orders,total_orders,new_customer_orders,retained_customers,successful_customers
0,FP_SG,j53g,,,fp_sg,Singapore,UNK,D,False,Chinese,"[1, NA]",Local Hero,"[73, 38, 39, 71]",33,33,18,6,21
1,FP_SG,iz2f,,,fp_sg,Singapore,UNK,D,False,UNK,"[2, NA, 1]",Local Hero,[53],102,113,37,39,71
2,FP_SG,x8qj,,,fp_sg,Singapore,UNK,D,False,UNK,"[1, 3, 2]",UNK,"[59, 58, 115, 67, 35, 44, 49, 64, 39]",98,106,19,38,52
3,FP_SG,mqoa,,,fp_sg,Singapore,UNK,D,False,Asian,"[2, NA]",UNK,"[53, 39]",23,25,16,4,16
4,FP_SG,pxr3,,,fp_sg,Singapore,UNK,D,False,Malaysian,"[1, NA]",UNK,"[39, 104, 87]",33,39,25,5,25


In [4]:
def get_adjusted_r2(df: pd.DataFrame, 
                    col_name: str, 
                    y_col: str = 'successful_orders',
                    continuous: bool = False) -> float:
    """
    wrapper function to extract adjusted r squared 
    args:
        df - dataframe with data
        col_name - string of potential cohort dimension column 
        y_col - string of the performance variable 
    returns:
        adjusted_r_squared - float
    """
    df_tmp = df[[y_col, col_name]].copy()

    if continuous:
        # Create quartile categories
        df_tmp.loc[:, 'QQ'] = pd.qcut(df_tmp[col_name], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

        # Create dummy variables
        df_X = pd.get_dummies(df_tmp['QQ'], dtype=int, drop_first=True)
    else:
        df_X = pd.get_dummies(df_tmp[col_name], dtype=int, drop_first=True)
    
    X = df_X.values
    y = df_tmp[y_col].values

    X_np = np.asarray(X, dtype=np.float64) 
    y_np = np.asarray(y, dtype=np.float64) 
    assert len(X) == len(y), "X and y have different number of observations"

    # OLS estimation
    beta = np.linalg.solve(X_np.T @ X_np, X_np.T @ y_np)
    
    N, K = X_np.shape

    # Calculate R2
    y_pred = X @ beta
    ss_residual = np.sum((y - y_pred) ** 2)
    ss_total = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - ss_residual / ss_total
    # adjust it for the number of predictors
    adjusted_r_squared = 1 - ((1 - r_squared) * (N - 1) / (N - K - 1))
    
    return round(adjusted_r_squared.item(), 4)


In [5]:
cols_to_check = [
    'city',
    'area',
    'fixed_vendor_grade',
    #'fixed_is_new_vendor',
    'cuisine',
    #'budget',
    'key_account_sub_category'
]
# cuisine_ids << to fix up
continuous_cols = [
    'new_customer_orders',
    'retained_customers',
    'successful_customers'
]

In [6]:
results = {}
for col in cols_to_check:
    try:
        tmp = get_adjusted_r2(df, col)
    except LinAlgError:
        tmp = None
    results[col] = tmp 

for col in continuous_cols:
    try:
        tmp = get_adjusted_r2(df, col, continuous=True)
    except LinAlgError:
        tmp = None
    results[col] = tmp 


In [None]:
pd.DataFrame(list(results.items()), columns=['Cohort Dimension', 'Adjusted R2']).sort_values(by='Adjusted R2', ascending=False)

Unnamed: 0,Cohort Dimension,Adjusted R2
2,fixed_vendor_grade,0.3596
7,successful_customers,0.3138
6,retained_customers,0.3099
5,new_customer_orders,0.298
3,cuisine,0.0732
1,area,0.0709
0,city,0.0284
4,key_account_sub_category,-0.0384
