There are numerous ways to determine if a member of a population is representative of a population. This is an approach using a [Chi-Squared Distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution). The samples could things like comparing items sold to one another or whether an elected Congress is representative of the demographics of their constituents in terms of say age or gender (spoilers: they aren't).

Presented here will be an example comparing the number of items sold by a company compared to opportunities to sell. These opportunities could include number of days an item has been on sale or number of times that item has been presented to a customer. These could be further scaled based on active versus passive sales (a salesperson versus window shopping) or position on a webpage (top of page 1, versus the middle of page 12).

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import chisquare

In this example dataset, we have five items: A, B, C, D, and E; and we want to compare them to one another to see if they are overperforming or underperforming what we might expect based on the null hypothesis that they are exactly the same.

In [2]:
myData = pd.DataFrame({'Item': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E'},
 'Observed': {0: 10, 1: 8, 2: 16, 3: 3, 4: 3},
 'Opportunities': {0: 177, 1: 185, 2: 201, 3: 164, 4: 58}})

myData

Unnamed: 0,Item,Observed,Opportunities
0,A,10,177
1,B,8,185
2,C,16,201
3,D,3,164
4,E,3,58


To analyze each item, what we will compare each item to everything that isn't that item. For example for item A, we will compare A to the sum total of everything that isn't A.

1.   Calculate what fraction of the overall sales one might expect each item to have by dividing the number of opportunities each item has had to the total number of opportunities all items have had
2.   Calculate the expected number of sales each item should have by multiplying that fraction by the total number of sales
3.   Calculate the number of items everything else had by subtracting the sales of each item from the total number of sales
4.   Calculate the expected number of sales each other item should have had by subtracting the expected sales of each item from the total number of sales
5.   Calculate the Chi-Squared test statistic and p-value based on these observed and expected values

Actually determining whether or not an item is representative of the population is subjective. The [p-value](https://en.wikipedia.org/wiki/P-value) thresholds would need to be set at values that make sense for the data being used. Big data sets can lead to very small p-values for small deviations. As things scale (low sales early, company continues to grow), things may drift as well.

Using the [test statistic](https://en.wikipedia.org/wiki/Test_statistic) instead may make sense to assign a rank order to each item. This too presents a problem, as the test statistic may make jumps based on clustering. The difference between the 4th and 5th ranked items may be bigger or smaller than the 5th and 6th ranked items, and cutoffs may then be subject to noise. Normalizing the test statistic in some way and looking for jumps may be the most rigorous way.

Since this is a toy example to illustrate the concept, we will use p-value despite the asterisks.


In [3]:
def ChiSquared(df, observedColumn, opportunitiesColumn, threshhold=0.05):
    totalSales = df[observedColumn].sum()
    totalOpportunties = df[opportunitiesColumn].sum()

    df['Expected Fraction'] = df[opportunitiesColumn] / totalOpportunties
    df['Expected'] = df['Expected Fraction'] * totalSales
    df['Other Observed'] = totalSales - df[observedColumn]
    df['Other Expected'] = totalSales - df['Expected']
    df['Chi-Squared'], df['p-value'] = chisquare(f_obs=[df[observedColumn], df['Other Observed']], f_exp=[df['Expected'], df['Other Expected']])

    maskSignificance = df['p-value'] < threshhold
    maskUnder = df['Expected'] > df[observedColumn]
    maskOver = df['Expected'] < df[observedColumn]

    try:
        df.loc[maskSignificance & maskUnder, 'Representation'] = '-'
        df.loc[maskSignificance & maskOver, 'Representation'] = '+'
        df.loc[~maskSignificance, 'Representation'] = '0'
    except Exception as e:
        print(e)
        df['Representation'] = np.nan
    return df

In [4]:
myData = ChiSquared(myData, 'Observed', 'Opportunities')
myData

Unnamed: 0,Item,Observed,Opportunities,Expected Fraction,Expected,Other Observed,Other Expected,Chi-Squared,p-value,Representation
0,A,10,177,0.225478,9.019108,30,30.980892,0.137735,0.710543,0
1,B,8,185,0.235669,9.426752,32,30.573248,0.282523,0.595053,0
2,C,16,201,0.256051,10.242038,24,29.757962,4.351189,0.036983,+
3,D,3,164,0.208917,8.356688,37,31.643312,4.340468,0.037217,-
4,E,3,58,0.073885,2.955414,37,37.044586,0.000726,0.9785,0


Setting the p-value threshold to 0.05, it would then be observed that item C is overperforming and is one of our strongest sellers, and item D is underperforming and may be worth considering purging from inventory. Items A, B, and E are within the noise and are typical performers. 

Also despite D and E having the same number of observed sales, we're basing decisions on the number of opportunities. Since D had about three times the opportunities, it's safer to call it and underperformer and E as a typical item.