# stage3_build_labelling
This notebook will use a __weak supervision package snorkel__ to generate labels. These labels will be used to train a simple classification model in the next step. 

The core task in this notebook is for me to write up a bunch of "Labelling functions", each of them simple, naive and noisy, they are purely based on my biased intuition of what diferent gendered customers may looks like. 

Snorkel will then try _"observing when and where these different labeling functions agree or disagree with one another, you can automatically learn—in unsupervised ways—when, where, and how much to trust each of them. You can thus learn their areas of expertise, and the overall level of expertise, so that when you combine their votes you end up with the highest quality label possible for each data point."_

This should work better than any opinionated analysis or heuristics that I can come up in short time. 

# Imports

In [155]:
from snorkel.labeling import labeling_function
from snorkel.labeling import PandasLFApplier
from snorkel.labeling import LFAnalysis
from snorkel.labeling.model import LabelModel


import pandas as pd
import numpy as np

ImportError: cannot import name 'gcd' from 'fractions' (/Users/muwang/opt/anaconda3/envs/challenge-ilikedata/lib/python3.9/fractions.py)

# Load data

In [3]:
features_data_path = "../data/processed/features.parquet"

features = pd.read_parquet(features_data_path)

In [4]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46279 entries, 0 to 46278
Data columns (total 44 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               46279 non-null  object 
 1   is_newsletter_subscriber  46279 non-null  float64
 2   cc_payments               46279 non-null  float64
 3   paypal_payments           46279 non-null  float64
 4   afterpay_payments         46279 non-null  float64
 5   apple_payments            46279 non-null  float64
 6   orders                    46279 non-null  float64
 7   items                     46279 non-null  float64
 8   cancels                   46279 non-null  float64
 9   returns                   46279 non-null  float64
 10  vouchers                  46279 non-null  float64
 11  female_items              46279 non-null  float64
 12  male_items                46279 non-null  float64
 13  unisex_items              46279 non-null  float64
 14  wapp_i

# Write labelling functions
I'll come up with as many ideas as I can while exploring the features

Setup some "constants"

In [13]:
# ABSTAIN is labelling function's way of saying "I dont' know"
# UNKNOWN is for when there's not enough data, for example customer hasn't bought anything yet
# Sorry, LGBTQI+ community, I don't have enough data or time to account for everyone

MALE = 0
FEMALE = 1
UNKNOWN = 2
ABSTAIN = -1

Explore the data:

In [7]:
with pd.option_context('display.max_columns', 999):
    display(features.sample(10))

Unnamed: 0,customer_id,is_newsletter_subscriber,cc_payments,paypal_payments,afterpay_payments,apple_payments,orders,items,cancels,returns,vouchers,female_items,male_items,unisex_items,wapp_items,wftw_items,mapp_items,wacc_items,macc_items,mftw_items,wspt_items,mspt_items,curvy_items,sacc_items,msite_orders,desktop_orders,android_orders,ios_orders,other_device_orders,work_orders,home_orders,parcelpoint_orders,other_collection_orders,redpen_discount_used,coupon_discount_applied,revenue,days_since_first_order,days_since_last_order,tenure_months,different_addresses,shipping_addresses,devices,average_discount_onoffer,average_discount_used
22962,7e6b3e3a57ab92c06f79008a7c1eec69,1.0,1.0,0.0,0.0,0.0,0.218182,0.381818,0.0,0.0,0.018182,0.363636,0.0,0.018182,0.218182,0.090909,0.0,0.0,0.0,0.018182,0.054545,0.0,0.0,0.0,0.0,0.163636,0.0,0.054545,0.0,0.0,0.109091,0.0,0.109091,26.270182,0.991273,34.116909,1734.0,112.0,55.0,0.0,1.0,2.0,0.4627,0.469517
25369,3b32ed1d40800cb33affe2e8fa657b85,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,12.72,0.0,50.86,370.0,370.0,1.0,0.0,1.0,1.0,0.2,0.200031
44611,2f74aae2d6e670f04be7e0fb843566e1,1.0,1.0,0.0,0.0,0.0,0.40625,0.5625,0.0,0.09375,0.09375,0.5625,0.0,0.0,0.375,0.15625,0.0,0.03125,0.03125,0.0,0.0,0.0,0.0,0.0,0.1875,0.1875,0.0,0.03125,0.0,0.0,0.34375,0.0,0.0625,0.0,12.335938,32.87875,1219.0,278.0,32.0,0.0,2.0,3.0,0.0,0.209439
4246,facb50c3929e5d94e0a1da0006048e4a,1.0,1.0,1.0,1.0,0.0,0.350877,1.0,0.0,0.035088,0.087719,1.0,0.0,0.0,0.368421,0.245614,0.0,0.070175,0.070175,0.0,0.315789,0.0,0.0,0.0,0.017544,0.157895,0.0,0.175439,0.0,0.245614,0.0,0.0,0.105263,51.936316,11.322632,175.322982,1824.0,124.0,57.0,1.0,5.0,2.0,0.2098,0.25205
28274,1fd2335186932f78a197e523f8f1b42a,1.0,0.0,1.0,1.0,0.0,0.714286,1.0,0.142857,0.0,0.142857,0.857143,0.0,0.142857,0.142857,0.714286,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.285714,0.428571,0.0,0.0,0.0,0.0,0.714286,0.0,0.0,9.865714,2.075714,49.03,255.0,51.0,7.0,0.0,1.0,2.0,0.0667,0.078783
9424,07c92ea42367e32709aba581bc5b1501,0.0,1.0,1.0,0.0,0.0,0.285714,1.071429,0.0,0.142857,0.0,0.428571,0.571429,0.071429,0.0,0.357143,0.428571,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.071429,0.214286,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,46.906429,0.0,183.770714,1758.0,1363.0,14.0,0.0,1.0,2.0,0.2559,0.255889
30063,310a64113cc54dc223984fd047a3cebb,1.0,1.0,0.0,0.0,0.0,2.0,4.0,0.0,2.0,1.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,185.24,36.32,214.36,603.0,600.0,1.0,0.0,1.0,1.0,0.4249,0.508165
45723,1e7f5328bd109c9f2eb5d9802e3d8025,1.0,1.0,0.0,0.0,0.0,0.235294,0.705882,0.0,0.0,0.058824,0.705882,0.0,0.0,0.470588,0.0,0.0,0.235294,0.235294,0.0,0.0,0.0,0.0,0.0,0.0,0.235294,0.0,0.0,0.0,0.0,0.0,0.0,0.235294,34.167059,7.689412,49.015882,1342.0,844.0,17.0,1.0,3.0,1.0,0.3263,0.383371
6256,d1a2b2303f4340b8f01c5897873e405b,1.0,1.0,0.0,0.0,0.0,0.163636,0.381818,0.0,0.145455,0.036364,0.381818,0.0,0.0,0.254545,0.127273,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054545,0.109091,0.0,0.0,0.0,0.0,0.127273,0.0,0.036364,10.183091,2.666909,98.973091,1804.0,166.0,55.0,0.0,1.0,3.0,0.1077,0.136521
27548,0a60c57abd9c412ab6b42a3b5faa8181,1.0,0.0,1.0,1.0,0.0,0.235294,0.294118,0.0,0.058824,0.0,0.294118,0.0,0.0,0.176471,0.029412,0.0,0.088235,0.088235,0.0,0.0,0.0,0.0,0.0,0.029412,0.205882,0.0,0.0,0.0,0.0,0.147059,0.0,0.088235,18.178235,0.0,30.960588,1058.0,55.0,34.0,1.0,2.0,2.0,0.25,0.249972


In [50]:
features['items'].describe()

count    46279.000000
mean         1.629684
std          2.770734
min          0.029851
25%          0.619048
50%          1.000000
75%          2.000000
max        232.000000
Name: items, dtype: float64

In [131]:
@labeling_function()
def bought_male_item(x):
    return (
        MALE 
        if x.male_items > 0
        else ABSTAIN
    )

@labeling_function()
def bought_more_male_item(x):
    return (
        MALE 
        if x.male_items > x.female_items
        else ABSTAIN
    )

@labeling_function()
def bought_more_female_item(x):
    return (
        FEMALE 
        if x.female_items > x.male_items
        else ABSTAIN
    )

@labeling_function()
def no_purchase(x):
    return (
        UNKNOWN
        if x['items'] == 0
        else ABSTAIN
    )

# Evaluate labelling functions on feature set - first pass

In [132]:
labelling_functions = [bought_male_item, bought_more_male_item, bought_more_female_item, no_purchase]
applier = PandasLFApplier(lfs=labelling_functions)
L_train = applier.apply(df=features)

100%|██████████████████████████████████| 46279/46279 [00:02<00:00, 21343.57it/s]


In [133]:
coverage_array = (pd.DataFrame(L_train) != ABSTAIN).mean(axis=0)

In [134]:
LFAnalysis(L=L_train, lfs=labelling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
bought_male_item,0,[0],0.371356,0.352817,0.106463
bought_more_male_item,1,[0],0.246354,0.246354,0.0
bought_more_female_item,2,[1],0.6896,0.106463,0.106463
no_purchase,3,[],0.0,0.0,0.0


Oops, no_purchase() has 0 coverage, meaning all customers in this dataset has bought something, anyway...

# Keep writing more labelling functions

In [141]:
@labeling_function()
def unisex_only(x):
    return (
        UNKNOWN 
        if x.male_items == 0 
            and x.female_items == 0
            and x.unisex_items > 0
        else ABSTAIN
    )

@labeling_function()
def more_than_one_female_categories(x):
    return (
        FEMALE
        if np.sum([
                x.wapp_items > 0, 
                  x.wftw_items > 0, 
                  x.wacc_items > 0,
                  x.wspt_items > 0,
                  x.curvy_items > 0,
        ]) > 1.001
        else ABSTAIN
    )

@labeling_function()
def more_than_one_male_categories(x):
    return (
        MALE
        if np.sum([x.mapp_items > 0, 
                  x.macc_items > 0, 
                  x.mftw_items > 0,
                  x.mspt_items > 0]
                 ) > 1.001
        else ABSTAIN
    )

# Evaluate labelling functions on feature set - second pass

In [142]:
labelling_functions = [bought_male_item, bought_more_male_item, bought_more_female_item, no_purchase,
                      unisex_only, more_than_one_female_categories, more_than_one_male_categories
                      ]
applier = PandasLFApplier(lfs=labelling_functions)
L_train = applier.apply(df=features)

100%|███████████████████████████████████| 46279/46279 [00:06<00:00, 7161.99it/s]


In [143]:
coverage_array = (pd.DataFrame(L_train) != ABSTAIN).mean(axis=0)

In [144]:
LFAnalysis(L=L_train, lfs=labelling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
bought_male_item,0,[0],0.371356,0.357873,0.120551
bought_more_male_item,1,[0],0.246354,0.246354,0.011085
bought_more_female_item,2,[1],0.6896,0.328464,0.114501
no_purchase,3,[],0.0,0.0,0.0
unisex_only,4,[2],0.045507,0.000324,0.000324
more_than_one_female_categories,5,[1],0.323257,0.323257,0.109294
more_than_one_male_categories,6,[0],0.139113,0.139113,0.072646


# Examine conflicts
Now that we are starting to see more conflicts between different LFs, it'd be good to dive in and see what's happening. It should give me more ideas. 

In [145]:
# how many different votes other than ABSTAIN each example got:
multi_votes = pd.DataFrame(L_train).replace(-1, np.NaN)
multi_votes.columns = [x.name for x in labelling_functions]
multi_votes

Unnamed: 0,bought_male_item,bought_more_male_item,bought_more_female_item,no_purchase,unisex_only,more_than_one_female_categories,more_than_one_male_categories
0,0.0,,1.0,,,1.0,0.0
1,,,1.0,,,1.0,
2,0.0,,1.0,,,1.0,0.0
3,,,,,2.0,,
4,,,1.0,,,,
...,...,...,...,...,...,...,...
46274,0.0,,1.0,,,1.0,0.0
46275,0.0,,1.0,,,1.0,0.0
46276,,,1.0,,,,
46277,,,1.0,,,1.0,


In [146]:
multi_votes_filter = (multi_votes.apply(lambda s: s.nunique(), axis=1) > 1)

In [148]:
with pd.option_context('display.max_columns', 999):
    display(
        pd.concat([
            multi_votes.loc[multi_votes_filter,:],
            features.loc[multi_votes_filter,:]
        ], axis=1).sample(10)
    )

Unnamed: 0,bought_male_item,bought_more_male_item,bought_more_female_item,no_purchase,unisex_only,more_than_one_female_categories,more_than_one_male_categories,customer_id,is_newsletter_subscriber,cc_payments,paypal_payments,afterpay_payments,apple_payments,orders,items,cancels,returns,vouchers,female_items,male_items,unisex_items,wapp_items,wftw_items,mapp_items,wacc_items,macc_items,mftw_items,wspt_items,mspt_items,curvy_items,sacc_items,msite_orders,desktop_orders,android_orders,ios_orders,other_device_orders,work_orders,home_orders,parcelpoint_orders,other_collection_orders,redpen_discount_used,coupon_discount_applied,revenue,days_since_first_order,days_since_last_order,tenure_months,different_addresses,shipping_addresses,devices,average_discount_onoffer,average_discount_used
23809,0.0,,1.0,,,1.0,,57511bf7cef5f8e6b9a9d3875793de34,0.0,1.0,1.0,0.0,0.0,0.267857,0.446429,0.017857,0.196429,0.107143,0.392857,0.035714,0.017857,0.285714,0.071429,0.0,0.017857,0.017857,0.0,0.017857,0.0,0.0,0.0,0.107143,0.160714,0.0,0.0,0.0,0.107143,0.053571,0.0,0.107143,21.776429,4.176607,53.785714,1678.0,12.0,56.0,1.0,9.0,2.0,0.2679,0.366561
18087,0.0,,,,,1.0,0.0,22d677c7f6e9cad8398328624c03638c,0.0,0.0,1.0,0.0,0.0,0.375,0.75,0.0,0.125,0.125,0.375,0.375,0.0,0.0,0.25,0.0,0.125,0.125,0.125,0.0,0.0,0.0,0.0,0.0,0.375,0.0,0.0,0.0,0.0,0.0,0.0,0.375,12.2625,3.4075,40.855,1744.0,1520.0,8.0,0.0,1.0,1.0,0.1929,0.250094
25131,0.0,,1.0,,,1.0,,0317e9e2d5e6c02ae3d45d49d09d5311,1.0,1.0,0.0,0.0,0.0,0.833333,1.666667,0.0,0.166667,0.166667,1.166667,0.333333,0.166667,0.666667,0.5,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.833333,0.0,0.0,0.0,0.0,0.833333,0.0,0.0,37.85,98.573333,358.463333,470.0,307.0,6.0,0.0,1.0,1.0,0.0781,0.279655
11262,0.0,,1.0,,,1.0,,5034edbcf46a1972c5886a024faac566,1.0,1.0,0.0,0.0,0.0,0.183673,0.346939,0.0,0.122449,0.040816,0.326531,0.020408,0.0,0.244898,0.061224,0.0,0.020408,0.020408,0.0,0.0,0.0,0.0,0.0,0.0,0.183673,0.0,0.0,0.0,0.020408,0.081633,0.0,0.081633,30.099592,4.063061,64.51551,1745.0,275.0,49.0,1.0,4.0,1.0,0.242,0.279925
12694,0.0,,1.0,,,1.0,,facab86622bee27c0d0548af75547782,1.0,0.0,1.0,0.0,0.0,0.173913,0.521739,0.0,0.0,0.043478,0.434783,0.086957,0.0,0.086957,0.043478,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.173913,0.0,0.0,0.0,0.0,0.0,0.0,0.173913,27.357391,1.725652,23.306522,1754.0,1070.0,23.0,0.0,2.0,1.0,0.4992,0.561999
36774,0.0,,1.0,,,1.0,0.0,b7db819172445ee66d6118ad75c2f1d8,1.0,1.0,0.0,0.0,0.0,0.692308,1.846154,0.0,0.230769,0.461538,1.692308,0.076923,0.076923,0.153846,0.923077,0.0,0.153846,0.153846,0.076923,0.461538,0.0,0.0,0.076923,0.153846,0.230769,0.0,0.307692,0.0,0.384615,0.307692,0.0,0.0,74.371538,66.182308,205.053846,495.0,126.0,13.0,0.0,2.0,2.0,0.1868,0.390919
24560,0.0,0.0,,,,1.0,0.0,03d0a618eaf6cda1bfd0f54f76b69d35,1.0,1.0,1.0,0.0,0.0,0.25,0.583333,0.0,0.166667,0.0,0.145833,0.4375,0.0,0.083333,0.020833,0.395833,0.020833,0.020833,0.0,0.0,0.0,0.0,0.0,0.0625,0.1875,0.0,0.0,0.0,0.0,0.0625,0.0,0.1875,19.425833,0.0,72.2675,1687.0,247.0,48.0,1.0,2.0,2.0,0.1724,0.172443
2510,0.0,,1.0,,,1.0,,08655a58a87189cb6a8878e596227e12,1.0,1.0,0.0,1.0,0.0,0.163934,0.295082,0.04918,0.0,0.016393,0.262295,0.032787,0.0,0.147541,0.081967,0.0,0.016393,0.016393,0.0,0.016393,0.0,0.0,0.0,0.0,0.163934,0.0,0.0,0.0,0.0,0.114754,0.0,0.04918,7.156393,0.968689,34.388689,1862.0,35.0,61.0,1.0,2.0,1.0,0.18,0.185283
26544,0.0,,1.0,,,1.0,,3b2a5bcbd1fc634011152892cce852c7,1.0,0.0,1.0,0.0,0.0,0.5,1.166667,0.0,0.5,0.166667,1.0,0.166667,0.0,0.166667,0.0,0.0,0.0,0.0,0.166667,0.833333,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.0,27.23,4.54,80.213333,544.0,386.0,6.0,0.0,2.0,1.0,0.2367,0.283184
22300,0.0,,1.0,,,,0.0,d9bbb0c203a2c39f3158fe2593272a12,0.0,1.0,0.0,0.0,0.0,2.0,6.0,0.0,0.0,1.0,3.0,2.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,444.15,40.89,523.2,1626.0,1626.0,1.0,0.0,1.0,1.0,0.4332,0.474931


Here are some observations:
- customers who bought equal male and female items
- customers who bought unisex only but also bought men's footwear (is men's footwear unisex?)
- found some bugs in the LFs
- customers who bought both male and female, but lean heavily into one side

These give me a few more ideas to try

# The third round of LFs and evaluation

In [150]:
@labeling_function()
def two_times_more_female_item(x):
    return (
        FEMALE 
        if x.female_items >= 2.0 * x.male_items
        else ABSTAIN
    )

@labeling_function()
def two_times_more_male_item(x):
    return (
        MALE 
        if x.male_items >= 2.0 * x.female_items
        else ABSTAIN
    )

@labeling_function()
def roughtly_equal_male_female_item(x):
    """My experience tells me this is still more likely female customer :) """
    return (
        FEMALE 
        if x.female_items >= 0.9 * x.male_items
            and x.female_items <= 1.1 * x.male_items
        else ABSTAIN
    )

In [151]:
labelling_functions = [bought_male_item, bought_more_male_item, bought_more_female_item, no_purchase,
                      unisex_only, more_than_one_female_categories, more_than_one_male_categories,
                       two_times_more_female_item, two_times_more_male_item, roughtly_equal_male_female_item
                      ]
applier = PandasLFApplier(lfs=labelling_functions)
L_train = applier.apply(df=features)

100%|███████████████████████████████████| 46279/46279 [00:08<00:00, 5429.18it/s]


In [152]:
coverage_array = (pd.DataFrame(L_train) != ABSTAIN).mean(axis=0)

In [153]:
LFAnalysis(L=L_train, lfs=labelling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
bought_male_item,0,[0],0.371356,0.371356,0.136088
bought_more_male_item,1,[0],0.246354,0.246354,0.011085
bought_more_female_item,2,[1],0.6896,0.6896,0.114501
no_purchase,3,[],0.0,0.0,0.0
unisex_only,4,[2],0.045507,0.045507,0.045507
more_than_one_female_categories,5,[1],0.323257,0.323257,0.109294
more_than_one_male_categories,6,[0],0.139113,0.139113,0.074699
two_times_more_female_item,7,[1],0.724411,0.724411,0.149312
two_times_more_male_item,8,[0],0.283347,0.283347,0.0516
roughtly_equal_male_female_item,9,[1],0.064392,0.064392,0.064392


# Combine all LFs to generate a statistical labelling model
It can get much more sophisticated than that, for example with expert inputs from SMEs. But I'm clearly no SME in this area, thinking on it even more might have diminished return. 

I think that's good enough given the time spent

Time to generate Labels!