# stage3_build_labelling
This notebook will use a __weak supervision package snorkel__ to generate labels. These labels will be used to train a simple classification model in the next step. 

The core task in this notebook is for me to write up a bunch of "Labelling functions", each of them simple, naive and noisy, they are purely based on my biased intuition of what diferent gendered customers may looks like. 

Snorkel will then try _"observing when and where these different labeling functions agree or disagree with one another, you can automatically learn—in unsupervised ways—when, where, and how much to trust each of them. You can thus learn their areas of expertise, and the overall level of expertise, so that when you combine their votes you end up with the highest quality label possible for each data point."_

This should work better than the opinionated analysis or heuristics that I can come up in short time. 

__The "Labelling Model" this produces at the end is a generalisation and statistical combination of my crude intuitions to the entire data set. It doesn't use all features (only the ones I wrote into the Labelling Functions) and it wouldn't generalise well into unseen data. Hence we still need another ML classification model to be trained on top of these labels.__

# Imports

In [1]:
from snorkel.labeling import labeling_function
from snorkel.labeling import PandasLFApplier
from snorkel.labeling import LFAnalysis
from snorkel.labeling.model import LabelModel, MajorityLabelVoter


import pandas as pd
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


# Load data

In [2]:
features_data_path = "../data/processed/features.parquet"

features = pd.read_parquet(features_data_path)

In [3]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46279 entries, 0 to 46278
Data columns (total 44 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               46279 non-null  object 
 1   is_newsletter_subscriber  46279 non-null  float64
 2   cc_payments               46279 non-null  float64
 3   paypal_payments           46279 non-null  float64
 4   afterpay_payments         46279 non-null  float64
 5   apple_payments            46279 non-null  float64
 6   orders                    46279 non-null  float64
 7   items                     46279 non-null  float64
 8   cancels                   46279 non-null  float64
 9   returns                   46279 non-null  float64
 10  vouchers                  46279 non-null  float64
 11  female_items              46279 non-null  float64
 12  male_items                46279 non-null  float64
 13  unisex_items              46279 non-null  float64
 14  wapp_i

# Write labelling functions
I'll come up with as many ideas as I can while exploring the features

Setup some "constants"

In [4]:
# ABSTAIN is labelling function's way of saying "I dont' know"
# UNKNOWN is for when there's not enough data, for example customer hasn't bought anything yet
# Sorry, LGBTQI+ community, I don't have enough data or time to account for everyone

MALE = 0
FEMALE = 1
UNKNOWN = 2
ABSTAIN = -1

Explore the data:

In [5]:
with pd.option_context('display.max_columns', 999):
    display(features.sample(10))

Unnamed: 0,customer_id,is_newsletter_subscriber,cc_payments,paypal_payments,afterpay_payments,apple_payments,orders,items,cancels,returns,vouchers,female_items,male_items,unisex_items,wapp_items,wftw_items,mapp_items,wacc_items,macc_items,mftw_items,wspt_items,mspt_items,curvy_items,sacc_items,msite_orders,desktop_orders,android_orders,ios_orders,other_device_orders,work_orders,home_orders,parcelpoint_orders,other_collection_orders,redpen_discount_used,coupon_discount_applied,revenue,days_since_first_order,days_since_last_order,tenure_months,different_addresses,shipping_addresses,devices,average_discount_onoffer,average_discount_used
14824,eb9ab39054edcfa17ace7333a8688771,0.0,0.0,1.0,0.0,0.0,0.214286,0.357143,0.0,0.0,0.0,0.285714,0.0,0.071429,0.0,0.071429,0.0,0.285714,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.214286,0.0,0.0,0.0,0.0,0.0,0.0,0.214286,10.189286,0.0,16.987857,1746.0,1351.0,14.0,0.0,2.0,1.0,0.3481,0.348096
8897,1ff645e7d2cc6499f850b3b4f2de26d2,1.0,1.0,0.0,0.0,0.0,0.666667,1.333333,0.0,0.0,0.0,1.333333,0.0,0.0,0.0,1.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,50.88,0.0,65.633333,1799.0,1719.0,3.0,0.0,1.0,1.0,0.4038,0.403804
31815,4ad7dac0baac299f4ddbcea9e2a46bbd,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,127.23,136.0,136.0,1.0,0.0,1.0,1.0,0.0,0.0
36928,5c7927a07d60a47f4bc634e0a5515407,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,45.41,6.0,6.0,1.0,0.0,1.0,1.0,0.0,0.0
1845,1c410049c89c83567f4a71a120d4b914,1.0,1.0,0.0,1.0,0.0,0.206349,0.539683,0.0,0.031746,0.015873,0.47619,0.031746,0.031746,0.412698,0.063492,0.031746,0.0,0.0,0.0,0.0,0.0,0.0,0.031746,0.0,0.206349,0.0,0.0,0.0,0.0,0.174603,0.0,0.031746,3.802857,4.021905,56.667619,1945.0,56.0,63.0,0.0,2.0,1.0,0.0559,0.103507
44143,eb54bb84a725e39c08c55824dbe0fa18,1.0,1.0,1.0,0.0,0.0,0.238095,0.619048,0.0,0.047619,0.142857,0.52381,0.047619,0.047619,0.428571,0.047619,0.047619,0.047619,0.047619,0.047619,0.0,0.0,0.0,0.0,0.0,0.238095,0.0,0.0,0.0,0.142857,0.047619,0.047619,0.0,52.249524,34.011429,98.310952,633.0,21.0,21.0,0.0,4.0,1.0,0.2634,0.441391
7144,c892a16cea982a657258b57c97463923,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,72.68,1769.0,1769.0,1.0,0.0,1.0,1.0,0.0,0.0
2439,9ef891fa20ac3a637e586cc19d3c1dab,0.0,1.0,0.0,0.0,0.0,0.08,0.2,0.0,0.08,0.0,0.06,0.1,0.04,0.0,0.04,0.14,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.06,0.0,0.0,0.0,0.0,0.02,0.0,0.06,26.1192,0.0,40.706,1846.0,364.0,50.0,0.0,1.0,2.0,0.2096,0.209599
7567,85c665052c207d43ef8a1601f4529772,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,72.68,1782.0,1782.0,1.0,0.0,1.0,1.0,0.0,0.0
31600,3ae158b59da2837b65651bb3a907137c,0.0,1.0,0.0,0.0,0.0,0.096774,0.290323,0.0,0.193548,0.064516,0.290323,0.0,0.0,0.225806,0.0,0.0,0.064516,0.064516,0.0,0.0,0.0,0.0,0.0,0.0,0.096774,0.0,0.0,0.0,0.0,0.032258,0.0,0.064516,60.859677,3.470968,104.539677,1179.0,277.0,31.0,0.0,2.0,1.0,0.3242,0.352693


In [6]:
features['items'].describe()

count    46279.000000
mean         1.629684
std          2.770734
min          0.029851
25%          0.619048
50%          1.000000
75%          2.000000
max        232.000000
Name: items, dtype: float64

In [33]:
@labeling_function()
def bought_male_item(x):
    return (
        MALE 
        if x.male_items > 0
        else ABSTAIN
    )

@labeling_function()
def bought_more_male_item(x):
    return (
        MALE 
        if x.male_items > x.female_items
        else ABSTAIN
    )

@labeling_function()
def bought_female_item(x):
    return (
        FEMALE 
        if x.female_items > 0
        else ABSTAIN
    )

@labeling_function()
def bought_more_female_item(x):
    return (
        FEMALE 
        if x.female_items > x.male_items
        else ABSTAIN
    )

@labeling_function()
def no_purchase(x):
    return (
        UNKNOWN
        if x['items'] == 0
        else ABSTAIN
    )

# Evaluate labelling functions on feature set - first pass

In [34]:
labelling_functions = [bought_male_item, bought_more_male_item, 
                       bought_female_item, bought_more_female_item, no_purchase]
applier = PandasLFApplier(lfs=labelling_functions)
L_train = applier.apply(df=features)

100%|██████████████████████████████| 46279/46279 [00:02<00:00, 19215.16it/s]


In [35]:
coverage_array = (pd.DataFrame(L_train) != ABSTAIN).mean(axis=0)

In [36]:
LFAnalysis(L=L_train, lfs=labelling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
bought_male_item,0,[0],0.371356,0.371356,0.162838
bought_more_male_item,1,[0],0.246354,0.246354,0.037836
bought_female_item,2,[1],0.745975,0.745975,0.162838
bought_more_female_item,3,[1],0.6896,0.6896,0.106463
no_purchase,4,[],0.0,0.0,0.0


Oops, no_purchase() has 0 coverage, meaning all customers in this dataset has bought something, anyway...

# Keep writing more labelling functions

In [37]:
@labeling_function()
def unisex_only(x):
    return (
        UNKNOWN 
        if x.male_items == 0 
            and x.female_items == 0
            and x.unisex_items > 0
        else ABSTAIN
    )

@labeling_function()
def more_than_one_female_categories(x):
    return (
        FEMALE
        if np.sum([
                x.wapp_items > 0, 
                  x.wftw_items > 0, 
                  x.wacc_items > 0,
                  x.wspt_items > 0,
        ]) > 1.001
        else ABSTAIN
    )

@labeling_function()
def more_than_one_male_categories(x):
    return (
        MALE
        if np.sum([x.mapp_items > 0, 
                  x.macc_items > 0, 
                  x.mftw_items > 0,
                  x.mspt_items > 0]
                 ) > 1.001
        else ABSTAIN
    )

# Evaluate labelling functions on feature set - second pass

In [39]:
labelling_functions = [bought_male_item, bought_more_male_item, 
                       bought_female_item, bought_more_female_item, no_purchase,
                      unisex_only, more_than_one_female_categories, more_than_one_male_categories
                      ]
applier = PandasLFApplier(lfs=labelling_functions)
L_train = applier.apply(df=features)

100%|███████████████████████████████| 46279/46279 [00:06<00:00, 7014.84it/s]


In [40]:
coverage_array = (pd.DataFrame(L_train) != ABSTAIN).mean(axis=0)

In [41]:
LFAnalysis(L=L_train, lfs=labelling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
bought_male_item,0,[0],0.371356,0.371356,0.163011
bought_more_male_item,1,[0],0.246354,0.246354,0.038009
bought_female_item,2,[1],0.745975,0.745975,0.170877
bought_more_female_item,3,[1],0.6896,0.6896,0.114501
no_purchase,4,[],0.0,0.0,0.0
unisex_only,5,[2],0.045507,0.000324,0.000324
more_than_one_female_categories,6,[1],0.319843,0.319843,0.10871
more_than_one_male_categories,7,[0],0.139113,0.139113,0.088204


# Examine conflicts
Now that we are starting to see more conflicts between different LFs, it'd be good to dive in and see what's happening. It should give me more ideas. 

In [42]:
# how many different votes other than ABSTAIN each example got:
multi_votes = pd.DataFrame(L_train).replace(-1, np.NaN)
multi_votes.columns = [x.name for x in labelling_functions]
multi_votes

Unnamed: 0,bought_male_item,bought_more_male_item,bought_female_item,bought_more_female_item,no_purchase,unisex_only,more_than_one_female_categories,more_than_one_male_categories
0,0.0,,1.0,1.0,,,1.0,0.0
1,,,1.0,1.0,,,1.0,
2,0.0,,1.0,1.0,,,1.0,0.0
3,,,,,,2.0,,
4,,,1.0,1.0,,,,
...,...,...,...,...,...,...,...,...
46274,0.0,,1.0,1.0,,,1.0,0.0
46275,0.0,,1.0,1.0,,,1.0,0.0
46276,,,1.0,1.0,,,,
46277,,,1.0,1.0,,,,


In [43]:
multi_votes_filter = (multi_votes.apply(lambda s: s.nunique(), axis=1) > 1)

In [44]:
with pd.option_context('display.max_columns', 999):
    display(
        pd.concat([
            multi_votes.loc[multi_votes_filter,:],
            features.loc[multi_votes_filter,:]
        ], axis=1).sample(10)
    )

Unnamed: 0,bought_male_item,bought_more_male_item,bought_female_item,bought_more_female_item,no_purchase,unisex_only,more_than_one_female_categories,more_than_one_male_categories,customer_id,is_newsletter_subscriber,cc_payments,paypal_payments,afterpay_payments,apple_payments,orders,items,cancels,returns,vouchers,female_items,male_items,unisex_items,wapp_items,wftw_items,mapp_items,wacc_items,macc_items,mftw_items,wspt_items,mspt_items,curvy_items,sacc_items,msite_orders,desktop_orders,android_orders,ios_orders,other_device_orders,work_orders,home_orders,parcelpoint_orders,other_collection_orders,redpen_discount_used,coupon_discount_applied,revenue,days_since_first_order,days_since_last_order,tenure_months,different_addresses,shipping_addresses,devices,average_discount_onoffer,average_discount_used
552,0.0,,1.0,1.0,,,1.0,,54bb6c91418bcb19a54e93b7c2e0c388,0.0,1.0,1.0,0.0,0.0,0.166667,0.181818,0.030303,0.015152,0.015152,0.166667,0.015152,0.0,0.136364,0.030303,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.030303,0.136364,0.0,0.0,0.0,0.0,0.106061,0.0,0.060606,4.267727,0.275455,15.118485,1990.0,32.0,66.0,0.0,1.0,2.0,0.2099,0.233768
16677,0.0,,1.0,,,,,,5f4b090d8d3a8c8e671ef05934f9e090,0.0,0.0,1.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,65.4,0.0,187.22,1744.0,1744.0,1.0,0.0,1.0,1.0,0.2,0.199951
15736,0.0,,1.0,1.0,,,1.0,0.0,aa7b6f06598ed7bb0562df4ef3c5439f,1.0,1.0,0.0,0.0,0.0,0.188679,0.603774,0.0,0.301887,0.018868,0.566038,0.037736,0.0,0.396226,0.113208,0.037736,0.056604,0.056604,0.0,0.0,0.0,0.0,0.0,0.0,0.188679,0.0,0.0,0.0,0.0,0.018868,0.0,0.169811,41.001509,12.860377,192.451321,1734.0,159.0,53.0,1.0,4.0,1.0,0.158,0.196413
10611,0.0,,1.0,1.0,,,1.0,,32f5726de3be7477bd32acf2b36a2289,0.0,0.0,1.0,0.0,0.0,0.118644,0.186441,0.0,0.0,0.050847,0.135593,0.033898,0.016949,0.050847,0.084746,0.0,0.0,0.0,0.033898,0.0,0.0,0.0,0.0,0.050847,0.067797,0.0,0.0,0.0,0.016949,0.050847,0.0,0.050847,3.927458,2.427966,29.855593,1783.0,23.0,59.0,0.0,4.0,2.0,0.0826,0.246961
44115,0.0,,1.0,,,,1.0,0.0,a71dd9d7406030938cab15a3ac91e748,1.0,0.0,1.0,0.0,0.0,2.0,2.0,0.0,0.5,0.5,1.0,1.0,0.0,0.0,0.5,0.0,0.5,0.5,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,74.975,233.615,1093.0,1043.0,2.0,0.0,1.0,1.0,0.0,0.229155
31277,0.0,,1.0,,,,,,2278a2b8d7f0f0e6be1b140144c31782,1.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,54.5,0.0,145.32,822.0,822.0,1.0,0.0,1.0,1.0,0.25,0.25
19910,,,1.0,1.0,,,1.0,0.0,74366a19d1cbf671c9112ce950d7a58e,1.0,1.0,1.0,0.0,0.0,0.211538,0.519231,0.0,0.288462,0.019231,0.5,0.0,0.019231,0.346154,0.019231,0.019231,0.096154,0.096154,0.0,0.038462,0.0,0.0,0.0,0.115385,0.096154,0.0,0.0,0.0,0.019231,0.153846,0.0,0.038462,34.709038,2.026923,109.466154,1567.0,13.0,52.0,0.0,3.0,3.0,0.1623,0.171541
16331,0.0,,1.0,1.0,,,1.0,,aca726e296b01d48a43a17848b14da13,0.0,1.0,0.0,0.0,0.0,1.0,25.0,0.0,0.0,1.0,17.0,8.0,0.0,11.0,6.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2744.5,4670.75,0.0,1750.0,1750.0,1.0,0.0,1.0,1.0,0.048,0.223903
1625,0.0,,1.0,1.0,,,,,a1c15058b288fdc9f24fad72f3b37d22,0.0,1.0,1.0,0.0,0.0,0.1,0.15,0.0,0.0,0.05,0.1,0.05,0.0,0.1,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.05,0.0,0.0,0.0,0.0,0.1,0.0,5.452,17.4,2031.0,1454.0,20.0,0.0,2.0,2.0,0.0,0.199995
33501,0.0,0.0,1.0,,,,1.0,0.0,224b0b9fa671b4cfddc356319e9b2647,1.0,1.0,0.0,0.0,0.0,0.285714,0.5,0.0,0.035714,0.035714,0.178571,0.285714,0.035714,0.035714,0.107143,0.107143,0.035714,0.035714,0.035714,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.107143,0.071429,0.0,0.107143,19.794286,0.486429,45.072143,977.0,160.0,28.0,0.0,3.0,1.0,0.2046,0.214202


Here are some observations:
- customers who bought equal male and female items
- customers who bought unisex only but also bought men's footwear (is men's footwear unisex?)
- found some bugs in the LFs
- customers who bought both male and female, but lean heavily into one side

These give me a few more ideas to try

# The third round of LFs and evaluation

In [138]:
@labeling_function()
def two_times_more_female_item(x):
    return (
        FEMALE 
        if x.female_items >= 2.0 * x.male_items
        else ABSTAIN
    )

@labeling_function()
def two_times_more_male_item(x):
    return (
        MALE 
        if x.male_items >= 2.0 * x.female_items
        else ABSTAIN
    )

@labeling_function()
def roughtly_equal_male_female_item(x):
    """My experience tells me this is still more likely female customer :) """
    return (
        FEMALE 
        if x.female_items >= 0.9 * x.male_items
            and x.female_items <= 1.1 * x.male_items
        else ABSTAIN
    )

@labeling_function()
def equal_or_more_female_categories(x):
    return (
        FEMALE
        if np.sum([
                x.wapp_items > 0, 
                  x.wftw_items > 0, 
                  x.wacc_items > 0,
                  x.wspt_items > 0,
        ]) >= np.sum([x.mapp_items > 0, 
                  x.macc_items > 0, 
                  x.mftw_items > 0,
                  x.mspt_items > 0]
                 )
        else ABSTAIN
    )

@labeling_function()
def more_male_categories(x):
    return (
        MALE
        if np.sum([x.mapp_items > 0, 
                  x.macc_items > 0, 
                  x.mftw_items > 0,
                  x.mspt_items > 0]
                 ) > np.sum([
                x.wapp_items > 0, 
                  x.wftw_items > 0, 
                  x.wacc_items > 0,
                  x.wspt_items > 0,
        ])
        else ABSTAIN
    )

@labeling_function()
def lot_more_female_item(x):
    return (
        FEMALE 
        if x.female_items >= 3.0 * x.male_items
        else ABSTAIN
    )

@labeling_function()
def lot_more_male_item(x):
    return (
        MALE 
        if x.male_items >= 3.0 * x.female_items
        else ABSTAIN
    )


@labeling_function()
def ten_times_more_female_item(x):
    return (
        FEMALE 
        if x.female_items >= 10.0 * x.male_items
        else ABSTAIN
    )

@labeling_function()
def ten_times_more_male_item(x):
    return (
        MALE 
        if x.male_items >= 10.0 * x.female_items
        else ABSTAIN
    )

@labeling_function()
def no_male_categories(x):
    return (
        FEMALE
        if np.sum([x.mapp_items > 0, 
                  x.macc_items > 0, 
                  x.mftw_items > 0,
                  x.mspt_items > 0]
                 ) == 0
        else ABSTAIN
    )

@labeling_function()
def no_female_categories(x):
    return (
        MALE
        if np.sum([
                x.wapp_items > 0, 
                  x.wftw_items > 0, 
                  x.wacc_items > 0,
                  x.wspt_items > 0,
        ]) == 0
        else ABSTAIN
    )

In [139]:
labelling_functions = [bought_male_item, bought_more_male_item, 
                       bought_female_item, bought_more_female_item, no_purchase,
                      unisex_only, more_than_one_female_categories, more_than_one_male_categories,
                       two_times_more_female_item, two_times_more_male_item, roughtly_equal_male_female_item,
                       equal_or_more_female_categories, more_male_categories,
                       lot_more_female_item, lot_more_male_item,
                       ten_times_more_female_item, ten_times_more_male_item,
                       no_male_categories,no_female_categories
                      ]
applier = PandasLFApplier(lfs=labelling_functions)
L_train = applier.apply(df=features)

100%|███████████████████████████████| 46279/46279 [00:20<00:00, 2264.22it/s]


In [140]:
coverage_array = (pd.DataFrame(L_train) != ABSTAIN).mean(axis=0)

In [141]:
LFAnalysis(L=L_train, lfs=labelling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
bought_male_item,0,[0],0.371356,0.371356,0.202446
bought_more_male_item,1,[0],0.246354,0.246354,0.077443
bought_female_item,2,[1],0.745975,0.745975,0.186197
bought_more_female_item,3,[1],0.6896,0.6896,0.129821
no_purchase,4,[],0.0,0.0,0.0
unisex_only,5,[2],0.045507,0.045507,0.045507
more_than_one_female_categories,6,[1],0.319843,0.319843,0.10871
more_than_one_male_categories,7,[0],0.139113,0.139113,0.088204
two_times_more_female_item,8,[1],0.724411,0.724411,0.164632
two_times_more_male_item,9,[0],0.283347,0.283347,0.114436


# Combine all LFs to generate a statistical labelling model
It can get much more sophisticated than that, for example with expert inputs from SMEs. But I'm clearly no SME in this area, thinking on it even more might have diminished return. 

I think that's good enough given the time spent

Time to generate Labels!

In [142]:
L_train.shape

(46279, 19)

## First build a simple majority vote labelling model

In [143]:
majority_model = MajorityLabelVoter(cardinality=3)
preds = majority_model.predict(L=L_train)

In [144]:
label_dict = {0:'MALE', 1:'FEMALE', 2:'UNKNOWN', -1:'ABSTAIN'}
pd.Series(preds).replace(label_dict).value_counts()

FEMALE     34143
MALE       11765
ABSTAIN      371
dtype: int64

Not too bad, it covered the vast majority of the data

Still I'll check the examples where all LFs abstained

In [145]:
with pd.option_context('display.max_columns', 999):
    display(
        features.loc[preds == -1]
    )

Unnamed: 0,customer_id,is_newsletter_subscriber,cc_payments,paypal_payments,afterpay_payments,apple_payments,orders,items,cancels,returns,vouchers,female_items,male_items,unisex_items,wapp_items,wftw_items,mapp_items,wacc_items,macc_items,mftw_items,wspt_items,mspt_items,curvy_items,sacc_items,msite_orders,desktop_orders,android_orders,ios_orders,other_device_orders,work_orders,home_orders,parcelpoint_orders,other_collection_orders,redpen_discount_used,coupon_discount_applied,revenue,days_since_first_order,days_since_last_order,tenure_months,different_addresses,shipping_addresses,devices,average_discount_onoffer,average_discount_used
165,86774735323382706c516590772d1972,1.0,1.0,0.0,0.0,0.0,0.285714,0.619048,0.0,0.000000,0.047619,0.238095,0.285714,0.095238,0.000000,0.285714,0.000000,0.047619,0.047619,0.285714,0.000000,0.00,0.0,0.00,0.047619,0.238095,0.0,0.00,0.0,0.0,0.000000,0.0,0.285714,42.518571,0.000000,115.843810,1972.0,1367.0,21.0,0.0,2.0,2.0,0.3465,0.346467
241,57fc4e2f27066a39532751686c43ec2b,1.0,1.0,0.0,0.0,0.0,2.500000,19.500000,0.0,0.000000,0.000000,6.000000,10.500000,3.000000,0.000000,6.000000,0.000000,0.000000,0.000000,7.500000,0.000000,0.00,0.0,2.00,0.000000,2.500000,0.0,0.00,0.0,0.0,0.000000,0.0,2.500000,0.000000,0.000000,3273.130000,2011.0,1951.0,2.0,0.0,1.0,1.0,0.0000,0.000000
282,8533224f598d39f203964439c2b75280,1.0,1.0,1.0,0.0,0.0,0.133333,0.150000,0.0,0.000000,0.033333,0.050000,0.100000,0.000000,0.016667,0.033333,0.000000,0.000000,0.000000,0.000000,0.000000,0.10,0.0,0.00,0.000000,0.133333,0.0,0.00,0.0,0.0,0.016667,0.0,0.116667,10.120000,1.798167,5.485667,2064.0,278.0,60.0,0.0,3.0,1.0,0.4415,0.510553
502,a1b1c39eda866cd14190e5b4b406e790,1.0,1.0,0.0,0.0,0.0,0.434783,0.826087,0.0,0.086957,0.043478,0.130435,0.565217,0.130435,0.086957,0.043478,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.0,0.00,0.391304,0.043478,0.0,0.00,0.0,0.0,0.000000,0.0,0.434783,22.496957,0.000000,38.599130,2004.0,1320.0,23.0,0.0,4.0,2.0,0.3859,0.385893
742,4b2a2e97f60c8dbdd72a41da0ff2fe16,0.0,0.0,1.0,0.0,0.0,2.000000,5.000000,0.0,3.000000,0.000000,2.000000,3.000000,0.000000,1.000000,1.000000,1.000000,0.000000,0.000000,2.000000,0.000000,0.00,0.0,0.00,0.000000,2.000000,0.0,0.00,0.0,0.0,0.000000,0.0,2.000000,0.000000,0.000000,926.830000,2016.0,2016.0,1.0,0.0,1.0,1.0,0.0000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45729,3e818b0a1b2d0590eadbbeeede9f8409,1.0,0.0,1.0,0.0,0.0,1.000000,4.250000,0.0,0.500000,0.000000,2.750000,1.000000,0.500000,2.750000,0.000000,1.000000,0.000000,0.000000,0.500000,0.000000,0.00,0.0,0.00,0.500000,0.500000,0.0,0.00,0.0,0.0,1.000000,0.0,0.000000,626.160000,0.000000,1085.400000,261.0,162.0,4.0,0.0,1.0,2.0,0.3523,0.352329
45781,22268bb02095239eeca22b4248ec6eb1,1.0,1.0,1.0,0.0,0.0,0.550000,1.200000,0.0,0.200000,0.000000,0.650000,0.450000,0.100000,0.350000,0.300000,0.050000,0.000000,0.000000,0.150000,0.000000,0.05,0.0,0.05,0.100000,0.000000,0.0,0.45,0.0,0.0,0.550000,0.0,0.000000,16.391000,0.000000,121.933500,787.0,197.0,20.0,0.0,1.0,1.0,0.0984,0.098399
45840,f42fc634ab794767015fdd183546df63,0.0,1.0,0.0,0.0,0.0,0.300000,0.400000,0.0,0.000000,0.100000,0.200000,0.100000,0.100000,0.000000,0.000000,0.000000,0.200000,0.200000,0.000000,0.000000,0.10,0.0,0.00,0.000000,0.300000,0.0,0.00,0.0,0.0,0.300000,0.0,0.000000,0.000000,0.817000,19.161000,659.0,378.0,10.0,0.0,1.0,1.0,0.0000,0.024985
46089,1a8a061c17524ea5e2933bff04805592,0.0,0.0,1.0,0.0,0.0,1.000000,1.500000,0.0,0.000000,0.500000,0.500000,1.000000,0.000000,0.000000,0.500000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00,0.0,0.00,0.000000,1.000000,0.0,0.00,0.0,0.0,0.000000,0.0,1.000000,31.800000,9.090000,77.170000,1879.0,1837.0,2.0,0.0,1.0,1.0,0.1000,0.233557


In [146]:
L_train[14,:]

array([ 0, -1,  1,  1, -1, -1,  1,  0, -1, -1, -1,  1, -1, -1, -1, -1, -1,
       -1, -1])

In [147]:
L_train[32,:]

array([ 0, -1,  1, -1, -1, -1, -1, -1, -1, -1,  1,  1, -1, -1, -1, -1, -1,
       -1, -1])

Looks like we got some ties in the voting, this is fine, a more sophisticated labelling model should resolve the ties with weighted confidenc in each LF

## Build a probabilistic or confidence-weighted labelling model

In [148]:
label_model = LabelModel(cardinality=3)
label_model.fit(L_train=L_train, n_epochs=1000, log_freq=100, seed=273)

INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|                                           | 0/1000 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=23.521]
INFO:root:[100 epochs]: TRAIN:[loss=0.121]
 16%|████▊                          | 156/1000 [00:00<00:00, 1557.06epoch/s]INFO:root:[200 epochs]: TRAIN:[loss=0.032]
INFO:root:[300 epochs]: TRAIN:[loss=0.031]
 31%|█████████▋                     | 312/1000 [00:00<00:00, 1512.39epoch/s]INFO:root:[400 epochs]: TRAIN:[loss=0.031]
 47%|██████████████▍                | 467/1000 [00:00<00:00, 1524.66epoch/s]INFO:root:[500 epochs]: TRAIN:[loss=0.031]
INFO:root:[600 epochs]: TRAIN:[loss=0.031]
 62%|███████████████████▏           | 620/1000 [00:00<00:00, 1520.03epoch/s]INFO:root:[700 epochs]: TRAIN:[loss=0.031]
 78%|████████████████████████▏      | 781/1000 [00:00<00:00, 1550.81epoch/s]INFO:root:[800 epochs]: TRAIN:[loss=0.031]
INFO:root:[900 epochs]: TRAIN:[loss=0.031]
100%|██████████████████████████████| 1000/1000 [00:00<00:00, 1

The loss didn't decrease further, 1000 epochs is fine

In [149]:
adv_preds = label_model.predict(L_train)

In [150]:
pd.Series(adv_preds).replace(label_dict).value_counts()

FEMALE     25906
MALE       13000
UNKNOWN     7373
dtype: int64

Very good! More FEMALE, less MALE and some UNKNOWN as well.

## Examine disagreement between voting and probabilistic model

In [151]:
diff_preds = (preds != adv_preds)

In [152]:
pd.Series(diff_preds).value_counts()

False    36695
True      9584
dtype: int64

In [153]:
pd.concat([
            pd.Series(preds).rename('majority_model'), 
            pd.Series(adv_preds).rename('proba_model'),
        ], axis=1)\
[lambda df: df.majority_model != df.proba_model]\
.groupby(['majority_model','proba_model']).size()

majority_model  proba_model
-1              0                70
                2               301
 0              2               976
 1              0              2141
                2              6096
dtype: int64

In [155]:
with pd.option_context('display.max_columns', 999):
    display(
        pd.concat([
            pd.Series(preds).rename('majority_model'), 
            pd.Series(adv_preds).rename('proba_model'),
            features
        ], axis=1).loc[diff_preds].sample(10)
    )

Unnamed: 0,majority_model,proba_model,customer_id,is_newsletter_subscriber,cc_payments,paypal_payments,afterpay_payments,apple_payments,orders,items,cancels,returns,vouchers,female_items,male_items,unisex_items,wapp_items,wftw_items,mapp_items,wacc_items,macc_items,mftw_items,wspt_items,mspt_items,curvy_items,sacc_items,msite_orders,desktop_orders,android_orders,ios_orders,other_device_orders,work_orders,home_orders,parcelpoint_orders,other_collection_orders,redpen_discount_used,coupon_discount_applied,revenue,days_since_first_order,days_since_last_order,tenure_months,different_addresses,shipping_addresses,devices,average_discount_onoffer,average_discount_used
43511,1,2,0f90b3f4737a18d4a8e9973a73f5236d,0.0,1.0,1.0,0.0,0.0,1.37931,4.172414,0.034483,0.344828,0.103448,3.896552,0.241379,0.034483,2.103448,0.689655,0.103448,0.206897,0.206897,0.103448,0.862069,0.0,0.0,0.034483,0.275862,1.103448,0.0,0.0,0.0,0.862069,0.034483,0.0,0.482759,134.516552,26.441034,856.001034,1230.0,382.0,29.0,1.0,4.0,3.0,0.111,0.127163
41895,1,0,a6cc94afb07c103608db0223072ea83b,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,90.91,44.0,44.0,1.0,0.0,1.0,1.0,0.0,0.0
30352,0,2,e3099477b36b3a7fa192e8597c83647f,1.0,1.0,1.0,0.0,0.0,0.153846,0.25641,0.0,0.0,0.0,0.076923,0.128205,0.051282,0.0,0.076923,0.076923,0.025641,0.025641,0.051282,0.0,0.0,0.0,0.0,0.0,0.153846,0.0,0.0,0.0,0.0,0.128205,0.0,0.025641,1.67641,0.0,37.435128,1594.0,448.0,39.0,0.0,3.0,1.0,0.0458,0.045819
4984,1,2,0ec630f389e952ffee4e154732754926,1.0,1.0,1.0,0.0,0.0,0.645161,1.903226,0.0,0.032258,0.129032,1.483871,0.354839,0.064516,1.193548,0.064516,0.322581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.645161,0.0,0.0,0.0,0.0,0.0,0.0,0.645161,287.695161,8.476452,329.359677,1832.0,924.0,31.0,0.0,1.0,1.0,0.448,0.465569
28078,1,2,170d72192790ac0cdb46b7fb1c447910,1.0,1.0,1.0,0.0,0.0,0.347826,0.934783,0.0,0.043478,0.021739,0.652174,0.26087,0.021739,0.478261,0.130435,0.152174,0.043478,0.043478,0.043478,0.0,0.021739,0.0,0.0,0.086957,0.23913,0.021739,0.0,0.0,0.26087,0.0,0.0,0.086957,25.953913,4.52087,162.001087,1646.0,274.0,46.0,0.0,3.0,2.0,0.1117,0.129399
8572,1,2,a96f87b1e108449aa9d04462aebf995b,1.0,1.0,1.0,0.0,0.0,0.089286,0.089286,0.0,0.035714,0.017857,0.035714,0.017857,0.035714,0.0,0.017857,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.053571,0.035714,0.0,0.0,0.0,0.0,0.035714,0.0,0.053571,3.652143,0.243393,6.002857,1765.0,114.0,56.0,0.0,2.0,2.0,0.3,0.317687
19935,1,2,de47e9045e3a063a94b84707bd1511dd,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,22.7,0.0,22.71,1718.0,1718.0,1.0,0.0,1.0,1.0,0.5,0.5
10864,1,2,4515659f1e48f41de3c26ed170220231,1.0,0.0,1.0,0.0,0.0,0.090909,0.136364,0.0,0.022727,0.022727,0.113636,0.022727,0.0,0.090909,0.0,0.0,0.022727,0.022727,0.0,0.0,0.0,0.0,0.0,0.022727,0.068182,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,6.503182,1.95,8.765909,1769.0,461.0,44.0,0.0,1.0,2.0,0.2916,0.379063
21380,1,2,36b27ae2b25a6054698699d69b0659b0,0.0,0.0,1.0,0.0,0.0,1.0,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,246.8,0.0,297.84,1715.0,1715.0,1.0,0.0,1.0,1.0,0.4498,0.449805
33846,1,0,90fe3ca9ec5eba08a58cc00037e8bbbd,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,33.63,0.0,39.05,1960.0,1960.0,1.0,0.0,1.0,1.0,0.4627,0.462713


# Save labels and final thoughts
When the two models disagree, I found it hard to tell which model is better. There are some examples seemed very interesting. If this is a real project with more time, it's worth looking into further. For now I'll naively trust the probabilistic model. 

Normally the labelling models should be tested against a hold out ground truth set. But in this special scenario, there's no ground truth at all. Any label I came up will not be much different from the LFs I wrote, biased in my personal way. So I'll skip ahead an save these labels for the next step. 

In [158]:
training_set = pd.concat([
            pd.Series(adv_preds).rename('gender').replace(label_dict),
            features
        ], axis=1)

training_set

Unnamed: 0,gender,customer_id,is_newsletter_subscriber,cc_payments,paypal_payments,afterpay_payments,apple_payments,orders,items,cancels,...,coupon_discount_applied,revenue,days_since_first_order,days_since_last_order,tenure_months,different_addresses,shipping_addresses,devices,average_discount_onoffer,average_discount_used
0,UNKNOWN,64f7d7dd7a59bba7168cc9c960a5c60e,0.0,1.0,0.0,0.0,0.0,0.354167,1.041667,0.000000,...,5.180208,144.715417,2091.0,653.0,48.0,0.0,4.0,1.0,0.3364,0.358448
1,FEMALE,fa7c64efd5c037ff2abcce571f9c1712,1.0,0.0,1.0,0.0,0.0,0.188406,0.376812,0.000000,...,0.000000,77.235942,2082.0,22.0,69.0,0.0,4.0,2.0,0.1404,0.140410
2,UNKNOWN,18923c9361f27583d2320951435e4888,1.0,1.0,0.0,1.0,0.0,1.028986,2.202899,0.028986,...,1.564058,204.838696,2072.0,6.0,69.0,1.0,6.0,2.0,0.1851,0.189973
3,MALE,aa21f31def4edbdcead818afcdfc4d32,1.0,1.0,0.0,0.0,0.0,2.000000,2.000000,0.000000,...,90.900000,143.640000,2054.0,2050.0,1.0,0.0,1.0,1.0,0.0000,0.387567
4,FEMALE,668c6aac52ff54d4828ad379cdb38e7d,1.0,1.0,0.0,0.0,0.0,1.000000,1.000000,0.000000,...,0.000000,0.000000,2053.0,2053.0,1.0,0.0,1.0,1.0,0.0000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46274,UNKNOWN,5b34391ec6fbc0f189cb8d3d88806199,0.0,1.0,1.0,0.0,0.0,0.400000,0.888889,0.000000,...,39.807333,84.952000,1372.0,50.0,45.0,0.0,7.0,2.0,0.0091,0.352567
46275,UNKNOWN,198fd2f143f70b149344bcaf7eddee12,1.0,1.0,1.0,0.0,0.0,1.055556,1.055556,0.055556,...,13.367778,76.871111,646.0,124.0,18.0,1.0,2.0,2.0,0.1210,0.209202
46276,FEMALE,338b5c8ade4af1a562d55d4036710630,0.0,1.0,0.0,0.0,0.0,0.181818,0.181818,0.000000,...,0.000000,47.437273,1308.0,998.0,11.0,1.0,2.0,1.0,0.1500,0.150000
46277,FEMALE,2115c065bfc1f3b39e4c87c202e80fa5,1.0,1.0,0.0,0.0,0.0,2.800000,3.000000,0.000000,...,50.990000,142.458000,1410.0,1287.0,5.0,0.0,1.0,2.0,0.1824,0.320760


In [159]:
training_set_path = "../data/processed/training_set.parquet"

training_set.to_parquet(training_set_path)