In [1]:
%load_ext autoreload
%autoreload 2

# Tutorial: How to train a classifier using Weak Supervision?

##### by Anastasiia Sedova (GitHub: @anasedova, Email: anastasiia.sedova@univie.ac.at)

In this tutorial, we are going to train a spam detection classifier using weakly supervised data. 

The steps:
- Collect training data
- Annotate this data in a weakly supervised setting
    - Create labeling functions
    - *Match* the labeling functions to the data samples
    - Aggregate the labels with different label aggregation techniques
        - Majority Vote
        - FABLE 
- train a logistic regression classifier using weak labels
- train a logistic regresison classifier with SepLL

In [None]:
# necessary imports
import sys

sys.path.append("..")

import logging
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)


from wrench.utils import set_seed
from wrench.endmodel import EndClassifierModel
from wrench._logging import LoggingHandler


from snorkel.utils import probs_to_preds
from utils import load_raw_spam_dataset


#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

logger = logging.getLogger(__name__)

In [None]:
# the path to the folder where our data is stored

path_to_data = "data"

## Data

The dataset we will use for training is Spam Detection YouTube comments dataset 
[3]. 

- The dataset consists of comments that YouTube users left under different videos.
- Each sample is a comment (i.e., a word, a sentence, or a couple of sentences).
- 1,586 train samples, 120 dev samples, 250 test samples
- There are 2 types of samples:
    - HAM: comments relevant to the video (even very simple ones), or
    - SPAM: irrelevant (often trying to advertise something) or inappropriate messages
    
<img src="../img/spam_detection.png" width="800"/>

**NB! Original dataset is manually labeled, but we won't use these gold labels for model training! We will use the dataset as unlabeled one (and label it in a weakly-supervised fasion).** 

Let's first have a look at the dataset.

In [None]:
# load the YouTube dataset

df_train, df_dev, df_test = load_raw_spam_dataset(load_train_labels=True)
# Y_train = df_train["label"].values
# Y_test = df_test["label"].values

In [None]:
df_train[:10]

For each data sample in the original dataset (i.e., a YouTube comment), we know:
- comment's author,
- date when the corresponding comment was left,
- text of the sample,
- gold manual label,
- id of the YouTube video.

In [None]:
# some examples of positive (=non-spam) samples, label id 0

df_train.loc[df_train["label"]==0][:10]

In [None]:
# some examples of negative (=spam) samples, label id 0

df_train.loc[df_train["label"]==1][:10]

In [None]:
df_train[["text", "label"]][:20]

Now it is time to start weak supervision! So, let's imagin the gold labels disappeared... 

<img src="../img/poof.jpg" width="300"/>

... and here we are: there is some data we want to use for classifier training, but we don't have any labels and capacity/time/money/... for hiring annotators.

But we can label this data with **weak supervision** :)

<img src="../img/rainbow.png" width="500"/>

# Weak Supervision

A brief reminder how weak supervision works:
1. We come up with some heuristic rules and transform these rules into labeling functions.
2. We apply these labeling functions to the data and obtain weak labels.
3. We use this weak labels to train a classifier. 

Let's have a closer look at the training samples we have:

In [None]:
list(df_train.text[100:120])

## Task: formulate the rules that could annotate the training samples

The questions that might help you: 

*What patterns are typical for spam YouTube comments? for non-spam comments?*

*What rules might help to distinguish between spam and not-spam YouTube comments?*

*What labeling functions do you think are productive and useful to annotate the YouTube comments?*

Rules: 

1. Keywords: subscribe, click, Nigerian prince, check out, channel, single mom
2. Pattern: regex-links
3. ...
4. ...
5. ...
6. ...
7. ...
8. ...
9. ...
10. ...



### What can be a rule?

- Keyword searches: looking for specific words in a sentence
- Pattern matching: looking for specific syntactical patterns
- Third-party models: using an pre-trained model (usually a model for a different task than the one at hand)
- ...
- Crowdworker labels: treating each crowdworker as a black-box function that assigns labels to subsets of the data

### Rules into labeling functions

After we collected some rules, we transform them into labeling functions that could *label* the data sample - that is, assign it to one or another class. 

In [None]:
# an example of LF based on a keyword "check out"

def check_out(x):
    return 1 if "check out" in x.text.lower() else -1

# meaning the sample will be assigned to class 1 (=SPAM) if there is a "check out" expression in the comment, 
# otherwise to class 0 (=non-SPAM)

In [None]:
# an example of LF based on a keyword "please"

def check(x):
    return 1 if "please" in x.text.lower() else -1

# meaning the sample will be assigned to class 1 (=SPAM) if there is a "please" expression in the comment, 
# otherwise to class 0 (=non-SPAM)

In [None]:
from snorkel.labeling import PandasLFApplier, labeling_function

@labeling_function()
def check_out(x):
    return 1 if "check out" in x.text.lower() else -1

@labeling_function()
def check(x):
    return 1 if "please" in x.text.lower() else -1

lfs = [check_out, check]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_train

In [None]:
df_train[12:13]

### Labeling functions we are going to use

In this tutorial, we are going to use the labeling functions created by [Snorkel team](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb), which are: 


1. keyword **"my"** (to detect spam comments like "my channel", "my video", etc)
2. keyword **"subscribe"** (to detect spam comments that ask users to subscribe to some channel)
3. keyword **"http"** (to detect spam comments that link to other channels)
4. keyword **"please"/"plz"** (to detect spam comments that make requests rather than commenting)
5. keyword **"song"** (to detect non-spam comments that actually talk about the video's content)
6. regex **"check_out"** (to detect spam comments like "check out this channel", etc)
7. **short comment** (non-spam comments are often short, such as 'cool video!')
8. **mentioning specific people** and are **short** (using SpaCy library; non-spam comments usually mention some people)
9. **polarity** (using TextBlob library; if polarity > 0.9, it is most probably a non-spam message)
10. **subjectivity** (using TextBlob library; if subjectivity >= 0.5, it is most probably a non-spam message)

(You will hear more about the labeling process from my colleagues later).

### Processed data

The resulted annotations can be saved in the following format: 

In [None]:
import json
with open("data/youtube/train.json") as train_file:
    train_data = json.load(train_file)
train_data["1"]

The structure of the processed data is the following: 
- data.text: the text of the sample
- label: gold label obtained by manual annotation
- weak_labels: the results of annotation by labeling functions. 
    - -1: the corresponding labeling function did not match
    - 0: the labeling function matched and assigned this sample to class 0 (non-spam class in our case)
    - 1: the labeling function matched and assigned this sample to class 1 (spam class in our case)

So, for the sample #1:
(*if your like drones, plz subscribe to Kamal Tayara. He takes videos with  his drone that are absolutely beautiful.\ufeff*)

- labeling functions 1, 3, 5, 6, 7, 8, 9 did not match
- labeling functions 2 (a key word *subscribe*) & 4 (a key word *plz*) matched and assigned this sample to the class 1
- labeling function 10 (subjectivity score > 0.5) matched and assigned this sample to the class 0

**Next step: how to turn these annotations into weak labels to train a classifier with them?**

## Weak labels

There are different *label models* that calculate the weak labels based on labeling functions annotations. In this tutorial, we are going to try two of them: 

- **Majority Vote** (intuitive and straightforward)
- **FABLE** [1] (most recent and well-performing)

For label calculation and model training we will use a weakly supervised framework called [Wrench](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiRmYabjOGAAxW1h_0HHQt3COQQFnoECA4QAQ&url=https%3A%2F%2Fgithub.com%2FJieyuZ2%2Fwrench&usg=AOvVaw3EWVM0icLVHENbUv51USa_&opi=89978449) [2].

### Wrench dataset

First, we transform our data into a Wrench-specific dataset.

We can encode the data with TF-IDF features... 

In [None]:
# TF-IDF features

from wrench.dataset import load_dataset

train_data_tfidf, valid_data_tfidf, test_data_tfidf = load_dataset(
    path_to_data,     # path to the folder where the dataset is stored
    "youtube",         # name of the dataset
    extract_feature=True,      # we want to encode out data ...
    extract_fn='tfidf'        # ... with TF-IDF features (other predefined options are 'sentence_transformer', 'bert')
)

... or with BERT features.

In [None]:
# Bert features

train_data, valid_data, test_data = load_dataset(
    path_to_data,       # path to the folder where the dataset is stored
    "youtube",    # name of the dataset
    extract_feature=True,      # we want to encode out data ...
    extract_fn='bert',        # ... with bert embeddings
    model_name='bert-base-cased',      # the name of the bert model
    cache_name='bert'     # load it from cache if there are cached files 
)

Let's have a look what's inside. 

In [None]:
# the format of the train_data, valida_data, and test_data now is: wrench.dataset.dataset.TextDataset

train_data

In [None]:
# how many classes are there in the dataset?

train_data.n_class

In [None]:
# how many labeling functions are there in the dataset?

train_data.n_lf

In [None]:
# what is the class_id to class correspondence?

train_data.id2label

In [None]:
# how do the samples look like?

train_data.examples[:10]

In [None]:
# how do the encoded samples look like?

print(type(train_data.features))
train_data.features[:10]

In [None]:
# what are the weak annotations produced by labeling functions?

train_data.weak_labels[3]

### Majority Vote

The simplest and most straightforward method to calculate labels from the noisy annotations is **majority voting** - a decision-making method where the option with the most votes is chosen. It's like asking a group of people to pick a movie, and the one that gets the most hands raised wins. 

In our case, each labeling function produces a *vote*; the most voted class is selected as a sample label. All ties are broken randomly.


# Task:  write your own majority vote function
- Input: the weak annotations produced by labeling functions (stored in weak_labels field of wrench dataset objects)
- Output: labels

Before you start programming, think about possible bottlenecks: 
- what if a sample obtains equal amount of votes for some class?
- what if there are no votes for a sample?

In [None]:
# todo
import numpy as np


def majority_vote(weak_annotations):
    #print(weak_annotations)
    labels = []
    # calculate labels with majority vote 
    # output should be a numpy array of size (number of training samples) x 1
    return np.array(labels)

labels_mv = majority_vote(train_data.weak_labels)

A ready solution to aggregate the weak labels with majority vote is already included to the Wrench framework:`MajorityVoting` label model. 

In [None]:
# initialize and fit the majority vote label model from the Wrench framework

from wrench.labelmodel import MajorityVoting

label_model = MajorityVoting()
label_model.fit(dataset_train=train_data, dataset_valid=valid_data)

In [None]:
# calculate weak labels 

soft_label_mv = label_model.predict_proba(train_data)    # soft label as probabilities across all classes
hard_label_mv = probs_to_preds(soft_label_mv)               # hard labels as the most probable classes 

In [None]:
hard_label_mv.shape

Let's look at the first 10 sentences, their weak annotations, and the weak labels obtained with majority voting. 

In [None]:
train_data.examples[:10]

In [None]:
train_data.weak_labels[:10]

In [None]:
soft_label_mv[:10]

In [None]:
hard_label_mv[:10]

### FABLE 

Fable [1] is a label model where noisy labels are inferred not only based on the labeling functions' votes, but also using the instance features. 

In [None]:
# initialize and apply the fable model
from wrench.labelmodel import Fable

label_model = Fable(kernel_function=None, num_groups=10)
_ = label_model.fit(dataset_train=train_data, dataset_valid=valid_data)

In [None]:
# calculate labels
soft_label_fable = label_model.predict_proba(train_data)
hard_label_fable = probs_to_preds(soft_label_fable)

In [None]:
soft_label_fable[:10]

In [None]:
hard_label_fable[:10]

## Classifier training

In [None]:
batch_size = 32
test_batch_size = 32
lr = 0.01

Train a classifier with majorty vote hard labels.

In [None]:
set_seed(42)

# initialize a classifier
model = EndClassifierModel(
    batch_size=batch_size, test_batch_size=test_batch_size
)

# fit it on the training data + majority vote hard labels
model.fit(
    dataset_train=train_data, 
    y_train=hard_label_mv, 
    dataset_valid=valid_data, 
    verbose=False
)

# test on the test set
model.test(dataset=test_data, metric_fn="acc")

Train a classifier with FABLE hard labels.

In [None]:
set_seed(42)

# initialize a classifier
model = EndClassifierModel(
    batch_size=batch_size, test_batch_size=test_batch_size
)

# fit it 
model.fit(
    dataset_train=train_data, 
    y_train=hard_label_fable, 
    dataset_valid=valid_data,
    verbose=False
)

# test on the test set
model.test(dataset=test_data, metric_fn="acc")

## End-2-End training with SepLL

In the following, we use a state-of-the-art method called SepLL [4] to train a classifier with weak labels. During training, LF matches are the only training signal, and prediction is then later made from a latent state.

In [None]:
from wrench.classification.sepll import SepLL

set_seed(42)

bert_model_name = 'roberta-base'

#### Initialize SepLL
model = SepLL(
    batch_size=batch_size,
    test_batch_size=test_batch_size,
    backbone='MLP',
    backbone_model_name=bert_model_name,
    # 
    # SepLL specific
    add_unlabeled=False,
    class_noise=0.0,
    lf_l2_regularization=0.05,
)


model.fit(
    dataset_train=train_data,
    dataset_valid=valid_data,
    verbose=True
)

acc = model.test(test_data, 'acc')

logger.info(f'SepLL test acc: {acc}')

### GPU training

In case your environment has a GPU available, it is also possible to make use of the full strength of SepLL. 

In [None]:
from wrench.classification.sepll import SepLL

set_seed(42)

batch_size=16
bert_model_name = 'roberta-base'

#### Initialize SepLL
model = SepLL(
    batch_size=batch_size,
    real_batch_size=batch_size,
    test_batch_size=test_batch_size,
    # BERT specific parameters
    backbone='BERT',
    backbone_model_name=bert_model_name,
    optimizer='Adam',
    optimizer_lr=5e-5,
    optimizer_weight_decay=0.0,
    
    # SepLL specific
    add_unlabeled=False,
    class_noise=0.0,
    lf_l2_regularization=0.5,
)


model.fit(
    dataset_train=train_data,
    dataset_valid=valid_data,
    metric='acc',
    verbose=True
)

In [None]:
acc = model.test(test_data, 'acc')

logger.info(f'SepLL test acc: {acc}')

# References

1. Zhang et al. 2023. Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision. https://arxiv.org/abs/2210.02724 
2. Zhang et al. 2021 WRENCH: A Comprehensive Benchmark for Weak Supervision. https://arxiv.org/abs/2109.11377
3. Alberto TC et al.  2015. Tubespam: Comment Spam Filtering on Youtube. https://ieeexplore.ieee.org/document/7424299
4. Stephan et al. 2022. SepLL: Separating Latent Class Labels from Weak Supervision Noise. https://arxiv.org/abs/2210.13898
