# Snorkel

## Introduction

The goal of this lab is to introduce students to the [Snorkel](http://www.snorkel.org) tool and the possibilities of programmatic label generation using the weak-supervised learning paradigm.

In order to use weakly supervised learning to generate labels, it is necessary to create three datasets:

- **train set**: which does not have any labels
- **validation set**: used for hyperparameter optimization, has labels
- **test set**: used only for final model evaluation, has labels

## Labeling functions

The first step will be to load the dataset and split it into a train set and a test set. Since in our set all SMS have a label, we will simulate a weakly supervised learning problem by randomly removing 80% of the labels. Additionally, Snorkel requires numeric labels, so we need to recode the values.

In [None]:
!head data/smsspamcollection.csv

In [None]:
import pandas as pd
import numpy as np

pd.set_option('max_colwidth', 600)

SPAM = 1
HAM = 0
ABSTAIN = -1

df = pd.read_csv('./data/smsspamcollection.csv', 
                 sep='\t', 
                 header=None, 
                 names=['old_label', 'text'])

df['label'] = df.old_label.apply(lambda x: SPAM if x == 'spam' else HAM)

df.loc[df.sample(frac=0.8).index, 'label'] = ABSTAIN
df.drop(columns=['old_label'], inplace=True)

df.head()

In [None]:
abstain_idx = df.label == ABSTAIN

abstain_idx

In [None]:
df_train = df[abstain_idx]
df_test = df[~abstain_idx]

In [None]:
df_train.shape, df_test.shape

### Simple keyword search

As a first example, we will use a search for the words "check" and "free" in SMS content

In [None]:
from snorkel.labeling import labeling_function

@labeling_function()
def check(sms):
    return SPAM if "check" in sms.text.lower() else ABSTAIN

@labeling_function()
def free(sms):
    return SPAM if "free" in sms.text.lower() else ABSTAIN

The next step is to apply the labeling functions to the train set.

In [None]:
from snorkel.labeling import PandasLFApplier

lfs = [check, free]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

The result of applying the set of labeling functions to the train set is a matrix of size $m \times n$, where $m$ is the number of examples and $n$ is the number of labeling functions. The matrix contains the result of applying each function to each example.

In [None]:
L_train[200:250,:]

In [None]:
L_train.shape

In [None]:
df_train.iloc[200,:]

The simplest way to analyze this is to determine the coverage of labeling functions (i.e., the percentage of cases for which the function returned a result other than `ABSTAIN'.

In [None]:
coverage_check, coverage_free = (L_train != ABSTAIN).mean(axis=0)

print(f"Coverage for check(): {coverage_check * 100:.1f}%")
print(f"Coverage for free(): {coverage_free * 100:.1f}%")

Fortunately, Snorkel offers additional tools that allow for deeper analysis of the result of labeling functions.

In [None]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

The meaning of each column is as follows:
- `Polarity`: the set of labels returned by the function
- `Coverage`: the percentage of examples for which the function returns a value other than `ABSTAIN`
- Overlaps: the percentage of examples for which at least one other labeling function returned a value
- Conflicts: the percentage of examples for which at least one other labeling function returned a different value

If the train set contained labels, the method would also return:
- `Correct`: the number of correct labels
- `Incorrect`: number of incorrect labels
- `Empirical Accuracy`: the percentage of correct labels

Let's check the examples labeled by the `free()` function as spam

In [None]:
df_train.iloc[L_train[:,1] == SPAM].sample(frac=0.1)

It seems that the phrase "call now" is also a good indicator for spam. So let's add one more labeling function.

In [None]:
@labeling_function()
def call_now(sms):
    return SPAM if "call now" in sms.text.lower() else ABSTAIN

lfs = [check, free, call_now]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

In [None]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Let's see which examples were labeled as spam by the `call_now()` function but omitted by `free()`.

In [None]:
from snorkel.analysis import get_label_buckets

buckets = get_label_buckets(L_train[:, 1], L_train[:, 2])
buckets

In [None]:
buckets.keys()

In [None]:
df_train.iloc[buckets[(SPAM, ABSTAIN)]]

In [None]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

#### assignment

Write a labeling function that marks as spam all messages containing the word "HOT" written in capitals.

In [None]:
@labeling_function()
def hot(sms):
    ...

### Searching based on a regular expression

Another type of labeling function is one that uses regexp to find specific expressions.

In [None]:
import re

@labeling_function()
def regex_I_am_free(sms):
    if re.search(r"\sI\s.*free", sms.text, flags=re.I):
        return HAM
    elif re.search(r"free", sms.text, flags=re.I):
        return SPAM
    else:
        return ABSTAIN

lfs = [check, free, call_now, regex_I_am_free]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Let's compare examples that the `free()` function labels as spam and the `regex_I_am_free()` function considers valid.

In [None]:
buckets = get_label_buckets(L_train[:, 1], L_train[:, 3])
df_train.iloc[buckets[(SPAM, HAM)]].sample(10, random_state=1)

#### assignment

Write a labeling function that will mark as spam all messages containing any amounts specified with a currency symbol ($99, £1.50)

In [None]:
@labeling_function()
def contains_money(sms):
    ...


### Searching based on heuristics

A simple heuristic to find spam is to assume that if more than 10% of the message text is written in capitals, there is a good chance it is spam.

In [None]:
@labeling_function()
def has_many_uppercase_words(sms):
    percentage_uppercase = sum([word.isupper() for word in sms.text.split()]) / len(sms.text.split())
    
    return SPAM if percentage_uppercase > 0.1 else ABSTAIN

lfs = [check, free, call_now, regex_I_am_free, has_many_uppercase_words]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

#### assignment

Write a labeling function that marks as valid those messages that are shorter than 10 words and do not contain any word written in capitals.

In [None]:
@labeling_function()
def short_and_no_uppercase(sms):
    ...

### Using an external statistical model

When labeling data, you can use external models whose response can be important information for deciding how to label an example. Snorkel has several built-in integrations in the form of the `Preprocessor` interface, in the example below we will use the `SpaCy` library to perform additional grammatical analysis of the text. However, you will need to download the English language model.

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
!python -m spacy validate

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [None]:
_text = """England is a country that is part of the United Kingdom. 
It shares land borders with Wales to its west and Scotland to its north. 
The Irish Sea lies northwest of England and the Celtic Sea to the southwest. 
England is separated from continental Europe by the North Sea to the east and the 
English Channel to the south. The country covers five-eighths of the island of 
Great Britain, which lies in the North Atlantic, and includes over 100 smaller islands, 
such as the Isles of Scilly and the Isle of Wight."""

doc = nlp(_text)

for e in doc.ents:
    print(f"{e.text:<25} start:{e.start} end:{e.end} label: {e.label_}")

In [None]:
from snorkel.preprocess.nlp import SpacyPreprocessor

spacy_preprocessor = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

Assume that short text messages in which a reference to a specific person appears are not spam.

In [None]:
df_train.columns

In [None]:
@labeling_function(pre=[spacy_preprocessor])
def has_person(sms):
    if len(sms.doc) < 20 and any([ent.label_ == "PERSON" for ent in sms.doc.ents]):
        return HAM
    else:
        return ABSTAIN

In [None]:
lfs = [check, free, call_now, regex_I_am_free, has_many_uppercase_words, has_person]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Another example of pre-processing data for labeling would be determining the average word frequency of a document. Below we define a function that determines the average word frequency and we decorate it as an example of a pre-processor. When a text message is sent to the next labeling function, the pre-processor will populate the text message with the average word frequency and, based on that, the labeling function will make a decision (we assume that if the text message contains many rare words then it is spam).

In [None]:
from wordfreq import zipf_frequency
from snorkel.preprocess import preprocessor

@preprocessor(memoize=True)
def avg_word_freq(sms):
    sms.avg_word_freq = sum([zipf_frequency(word, 'en') for word in sms.text.split()]) / len(sms.text.split())
    
    return sms

In [None]:
@labeling_function(pre=[avg_word_freq])
def many_rare_words(sms):
    return ABSTAIN if sms.avg_word_freq >= 4 else SPAM

In [None]:
lfs = [check, free, call_now, regex_I_am_free, has_many_uppercase_words, has_person, many_rare_words]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

In [None]:
df_train.iloc[L_train[:,6] == SPAM].sample(frac=0.1)

#### assignment

Write a labeling function that marks messages containing more than 3 adjectives as spam. Use the SpaCy library for pre-processing. 

__Hint__: the following example shows how to read the part-of-speech label for each token from the message being analyzed. For information on all token properties recognized by SpaCy, see [API documentation](https://spacy.io/api/token)

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

sms = "Yetunde, i'm sorry but moji and i seem too busy to be able to go shopping."

for token in nlp(sms):
    print(f"{token.text:<10} {token.pos_:<10} {token.tag_:<10} {token.lemma_:<10}")

## Combining labeling functions into a single model

The goal of labeling functions is not to achieve individually large coverage. Labeling functions are inherently noisy and can make many individual errors. The true utility of labeling functions becomes apparent when multiple functions are combined to form a single model.

We will first build a simple model based on majority voting, and then build a more complex model. 

In [None]:
lfs = [check, free, call_now, regex_I_am_free, has_person, many_rare_words]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_test = applier.apply(df=df_test)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

In [None]:
LFAnalysis(L=L_test, lfs=lfs).lf_summary()

In [None]:
from snorkel.labeling.model import MajorityLabelVoter

majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train)

In [None]:
preds_train

In [None]:
import numpy as np

labels, counts = np.unique(preds_train, return_counts=True)

for l, c in zip(labels, counts):
    print(f"LABEL: {l}, count: {c}")

In [None]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=42)

In [None]:
majority_acc = majority_model.score(L=L_test, Y=df_test.label, tie_break_policy="random")["accuracy"]
print(f"{'Majority voting accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=df_test.label, tie_break_policy="random")["accuracy"]
print(f"{'Probabilistic model accuracy:':<25} {label_model_acc * 100:.1f}%")

Unfortunately, some data points will not receive any label. It is necessary to filter out these points before sending the labeling result for further processing.

In [None]:
from snorkel.labeling import filter_unlabeled_dataframe
from snorkel.utils import preds_to_probs, probs_to_preds

preds_train, probs_train = label_model.predict(L=L_train, return_probs=True)

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(X=df_train, y=probs_train, L=L_train)
df_train.shape, df_train_filtered.shape

In [None]:
df_train_filtered

As you can see, we were able to quickly prepare labels for about 650 examples (recall that initially no example in the `df_train` set had labels).

The next step will use prepared labels as training data for the actual classifier. We will use simple [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), first pre-processing the input data. Since we are working with text, we will use the [word vector representation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) created based on 5-grams by `CountVectorizer`.

In [None]:
from snorkel.utils import probs_to_preds
from sklearn.feature_extraction.text import CountVectorizer

preds_train_filtered = probs_to_preds(probs=probs_train_filtered)

vectorizer = CountVectorizer(ngram_range=(1, 1))

X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

In [None]:
X_train.shape, X_test.shape

In [None]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(C=1e3, solver='lbfgs')
sklearn_model.fit(X=X_train, y=preds_train_filtered)

In [None]:
print(f"Logistic regression accuracy: {sklearn_model.score(X=X_test, y=df_test.label) * 100:.1f}%")

As can be seen, the final model improved the score over the majority vote and the `LabelModel` model.

#### assignment

Complete the above calls with functions that you wrote yourself and check whether your functions improve the quality of the model.

## Slicing functions

The idea of a *slicing function* is to define a subset of the train/test dataset in order to be able to independently evaluate the quality of trained models on specific subsets of data. This is very important from the point of view of ethical/explainable AI and it allows to discover areas of the input feature space where the model under/over performs.

Let us define a simple slicing function and verify the performance of our model on it. But first we will define a few utility functions borrowed directly from one of Snorkel's tutorials

In [None]:
import torch
import torch.nn as nn
from snorkel.classification.data import DictDataset, DictDataLoader

def get_pytorch_mlp(hidden_dim, num_layers):
    layers = []
    for _ in range(num_layers):
        layers.extend([nn.Linear(hidden_dim, hidden_dim), nn.ReLU()])
    return nn.Sequential(*layers)


def create_dict_dataloader(X, Y, split, **kwargs):
    ds = DictDataset.from_tensors(torch.FloatTensor(X), torch.LongTensor(Y), split)
    return DictDataLoader(ds, **kwargs)


def df_to_features(vectorizer, df, split):
    words = [row.text for i, row in df.iterrows()]

    if split == "train":
        feats = vectorizer.fit_transform(words)
    else:
        feats = vectorizer.transform(words)
    X = feats.todense()
    Y = df["label"].values
    return X, Y

In [None]:
from snorkel.slicing import slicing_function

@slicing_function()
def short_message(sms):
    return len(sms.text.split()) <= 5

The first thing is to visualize the slice in a dataframe

In [None]:
from snorkel.slicing import slice_dataframe

short_message_df = slice_dataframe(df_test, short_message)
short_message_df.head()

We will also introduce another slice: sms messages containing the word *free*.

In [None]:
@slicing_function()
def message_has_free(sms):
    return 'free' in sms.text.lower()

`S_test` array contain information on whether a given example belongs to any of the defined slices.

In [None]:
from snorkel.slicing import PandasSFApplier

sfs = [short_message, message_has_free]

applier = PandasSFApplier(sfs)
S_test = applier.apply(df_test)

S_test

Before we proceed further, we have to explicitly add the labels to the `df_train` dataframe as we haven't done it before.

In [None]:
df_train_filtered["label"] = sklearn_model.predict(X_train)

Next, we vectorize all messages (turning them into binary vectors). 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 1))

x_train, y_train = df_to_features(vectorizer, df_train_filtered, "train")
x_test, y_test = df_to_features(vectorizer, df_test, "test")

Now we can use the vectorized inputs to train a logistic regression model to classify messages.

In [None]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(C=0.001, solver="lbfgs")
sklearn_model.fit(X=np.asarray(x_train), y=y_train)

Let us check the performance of our model on the entire test set and on the slices.

In [None]:
from snorkel.analysis import Scorer

scorer = Scorer(metrics=["accuracy", "precision", "recall", "f1"])

golds = df_test.label
preds = sklearn_model.predict(np.asarray(x_test))
probs = sklearn_model.predict_proba(np.asarray(x_test))

In [None]:
scorer.score_slices(
    S=S_test, 
    golds=golds, 
    preds=preds, 
    probs=probs, 
    as_dataframe=True
)

#### assignment

Create a new slicing function that identifies all text messages where the majority of words is in uppercase

In [None]:
@slicing_function()
def most_words_uppercase(sms):
    ...

sfs = sfs + [most_words_uppercase]

In [None]:
sfs = [short_message, message_has_free]

applier = PandasSFApplier(sfs)

S_train = applier.apply(df_train_filtered)
S_test = applier.apply(df_test)

In our last experiment we will train a model which will be fine-tuned independently to each slice. This approach is known as *slice-based learning*. Intuitively, we’d like to model to learn representations that are better suited to handle data points in this slice. In our approach, we model each slice as a separate “expert task” in the style of multi-task learning.

In [None]:
from snorkel.slicing import SliceAwareClassifier

bow_dim = X_train.shape[1]
hidden_dim = bow_dim
mlp = get_pytorch_mlp(hidden_dim=hidden_dim, num_layers=2)

slice_model = SliceAwareClassifier(
    base_architecture=mlp,
    head_dim=hidden_dim,
    slice_names=[sf.name for sf in sfs],
    scorer=scorer,
)

In [None]:
BATCH_SIZE = 64

train_dl = create_dict_dataloader(x_train, y_train, "train")
train_dl_slice = slice_model.make_slice_dataloader(
    train_dl.dataset, S_train, shuffle=True, batch_size=BATCH_SIZE
)

test_dl = create_dict_dataloader(x_test, y_test, "test")
test_dl_slice = slice_model.make_slice_dataloader(
    test_dl.dataset, S_test, shuffle=False, batch_size=BATCH_SIZE
)

In [None]:
from snorkel.classification import Trainer

trainer = Trainer(n_epochs=10, lr=1e-4, progress_bar=True)
trainer.fit(slice_model, [train_dl_slice])

Let us check if the slice-based learning allowed us to improve the model.

In [None]:
slice_model.score_slices([test_dl_slice], as_dataframe=True)

## Transforming functions

The idea of a transforming function is to perform an atomic transformation of an instance. For data that is an image, typical transformations include cropping, rotating, and changing the color palette. For text data, you can replace words with synonyms, substitute named entities, cut random pieces of text, etc. In the following example we will find types of named entities occurring in the text, and then prepare a simple transformer that will randomly replace occurrences of the `PERSON` entity

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

for doc in nlp.pipe(df_train.text.sample(frac=0.05)):
    print(f"Entities: {[(e.text, e.label_) for e in doc.ents]}")

In [None]:
person_entities = []

for doc in nlp.pipe(df_train.text.sample(frac=0.05)):
    for e in doc.ents:
        if e.label_ == 'PERSON':
            person_entities.append(e.text)
        
person_entities[:10]

In [None]:
from snorkel.augmentation import transformation_function
from snorkel.preprocess.nlp import SpacyPreprocessor

spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

@transformation_function(pre=[spacy])
def random_person_ner(sms):
    person_ners = [e.text for e in sms.doc.ents if e.label_ == 'PERSON']
    
    if person_ners:
        person_to_replace = np.random.choice(person_ners)
        person_to_add = np.random.choice(person_entities)
        sms.text = sms.text.replace(person_to_replace, person_to_add)
    return sms

Another example of transformation could be using WordNet to find synonyms for words. However, this requires downloading a corpus of data

In [None]:
import nltk
from nltk.corpus import wordnet

nltk.download("wordnet")

In [None]:
def get_synonym(word):
    
    synsets = wordnet.synsets(word)
    
    if synsets:
        words = [lemma.name() for lemma in synsets[0].lemmas()]
        
        return np.random.choice([w.replace("_", " ") for w in words])


In [None]:
@transformation_function()
def replace_words_with_synonym(sms, num_replacements=5):

    words = sms.text.split()
    
    for _ in range(num_replacements):
        word_idx = np.random.choice(range(len(words)))
        synonym = get_synonym(words[word_idx])
        if synonym:
            words[word_idx] = synonym
        
    sms.text = ' '.join(words)
    return sms

Let us now compare the original text message content with the transformed versions.

In [None]:
# source: https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/utils.py

from collections import OrderedDict

def preview_tfs(df, tfs):
    transformed_examples = []
    for f in tfs:
        for i, row in df.iterrows():
            transformed_or_none = f(row)
            # If TF returned a transformed example, record it in dict and move to next TF.
            if transformed_or_none is not None:
                transformed_examples.append(
                    OrderedDict(
                        {
                            "TF Name": f.name,
                            "Original Text": row.text,
                            "Transformed Text": transformed_or_none.text,
                        }
                    )
                )
                
    return pd.DataFrame(transformed_examples)


In [None]:
tfs = [random_person_ner, replace_words_with_synonym]

df_transformed = preview_tfs(df_train.sample(frac=0.1), tfs)

In [None]:
df_transformed[df_transformed['Original Text'] != df_transformed['Transformed Text']]

Applying transforming functions requires some policy defining the order and number of transformations. In the example below, two transformation functions are drawn at random and this sequence of two functions is applied twice to each data point. As a result, we triple the size of the train set.

In [None]:
from snorkel.augmentation import RandomPolicy, PandasTFApplier

random_policy = RandomPolicy(len(tfs), sequence_length=2, n_per_original=2, keep_original=True)

tf_applier = PandasTFApplier(tfs, random_policy)

df_train_sample = df_train.sample(frac=0.1)
df_train_augmented = tf_applier.apply(df_train_sample)

In [None]:
df_train_sample.shape, df_train_augmented.shape

#### assignment

Modify the transforming function ``replace_words_with_synonym()`` so that you can restrict the replacement of words with synonyms only for specific parts of speech (e.g., replace only nouns or verbs).