In [18]:
import pandas as pd
from snorkel.utils import probs_to_preds
from utils import load_raw_spam_dataset
from wrench.dataset import load_dataset
from wrench.endmodel import EndClassifierModel
from wrench.labelmodel import MajorityVoting, Fable

path_to_data = "data/youtube"
# os.chdir("wrench/spam")

In [19]:
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 199

# Weak Supervision

todo: some intro, what is WS once again, what do we need to train a classifier

In this tutorial, we are going to train a spam detection classifier using weak supervision. The dataset we will use for training is Spam Detection YouTube comments dataset. 

Some info about the dataset:
- The dataset consists of comments that YouTube users left under different videos.
- Each sample is a comment (i.e., a word, a sentence, or a couple of sentences).
- 1,586 train samples, 120 dev samples, 250 test samples
- There are 2 types of samples:
    - HAM: comments relevant to the video (even very simple ones), or
    - SPAM: irrelevant (often trying to advertise something) or inappropriate messages
- Original dataset is labeled; we are going to use it as unlabeled one (and label it in a weakly-supervised fasion). 

Let's first have a look at the unlabeled dataset.

In [2]:
# load the YouTube dataset
df_train, df_dev, df_test = load_raw_spam_dataset(load_train_labels=True)
Y_train = df_train["label"].values
Y_test = df_test["label"].values

In [23]:
df_train[:10]

Unnamed: 0,author,date,text,label,video
0,Alessandro leite,2014-11-05T22:21:36,pls http://www10.vakinha.com.br/VaquinhaE.aspx?e=313327 help me get vip gun cross fire al﻿,1,1
1,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Tayara. He takes videos with his drone that are absolutely beautiful.﻿",1,1
2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,0,1
3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",0,1
4,css403,2014-11-07T14:25:48,"i am 2,126,492,636 viewer :D﻿",0,1
5,Giang Nguyen,2014-11-06T04:55:41,https://www.facebook.com/teeLaLaLa﻿,1,1
6,Caius Ballad,2014-11-13T00:58:20,imagine if this guy put adsense on with all these views... u could pay ur morgage﻿,0,1
7,Holly,2014-11-06T13:41:30,Follow me on Twitter @mscalifornia95﻿,1,1
8,King uzzy,2014-11-07T23:19:08,Can we reach 3 billion views by December 2014? ﻿,0,1
9,iKap Taz,2014-11-08T13:34:27,Follow 4 Follow @ VaahidMustafic Like 4 Like ﻿,1,1


For each data sample (i.e., a YouTube comment), we know:
- comment's author,
- date when the corresponding comment was left,
- text of the sample,
- label,
- id of the YouTube video.

Examples of HAM messages (label = 0):
- "3:46 so cute!"
- "This is a weird video."

In [28]:
# some examples of positive (=non-spam) samples

df_train.loc[df_train["label"]==0][:10]

Unnamed: 0,author,date,text,label,video
2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,0,1
3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",0,1
4,css403,2014-11-07T14:25:48,"i am 2,126,492,636 viewer :D﻿",0,1
6,Caius Ballad,2014-11-13T00:58:20,imagine if this guy put adsense on with all these views... u could pay ur morgage﻿,0,1
8,King uzzy,2014-11-07T23:19:08,Can we reach 3 billion views by December 2014? ﻿,0,1
10,John Plaatt,2014-11-07T22:22:29,On 0:02 u can see the camera man on his glasses....﻿,0,1
11,Praise Samuel,2014-11-08T11:10:30,2 billion views wow not even baby by justin beibs has that much he doesn't deserve a capitalized name﻿,0,1
16,zhichao wang,2013-11-29T02:13:56,i think about 100 millions of the views come from people who only wanted to check the views﻿,0,1
19,Tedi Foto,2014-11-08T09:33:30,What my gangnam style﻿,0,1
20,Tee Tee,2014-11-07T20:16:51,Loool nice song funny how no one understands (me) and we love it﻿,0,1


Examples of SPAM messages (label = 1):
- "Please check out my vidios"
- "Subscribe to me and I'll subscribe back!!!"

In [29]:
# some examples of negative (=spam) samples

df_train.loc[df_train["label"]==1][:10]

Unnamed: 0,author,date,text,label,video
0,Alessandro leite,2014-11-05T22:21:36,pls http://www10.vakinha.com.br/VaquinhaE.aspx?e=313327 help me get vip gun cross fire al﻿,1,1
1,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Tayara. He takes videos with his drone that are absolutely beautiful.﻿",1,1
5,Giang Nguyen,2014-11-06T04:55:41,https://www.facebook.com/teeLaLaLa﻿,1,1
7,Holly,2014-11-06T13:41:30,Follow me on Twitter @mscalifornia95﻿,1,1
9,iKap Taz,2014-11-08T13:34:27,Follow 4 Follow @ VaahidMustafic Like 4 Like ﻿,1,1
12,Malin Linford,2014-11-05T01:13:43,"Hey guys please check out my new Google+ page it has many funny pictures, FunnyTortsPics https://plus.google.com/112720997191206369631/post﻿",1,1
13,Lone Twistt,2013-11-28T17:34:55,Once you have started reading do not stop. If you do not subscribe to me within one day you and you're entire family will die so if you want to stay alive subscribe right now.﻿,1,1
14,Олег Пась,2014-11-03T23:29:00,Plizz withing my channel ﻿,1,1
15,JD COKE,2014-11-08T02:24:02,"It's so hard, sad :( iThat little child Actor HWANG MINOO dancing very active child is suffering from brain tumor, only 6 month left for him .Hard to believe .. Keep praying everyone for our future superstar. #StrongLittlePsY #Fighting SHARE EVERYONE PRAYING FOR HIM http://ygunited.com/2014/11/08/little-psy-from-the-has-brain-tumor-6-months-left-to-live/ ﻿",1,1
17,Rancy Gaming,2014-11-06T09:41:07,What free gift cards? Go here http://www.swagbucks.com/p/register?rb=13017194﻿,1,1


Now let's imagine gold labels disappeared...

![poof](img/poof.png)

What can be a labeling function?

- Keyword searches: looking for specific words in a sentence
- Pattern matching: looking for specific syntactical patterns
- Third-party models: using an pre-trained model (usually a model for a different task than the one at hand)
...
- Crowdworker labels: treating each crowdworker as a black-box function that assigns labels to subsets of the data

For detection of YouTube comments dataset: e.g. "check out" - a marker of a SPAM comment

In [None]:
# an example of LF based on a key word "check out"
def check_out(x):
    return 1 if "check out" in x.text.lower() else -1

# an example of LF based on a key word "check"
def check(x):
    return 1 if "check" in x.text.lower() else -1

todo: e.g., we created 10 LFs and applied them for each sentence. The result can be saved in the following format:

In [13]:
with open("/Users/asedova/PycharmProjects/01_datasets/wrench/youtube/train.json") as train_file:
    train_data = pandas.read_json(train_file)
print(train_data)

Unnamed: 0,data,label,weak_labels
0,{'text': 'pls http://www10.vakinha.com.br/Vaqu...,1,"[-1, -1, 1, -1, -1, -1, -1, -1, -1, -1]"
1,"{'text': 'if your like drones, plz subscribe t...",1,"[-1, 1, -1, 1, -1, -1, -1, -1, -1, 0]"
2,{'text': 'go here to check the views :3﻿'},0,"[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
3,"{'text': 'Came here to check the views, goodby...",0,"[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
4,"{'text': 'i am 2,126,492,636 viewer :D﻿'}",0,"[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"


todo: explain what are the data, labels (=still gold ones), weak_labels (=the annotations after applying LFs)

todo: what is majority vote?
todo: how to get the weak labels from the wrench data to train a classifier?
todo: example of classifier training
todo: other labeling model (e.g. Snorkel)

## How to obtain weak labels?

todo: different labeling functions: 1) (simple) majority vote 2) (more advanced) FABLE

In [None]:
train_data, valid_data, test_data = load_dataset(
    path_to_data,
    "youtube",
    extract_feature=True,
    extract_fn='tfidf'
)

1. Majority Vote
(+ todo: a small description)

In [7]:
# initialize and apply the majority vote
label_model = MajorityVoting()
label_model.fit(dataset_train=train_data, dataset_valid=valid_data)

In [34]:
# calculate labels
soft_label_mv = label_model.predict_proba(train_data)
hard_label_mv = probs_to_preds(soft_label_mv)

  0%|          | 0/1586 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]

  0%|          | 0/250 [00:00<?, ?it/s]

2. Fable
(+ todo: a small description)

In [36]:
# initialize and apply the fable model
label_model = Fable(kernel_function=None, num_groups=3)
label_model.fit(dataset_train=train_data, dataset_valid=valid_data)

In [37]:
# calculate labels
soft_label_fable = label_model.predict_proba(train_data)
hard_label_fable = probs_to_preds(soft_label_mv)

array([1, 1, 0, ..., 1, 1, 1])

## How to train a classifier?

In [30]:
batch_size = 32
test_batch_size = 32
lr = 0.01

model = EndClassifierModel(
    batch_size=batch_size, test_batch_size=test_batch_size
)

AttributeError: 'DataFrame' object has no attribute 'weak_labels'