# IMDB Dataset - Create Weak Supervision Sources

This notebook shows how to create labeling functions on the IMDB Movie Review dataset and apply them on text data.

The original dataset has gold labels, but we will use these labels only for evaluation purposes, since we want to test models under the weak supervision setting with Knodle. The idea behind it is that you don't have a dataset which is purely labeled with strong supervision (manual) and instead use heuristics (e.g. rules) to obtain a weak labeling. In the following tutorial, we will look for certain keywords in texts that can be helpful to distinguish between classes.

First, we load the dataset from a Knodle dataset collection. Then, we will create labeling functions from simple keywords and apply them to the IMDB reviews. Each labeling function will be associated with a target label (positive or negative). To estimate how good our weak labeling works on its own, we will use the resulting keyword matches together with a basic majority vote model. Finally, the preprocessed data will be saved in a knodle-friendly format, so that other denoising models can be trained with the IMDB dataset.

The IMDB dataset available in the Knodle collection was downloaded from [Kaggle](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) in January 2021. 

## Imports

Lets make some basic imports 

In [1]:
import os
from tqdm import tqdm
from typing import List

import pandas as pd 
import numpy as np 

from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, CountVectorizer
from snorkel.labeling import LabelingFunction, PandasLFApplier, filter_unlabeled_dataframe, LFAnalysis
from snorkel.labeling.model import MajorityLabelVoter, LabelModel

# client to access the dataset collection
from minio import Minio
client = Minio("knodle.dm.univie.ac.at", secure=False)

In [2]:
# Init
tqdm.pandas()
pd.set_option('display.max_colwidth', -1)

  from pandas import Panel
  pd.set_option('display.max_colwidth', -1)


In [3]:
# Constants for Snorkel labeling functions
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
COLUMN_WITH_TEXT = "reviews_preprocessed"

## Download the dataset

In [4]:
# define the path to the folder where the data will be stored
data_path = "../../../data_from_minio/imdb"

Together with the IMDB data, let us also collect the keywords that we will need later.

In [5]:
files = [("IMDB Dataset.csv",), ("keywords", "negative-words.txt"), ("keywords", "positive-words.txt")]

for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/imdb", *file),
        file_path=os.path.join(data_path, file[-1]),
    )

100%|██████████| 3/3 [00:01<00:00,  2.80it/s]


## Preview dataset

After downloading and unpacking the dataset we can have a first look at it and work with it.

In [6]:
imdb_dataset_raw = pd.read_csv(os.path.join(data_path, "IMDB Dataset.csv"))

In [7]:
imdb_dataset_raw.head(1)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive


In [8]:
imdb_dataset_raw.groupby('sentiment').count()


Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
negative,25000
positive,25000


In [9]:
imdb_dataset_raw.isna().sum()

review       0
sentiment    0
dtype: int64

## Preprocess dataset

Now lets take some basic preprocessing steps

### Remove Stopwords

We begin by removing all common stop words. We use `scikit-learn`'s stopwords that we don't install to many packages.

In [10]:
imdb_dataset_raw[COLUMN_WITH_TEXT] = imdb_dataset_raw['review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (ENGLISH_STOP_WORDS)]))

In [11]:
imdb_dataset_raw.head(1)

Unnamed: 0,review,sentiment,reviews_preprocessed
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive,"One reviewers mentioned watching just 1 Oz episode you'll hooked. They right, exactly happened me.<br /><br />The thing struck Oz brutality unflinching scenes violence, set right word GO. Trust me, faint hearted timid. This pulls punches regards drugs, sex violence. Its hardcore, classic use word.<br /><br />It called OZ nickname given Oswald Maximum Security State Penitentary. It focuses mainly Emerald City, experimental section prison cells glass fronts face inwards, privacy high agenda. Em City home many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish more....so scuffles, death stares, dodgy dealings shady agreements far away.<br /><br />I say main appeal fact goes shows wouldn't dare. Forget pretty pictures painted mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The episode I saw struck nasty surreal, I couldn't say I ready it, I watched more, I developed taste Oz, got accustomed high levels graphic violence. Not just violence, injustice (crooked guards who'll sold nickel, inmates who'll kill order away it, mannered, middle class inmates turned prison bitches lack street skills prison experience) Watching Oz, comfortable uncomfortable viewing....thats touch darker side."


### Remove HTML Tags

The dataset contains many HTML tags. We'll remove them

In [12]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

imdb_dataset_raw[COLUMN_WITH_TEXT] = imdb_dataset_raw[COLUMN_WITH_TEXT].apply(
    lambda x: strip_html(x))

In [13]:
imdb_dataset_raw[COLUMN_WITH_TEXT].head(1)

0    One reviewers mentioned watching just 1 Oz episode you'll hooked. They right, exactly happened me.The thing struck Oz brutality unflinching scenes violence, set right word GO. Trust me, faint hearted timid. This pulls punches regards drugs, sex violence. Its hardcore, classic use word.It called OZ nickname given Oswald Maximum Security State Penitentary. It focuses mainly Emerald City, experimental section prison cells glass fronts face inwards, privacy high agenda. Em City home many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish more....so scuffles, death stares, dodgy dealings shady agreements far away.I say main appeal fact goes shows wouldn't dare. Forget pretty pictures painted mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The episode I saw struck nasty surreal, I couldn't say I ready it, I watched more, I developed taste Oz, got accustomed high levels graphic violence. Not just violence, injustice (crooked guards who'll sold n

## Keywords

For weak supervision sources we use sentiment keyword lists for positive and negative words.
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

We have downloaded them from the Knodle collection earlier, with the IMDB dataset. 

After parsing the keywords from separate files, they are stored in a pd.DataFrame with the corresponding sentiment as "label".

In [14]:
positive_keywords = pd.read_csv(os.path.join(data_path, "positive-words.txt"), sep=" ", header=None, error_bad_lines=False, skiprows=30)
positive_keywords.columns = ["keyword"]
positive_keywords["label"] = "positive"
positive_keywords.head(2)

Unnamed: 0,keyword,label
0,a+,positive
1,abound,positive


In [15]:
negative_keywords = pd.read_csv(os.path.join(data_path, "negative-words.txt"),
                                sep=" ", header=None, error_bad_lines=False,  encoding='latin-1', skiprows=30)
negative_keywords.columns = ["keyword"]
negative_keywords["label"] = "negative"
negative_keywords.head(2)

Unnamed: 0,keyword,label
0,2-faced,negative
1,2-faces,negative


In [16]:
all_keywords = pd.concat([positive_keywords, negative_keywords])
all_keywords.label.value_counts()

negative    4783
positive    2006
Name: label, dtype: int64

In [17]:
# remove overlap of keywords between two sentiments
all_keywords.drop_duplicates('keyword',inplace=True)
all_keywords.label.value_counts()

negative    4780
positive    2006
Name: label, dtype: int64

## Labeling Functions

Now we start to build labeling functions with Snorkel with these keywords and check the coverage.

This is an iterative process of course so we surely have to add more keywords and regulary expressions ;-) 

In [18]:
def keyword_lookup(x, keyword, label):
    return label if keyword in x[COLUMN_WITH_TEXT].lower() else ABSTAIN


In [19]:
def make_keyword_lf(keyword: str, label: str) -> LabelingFunction:
    """
    Creates labeling function based on keyword.
    Args:
        keyword: what keyword should be look for
        label: what label does this keyword imply

    Returns: LabelingFunction object

    """
    return LabelingFunction(
        name=f"keyword_{keyword}",
        f=keyword_lookup,
        resources=dict(keyword=keyword, label=label),
    )

In [20]:
def create_labeling_functions(keywords: pd.DataFrame) -> np.ndarray:
    """
    Create Labeling Functions based on the columns keyword and regex. Appends column lf to df.

    Args:
        keywords: DataFrame with processed keywords

    Returns:
        All labeling functions. 1d Array with shape: (number_of_lfs x 1)
    """
    keywords = keywords.assign(lf=keywords.progress_apply(
        lambda x:make_keyword_lf(x.keyword, x.label_id), axis=1
    ))
    lfs = keywords.lf.values
    return lfs

In [21]:
all_keywords["label_id"] = all_keywords.label.map({'positive':POSITIVE, 'negative':NEGATIVE})

In [22]:
all_keywords

Unnamed: 0,keyword,label,label_id
0,a+,positive,1
1,abound,positive,1
2,abounds,positive,1
3,abundance,positive,1
4,abundant,positive,1
...,...,...,...
4778,zaps,negative,0
4779,zealot,negative,0
4780,zealous,negative,0
4781,zealously,negative,0


In [23]:
labeling_functions = create_labeling_functions(all_keywords)

100%|██████████| 6786/6786 [00:00<00:00, 53479.31it/s]


In [24]:
labeling_functions

array([LabelingFunction keyword_a+, Preprocessors: [],
       LabelingFunction keyword_abound, Preprocessors: [],
       LabelingFunction keyword_abounds, Preprocessors: [], ...,
       LabelingFunction keyword_zealous, Preprocessors: [],
       LabelingFunction keyword_zealously, Preprocessors: [],
       LabelingFunction keyword_zombie, Preprocessors: []], dtype=object)

### Apply Labeling Functions

Now lets apply all labeling functions on our reviews and check some statistics.

In [25]:
applier = PandasLFApplier(lfs=labeling_functions)
applied_lfs = applier.apply(df=imdb_dataset_raw)

  from pandas import Panel
100%|██████████| 50000/50000 [35:42<00:00, 23.34it/s]   


In [26]:
applied_lfs

array([[-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       ...,
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1]])

Now we have a matrix with all labeling functions applied. This matrix has the shape $(instances \times labeling functions)$

In [27]:
print("Shape of applied labeling functions: ", applied_lfs.shape)
print("Number of reviews", len(imdb_dataset_raw))
print("Number of labeling functions", len(labeling_functions))

Shape of applied labeling functions:  (50000, 6786)
Number of reviews 50000
Number of labeling functions 6786


### Analysis

Now we can analyse some basic stats about our labeling functions. The main figures are:

- Coverage: How many labeling functions match at all
- Overlaps: How many labeling functions overlap with each other (e.g. awesome and amazing)
- Conflicts: How many labeling functions overlap and have different labels (e.g. awesome and bad)
- Correct: Correct LFs
- Incorrect: Incorrect Lfs

In [93]:
LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
keyword_a+,0,[1],0.00102,0.00102,0.00102
keyword_abound,1,[1],0.00264,0.00264,0.00264
keyword_abounds,2,[1],0.00052,0.00052,0.00052
keyword_abundance,3,[1],0.00182,0.00182,0.00182
keyword_abundant,4,[1],0.00114,0.00114,0.00114
...,...,...,...,...,...
keyword_zaps,6781,[0],0.00012,0.00012,0.00012
keyword_zealot,6782,[0],0.00048,0.00048,0.00048
keyword_zealous,6783,[0],0.00092,0.00092,0.00092
keyword_zealously,6784,[0],0.00006,0.00006,0.00006


In [94]:
lf_analysis = LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

In [95]:
pd.DataFrame(lf_analysis.mean())

Unnamed: 0,0
j,3392.5
Coverage,0.005002
Overlaps,0.005002
Conflicts,0.004997


In [96]:
pd.DataFrame(lf_analysis.median())

Unnamed: 0,0
j,3392.5
Coverage,0.00052
Overlaps,0.00052
Conflicts,0.00052


Lets have a look at some examples that were labeled by a positive keyword. You can see, that the true label for some of them is negative.

In [44]:
# consider 50th keyword
all_keywords.iloc[1110]

keyword     loyalty 
label       positive
label_id    1       
Name: 1110, dtype: object

In [45]:
# sample 2 random examples where the 50th LF assigned a positive label
imdb_dataset_raw.iloc[applied_lfs[:, 1110] == POSITIVE, :].sample(2, random_state=1).loc[:, [COLUMN_WITH_TEXT,'sentiment']]

Unnamed: 0,reviews_preprocessed,sentiment
31052,"I finally got wish cinema. I'd seen Fritz Lang's film video years ago. I'd hoping ideal screening conditions work magic.Conditions ideal Cinematheque Ontario. Pristine full-length print. Intertitles original Gothic-script German simultaneous English translation, accurate literal. Live piano accompaniment. Ideal.The film's magic sputtered little ultimately failed catch, me.This film bears real relation Wagner's Ring cycle I knew not. Wagner adapted 13th c. Niebelungenlied purposes. Part I Fritz Lang's epic -- ""Siegfried"" -- familiar listeners Wagner however.""Kriemhild's Revenge"" story Siegfried's wife Kriemhild, marriage King Etzel (Attila) Hun, desire revenge Hagen Gunther, rechristened Nibelungs, murder Siegfried. The spectacular conflagration film presumably evolved expanded Wagnerian mythos Götterdämmerung, Twilight Gods, end Valhalla. This film remains earthbound.Most film spectacular. The massive sets rival ""Cabiria"" (1914), inspired Griffith's ""Intolerance"" (1916). Their decoration sets new benchmark barbaric splendour. There's huge cast scarred, mangy Huns Art Deco Burgundians. And battles. Battles end fact.Kriemhild successful plan revenge. She manages destroy her. Her loyalty martyred Siegfried stem love, devotion, closer psychosis. Lady Macbeth cried out, ""Unsex here."" She knew emotionally unprepared needed do. But Kriemhild displays normal human emotions, certainly equates feminine principle. She ""top direst cruelty"", borrow Shakespeare's phrase, outset. Margarethe Schön director convey glower. I don't want exaggerate, glower virtually expression ""animate"" Kriemhild's face. It's ultimate one-note performances. It's clearly intentional however, simply case poor acting.What offer one-dimensional sketch avenging Fury. Some Kriemhild empowered heroine. I just film misogynistic.",positive
1798,"""Laugh, Clown Laugh"" released 1928, stars legendary Lon Chaney circus clown named Tito. Tito raised foundling (a young beautiful Loretta Young) adulthood names Simonetta. Tito raised girl circus life, accomplished ballerina. While Chaney gives usual great performance, I past fact Tito, middle age, hots young Simonetta. Although biological father, raised like daughter. That kind ""ick"" factor permeates film. Tito competes Simonetta's affections young handsome 'Count' Luigi (Nils Asther). Simonetta clearly falls young man, feels guilt abandoning Tito (out loyalty, romantic love). The premise film ridiculous, I amazing film tells Tito stupid old fool (until reveals end). The film noteworthy Loretta Young, great career. While I adore Chaney's brilliance actor, film just downright creepy.",negative


## Transform rule matches

To work with knodle the dataset needs to be transformed into a binary array $Z$

(shape: `#instances x # rules_`).

Where a cell $Z_{ij}$ means that for instance i

    0 -> Rule j didn't match 
    1 -> Rule j matched

Furthermore, we need a matrix `mapping_rule_labels` which has a mapping of all rules to labels, stored in a binary manner as well 

(shape `#rules x #labels`).

In [58]:
from knodle.data.labelling_fcts import transform_rule_class_matrix_to_z_t

2021-03-25 22:20:48,979 root         INFO     Initalized logger


In [59]:
rule_matches, mapping_rules_labels = transform_rule_class_matrix_to_z_t(applied_lfs)

### Majority Vote

Now we make a majority vote based on all rule matches. First we get the `rule_counts` by multiplying `rule_matches` with the `mapping_rules_labels`, then we divide it sumwise by the sum to get a probability value. In the end we counteract the divide with zero issue by setting all nan values to zero. All this happens in the `z_t_matrices_to_majority_vote_labels` function.

In [60]:
from knodle.transformation.majority import z_t_matrices_to_majority_vote_labels

In [61]:
# the ties are resolved randomly internally, so the predictions might slightly vary
pred_labels = z_t_matrices_to_majority_vote_labels(rule_matches, mapping_rules_labels)

  rule_counts_probs = rule_counts / rule_counts.sum(axis=1).reshape(-1, 1)


There are more positive labels predicted by the majority vote.

In [62]:
np.unique(pred_labels, return_counts=True)

(array([0, 1]), array([37063, 12937]))

In [63]:
# accuracy of the weak labels
(pred_labels == imdb_dataset_raw.sentiment.map({'positive':POSITIVE, 'negative':NEGATIVE})).mean()

0.6637

## Save Files


In [64]:
from joblib import dump
from scipy.sparse import csr_matrix

In [65]:
rule_matches_sparse = csr_matrix(rule_matches)

In [66]:
dump(rule_matches_sparse, os.path.join(data_path, "rule_matches.lib"))
dump(mapping_rules_labels,  os.path.join(data_path, "mapping_rules_labels.lib"))

['../../../data/data_from_minio/imdb/mapping_rules_labels.lib']

We also save the preprocessed texts with labels to use them later to evalutate a classifier.

In [67]:
imdb_dataset_raw['label_id'] = imdb_dataset_raw.sentiment.map({'positive':POSITIVE, 'negative':NEGATIVE})

In [68]:
imdb_dataset_raw.to_csv(os.path.join(data_path, 'imdb_data_preprocessed.csv'), index=None)

In [69]:
all_keywords.to_csv(os.path.join(data_path, 'keywords.csv'), index=None)

In [70]:
os.listdir(data_path)

['rule_matches.lib',
 'keywords.csv',
 'positive-words.txt',
 'imdb_data_preprocessed.csv',
 'mapping_rules_labels.lib',
 'IMDB Dataset.csv',
 'negative-words.txt']

# Finish

Now, we have created a weak supervision dataset. Of course it is not perfect but it is something with which we can compare performances of different denoising methods with. :-) 