# IMDB Dataset - Create Weak Supervision Sources and Get the Weak Data Annotations

This notebook shows how to use keywords as a weak supervision source on the example of a well-known IMDB Movie Review dataset, which targets a binary sentiment analysis task.

The original dataset has gold labels, but we will use these labels only for evaluation purposes, since we want to test models under the weak supervision setting with Knodle. The idea behind it is that you don't have a dataset which is purely labeled with strong supervision (manual) and instead use heuristics (e.g. rules) to obtain a weak labeling. In the following tutorial, we will look for certain keywords expressing positive and negative sentiments that can be helpful to distinguish between classes. Specifically, we use the [Opinion lexicon](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) of the University of Illinois at Chicago.

First, we load the dataset from a Knodle dataset collection. Then, we will create [Snorkel](https://www.snorkel.org/) labeling functions from two sets of keywords and apply them to the IMDB reviews. Please keep in mind, that keyword matching can be done without Snorkel; however, we enjoy the transparent annotation functionality of this library in our tutorial. 
Each labeling function (i.e. keyword) will be further associated with a respective target label. This concludes the annotation step.

To estimate how good our weak labeling works on its own, we will use the resulting keyword matches together with a basic majority vote model. Finally, the preprocessed data will be saved in a knodle-friendly format, so that other denoising models can be trained with the IMDB dataset.

The IMDB dataset available in the Knodle collection was downloaded from [Kaggle](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) in January 2021. 

## Imports

Lets make some basic imports 

In [2]:
import os
from tqdm import tqdm
from typing import List

import pandas as pd 
import numpy as np 

from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, CountVectorizer
from snorkel.labeling import LabelingFunction, PandasLFApplier, filter_unlabeled_dataframe, LFAnalysis
from snorkel.labeling.model import MajorityLabelVoter, LabelModel

# client to access the dataset collection
from minio import Minio
client = Minio("knodle.dm.univie.ac.at", secure=False)

In [3]:
# Init
tqdm.pandas()
pd.set_option('display.max_colwidth', -1)

  from pandas import Panel
  pd.set_option('display.max_colwidth', -1)


In [4]:
# Constants for Snorkel labeling functions
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
COLUMN_WITH_TEXT = "reviews_preprocessed"

## Download the dataset

In [5]:
# define the path to the folder where the data will be stored
data_path = "../../../data_from_minio/imdb"

Together with the IMDB data, let us also collect the keywords that we will need later.

In [6]:
files = [("IMDB Dataset.csv",), ("keywords", "negative-words.txt"), ("keywords", "positive-words.txt")]

for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/imdb", *file),
        file_path=os.path.join(data_path, file[-1]),
    )

100%|██████████| 3/3 [00:04<00:00,  1.36s/it]


## Preview dataset

After downloading and unpacking the dataset we can have a first look at it and work with it.

In [7]:
imdb_dataset_raw = pd.read_csv(os.path.join(data_path, "IMDB Dataset.csv"))

In [8]:
imdb_dataset_raw.head(1)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive


In [9]:
imdb_dataset_raw.groupby('sentiment').count()


Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
negative,25000
positive,25000


In [10]:
imdb_dataset_raw.isna().sum()

review       0
sentiment    0
dtype: int64

## Preprocess dataset

Now lets take some basic preprocessing steps

### Remove Stopwords

It could be a reasonable step for some classifiers, but since we use BERT among other approaches, we want to keep the sentence structure and hence do not remove stopwords just yet.

### Remove HTML Tags

The dataset contains many HTML tags. We'll remove them

In [11]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

imdb_dataset_raw[COLUMN_WITH_TEXT] = imdb_dataset_raw["review"].apply(
    lambda x: strip_html(x))

In [12]:
imdb_dataset_raw[COLUMN_WITH_TEXT].head(1)

0    One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other show

## Keywords

For weak supervision sources we use sentiment keyword lists for positive and negative words.

We have downloaded them from the Knodle collection earlier, with the IMDB dataset. 

After parsing the keywords from separate files, they are stored in a pd.DataFrame with the corresponding sentiment as "label".

In [13]:
positive_keywords = pd.read_csv(os.path.join(data_path, "positive-words.txt"), sep=" ", header=None, error_bad_lines=False, skiprows=30)
positive_keywords.columns = ["keyword"]
positive_keywords["label"] = "positive"
positive_keywords.head(2)

Unnamed: 0,keyword,label
0,a+,positive
1,abound,positive


In [14]:
negative_keywords = pd.read_csv(os.path.join(data_path, "negative-words.txt"),
                                sep=" ", header=None, error_bad_lines=False,  encoding='latin-1', skiprows=30)
negative_keywords.columns = ["keyword"]
negative_keywords["label"] = "negative"
negative_keywords.head(2)

Unnamed: 0,keyword,label
0,2-faced,negative
1,2-faces,negative


In [15]:
all_keywords = pd.concat([positive_keywords, negative_keywords])
all_keywords.label.value_counts()

negative    4783
positive    2006
Name: label, dtype: int64

In [16]:
# remove overlap of keywords between two sentiments
all_keywords.drop_duplicates('keyword',inplace=True)
all_keywords.label.value_counts()

negative    4780
positive    2006
Name: label, dtype: int64

## Labeling Functions

Now we start to build labeling functions with Snorkel with these keywords and check the coverage.

This is an iterative process of course so we surely have to add more keywords and regulary expressions ;-) 

In [17]:
def keyword_lookup(x, keyword, label):
    return label if keyword in x[COLUMN_WITH_TEXT].lower() else ABSTAIN


In [18]:
def make_keyword_lf(keyword: str, label: str) -> LabelingFunction:
    """
    Creates labeling function based on keyword.
    Args:
        keyword: what keyword should be look for
        label: what label does this keyword imply

    Returns: LabelingFunction object

    """
    return LabelingFunction(
        name=f"keyword_{keyword}",
        f=keyword_lookup,
        resources=dict(keyword=keyword, label=label),
    )

In [19]:
def create_labeling_functions(keywords: pd.DataFrame) -> np.ndarray:
    """
    Create Labeling Functions based on the columns keyword and regex. Appends column lf to df.

    Args:
        keywords: DataFrame with processed keywords

    Returns:
        All labeling functions. 1d Array with shape: (number_of_lfs x 1)
    """
    keywords = keywords.assign(lf=keywords.progress_apply(
        lambda x:make_keyword_lf(x.keyword, x.label_id), axis=1
    ))
    lfs = keywords.lf.values
    return lfs

In [20]:
all_keywords["label_id"] = all_keywords.label.map({'positive':POSITIVE, 'negative':NEGATIVE})

In [21]:
all_keywords

Unnamed: 0,keyword,label,label_id
0,a+,positive,1
1,abound,positive,1
2,abounds,positive,1
3,abundance,positive,1
4,abundant,positive,1
...,...,...,...
4778,zaps,negative,0
4779,zealot,negative,0
4780,zealous,negative,0
4781,zealously,negative,0


In [22]:
labeling_functions = create_labeling_functions(all_keywords)

100%|██████████| 6786/6786 [00:00<00:00, 37675.23it/s]


In [23]:
labeling_functions

array([LabelingFunction keyword_a+, Preprocessors: [],
       LabelingFunction keyword_abound, Preprocessors: [],
       LabelingFunction keyword_abounds, Preprocessors: [], ...,
       LabelingFunction keyword_zealous, Preprocessors: [],
       LabelingFunction keyword_zealously, Preprocessors: [],
       LabelingFunction keyword_zombie, Preprocessors: []], dtype=object)

### Apply Labeling Functions

Now lets apply all labeling functions on our reviews and check some statistics.

In [24]:
applier = PandasLFApplier(lfs=labeling_functions)
applied_lfs = applier.apply(df=imdb_dataset_raw)

  from pandas import Panel
100%|██████████| 50000/50000 [28:37<00:00, 29.11it/s]


In [25]:
applied_lfs

array([[-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       ...,
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1]])

Now we have a matrix with all labeling functions applied. This matrix has the shape $(instances \times labeling functions)$

In [26]:
print("Shape of applied labeling functions: ", applied_lfs.shape)
print("Number of reviews", len(imdb_dataset_raw))
print("Number of labeling functions", len(labeling_functions))

Shape of applied labeling functions:  (50000, 6786)
Number of reviews 50000
Number of labeling functions 6786


### Analysis

Now we can analyse some basic stats about our labeling functions. The main figures are:

- Coverage: How many labeling functions match at all
- Overlaps: How many labeling functions overlap with each other (e.g. awesome and amazing)
- Conflicts: How many labeling functions overlap and have different labels (e.g. awesome and bad)
- Correct: Correct LFs
- Incorrect: Incorrect Lfs

In [27]:
LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
keyword_a+,0,[1],0.00102,0.00102,0.00102
keyword_abound,1,[1],0.00264,0.00264,0.00264
keyword_abounds,2,[1],0.00052,0.00052,0.00052
keyword_abundance,3,[1],0.00182,0.00182,0.00182
keyword_abundant,4,[1],0.00114,0.00114,0.00114
...,...,...,...,...,...
keyword_zaps,6781,[0],0.00012,0.00012,0.00012
keyword_zealot,6782,[0],0.00048,0.00048,0.00048
keyword_zealous,6783,[0],0.00092,0.00092,0.00092
keyword_zealously,6784,[0],0.00006,0.00006,0.00006


In [28]:
lf_analysis = LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

In [29]:
pd.DataFrame(lf_analysis.mean())

Unnamed: 0,0
j,3392.5
Coverage,0.005163
Overlaps,0.005163
Conflicts,0.005159


In [30]:
pd.DataFrame(lf_analysis.median())

Unnamed: 0,0
j,3392.5
Coverage,0.00052
Overlaps,0.00052
Conflicts,0.00052


Lets have a look at some examples that were labeled by a positive keyword. You can see, that the true label for some of them is negative.

In [31]:
# consider 50th keyword
all_keywords.iloc[1110]

keyword     loyalty 
label       positive
label_id    1       
Name: 1110, dtype: object

In [32]:
# sample 2 random examples where the 50th LF assigned a positive label
imdb_dataset_raw.iloc[applied_lfs[:, 1110] == POSITIVE, :].sample(2, random_state=1).loc[:, [COLUMN_WITH_TEXT,'sentiment']]

Unnamed: 0,reviews_preprocessed,sentiment
31052,"I finally got my wish to see this one in a cinema. I'd seen Fritz Lang's film on video some years ago. I'd been hoping that ideal screening conditions would work their magic.Conditions were ideal at Cinematheque Ontario. Pristine full-length print. Intertitles in the original Gothic-script German with simultaneous English translation, accurate without being too literal. Live piano accompaniment. Ideal.The film's magic sputtered for a little while but ultimately failed to catch, at least for me.This film bears no real relation to Wagner's Ring cycle as I already knew but some may not. Wagner had adapted the 13th c. Niebelungenlied to his own purposes. Part I of Fritz Lang's epic -- ""Siegfried"" -- has much that will be familiar to listeners of Wagner however.""Kriemhild's Revenge"" is the story of Siegfried's wife Kriemhild, her marriage to King Etzel (Attila) the Hun, and her desire for revenge against Hagen and Gunther, the rechristened Nibelungs, for the murder of Siegfried. The spectacular conflagration in this film presumably evolved and expanded in the Wagnerian mythos into his Götterdämmerung, his Twilight of the Gods, and the end of Valhalla. This film remains earthbound.Most of the film is spectacular. The massive sets rival those of ""Cabiria"" (1914), which inspired Griffith's ""Intolerance"" (1916). Their decoration sets a new benchmark in barbaric splendour. There's a huge cast of scarred, mangy Huns and Art Deco Burgundians. And battles. Battles that never seem to end in fact.Kriemhild is very successful in her plan of revenge. She manages to destroy all around her. Her loyalty to her martyred Siegfried seems not to stem so much from love, or devotion, but from something closer to psychosis. Lady Macbeth cried out, ""Unsex me here."" She knew she was emotionally unprepared for what she needed to do. But Kriemhild displays no normal human emotions, and certainly nothing one equates with the feminine principle. She is already ""top full of direst cruelty"", to borrow Shakespeare's phrase, from the outset. Margarethe Schön and her director convey this with a glower. I don't want to exaggerate, but that glower is virtually the only expression ever to ""animate"" Kriemhild's face. It's the ultimate in one-note performances. It's clearly intentional however, not simply a case of poor acting.What we have then on offer is a one-dimensional sketch of an avenging Fury. Some might see Kriemhild as an empowered heroine. I just see the film as misogynistic.",positive
1798,"""Laugh, Clown Laugh"" released in 1928, stars the legendary Lon Chaney as a circus clown named Tito. Tito has raised a foundling (a young and beautiful Loretta Young) to adulthood and names her Simonetta. Tito has raised the girl in the circus life, and she has become an accomplished ballerina. While Chaney gives his usual great performance, I could not get past the fact that Tito, now well into middle age, has the hots for the young Simonetta. Although he is not her biological father, he has raised her like a daughter. That kind of ""ick"" factor permeates throughout the film. Tito competes for Simonetta's affections with a young and handsome 'Count' Luigi (Nils Asther). Simonetta clearly falls for the young man, but feels guilt about abandoning Tito (out of loyalty, not romantic love). The whole premise of the film is ridiculous, and I find it amazing that no one in the film tells Tito what a stupid old fool he is being (until he reveals it himself at the end). The film is noteworthy only because of Loretta Young, who would go on to have a great career. While I adore Chaney's brilliance as an actor, this whole film seems off to me and just downright creepy.",negative


## Transform rule matches

To work with knodle the dataset needs to be transformed into a binary array $Z$

(shape: `#instances x # rules`).

Where a cell $Z_{ij}$ means that for instance i

    0 -> Rule j didn't match 
    1 -> Rule j matched

Furthermore, we need a matrix `mapping_rule_labels_t` which has a mapping of all rules to labels, stored in a binary manner as well 

(shape `#rules x #labels`).

In [38]:
from knodle.transformation.rule_label_format import transform_snorkel_matrix_to_z_t

rule_matches_z, mapping_rules_labels_t = transform_snorkel_matrix_to_z_t(applied_lfs)

### Majority vote

Now we make a majority vote based on all rule matches. First we get the `rule_counts` by multiplying `rule_matches_z` with the `mapping_rules_labels_t`, then we divide it sumwise by the sum to get a probability value. In the end we counteract the divide with zero issue by setting all nan values to zero. All this happens in the `z_t_matrices_to_majority_vote_probs` function.

In [39]:
from knodle.transformation.majority import z_t_matrices_to_majority_vote_probs

In [40]:
# the ties are resolved randomly internally, so the predictions might slightly vary
pred_labels = z_t_matrices_to_majority_vote_probs(rule_matches_z, mapping_rules_labels_t)

  rule_counts_probs = rule_counts / rule_counts.sum(axis=1).reshape(-1, 1)


In [43]:
# accuracy of the weak labels
acc = (pred_labels == imdb_dataset_raw.sentiment.map({'positive':POSITIVE, 'negative':NEGATIVE})).mean()
f"Accuracy of majority vote on weak labeling: {acc}"

'Accuracy of majority vote on weak labeling: 0.66208'

## Make splits

In [44]:
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split

In [45]:
# we want to save the matrix a sparse format, so we convert once before it gets split
rule_matches_sparse = csr_matrix(rule_matches_z)

In [47]:
# adjust DataFrame format to Knodle standard
imdb_dataset_formatted = pd.DataFrame({
    "sample": imdb_dataset_raw[COLUMN_WITH_TEXT].values,
    "label": imdb_dataset_raw.sentiment.map({'positive':POSITIVE, 'negative':NEGATIVE}).values
})

In [50]:
# make splits for samples and weak annotation matrix
rest_df, test_df, rest_rule_matches_sparse_z, test_rule_matches_sparse_z = train_test_split(imdb_dataset_formatted, rule_matches_sparse, test_size=5000, random_state=42)
train_df, dev_df, train_rule_matches_sparse_z, dev_rule_matches_sparse_z = train_test_split(rest_df, rest_rule_matches_sparse_z, test_size=5000, random_state=42)

In [52]:
# drop labels for the train split
train_df.drop(columns=["label"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [54]:
train_df.head(1)

Unnamed: 0,sample
23844,"I have no idea what these people were thinking when they made this film. No plot, very limited action, and what is with the 3rd person commentary throughout the film???? Instead of running around the planet to shoot on all of these locations, they should have spent some money on script writing and actors. What acting there was, was lousy. This was 90 minutes of my life I will never be able to get back. I should bill the director for the cost of renting this film. To the director and the writers of this film....please quit now. This film should have a tag on the front of it saying beware of boredom. The only good thing I can say about this film, is the computer generation. It's OK as generation is. This movie should never have a sequel....ever."


In [55]:
test_df.head(1)

Unnamed: 0,sample,label
33553,"I really liked this Summerslam due to the look of the arena, the curtains and just the look overall was interesting to me for some reason. Anyways, this could have been one of the best Summerslam's ever if the WWF didn't have Lex Luger in the main event against Yokozuna, now for it's time it was ok to have a huge fat man vs a strong man but I'm glad times have changed. It was a terrible main event just like every match Luger is in is terrible. Other matches on the card were Razor Ramon vs Ted Dibiase, Steiner Brothers vs Heavenly Bodies, Shawn Michaels vs Curt Hening, this was the event where Shawn named his big monster of a body guard Diesel, IRS vs 1-2-3 Kid, Bret Hart first takes on Doink then takes on Jerry Lawler and stuff with the Harts and Lawler was always very interesting, then Ludvig Borga destroyed Marty Jannetty, Undertaker took on Giant Gonzalez in another terrible match, The Smoking Gunns and Tatanka took on Bam Bam Bigelow and the Headshrinkers, and Yokozuna defended the world title against Lex Luger this match was boring and it has a terrible ending. However it deserves 8/10",1


## Save files


In [56]:
from joblib import dump

In [57]:
dump(train_rule_matches_sparse_z, os.path.join(data_path, "train_rule_matches_z.lib"))
dump(dev_rule_matches_sparse_z, os.path.join(data_path, "dev_rule_matches_z.lib"))
dump(test_rule_matches_sparse_z, os.path.join(data_path, "test_rule_matches_z.lib"))

dump(mapping_rules_labels_t,  os.path.join(data_path, "mapping_rules_labels_t.lib"))

['../../../data_from_minio/imdb/mapping_rules_labels_t.lib']

We also save the preprocessed texts with labels to use them later to evalutate a classifier (both as csv and binary).

In [58]:
train_df.to_csv(os.path.join(data_path, 'df_train.csv'), index=None)
dev_df.to_csv(os.path.join(data_path, 'df_dev.csv'), index=None)
test_df.to_csv(os.path.join(data_path, 'df_test.csv'), index=None)

dump(train_df, os.path.join(data_path, "df_train.lib"))
dump(dev_df, os.path.join(data_path, "df_dev.lib"))
dump(test_df, os.path.join(data_path, "df_test.lib"))

['../../../data_from_minio/imdb/df_test.lib']

In [59]:
all_keywords.to_csv(os.path.join(data_path, 'keywords.csv'), index=None)

In [61]:
os.listdir(data_path)

['positive-words.txt',
 'dev_rule_matches_z.lib',
 'df_test.csv',
 'keywords.csv',
 'IMDB Dataset.csv',
 'negative-words.txt',
 'df_dev.csv',
 'df_train.csv',
 'df_train.lib',
 '.ipynb_checkpoints',
 'test_rule_matches_z.lib',
 'mapping_rules_labels_t.lib',
 'df_dev.lib',
 'train_rule_matches_z.lib',
 'df_test.lib']

# Finish

Now, we have created a weak supervision dataset. Of course it is not perfect but it is something with which we can compare performances of different denoising methods with. :-) 