# IMDB Dataset - Create Weak Supervision Sources

This notebook shows how to create labeling functions on the IMDB Movie Review dataset.

This dataset has gold labels. These labels are just there for evaluation purposes. The idea of using weak supervision and especially knodle is that you don't have a dataset which is purely labeled with strong supervision (manual) and instead label it with weak supervision.

First, we load the dataset from kaggle. Then, we will look at certain keywords within both sentiments and find good matching keywords which will act as a weak supervision source. Finally, the keywords on a basic majority vote model.

## Imports

Lets make some basic imports 

In [1]:
import pandas as pd 
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, CountVectorizer
from bs4 import BeautifulSoup
from snorkel.labeling import LabelingFunction, PandasLFApplier, filter_unlabeled_dataframe, LFAnalysis
from snorkel.labeling.model import MajorityLabelVoter, LabelModel
import numpy as np 
from tqdm import tqdm

In [2]:
# Init
tqdm.pandas()
pd.set_option('display.max_colwidth', -1)

  from pandas import Panel
  This is separate from the ipykernel package so we can avoid doing imports until


In [3]:
# Constants
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
COLUMN_WITH_TEXT = "reviews_preprocessed"

## Download the raw dataset

Now we download the dataset we need. For that you need to have the kaggle-cli installed and configured with your API key. Please have a look at the official [documentation](https://github.com/Kaggle/kaggle-api) for further instructions.

In [4]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
!tar -xvf imdb-dataset-of-50k-movie-reviews.zip
!rm imdb-dataset-of-50k-movie-reviews.zip

Downloading imdb-dataset-of-50k-movie-reviews.zip to /Users/sandro/repo/knodle/tutorials/ImdbDataset
100%|██████████████████████████████████████| 25.7M/25.7M [00:02<00:00, 9.98MB/s]
100%|██████████████████████████████████████| 25.7M/25.7M [00:02<00:00, 10.9MB/s]
x IMDB Dataset.csv


## Preview dataset

After downloading and unpacking the dataset we can have a first look at it and work with it.

In [5]:
imdb_dataset_raw = pd.read_csv('IMDB Dataset.csv')

In [6]:
imdb_dataset_raw.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.",positive


In [7]:
imdb_dataset_raw.groupby('sentiment').count()


Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
negative,25000
positive,25000


In [8]:
imdb_dataset_raw.isna().sum()

review       0
sentiment    0
dtype: int64

## Preprocess dataset

Now lets take some basic preprocessing steps

### Remove Stopwords

We begin by removing all common stop words. We use `scikit-learn`'s stopwords that we don't install to many packages.

In [9]:
imdb_dataset_raw['reviews_preprocessed'] = imdb_dataset_raw['review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (ENGLISH_STOP_WORDS)]))

In [10]:
imdb_dataset_raw.head()

Unnamed: 0,review,sentiment,reviews_preprocessed
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive,"One reviewers mentioned watching just 1 Oz episode you'll hooked. They right, exactly happened me.<br /><br />The thing struck Oz brutality unflinching scenes violence, set right word GO. Trust me, faint hearted timid. This pulls punches regards drugs, sex violence. Its hardcore, classic use word.<br /><br />It called OZ nickname given Oswald Maximum Security State Penitentary. It focuses mainly Emerald City, experimental section prison cells glass fronts face inwards, privacy high agenda. Em City home many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish more....so scuffles, death stares, dodgy dealings shady agreements far away.<br /><br />I say main appeal fact goes shows wouldn't dare. Forget pretty pictures painted mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The episode I saw struck nasty surreal, I couldn't say I ready it, I watched more, I developed taste Oz, got accustomed high levels graphic violence. Not just violence, injustice (crooked guards who'll sold nickel, inmates who'll kill order away it, mannered, middle class inmates turned prison bitches lack street skills prison experience) Watching Oz, comfortable uncomfortable viewing....thats touch darker side."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive,"A wonderful little production. <br /><br />The filming technique unassuming- old-time-BBC fashion gives comforting, discomforting, sense realism entire piece. <br /><br />The actors extremely chosen- Michael Sheen ""has got polari"" voices pat too! You truly seamless editing guided references Williams' diary entries, worth watching terrificly written performed piece. A masterful production great master's comedy life. <br /><br />The realism really comes home little things: fantasy guard which, use traditional 'dream' techniques remains solid disappears. It plays knowledge senses, particularly scenes concerning Orton Halliwell sets (particularly flat Halliwell's murals decorating surface) terribly done."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.",positive,"I thought wonderful way spend time hot summer weekend, sitting air conditioned theater watching light-hearted comedy. The plot simplistic, dialogue witty characters likable (even bread suspected serial killer). While disappointed realize Match Point 2: Risk Addiction, I thought proof Woody Allen fully control style grown love.<br /><br />This I'd laughed Woody's comedies years (dare I say decade?). While I've impressed Scarlet Johanson, managed tone ""sexy"" image jumped right average, spirited young woman.<br /><br />This crown jewel career, wittier ""Devil Wears Prada"" interesting ""Superman"" great comedy friends."
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative,"Basically there's family little boy (Jake) thinks there's zombie closet & parents fighting time.<br /><br />This movie slower soap opera... suddenly, Jake decides Rambo kill zombie.<br /><br />OK, you're going make film Decide thriller drama! As drama movie watchable. Parents divorcing & arguing like real life. And Jake closet totally ruins film! I expected BOOGEYMAN similar movie, instead watched drama meaningless thriller spots.<br /><br />3 10 just playing parents & descent dialogs. As shots Jake: just ignore them."
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.",positive,"Petter Mattei's ""Love Time Money"" visually stunning film watch. Mr. Mattei offers vivid portrait human relations. This movie telling money, power success people different situations encounter. <br /><br />This variation Arthur Schnitzler's play theme, director transfers action present time New York different characters meet connect. Each connected way, person, know previous point contact. Stylishly, film sophisticated luxurious look. We taken people live world live habitat.<br /><br />The thing gets souls picture different stages loneliness inhabits. A big city exactly best place human relations fulfillment, discerns case people encounter.<br /><br />The acting good Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, rest talented cast, make characters come alive.<br /><br />We wish Mr. Mattei good luck await anxiously work."


### Remove HTML Tags

The dataset contains many HTML tags. We'll remove them

In [11]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

imdb_dataset_raw['reviews_preprocessed'] = imdb_dataset_raw['reviews_preprocessed'].apply(
    lambda x: strip_html(x))

In [12]:
imdb_dataset_raw['reviews_preprocessed'].head()

0    One reviewers mentioned watching just 1 Oz episode you'll hooked. They right, exactly happened me.The thing struck Oz brutality unflinching scenes violence, set right word GO. Trust me, faint hearted timid. This pulls punches regards drugs, sex violence. Its hardcore, classic use word.It called OZ nickname given Oswald Maximum Security State Penitentary. It focuses mainly Emerald City, experimental section prison cells glass fronts face inwards, privacy high agenda. Em City home many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish more....so scuffles, death stares, dodgy dealings shady agreements far away.I say main appeal fact goes shows wouldn't dare. Forget pretty pictures painted mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The episode I saw struck nasty surreal, I couldn't say I ready it, I watched more, I developed taste Oz, got accustomed high levels graphic violence. Not just violence, injustice (crooked guards who'll sold n

## Keywords

For weak supervision sources we use sentiment keyword lists for positive and negative words.
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In [16]:
from minio import Minio


In [24]:
f = open("/Users/sandro/Downloads/opinion-lexicon-English/positive-words.txt", "r")
positive_keywords = f.readlines()

In [70]:
positive_keywords = pd.read_csv('/Users/sandro/Downloads/opinion-lexicon-English/positive-words.txt', sep=" ", header=None, error_bad_lines=False)
positive_keywords.columns = ["keywords"]
positive_keywords = positive_keywords.drop_duplicates("keywords")

In [69]:
negative_keywords = pd.read_csv('/Users/sandro/Downloads/opinion-lexicon-English/negative-words.txt',
                                sep=" ", header=None, error_bad_lines=False,  encoding='latin-1')
print("negative keywords orig shape: ", negative_keywords.shape)
negative_keywords.columns = ["keywords"]
negative_keywords = negative_keywords.drop_duplicates("keywords")
print("negative keywords shape after droppping duplicates: ", negative_keywords.shape)

negative keywords orig shape:  (4783, 1)
negative keywords shape after droppping duplicates:  (4783, 1)


In [82]:
positive_keywords

Unnamed: 0,keywords
0,a+
1,abound
2,abounds
3,abundance
4,abundant
...,...
2001,youthful
2002,zeal
2003,zenith
2004,zest


In [85]:
all_keywords = pd.DataFrame({'keyword':positive_keywords.keywords, 'sentiment':'positive'}, index=positive_keywords.index)
all_keywords.head()

Unnamed: 0,keyword,sentiment
0,a+,positive
1,abound,positive
2,abounds,positive
3,abundance,positive
4,abundant,positive


In [89]:
negative_keywords = pd.DataFrame({'keyword':negative_keywords.keywords, 'sentiment':'negative'}, index=negative_keywords.index)


AttributeError: 'DataFrame' object has no attribute 'keywords'

In [91]:
all_keywords = pd.concat([all_keywords, negative_keywords])

In [92]:
all_keywords.shape

(6789, 2)

In [95]:
all_keywords.drop_duplicates('keyword').shape

(6786, 2)

In [96]:
all_keywords.drop_duplicates('keyword',inplace=True)

## Labeling Functions

Now we start to build labeling functions with Snorkel with these keywords and check the coverage.

This is an iterative process of course so we surely have to add more keywords and regulary expressions ;-) 

In [97]:
def keyword_lookup(x, keyword, label):
    return label if keyword in x[COLUMN_WITH_TEXT].lower() else ABSTAIN


In [98]:
def make_keyword_lf(keyword: str, label: str) -> LabelingFunction:
    """
    Creates labeling function based on keyword.
    Args:
        keywords:
        label:

    Returns:

    """
    return LabelingFunction(
        name=f"keyword_{keyword}",
        f=keyword_lookup,
        resources=dict(keyword=keyword, label=label),
    )

In [99]:
def create_labeling_functions(keywords: pd.DataFrame) -> np.ndarray:
    """
    Create Labeling Functions based on the columns keyword and regex. Appends column lf to df.

    Args:
        keywords: DataFrame with processed keywords

    Returns:
        All labeling functions. 1d Array with shape: (number_of_lfs x 1)
    """
    keywords = keywords.assign(lf=keywords.progress_apply(
        lambda x:make_keyword_lf(x.keyword, x.label_id), axis=1
    ))
    lfs = keywords.lf.values
    return lfs

In [100]:
def make_keyword_df(positive_keywords: [str], negative_keywords: [str]) -> pd.DataFrame:
    positive = pd.DataFrame(
        pd.Series({x: 'positive' for (x) in positive_keywords}, name="label")).reset_index().rename(columns={"index":"keyword"})
    
    negative = pd.DataFrame(
        pd.Series({x: 'negative' for (x) in negative_keywords}, name="label")).reset_index().rename(columns={"index":"keyword"})
    keywords = positive.append(negative)
    
    assert len(positive) + len(negative) == len(keywords), "Shapes doesn't match"
    
    keywords.loc[keywords.label == 'positive', 'label_id'] = int(POSITIVE)
    keywords.loc[keywords.label == 'negative', 'label_id'] = int(NEGATIVE)
    keywords.reset_index(inplace=True, drop=True)
    return keywords


In [101]:
keywords = make_keyword_df(all_keywords.loc[all_keywords.sentiment == 'positive', 'keyword'].values,
                           all_keywords.loc[all_keywords.sentiment == 'negative', 'keyword'].values,)

In [102]:
keywords

Unnamed: 0,keyword,label,label_id
0,a+,positive,1.0
1,abound,positive,1.0
2,abounds,positive,1.0
3,abundance,positive,1.0
4,abundant,positive,1.0
...,...,...,...
6781,zaps,negative,0.0
6782,zealot,negative,0.0
6783,zealous,negative,0.0
6784,zealously,negative,0.0


In [103]:
labeling_functions = create_labeling_functions(keywords)

100%|██████████| 6786/6786 [00:00<00:00, 35806.49it/s]


In [104]:
labeling_functions

array([LabelingFunction keyword_a+, Preprocessors: [],
       LabelingFunction keyword_abound, Preprocessors: [],
       LabelingFunction keyword_abounds, Preprocessors: [], ...,
       LabelingFunction keyword_zealous, Preprocessors: [],
       LabelingFunction keyword_zealously, Preprocessors: [],
       LabelingFunction keyword_zombie, Preprocessors: []], dtype=object)

### Apply Labeling Functions

Now lets apply all labeling functions on our reviews and check some statistics.

In [105]:
applier = PandasLFApplier(lfs=labeling_functions)
applied_lfs = applier.apply(df=imdb_dataset_raw)

  from pandas import Panel
100%|██████████| 50000/50000 [34:44<00:00, 23.98it/s]


In [106]:
applied_lfs

array([[-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       ...,
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1]])

Now we have a matrix with all labeling functions applied. This matrix has the shape $(instances \times labeling functions)$

In [107]:
print("Shape of applied labeling functions: ", applied_lfs.shape)
print("Number of reviews", len(imdb_dataset_raw))
print("Number of labeling functions", len(labeling_functions))

Shape of applied labeling functions:  (50000, 6786)
Number of reviews 50000
Number of labeling functions 6786


### Analysis

Now we can analyse some basic stats about our labeling functions. The main figures are:

- Coverage: How many labeling functions match at all
- Overlaps: How many labeling functions overlap with each other (e.g. awesome and amazing)
- Conflicts: How many labeling functions overlap and have different labels (e.g. awesome and bad)
- Correct: Correct LFs
- Incorrect: Incorrect Lfs

In [108]:
LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
keyword_a+,0,[1],0.00102,0.00102,0.00102
keyword_abound,1,[1],0.00264,0.00264,0.00264
keyword_abounds,2,[1],0.00052,0.00052,0.00052
keyword_abundance,3,[1],0.00182,0.00182,0.00182
keyword_abundant,4,[1],0.00114,0.00114,0.00114
...,...,...,...,...,...
keyword_zaps,6781,[0],0.00012,0.00012,0.00012
keyword_zealot,6782,[0],0.00048,0.00048,0.00048
keyword_zealous,6783,[0],0.00092,0.00092,0.00092
keyword_zealously,6784,[0],0.00006,0.00006,0.00006


In [109]:
lf_analysis = LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

In [110]:
pd.DataFrame(lf_analysis.mean())

Unnamed: 0,0
j,3392.5
Coverage,0.005002
Overlaps,0.005002
Conflicts,0.004997


In [111]:
pd.DataFrame(lf_analysis.median())

Unnamed: 0,0
j,3392.5
Coverage,0.00052
Overlaps,0.00052
Conflicts,0.00052


Lets have a look at some basic examples

In [112]:
imdb_dataset_raw.iloc[applied_lfs[:, 1] == POSITIVE, :].sample(10, random_state=1).loc[:, ['review','sentiment']]

Unnamed: 0,review,sentiment
30200,"The summary line above, spoken by James Cloud (Robert Preston) to his brother Tom (Robert Sterling) just about says it all. Jim, AKA Kid Wichita, has a way of making things happen, only trouble is, he usually leaves dead bodies where he's been. Not the sort of mentoring Tom envisions for younger brother Jeff, who likes what he sees in Jim, especially when defending their ranch against local Texas cattlemen.<br /><br />The opening credits state 'Introducing John Barrymore Jr. as the Younger Brother', in this his very first screen appearance. That seemed rather odd to me, particularly since he was addressed as Jeff almost immediately into the story. Approximately eighteen at the time of this movie, he bears a passing resemblance to Sean Penn. No stranger to personal and legal problems throughout his career as well as estrangement from his family, I was left wondering if his daughter Drew Barrymore might have ever seen this picture. I'm inclined to think not.<br /><br />On the subject of resemblances, I was also struck by the thought that the young Robert Sterling looked a bit like Roy Rogers early in his career. Knowing Sterling previously only from his role as George Kerby in the early 1950's TV series 'Topper', I thought he looked out of place in a Western, but that might just be me. His character becomes emboldened by his brother's resourcefulness at creating trouble, and provides some of the edginess to this not so typical story. Minor subplots abound, including the relationship rancher John Gall (John Litel) has with his son the Sheriff (who Kid Wichita kills), and the troubled marriage between Kathleen Boyce (Cathy Downs) and her husband Earl (who Kid Wichita kills). Chill Wills rounds out the main cast as one of Tom Cloud's hired hands, and figures in the somewhat predictable finale.<br /><br />What's not quite predictable is how things eventually wind up there, and for that reason, this Western earns points for following a less traveled, hence not quite as formulaic a plot as a lot of good brother/bad brother Westerns do. Combined with the eclectic casting of the principals, it's one I'd recommend, even if you have to endure some of the jump cuts and sloppy editing that I experienced with my copy.",positive
14151,"Bingo is the game, bullshit is the name. Rarely has the screen been smeared with such a blown-up hodgepodge of half-baked conspiracy theories, puritan prudery, and new-age gibberish. The bulk of the story is set at Viciente, a Cristian resort in the Peruvian jungle. Think Tolkien's Rivendell meets Star Trek's Planet Baku, inhabited by dimwitted followers of a not-so-mysterious, but surprisingly narrow-minded cult of love and peace. Thanks to gruesome acting and tacky production design (the rainbow-colored visualization of the mysterious all-healing ""energy"" is particularly hideous), ""The Celestine Prophecies"" looks and feels like a discarded 1980s ""Twilight Zone"" episode. Factual errors regarding church history and nomenclature abound. I can't believe Hector Elizondo agreed to be a part of this. Maybe it was made without his consent, Bowfinger style. May the Lord have mercy on the director, the screenwriter, the author of the novel, and the poor souls who see the movie or read the book.",negative
18866,"In modern day Eastern Europe life is hard and for young women prostitution is one of the only career options and one taken, reluctantly, by Melania. She attracts the attentions of an American, Seymour, who becomes obsessed with her, paying more and more money for time with her until he eventually wants to buy her outright. She has two pimps with differring emotional attachments to her and she is generally passed around like some piece of baggage with no feelings of her own. However, we are in ""modern art-house cinema"" territory, so conventions like narrative structure, lighting the subject so it can be seen, camera techniques that add to rather than distract from the action and a vaguely consistent plot can all be abandoned. Much of the time I had no idea what was supposed to be happening and very rarely did I care. People began leaving the screening almost before the last latecomers had arrived and I don't think I've ever seen so many people walk out.<br /><br />Images are important to the director - characters slowly emerge from or disappear into a dark screen, we get long lingering shots of nothing in particular and one sex scene takes place in infra-red. In fact for such an unconventional film the sex scenes were remarkably ordinary; missionary positions between naked people in bed abounded and there were no drugs or related weirdness. But perhaps these days being ordinary is unconventional.<br /><br />On the whole, almost entirely without merit.<br /><br />",negative
2123,"This film for me and my wife is more entertaining than all the bloc-buster violent thriller/mystery/murder movies that abound. It is about real people making the best of their lives. They just happen to be Indian and the main characters are in law enforcement. The realistic acting and the great scenery more than make up for the slightly implausible plot. The sound track is by BC Smith, who also did the soundtrack for Coyote Waits, and is great. Adam Beach plays a tribal policeman who is a little bit accident prone and Wes Studi is the stoic consummately professional detective. There are many other fine either supporting or cameo roles by Graham Greene, Tantoo Cardinal, etc. We have also seen Coyote Waits, another adaptation of a Hillerman novel, and we greatly enjoyed it too.",positive
14348,"This straight-to-video duffer is another nail in the coffin of Rick Moranis's career. As is the Disney tradition, quality is sacrificed in the name of a quick cash-in; this is a lazy retread with Moranis accidentally shrinking himself and a few relatives so they can repeat all the best scenes from the original movie. Instantly dated visual effects and crummy dialogue abound in this cheesy lamer, which did nothing but make me pine for the days of 'The Incredible Shrinking Man', when this kind of thing was done properly. Shockingly, this is directed by top cinematographer Dean Cundey, who should either stick to the day job or pick better material next time.",negative
44052,"I tried to finish this film three times, but it's god awful. Case in point: mom and daughter drive up to the bed and breakfast,mom stops for gas, crazy gas station weirdos mad at her hubby whose running the B&B try to rape her. She escapes, heads to B&B and instead of hubby going ballistic and she wanting to call the cops, story just continues with lukewarm behavior on both their parts. Wow.<br /><br />Other action logic deficits abound. Acting is also lukewarm, and the next door neighbor's warning is delivered in a really corny, badly acted moment.<br /><br />Moments of intense gore/death unevenly interwoven with lukewarm scenes of time-filler interplay between characters.<br /><br />Less focus on gore, more focus on mood and story would have been appreciated.",negative
20214,"This short was nominated for an Academy Award, losing to Anna and Bella. Not since Doctor Strangelove has nuclear war been so hilarious! Condie takes a situation and turns it on its ear and then gives it a spin for good measure. Visual gags abound in Condie's work and I always see something I missed before every time I watch this. On The World's Greatest Animation, well worth seeing for thisand many other shorts. Recommended.",positive
5912,"This is the most human and humane of movies that I have seen in a long time. The ironies abound, Susan Sarandon as a nun, Tim Robbins and Susan Sarandon in a movie that doesn't preach but neither does it condemn. It is cinema verite at it's best, and yet the story is fictionalized from several real events.<br /><br />Which of the two is more amazing, Sarandon or Penn? It is easy to say who is more likeable, but it is hard to say who is more convincing. they are simply magnificent.<br /><br />You may think that all killers should be killed or you may argue that life without parole is no life and that death is more merciful. whatever your personal feeling, this movie gives you a chance to pause and reconsider.<br /><br />At the end one simply wants to sit in silence and reflect. That is what great drama does, it gives catharsis, it creates a moment in time, a shared memory that touches our humanity.",positive
44091,"Well, this is about as good as they come. There are arguments about whether Hitchcock was only a ""master of commercial suspense"" or maybe a ""compulsive technician"" -- or was he really ""deep."" Nobody knows precisely what terms like that mean, but it's legitimate to ask if, at his best, he could not have been all three things at once.<br /><br />In this one he seems to be about at his peak. Hardly anything in it is accidental. It abounds with doubt, ambiguity, and wit. And the story is engrossing, Patricia Highsmith apparently having complexes similar to Hitchcock's own.<br /><br />I'm sure the plot has been thoroughly outlined elsewhere so I won't bother going into it. I'll just point out five on screen incidents that Hitchcock is undoubtedly responsible for.<br /><br />Bruno Antony (Walker) has followed Miriam (Laura Miller) and her two boyfriends to a carnival at night with the intent of murdering her. She's noticed his attention and is innocently flattered by it.<br /><br />1. Laura and her two friends rent an electric boat to go through The Tunnel of Love and then to an island in the center of the lake. Walker is right behind them, smiling, in his boat -- Pluto. ""Pluto."" It's not an allusion to the Walt Disney cartoon character. It's a reference to Pluto, also called Hades, a god of the underworld in Greek and Roman mythology. This tiny touch can't be an accident. And the ""underworld"" that Walker represents is not just a literal hell, but the underworld of the human mind. I hate to say he's a Jungian ""shadow"" but that's what he is. (Did Carl Jung see this movie? He was alive at the time of its release -- but probably not.) 2. Now, this is a deadly serious sequence, right? Walker is a lunatic who is about to murder a woman he doesn't even know. Imagine the way this would have been laid out by most directors. A night-time stalking through a crowded carnival, stealing from shadow to shadow, the killer peering from behind the canvas walls, and so forth. What does Hitchcock show us? When Walker first comes through the gates, concentrating on his victim, a little boy in a cowboy suit, holding a balloon, runs up and shouts, ""Bang! Bang! You're dead!"" Walker jerks his head back in surprise and glares down indignantly at the kid. And when the kid starts to walk away, Walker darts his cigarette at the balloon and pops it, then continues his pursuit without another glance. That's one way Hitchcock treats impending doom.<br /><br />3. The famous strangling reflected in the fallen eyeglasses, which has been aped innumerable times.<br /><br />4. Miriam and her friends stop at one of those devices that you pound with a big mallet, sending a kind of hockey puck up the shaft to measure the strength of your blow. One of her boyfriend whams it and the puck doesn't reach the bell at the top. Under Miriam's delighted and admiring gaze, Walker smiles and rubs his hands together, then picks up the mallet, slams it down, and the puck bangs against the bell. She's thrilled. He puts the mallet down, looks at her, grins, and WAGGLES HIS EYEBROWS at her like a ten-year-old showoff! 5. After the discovery of Miriam's body, while whistles blow and people shout, Walker leaves the carnival and encounters a blind man waiting at the curb. Walker takes the old fellow by the arm and leads him across the intersection, gravely holding up his hand to stop the traffic. A macabre joke.<br /><br />These incidents and others all take place during the ten or fifteen-minutes of the carnival visit. (Robert Walker's performance is exceptional throughout.) It's essentially a kind of invitation to be noirish. (Cf: ""Ride the Pink Horse"") But the menace of the scene is undercut by Hitchcock's insistence on irony and distance. None of the familiar noir techniques are employed. There's nothing really ""creepy"" about it. And the murder itself is hardly a savage one. I don't think that any director other than Hitchcock would have handled it the way he did. It would have been all menace and shadows, hiding places, abortive attempts, scowls instead of grins.<br /><br />Not that it's an entirely flawless movie. A flawless movie is not yet with us. Some of the middle section is a bit slow going and Farley Granger, although a nice guy, is stolid, dull, and rather stupid. His new girl friend is just dull.<br /><br />Hitchcock was to treat the misattribution of guilt with deadly seriousness in ""The Wrong Man."" I'm not sure Hitchcock ever thought about the difference between legal guilt and moral guilt. The latter was imposed on him at an early age by his Catholic education. ""Original sin"" -- you're BORN with it -- and all that. In filmed interviews, he always glibly explained away his fear of the police and of authorities generally by telling a story about his father taking him to the police station to put a scare into him after some peccadillo. We're justified in asking if that was only what psychoanalysis calls a ""screen memory."" I hope you get the pun. I know, I know. It's strained and inept but I spent a good deal of valuable time thinking it up.",positive
43608,"When a movie like ""The Dukes of Hazzard"" brings in over $75 million it makes some incredibly sad statements about the condition of our own society. Either we are collectively too stupid to stay away from trash like this or maybe I'm just not realizing how many people this kind of no-effort trash will appeal to.<br /><br />Hollywood has had no incentive to make good movies since if it puts out trash then people will see it anyways since there is nothing else on screen. This is that. I walked out despite getting a free movie pass. The dialogue could not be dumber. The stunts could not be more over-the-top and outrageous. Perhaps this ""bigger that big"" image appeals to Texans but it didn't appeal to me nor anyone else in the theater. None of the ""big names"" were in this career-ending flick, except for Burt Reynolds, which says all you need to hear. Jessica Simpson -- don't make me laugh.<br /><br />I wouldn't even recommend this film for video, even if you were desperate. This was all about fooling the public to make enough money after opening day to equal or do better than it cost through marketing. They did despite the public being forewarned. Stupidity abound.",negative


## Transform rule matches

To work with knodle the dataset needs to be transformed into a binary array.

0 -> Rule match
1 -> Rule didn't match 

Furthermore, we need a matrix `mapping_rule_labels` which has a mapping of all rules to labels.

In [115]:
from knodle.data.labelling_fcts import transform_rule_class_matrix_to_z_t

In [118]:
rule_matches, mapping_rules_labels = transform_rule_class_matrix_to_z_t(applied_lfs)

In [119]:
rule_label_counts = np.matmul(applied_lfs, mapping_rules_labels)

### Majority Vote

Now we make a majority vote based on all rule matches. First we get the `rule_counts` by multiplying `rule_matches` with the `mapping_rules_labels`, then we divide it sumwise by the sum to get a probability value. In the end we counteract the divide with zero issue by setting all nan values to zero.

In [124]:
from knodle.trainer.utils.denoise import get_majority_vote_labels, get_majority_vote_probs

In [125]:
preds = get_majority_vote_probs(rule_matches, mapping_rules_labels)

  rule_counts_probs = rule_counts / rule_counts.sum(axis=1).reshape(-1, 1)


In [128]:
preds_label = np.argmax(preds, axis=1)

In [129]:
# Positive is a bit overrepresented
np.unique(preds_label, return_counts=True)

(array([0, 1]), array([38103, 11897]))

In [130]:
imdb_dataset_raw['label_id'] = imdb_dataset_raw.sentiment.map({'positive':POSITIVE, 'negative':NEGATIVE})

## Save Files


In [133]:
from joblib import dump
from scipy.sparse import csr_matrix

In [134]:
rule_matches_sparse = csr_matrix(rule_matches)

In [136]:
dump(rule_matches_sparse, 'rule_matches.lib')
dump(mapping_rules_labels, "mapping_rules_labels.lib")

['mapping_rules_labels.lib']

In [72]:
!ls

IMDB Dataset.csv           keywords.csv
Prepare_IMDB_Data.ipynb    mapping_rules_labels.lib
applied_lfs.lib            requirements.txt
imdb_data_preprocessed.csv tfidf.lib


In [137]:
keywords.to_csv('keywords.csv', index=None)


In [68]:
imdb_dataset_raw.to_csv('imdb_data_preprocessed.csv', index=None)

# Finish

Now, we have created a weak supervision dataset. Of course it is not perfect but it is something with which we can compare performances of different denoising methods with. :-) 