# IMDB Dataset - Create Weak Supervision Sources

This notebook shows how to create labeling functions on the IMDB Movie Review dataset.

This dataset has gold labels. These labels are just there for evaluation purposes. The idea of using weak supervision and especially knodle is that you don't have a dataset which is purely labeled with strong supervision (manual) and instead label it with weak supervision.

First, we load the dataset from kaggle. Then, we will look at certain keywords within both sentiments and find good matching keywords which will act as a weak supervision source. Finally, the keywords on a basic majority vote model.

## Imports

Lets make some basic imports 

In [1]:
import pandas as pd 
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, CountVectorizer
from bs4 import BeautifulSoup
from snorkel.labeling import LabelingFunction, PandasLFApplier, filter_unlabeled_dataframe, LFAnalysis
from snorkel.labeling.model import MajorityLabelVoter, LabelModel
import numpy as np 
from tqdm import tqdm

In [2]:
# Init
tqdm.pandas()
pd.set_option('display.max_colwidth', -1)

  from pandas import Panel
  This is separate from the ipykernel package so we can avoid doing imports until


In [3]:
# Constants
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
COLUMN_WITH_TEXT = "reviews_preprocessed"

## Download the raw dataset

Now we download the dataset we need. For that you need to have the kaggle-cli installed and configured with your API key. Please have a look at the official [documentation](https://github.com/Kaggle/kaggle-api) for further instructions.

In [4]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
!tar -xvf imdb-dataset-of-50k-movie-reviews.zip
!rm imdb-dataset-of-50k-movie-reviews.zip

Downloading imdb-dataset-of-50k-movie-reviews.zip to /Users/sandro/repo/knodle/tutorials/ImdbDataset
100%|██████████████████████████████████████| 25.7M/25.7M [00:01<00:00, 23.3MB/s]
100%|██████████████████████████████████████| 25.7M/25.7M [00:01<00:00, 21.5MB/s]
x IMDB Dataset.csv


## Preview dataset

After downloading and unpacking the dataset we can have a first look at it and work with it.

In [5]:
imdb_dataset_raw = pd.read_csv('IMDB Dataset.csv')

In [6]:
imdb_dataset_raw.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.",positive


In [7]:
imdb_dataset_raw.groupby('sentiment').count()


Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
negative,25000
positive,25000


In [8]:
imdb_dataset_raw.isna().sum()

review       0
sentiment    0
dtype: int64

## Preprocess dataset

Now lets take some basic preprocessing steps

### Remove Stopwords

We begin by removing all common stop words. We use `scikit-learn`'s stopwords that we don't install to many packages.

In [9]:
imdb_dataset_raw['reviews_preprocessed'] = imdb_dataset_raw['review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (ENGLISH_STOP_WORDS)]))

In [10]:
imdb_dataset_raw.head()

Unnamed: 0,review,sentiment,reviews_preprocessed
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive,"One reviewers mentioned watching just 1 Oz episode you'll hooked. They right, exactly happened me.<br /><br />The thing struck Oz brutality unflinching scenes violence, set right word GO. Trust me, faint hearted timid. This pulls punches regards drugs, sex violence. Its hardcore, classic use word.<br /><br />It called OZ nickname given Oswald Maximum Security State Penitentary. It focuses mainly Emerald City, experimental section prison cells glass fronts face inwards, privacy high agenda. Em City home many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish more....so scuffles, death stares, dodgy dealings shady agreements far away.<br /><br />I say main appeal fact goes shows wouldn't dare. Forget pretty pictures painted mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The episode I saw struck nasty surreal, I couldn't say I ready it, I watched more, I developed taste Oz, got accustomed high levels graphic violence. Not just violence, injustice (crooked guards who'll sold nickel, inmates who'll kill order away it, mannered, middle class inmates turned prison bitches lack street skills prison experience) Watching Oz, comfortable uncomfortable viewing....thats touch darker side."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive,"A wonderful little production. <br /><br />The filming technique unassuming- old-time-BBC fashion gives comforting, discomforting, sense realism entire piece. <br /><br />The actors extremely chosen- Michael Sheen ""has got polari"" voices pat too! You truly seamless editing guided references Williams' diary entries, worth watching terrificly written performed piece. A masterful production great master's comedy life. <br /><br />The realism really comes home little things: fantasy guard which, use traditional 'dream' techniques remains solid disappears. It plays knowledge senses, particularly scenes concerning Orton Halliwell sets (particularly flat Halliwell's murals decorating surface) terribly done."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.",positive,"I thought wonderful way spend time hot summer weekend, sitting air conditioned theater watching light-hearted comedy. The plot simplistic, dialogue witty characters likable (even bread suspected serial killer). While disappointed realize Match Point 2: Risk Addiction, I thought proof Woody Allen fully control style grown love.<br /><br />This I'd laughed Woody's comedies years (dare I say decade?). While I've impressed Scarlet Johanson, managed tone ""sexy"" image jumped right average, spirited young woman.<br /><br />This crown jewel career, wittier ""Devil Wears Prada"" interesting ""Superman"" great comedy friends."
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative,"Basically there's family little boy (Jake) thinks there's zombie closet & parents fighting time.<br /><br />This movie slower soap opera... suddenly, Jake decides Rambo kill zombie.<br /><br />OK, you're going make film Decide thriller drama! As drama movie watchable. Parents divorcing & arguing like real life. And Jake closet totally ruins film! I expected BOOGEYMAN similar movie, instead watched drama meaningless thriller spots.<br /><br />3 10 just playing parents & descent dialogs. As shots Jake: just ignore them."
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.",positive,"Petter Mattei's ""Love Time Money"" visually stunning film watch. Mr. Mattei offers vivid portrait human relations. This movie telling money, power success people different situations encounter. <br /><br />This variation Arthur Schnitzler's play theme, director transfers action present time New York different characters meet connect. Each connected way, person, know previous point contact. Stylishly, film sophisticated luxurious look. We taken people live world live habitat.<br /><br />The thing gets souls picture different stages loneliness inhabits. A big city exactly best place human relations fulfillment, discerns case people encounter.<br /><br />The acting good Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, rest talented cast, make characters come alive.<br /><br />We wish Mr. Mattei good luck await anxiously work."


### Remove HTML Tags

The dataset contains many HTML tags. We'll remove them

In [11]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

imdb_dataset_raw['reviews_preprocessed'] = imdb_dataset_raw['reviews_preprocessed'].apply(
    lambda x: strip_html(x))

In [12]:
imdb_dataset_raw['reviews_preprocessed'].head()

0    One reviewers mentioned watching just 1 Oz episode you'll hooked. They right, exactly happened me.The thing struck Oz brutality unflinching scenes violence, set right word GO. Trust me, faint hearted timid. This pulls punches regards drugs, sex violence. Its hardcore, classic use word.It called OZ nickname given Oswald Maximum Security State Penitentary. It focuses mainly Emerald City, experimental section prison cells glass fronts face inwards, privacy high agenda. Em City home many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish more....so scuffles, death stares, dodgy dealings shady agreements far away.I say main appeal fact goes shows wouldn't dare. Forget pretty pictures painted mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The episode I saw struck nasty surreal, I couldn't say I ready it, I watched more, I developed taste Oz, got accustomed high levels graphic violence. Not just violence, injustice (crooked guards who'll sold n

## Count Words

Now we want to count the words per sentiment to find good matching keywords.
We split the dataset first into positive and negative reviews and then count the words per sentiment

In [13]:
positive_reviews = \
    imdb_dataset_raw.loc[imdb_dataset_raw.sentiment == 'positive', ['reviews_preprocessed', 'sentiment']]

In [14]:
negative_reviews = \
    imdb_dataset_raw.loc[imdb_dataset_raw.sentiment == 'negative', ['reviews_preprocessed', 'sentiment']]

In [15]:
def get_word_count_list(reviews: pd.DataFrame, min_df: int = 10) -> pd.DataFrame:
    vectorizer = CountVectorizer(min_df=min_df)
    X = vectorizer.fit_transform(reviews.reviews_preprocessed.values)
    word_list = vectorizer.get_feature_names()
    count_list = X.toarray().sum(axis=0)
    word_count = pd.DataFrame(dict(zip(word_list, count_list)), index=['count']).transpose().sort_values('count', ascending=False)
    # Remove stop words again
    word_count = word_count.loc[~word_count.index.isin(ENGLISH_STOP_WORDS)]
    return word_count

### Positive Reviews

First lets look at some positive reviews and find good keywords which maybe describe them.

In [16]:
word_count_positive = get_word_count_list(positive_reviews, min_df=100)

In [17]:
word_count_positive.describe()

Unnamed: 0,count
count,3627.0
mean,557.271574
std,1355.342072
min,100.0
25%,159.0
50%,253.0
75%,507.0
max,42093.0


In [18]:
# All words in quantile 95 and higher
q95_positive_words =  word_count_positive.loc[word_count_positive['count'] > word_count_positive.quantile(0.95)['count']]

In [19]:
q95_positive_words.index.values

array(['film', 'movie', 'like', 'good', 'just', 'great', 'story', 'time',
       'really', 'people', 'love', 'best', 'life', 'way', 'films',
       'think', 'characters', 'don', 'movies', 'character', 'seen', 'man',
       'watch', 'make', 'little', 'does', 'know', 'did', 'years', 'end',
       'scene', 'real', 'scenes', 'say', 'acting', 'plot', 'world',
       'makes', 'better', 'new', 've', '10', 'young', 'work', 'old',
       'lot', 'quite', 'cast', 'funny', 'series', 'director', 'actors',
       'music', 'role', 'watching', 'look', 'bad', 'doesn', 'family',
       'performance', 'things', 'comedy', 'times', 'going', 'big', 'saw',
       'long', 'thing', 'actually', 'excellent', 'didn', 'bit', 'fun',
       'right', 'action', 'thought', 'fact', 'feel', 'want', 'come',
       'played', 'especially', 'got', 'war', 'horror', 'beautiful', 'day',
       'pretty', 'dvd', 'different', 'shows', 'gets', 'tv', 'interesting',
       'true', 'job', 'll', 'woman', 'probably', 'far', 'wonderful',

In [20]:
# Manual create list with keywords
positive_keywords = [
    'like','good','great','love', 'best', 'funny','excellent','fun','beautiful','interesting','wonderful',
    'original','perfect','classic','loved','recommend','amazing','favorite'
]

In [21]:
imdb_dataset_raw.loc[imdb_dataset_raw.reviews_preprocessed.str.contains('like')]

Unnamed: 0,review,sentiment,reviews_preprocessed
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative,"Basically there's family little boy (Jake) thinks there's zombie closet & parents fighting time.This movie slower soap opera... suddenly, Jake decides Rambo kill zombie.OK, you're going make film Decide thriller drama! As drama movie watchable. Parents divorcing & arguing like real life. And Jake closet totally ruins film! I expected BOOGEYMAN similar movie, instead watched drama meaningless thriller spots.3 10 just playing parents & descent dialogs. As shots Jake: just ignore them."
5,"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like ""dressed-up midgets"" than children, but that only makes them more fun to watch. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be ""up"" for this movie.",positive,"Probably all-time favorite movie, story selflessness, sacrifice dedication noble cause, it's preachy boring. It just gets old, despite having seen 15 times 25 years. Paul Lukas' performance brings tears eyes, Bette Davis, truly sympathetic roles, delight. The kids are, grandma says, like ""dressed-up midgets"" children, makes fun watch. And mother's slow awakening what's happening world roof believable startling. If I dozen thumbs, they'd ""up"" movie."
6,"I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gunsmoke were my hero's every week.You have my vote for a comeback of a new sea hunt.We need a change of pace in TV and this would work for a world of under water adventure.Oh by the way thank you for an outlet like this to view many viewpoints about TV and the many movies.So any ole way I believe I've got what I wanna say.Would be nice to read some more plus points about sea hunt.If my rhymes would be 10 lines would you let me submit,or leave me out to be in doubt and have me to quit,If this is so then I must go so lets do it.",positive,"I sure like resurrection dated Seahunt series tech today bring kid excitement me.I grew black white TV Seahunt Gunsmoke hero's week.You vote comeback new sea hunt.We need change pace TV work world water adventure.Oh way thank outlet like view viewpoints TV movies.So ole way I believe I've got I wanna say.Would nice read plus points sea hunt.If rhymes 10 lines let submit,or leave doubt quit,If I lets it."
9,"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!",positive,"If like original gut wrenching laughter like movie. If young old love movie, hell mom liked it.Great Camp!!!"
12,"So im not a big fan of Boll's work but then again not many are. I enjoyed his movie Postal (maybe im the only one). Boll apparently bought the rights to use Far Cry long ago even before the game itself was even finsished. <br /><br />People who have enjoyed killing mercs and infiltrating secret research labs located on a tropical island should be warned, that this is not Far Cry... This is something Mr Boll have schemed together along with his legion of schmucks.. Feeling loneley on the set Mr Boll invites three of his countrymen to play with. These players go by the names of Til Schweiger, Udo Kier and Ralf Moeller.<br /><br />Three names that actually have made them selfs pretty big in the movie biz. So the tale goes like this, Jack Carver played by Til Schweiger (yes Carver is German all hail the bratwurst eating dudes!!) However I find that Tils acting in this movie is pretty badass.. People have complained about how he's not really staying true to the whole Carver agenda but we only saw carver in a first person perspective so we don't really know what he looked like when he was kicking a**.. <br /><br />However, the storyline in this film is beyond demented. We see the evil mad scientist Dr. Krieger played by Udo Kier, making Genetically-Mutated-soldiers or GMS as they are called. Performing his top-secret research on an island that reminds me of ""SPOILER"" Vancouver for some reason. Thats right no palm trees here. Instead we got some nice rich lumberjack-woods. We haven't even gone FAR before I started to CRY (mehehe) I cannot go on any more.. If you wanna stay true to Bolls shenanigans then go and see this movie you will not be disappointed it delivers the true Boll experience, meaning most of it will suck.<br /><br />There are some things worth mentioning that would imply that Boll did a good work on some areas of the film such as some nice boat and fighting scenes. Until the whole cromed/albino GMS squad enters the scene and everything just makes me laugh.. The movie Far Cry reeks of scheisse (that's poop for you simpletons) from a fa,r if you wanna take a wiff go ahead.. BTW Carver gets a very annoying sidekick who makes you wanna shoot him the first three minutes he's on screen.",negative,"So im big fan Boll's work are. I enjoyed movie Postal (maybe im one). Boll apparently bought rights use Far Cry long ago game finsished. People enjoyed killing mercs infiltrating secret research labs located tropical island warned, Far Cry... This Mr Boll schemed legion schmucks.. Feeling loneley set Mr Boll invites countrymen play with. These players names Til Schweiger, Udo Kier Ralf Moeller.Three names actually selfs pretty big movie biz. So tale goes like this, Jack Carver played Til Schweiger (yes Carver German hail bratwurst eating dudes!!) However I Tils acting movie pretty badass.. People complained he's really staying true Carver agenda saw carver person perspective don't really know looked like kicking a**.. However, storyline film demented. We evil mad scientist Dr. Krieger played Udo Kier, making Genetically-Mutated-soldiers GMS called. Performing top-secret research island reminds ""SPOILER"" Vancouver reason. Thats right palm trees here. Instead got nice rich lumberjack-woods. We haven't gone FAR I started CRY (mehehe) I more.. If wanna stay true Bolls shenanigans movie disappointed delivers true Boll experience, meaning suck.There things worth mentioning imply Boll did good work areas film nice boat fighting scenes. Until cromed/albino GMS squad enters scene just makes laugh.. The movie Far Cry reeks scheisse (that's poop simpletons) fa,r wanna wiff ahead.. BTW Carver gets annoying sidekick makes wanna shoot minutes he's screen."
...,...,...,...
49990,"Lame, lame, lame!!! A 90-minute cringe-fest that's 89 minutes too long. A setting ripe with atmosphere and possibility (an abandoned convent) is squandered by a stinker of a script filled with clunky, witless dialogue that's straining oh-so-hard to be hip. Mostly it's just embarrassing, and the attempts at gonzo horror fall flat (a sample of this movie's dialogue: after demonstrating her artillery, fast dolly shot to a closeup of Barbeau's vigilante charactershe: `any questions?' hyuck hyuck hyuck). Bad acting, idiotic, homophobic jokes and judging from the creature effects, it looks like the director's watched `The Evil Dead' way too many times. <br /><br />I owe my friends big time for renting this turkey and subjecting them to ninety wasted minutes they'll never get back. What a turd.",negative,"Lame, lame, lame!!! A 90-minute cringe-fest that's 89 minutes long. A setting ripe atmosphere possibility (an abandoned convent) squandered stinker script filled clunky, witless dialogue that's straining oh-so-hard hip. Mostly it's just embarrassing, attempts gonzo horror fall flat (a sample movie's dialogue: demonstrating artillery, fast dolly shot closeup Barbeau's vigilante character she: `any questions?' hyuck hyuck hyuck). Bad acting, idiotic, homophobic jokes judging creature effects, looks like director's watched `The Evil Dead' way times. I owe friends big time renting turkey subjecting ninety wasted minutes they'll back. What turd."
49991,"Les Visiteurs, the first movie about the medieval time travelers was actually funny. I like Jean Reno as an actor, but there was more. There were unexpected twists, funny situations and of course plain absurdness, that would remind you a little bit of Louis de Funes.<br /><br />Now this sequel has the same characters, the same actors in great part and the same time traveling. The plot changes a little, since the characters now are supposed to be experienced time travelers. So they jump up and down in history, without paying any attention to the fact that it keeps getting absurder as you advance in the movie. The duke, Jean Reno, tries to keep the whole thing together with his playing, but his character has been emptied, so there's not a lot he can do to save the film.<br /><br />Now the duke's slave/helper, he has really all the attention. The movie is merely about him and his being clumsy / annoying / stupid or whatever he was supposed to be. Fact is; this character tries to produce the laughter from the audience, but he does not succeed. It is as if someone was telling you a really very very bad joke, you already know, but he insists on telling that joke till the end, adding details, to make your suffering a little longer.<br /><br />If you liked Les Visiteurs, do not spoil the taste in your mouth with the sequel. If you didn't like Les Visiteurs, you would never consider seeing the sequel. If you liked this sequel... well, I suppose you still need to see a lot of movies.",negative,"Les Visiteurs, movie medieval time travelers actually funny. I like Jean Reno actor, more. There unexpected twists, funny situations course plain absurdness, remind little bit Louis Funes.Now sequel characters, actors great time traveling. The plot changes little, characters supposed experienced time travelers. So jump history, paying attention fact keeps getting absurder advance movie. The duke, Jean Reno, tries thing playing, character emptied, there's lot save film.Now duke's slave/helper, really attention. The movie merely clumsy / annoying / stupid supposed be. Fact is; character tries produce laughter audience, does succeed. It telling really bad joke, know, insists telling joke till end, adding details, make suffering little longer.If liked Les Visiteurs, spoil taste mouth sequel. If didn't like Les Visiteurs, consider seeing sequel. If liked sequel... well, I suppose need lot movies."
49993,"Robert Colomb has two full-time jobs. He's known throughout the world as a globetrotting TV reporter. Less well-known but equally effortful are his exploits as a full-time philanderer.<br /><br />I saw `Vivre pour Vivre' dubbed in English with the title 'Live for Life.' Some life! Robert seems to always have at least three women in his life: one mistress on her way out, one on her way in, and the cheated wife at home. It helps that Robert is a glib liar. Among his most useful lies are `I'll call you tomorrow' and `My work took longer than planned.' He spends a lot of time and money on planes, trains and hotel rooms for his succession of liaisons. You wonder when this guy will get caught with his pants down.<br /><br />Some may find his life exciting, but I thought it to be tedious. His companions, including his wife, Catherine, are all attractive and desirable women. But his lifestyle is so hectic and he is so deceitful, you wonder if he's enjoying all this.<br /><br />Adding to the tedium is considerable footage that doesn't further the plot. There are extended sections with no dialogue or French-only dialogue. We see documentaries of wars, torture, and troop training interspersed with the live action. When Robert's flight returns from Africa, we wait and wait for the plane to land and taxi to the airport terminal.<br /><br />Annie Girardot is the standout performer in this film. Hers was the most interesting character and she played it to perfection. It was also nice to see Candice Bergen at the beginning of her career. I can't find fault with Yves Montand's performance of what was basically an amoral bum.<br /><br />I enjoyed some of Claude Lelouch's novel techniques. In a hotel room scene, the camera pans around the room as Robert and his mistress argue. We catch sight of them briefly during each pass around the room. In another scene set on a sleeping car of a train, Robert is lying on the upper bunk while his wife is on the lower. Robert is giving his wife some important but distressing news, but we hear only parts of it because of the clatter of the train. I sensed that his wife was also unable to absorb every word due to the shocking nature of the news. I also liked the exciting safari scenes in Africa. The cinematography of those scenes and of those in Amsterdam was superb.<br /><br />I reviewed this movie as part of a project at the Library of Congress. I've named the project FIFTY: 50 Notable Films Forgotten Within 50 Years. As best I can determine, this film, like the other forty-nine I've identified, has not been on video, telecast, or distributed in the U.S. since its original release. In my opinion, it is worthy of being made available again.<br /><br />",negative,"Robert Colomb full-time jobs. He's known world globetrotting TV reporter. Less well-known equally effortful exploits full-time philanderer.I saw `Vivre pour Vivre' dubbed English title 'Live Life.' Some life! Robert women life: mistress way out, way in, cheated wife home. It helps Robert glib liar. Among useful lies `I'll tomorrow' `My work took longer planned.' He spends lot time money planes, trains hotel rooms succession liaisons. You wonder guy caught pants down.Some life exciting, I thought tedious. His companions, including wife, Catherine, attractive desirable women. But lifestyle hectic deceitful, wonder he's enjoying this.Adding tedium considerable footage doesn't plot. There extended sections dialogue French-only dialogue. We documentaries wars, torture, troop training interspersed live action. When Robert's flight returns Africa, wait wait plane land taxi airport terminal.Annie Girardot standout performer film. Hers interesting character played perfection. It nice Candice Bergen beginning career. I can't fault Yves Montand's performance basically amoral bum.I enjoyed Claude Lelouch's novel techniques. In hotel room scene, camera pans room Robert mistress argue. We catch sight briefly pass room. In scene set sleeping car train, Robert lying upper bunk wife lower. Robert giving wife important distressing news, hear parts clatter train. I sensed wife unable absorb word shocking nature news. I liked exciting safari scenes Africa. The cinematography scenes Amsterdam superb.I reviewed movie project Library Congress. I've named project FIFTY: 50 Notable Films Forgotten Within 50 Years. As best I determine, film, like forty-nine I've identified, video, telecast, distributed U.S. original release. In opinion, worthy available again."
49995,"I thought this movie did a down right good job. It wasn't as creative or original as the first, but who was expecting it to be. It was a whole lotta fun. the more i think about it the more i like it, and when it comes out on DVD I'm going to pay the money for it very proudly, every last cent. Sharon Stone is great, she always is, even if her movie is horrible(Catwoman), but this movie isn't, this is one of those movies that will be underrated for its lifetime, and it will probably become a classic in like 20 yrs. Don't wait for it to be a classic, watch it now and enjoy it. Don't expect a masterpiece, or something thats gripping and soul touching, just allow yourself to get out of your life and get yourself involved in theirs.<br /><br />All in all, this movie is entertaining and i recommend people who haven't seen it see it, because what the critics and box office say doesn't always count, see it for yourself, you never know, you might just enjoy it. I tip my hat to this movie<br /><br />8/10",positive,"I thought movie did right good job. It wasn't creative original first, expecting be. It lotta fun. think like it, comes DVD I'm going pay money proudly, cent. Sharon Stone great, is, movie horrible(Catwoman), movie isn't, movies underrated lifetime, probably classic like 20 yrs. Don't wait classic, watch enjoy it. Don't expect masterpiece, thats gripping soul touching, just allow life involved theirs.All all, movie entertaining recommend people haven't seen it, critics box office say doesn't count, yourself, know, just enjoy it. I tip hat movie8/10"


### Negative Reviews

Lets do the same process with negative reviews and find some keywords which describe bad movie reviews.

In [22]:
word_count_negative = get_word_count_list(negative_reviews, min_df=100)

In [23]:
word_count_negative.describe()

Unnamed: 0,count
count,3356.0
mean,593.115018
std,1543.380762
min,100.0
25%,160.0
50%,252.0
75%,525.0
max,50090.0


In [24]:
# All words in quantile 95 and higher
q95_negative_words = word_count_negative.loc[
    word_count_negative['count'] > word_count_negative.quantile(0.95)['count']]

In [25]:
q95_negative_words.index.values

array(['movie', 'film', 'like', 'just', 'good', 'bad', 'time', 'really',
       'don', 'story', 'people', 'make', 'movies', 'plot', 'acting',
       'way', 'characters', 'watch', 'think', 'did', 'character', 'know',
       'better', 'seen', 'films', 'little', 'say', 'scene', 'thing',
       'end', 'does', 'scenes', 've', 'didn', 'watching', 'great',
       'doesn', 'actually', 'man', 'actors', 'worst', 'director', 'life',
       'funny', 'going', 'look', 'love', 'real', 'minutes', 'old',
       'pretty', 'horror', 'want', 'best', 'script', 'guy', 'work', '10',
       'got', 'lot', 'isn', 'things', 'original', 'fact', 'thought',
       'makes', 'point', 'new', 'big', 'long', 'years', 'gets', 'far',
       'interesting', 'cast', 'making', 'right', 'action', 'come',
       'awful', 'quite', 'money', 'll', 'kind', 'poor', 'comedy',
       'boring', 'trying', 'reason', 'stupid', 'probably', 'looking',
       'looks', 'instead', 'terrible', 'away', 'maybe', 'believe', 'saw',
       'girl', '

In [26]:
negative_keywords = [
    'bad', 'worst','horror','awful','poor','boring','stupid','terrible','waste','worse','horrible'
]

## Labeling Functions

Now we start to build labeling functions with Snorkel with these keywords and check the coverage.

This is an iterative process of course so we surely have to add more keywords and regulary expressions ;-) 

In [27]:
def keyword_lookup(x, keyword, label):
    return label if keyword in x[COLUMN_WITH_TEXT].lower() else ABSTAIN


In [28]:
def make_keyword_lf(keyword: str, label: str) -> LabelingFunction:
    """
    Creates labeling function based on keyword.
    Args:
        keywords:
        label:

    Returns:

    """
    return LabelingFunction(
        name=f"keyword_{keyword}",
        f=keyword_lookup,
        resources=dict(keyword=keyword, label=label),
    )

In [29]:
def create_labeling_functions(keywords: pd.DataFrame) -> np.ndarray:
    """
    Create Labeling Functions based on the columns keyword and regex. Appends column lf to df.

    Args:
        keywords: DataFrame with processed keywords

    Returns:
        All labeling functions. 1d Array with shape: (number_of_lfs x 1)
    """
    keywords = keywords.assign(lf=keywords.progress_apply(
        lambda x:make_keyword_lf(x.keyword, x.label_id), axis=1
    ))
    lfs = keywords.lf.values
    return lfs

In [30]:
def make_keyword_df(positive_keywords: [str], negative_keywords: [str]) -> pd.DataFrame:
    positive = pd.DataFrame(
        pd.Series({x: 'positive' for (x) in positive_keywords}, name="label")).reset_index().rename(columns={"index":"keyword"})
    
    negative = pd.DataFrame(
        pd.Series({x: 'negative' for (x) in negative_keywords}, name="label")).reset_index().rename(columns={"index":"keyword"})
    keywords = positive.append(negative)
    
    assert len(positive) + len(negative) == len(keywords), "Shapes doesn't match"
    
    keywords.loc[keywords.label == 'positive', 'label_id'] = int(POSITIVE)
    keywords.loc[keywords.label == 'negative', 'label_id'] = int(NEGATIVE)
    keywords.reset_index(inplace=True, drop=True)
    return keywords


In [31]:
keywords = make_keyword_df(positive_keywords, negative_keywords)

In [32]:
keywords

Unnamed: 0,keyword,label,label_id
0,like,positive,1.0
1,good,positive,1.0
2,great,positive,1.0
3,love,positive,1.0
4,best,positive,1.0
5,funny,positive,1.0
6,excellent,positive,1.0
7,fun,positive,1.0
8,beautiful,positive,1.0
9,interesting,positive,1.0


In [33]:
labeling_functions = create_labeling_functions(keywords)

100%|██████████| 29/29 [00:00<00:00, 6498.97it/s]


In [34]:
labeling_functions

array([LabelingFunction keyword_like, Preprocessors: [],
       LabelingFunction keyword_good, Preprocessors: [],
       LabelingFunction keyword_great, Preprocessors: [],
       LabelingFunction keyword_love, Preprocessors: [],
       LabelingFunction keyword_best, Preprocessors: [],
       LabelingFunction keyword_funny, Preprocessors: [],
       LabelingFunction keyword_excellent, Preprocessors: [],
       LabelingFunction keyword_fun, Preprocessors: [],
       LabelingFunction keyword_beautiful, Preprocessors: [],
       LabelingFunction keyword_interesting, Preprocessors: [],
       LabelingFunction keyword_wonderful, Preprocessors: [],
       LabelingFunction keyword_original, Preprocessors: [],
       LabelingFunction keyword_perfect, Preprocessors: [],
       LabelingFunction keyword_classic, Preprocessors: [],
       LabelingFunction keyword_loved, Preprocessors: [],
       LabelingFunction keyword_recommend, Preprocessors: [],
       LabelingFunction keyword_amazing, Preproce

### Apply Labeling Functions

Now lets apply all labeling functions on our reviews and check some statistics.

In [51]:
applier = PandasLFApplier(lfs=labeling_functions)
applied_lfs = applier.apply(df=imdb_dataset_raw)

  from pandas import Panel
100%|██████████| 50000/50000 [00:08<00:00, 5683.77it/s]


In [52]:
applied_lfs

array([[-1, -1, -1, ..., -1, -1, -1],
       [-1, -1,  1, ..., -1, -1, -1],
       [-1, -1,  1, ..., -1, -1, -1],
       ...,
       [-1,  1, -1, ..., -1, -1, -1],
       [ 1, -1, -1, ..., -1, -1, -1],
       [-1,  1, -1, ..., -1, -1, -1]])

Now we have a matrix with all labeling functions applied. This matrix has the shape $(instances \times labeling functions)$

In [37]:
print("Shape of applied labeling functions: ", applied_lfs.shape)
print("Number of reviews", len(imdb_dataset_raw))
print("Number of labeling functions", len(labeling_functions))

Shape of applied labeling functions:  (50000, 29)
Number of reviews 50000
Number of labeling functions 29


### Analysis

Now we can analyse some basic stats about our labeling functions. The main figures are:

- Coverage: How many labeling functions match at all
- Overlaps: How many labeling functions overlap with each other (e.g. awesome and amazing)
- Conflicts: How many labeling functions overlap and have different labels (e.g. awesome and bad)
- Correct: Correct LFs
- Incorrect: Incorrect Lfs

In [38]:
LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
keyword_like,0,[1],0.51858,0.49426,0.29586
keyword_good,1,[1],0.38944,0.3767,0.22624
keyword_great,2,[1],0.27614,0.266,0.12862
keyword_love,3,[1],0.25364,0.24712,0.11644
keyword_best,4,[1],0.19712,0.191,0.09734
keyword_funny,5,[1],0.13242,0.13242,0.07612
keyword_excellent,6,[1],0.0725,0.06944,0.0278
keyword_fun,7,[1],0.23214,0.2289,0.12876
keyword_beautiful,8,[1],0.08422,0.0815,0.03512
keyword_interesting,9,[1],0.11144,0.10734,0.06326


In [39]:
lf_analysis = LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

In [40]:
pd.DataFrame(lf_analysis.mean())

Unnamed: 0,0
j,14.0
Coverage,0.12978
Overlaps,0.126032
Conflicts,0.079798


Lets have a look at some basic examples

In [41]:
imdb_dataset_raw.iloc[applied_lfs[:, 1] == POSITIVE, :].sample(10, random_state=1).loc[:, ['review','sentiment']]

Unnamed: 0,review,sentiment
14708,"Some of the greatest and most loved horror movies have a wicked sense of humour, but when a film comes along that isn't as original as the ""classics"" but just goes at it for laughs then a bunch of po-faced, wanna-be critics completely slag it off. This film made me laugh aloud several times, this is testament to the way this film was approached and it shows. The two main leads look natural and believable together and this really helps this film. You root for them the whole way and laugh along with them, everyone has friends like both of these guys. Another highlight for me was the monster truck, it's awesome, intimidating and really well shot. Taking inspiration (completely stealing) from loads of films, the most obvious being Duel, Jeepers Creepers and probably in reference to the Jack Black alike co-star Orange County. But really you can pick any road trip gone wrong movie and find a reference here. But so what, it's not trying to win any Oscars just give the viewer a good dose of frights and laughs and on that score it's a 10! Obviously It's not getting a 10, I give real sensible reviews and scores unlike 99% of the people on IMDb. There is no-way this movie can get a zero like so many lazy idiots give to too many films and as fun as it was it ain't getting a 10 either. It's just a good fun movie for anyone with a sense of humour and a liking for scares. You really can't get anymore simple than that.",negative
24770,"Maximum risk is quite surprising to a person that has seen more then on of his movies. Director Ringo Lam made an average action-movie, that can be compared with most of the other mid-quality action movies, what is a special predicate to a `Muscles from Brussels`movie. It has a quite classy style, an interesting atmosphere and, last but not least, the beautyful Natasha Henstridge. Even VanDamme doesn´t make you crying by his acting, he does a relatively good job. Of course you may not compare Maximum Risk (oh, what a creative title!) to `Ronin`, but after watching `Knock off` it´s the hell of a good movie... in special standards, of course.",negative
38047,"The way this story played out and the interaction between the 2 lead characters may lead me to believe that if the X-Files continues without Mulder and Scully, these would be a pretty good replacement duo.",positive
23499,"Paul Naschy as a ghostly security guard in this is scarier than most of his fur-and-shoe-polish werewolf guises. The story is not unfamiliar, a bunch of kids going to party at an abandoned school. The thing is, that one of these kid's fathers did the same thing years ago but he's now deceased, and the latest group of kids seem to be reliving an event from 23 years ago. This is fairly well done for films of this type, and there's an air of mystery to what's going on because apparently what happened to the kids before is somewhat of a mystery and perhaps the truth wasn't revealed. So no, not just your standard slice and dice. This moves along at a fairly good clip and doesn't let you lose interest like a lot of films do, and the oddball story is compelling enough to keep you interested too, and there's some suspense which is lacking in a lot of films these days. The ending is rather abrupt and I suppose is left mostly to your imagination, but then again it doesn't out-stay its welcome either. 7 out of 10, check it out.",positive
21515,"As a big fan of gorilla movies in general, I anticipated that this one would be great - and as for the gorilla effects, They were quite good, however - that is the only thing I can write about this flop. The film claims to be based on a true story but in effect, it does not even come close to what actually happened to ""Buddy"" - who in real life, was the famous Gargantua, sold to Ringling Bros. by our supposed ""heroic"" Gertrude Lintz, known by many animal enthusiasts as a woman who hardly had her animals' welfare in the best interest. As far as Buddy being portrayed as becoming aggressive, this was total fiction and at no time did the gorilla, in real life, resort to such behavior. buddy did, in fact, escape his wooden crate (not a plush cage room as depicted in movie) during a storm, to seek shelter and comfort in the house, which frightened Gertrude Lintz into selling him. No, Buddy was not released into a gorilla family surrounded by lush trees in a zoological paradise - he was abandoned in a wooden crate, deep in the back of a garage for some time with only a single light bulb for comfort and then sold to the circus - where he actually lived a better life having peanuts thrown at him until he died (historically the oldest living gorilla on record, by the way) before a show in Miami. Notice also, in the film, how Buddy grows older but the chimpanzees never age. (The chimps, by the way, were not raised simultaneously with other animals, including Buddy, as portrayed in the film)",negative
9933,"This is an absolutely true and faithful adaptation of 'The Hollow'. It could be argued that the actual mystery here is not one Christie's best but what makes 'The Hollow' special is the characterisation and I found the actors here, more or less without exception, were perfect in their parts. In such a uniformly good cast it's difficult to select stand out performances but I have to say that Sarah Miles is just perfect as Lucy Angkatell. What is extraordinary is that she not only conveys Lucy's dottiness, tactlessness and her more lovable qualities BUT she also manages to pull off the underlying truth that in fact, Lucy is not really all that nice! Megan Dodds is also very good as Henrietta and Claire Price very affecting as Gerda. John Christow is really quite an unlikeable character but Jonathan Cake nevertheless manages to make us see what his women see in him.<br /><br />As I said, the script follows the story quite faithfully. The only disconcerting thing I found was that Midge and Edward's relationship really comes out of nowhere and I do believe that some of it must have ended up on the cutting room floor! Theirs a secondary story however and the primary story is very well done. The whole thing looks beautiful as well, really capturing a perfect English autumn.<br /><br />Its a beautiful film in every respect and well worth seeing.",positive
45308,"This could have been a really good movie if someone would just have known how to finish the film.<br /><br />The story was going along just fine and heading towards that point in every movie like this where the ""gray"" characters turn ""good"" and the ""bad"" guys get their just desserts and *boom* ... it's like they ran out of script and the cast just started to make things up.<br /><br />Which wouldn't have been so bad ... if the cast had just continued with the character development they had already put in place. But such is not the case and the movie soon becomes a goofy mess.<br /><br />My advice is to watch this movie up to about the last 30 minutes ... and then shut it off. At this point, imagine how you think the next 30 minutes will look based on what you have seen so far.<br /><br />Believe me, the ending you come up with will look far better than how this film actually ends. Trust me on this.",negative
49387,"Murder in Mesopotamia, I have always considered one of the better Poirot books, as it is very creepy and has an ingenious ending. There is no doubt that the TV adaptation is visually striking, with some lovely photography and a very haunting music score. As always David Suchet is impeccable as Hercule Poirot, the comedic highlight of the episode being Poirot's battle with a mosquito in the middle of the night, and Hugh Fraser is good as the rather naive Captain Hastings. The remainder of the cast turn in decent performances, but are careful not to overshadow the two leads, a danger in some Christie adaptations. Some of the episode was quite creepy, a juxtaposition of an episode as tragic as Five Little Pigs, an episode that I enjoyed a lot more than this one. What made it creepy in particular, putting aside the music was when Louise Leidner sees the ghostly face through the window. About the adaptation, it was fairly faithful to the book, but I will say that there were three things I didn't like. The main problem was the pacing, it is rather slow, and there are some scenes where very little happens. I didn't like the fact also that they made Joseph Mercado a murderer. In the book, I see him as a rather nervous character, but the intervention of the idea of making him a murderer, and under-developing that, made him a less appealing character, though I am glad they didn't miss his drug addiction. (I also noticed that the writers left out the fact that Mrs Mercado in the book falls into hysteria when she believes she is the murderer's next victim.) The other thing that wasn't so impressive was that I felt that it may have been more effective if the adaptation had been in the viewpoint of Amy Leatheran, like it was in the book, Amy somehow seemed less sensitive in the adaptation. On the whole, despite some misjudgements on the writers' behalf, I liked Murder in Mesopotamia. 7/10 Bethany Cox.",positive
38691,"I've seen this film on Sky Cinema not too long ago.. I must admit, it was a really good Western which features 2 of the big names.. On one side, there's Charlton Heston, playing the infamous and retired lawman Samuel Burgade. On the other.. The late James Coburn playing the villainous Zach Provo.. seeking revenge on Burgade no matter what the cost..!<br /><br />The good thing about this film was there was some really good characters.. Most of the actors played it out really well.. Especially James Coburn, who I find that he was really mean in this film.. But that how it was..<br /><br />Christopher Mitchum, who I've seen everywhere in other films.. Playing Hal Brickman.. I felt his character was left out in the cold, but he manage to get himself back in by teaming up with Burgade, to bring down Provo's posse's!<br /><br />All in all, it was a great film.. Very good to watch.. Great score from the late Jerry Goldsmith..<br /><br />Wonderful piece of Western persona..! 8 out of 10.",positive
10216,"Finding this piece sandwiched between a stale prequel and a rehashed 80s machomovie on a UPN affiliate's midday Saturday program would be misleading. It deserves better and definitely uses its talented leads' best attributes to its maximum advantage. Bracco and Walken team to provide a movie that while perhaps predictable to those familiar with their genre, do the streetwise, 'troubled minds' routine that they are so good at portraying. For a chance to ride a psychological roller coaster a la Fuqua's ""Training Day,"" dive back into the world of early '90s TV movies to find ""Scam""!",positive


## Transform rule matches

To work with knodle the dataset needs to be transformed into a binary array.

0 -> Rule match
1 -> Rule didn't match 

Furthermore, we need a matrix `mapping_rule_labels` which has a mapping of all rules to labels.

In [42]:
def transform_snorkel_to_knodle(applied_lfs: np.ndarray):
    # All labels other to -1 map to one
    applied_lfs = np.where(applied_lfs!=-1, 1, applied_lfs) 
    
    applied_lfs = np.where(applied_lfs==-1, 0, applied_lfs) 
    
    np.testing.assert_array_equal(np.unique(applied_lfs), np.array([0,1]))
    
    return applied_lfs



In [53]:
rule_matches = transform_snorkel_to_knodle(applied_lfs)

In [54]:
def create_mapping_rules_labels(keyword_labels: pd.Series) -> np.ndarray:
    """
    This function converts a series of labels ([0,1,1]) into a one-hot enoded vector. 
    This can server as the matrix mapping_rules_labels for knodle. 
    The index is the rule_match and the horizontal axis is tha label id. 
    """
    return pd.get_dummies(keyword_labels, prefix='label').to_numpy()

In [55]:
mapping_rules_labels = create_mapping_rules_labels(keywords.label_id)

In [46]:
rule_label_counts = np.matmul(applied_lfs, mapping_rules_labels)

### Majority Vote

Now we make a majority vote based on all rule matches. First we get the `rule_counts` by multiplying `rule_matches` with the `mapping_rules_labels`, then we divide it sumwise by the sum to get a probability value. In the end we counteract the divide with zero issue by setting all nan values to zero.

In [59]:
def get_majority_vote_probs(rule_matches: np.ndarray, mapping_rules_labels: np.ndarray): 
    rule_counts = np.matmul(rule_matches, mapping_rules_labels)
    rule_counts_probs = rule_counts / rule_counts.sum(axis=1).reshape(-1,1)
    
    # Set probability zero if it was devided with zero
    rule_counts_probs[np.isnan(rule_counts_probs)] = 0
    return rule_counts_probs

In [60]:
y_probs = get_majority_vote_probs(rule_matches, mapping_rules_labels)

  This is separate from the ipykernel package so we can avoid doing imports until


In [61]:
# Positive is a bit overrepresented
np.unique(y_probs, return_counts=True)

(array([0.        , 0.06666667, 0.07142857, 0.07692308, 0.08333333,
        0.09090909, 0.1       , 0.11111111, 0.125     , 0.14285714,
        0.15384615, 0.16666667, 0.18181818, 0.2       , 0.21428571,
        0.22222222, 0.23076923, 0.25      , 0.26666667, 0.27272727,
        0.28571429, 0.3       , 0.30769231, 0.3125    , 0.33333333,
        0.35714286, 0.36363636, 0.375     , 0.38461538, 0.4       ,
        0.41666667, 0.42857143, 0.4375    , 0.44444444, 0.45454545,
        0.46153846, 0.46666667, 0.5       , 0.53333333, 0.53846154,
        0.54545455, 0.55555556, 0.5625    , 0.57142857, 0.58333333,
        0.6       , 0.61538462, 0.625     , 0.63636364, 0.64285714,
        0.66666667, 0.6875    , 0.69230769, 0.7       , 0.71428571,
        0.72727273, 0.73333333, 0.75      , 0.76923077, 0.77777778,
        0.78571429, 0.8       , 0.81818182, 0.83333333, 0.84615385,
        0.85714286, 0.875     , 0.88888889, 0.9       , 0.90909091,
        0.91666667, 0.92307692, 0.92857143, 0.93

In [62]:
y_probs.shape

(50000, 2)

In [63]:
imdb_dataset_raw['label_id'] = imdb_dataset_raw.sentiment.map({'positive':POSITIVE, 'negative':NEGATIVE})

## Save Files


In [65]:
from joblib import dump

In [73]:
dump(rule_matches, 'rule_matches.lib')
dump(mapping_rules_labels, "mapping_rules_labels.lib")

['mapping_rules_labels.lib']

In [72]:
!ls

IMDB Dataset.csv           keywords.csv
Prepare_IMDB_Data.ipynb    mapping_rules_labels.lib
applied_lfs.lib            requirements.txt
imdb_data_preprocessed.csv tfidf.lib


In [67]:
keywords.to_csv('keywords.csv', index=None)


In [68]:
imdb_dataset_raw.to_csv('imdb_data_preprocessed.csv', index=None)

# Finish

Now, we have created a weak supervision dataset. Of course it is not perfect but it is something with which we can compare performances of different denoising methods with. :-) 