# IMDB Dataset - Create Weak Supervision Sources

This notebook shows how to create labeling functions on the IMDB Movie Review dataset.

This dataset has gold labels. These labels are just there for evaluation purposes. The idea of using weak supervision and especially knodle is that you don't have a dataset which is purely labeled with strong supervision (manual) and instead label it with weak supervision.

First, we load the dataset from kaggle. Then, we will look at certain keywords within both sentiments and find good matching keywords which will act as a weak supervision source. Finally, the keywords on a basic majority vote model.

## Imports

Lets make some basic imports 

In [253]:
import pandas as pd 
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, CountVectorizer
from bs4 import BeautifulSoup
from snorkel.labeling import LabelingFunction, PandasLFApplier, filter_unlabeled_dataframe, LFAnalysis
from snorkel.labeling.model import MajorityLabelVoter, LabelModel
import numpy as np 
from tqdm import tqdm

In [228]:
# Init
tqdm.pandas()
pd.set_option('display.max_colwidth', -1)

  from pandas import Panel
  This is separate from the ipykernel package so we can avoid doing imports until


In [198]:
# Constants
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
COLUMN_WITH_TEXT = "reviews_preprocessed"

## Download the raw dataset

Now we download the dataset we need. For that you need to have the kaggle-cli installed and configured with your API key. Please have a look at the official [documentation](https://github.com/Kaggle/kaggle-api) for further instructions.

In [2]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
!tar -xvf imdb-dataset-of-50k-movie-reviews.zip
!rm imdb-dataset-of-50k-movie-reviews.zip

Downloading imdb-dataset-of-50k-movie-reviews.zip to /Users/sandro/repo/knodle/tutorials/ImdbDataset
 97%|████████████████████████████████████▉ | 25.0M/25.7M [00:01<00:00, 24.1MB/s]
100%|██████████████████████████████████████| 25.7M/25.7M [00:01<00:00, 23.4MB/s]
x IMDB Dataset.csv


## Preview dataset

After downloading and unpacking the dataset we can have a first look at it and work with it.

In [3]:
imdb_dataset_raw = pd.read_csv('IMDB Dataset.csv')

In [4]:
imdb_dataset_raw.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
imdb_dataset_raw.groupby('sentiment').count()


Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
negative,25000
positive,25000


In [6]:
imdb_dataset_raw.isna().sum()

review       0
sentiment    0
dtype: int64

## Preprocess dataset

Now lets take some basic preprocessing steps

### Remove Stopwords

We begin by removing all common stop words. We use `scikit-learn`'s stopwords that we don't install to many packages.

In [11]:
imdb_dataset_raw['reviews_preprocessed'] = imdb_dataset_raw['review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (ENGLISH_STOP_WORDS)]))

In [12]:
imdb_dataset_raw.head()

Unnamed: 0,review,sentiment,reviews_wo_stopwords,reviews_preprocessed
0,One of the other reviewers has mentioned that ...,positive,One reviewers mentioned watching just 1 Oz epi...,One reviewers mentioned watching just 1 Oz epi...
1,A wonderful little production. <br /><br />The...,positive,A wonderful little production. <br /><br />The...,A wonderful little production. <br /><br />The...
2,I thought this was a wonderful way to spend ti...,positive,I thought wonderful way spend time hot summer ...,I thought wonderful way spend time hot summer ...
3,Basically there's a family where a little boy ...,negative,Basically there's family little boy (Jake) thi...,Basically there's family little boy (Jake) thi...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"Petter Mattei's ""Love Time Money"" visually stu...","Petter Mattei's ""Love Time Money"" visually stu..."


### Remove HTML Tags

The dataset contains many HTML tags. We'll remove them

In [13]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

imdb_dataset_raw['reviews_preprocessed'] = imdb_dataset_raw['reviews_preprocessed'].apply(
    lambda x: strip_html(x))

In [15]:
imdb_dataset_raw['reviews_preprocessed'].head()

0    One reviewers mentioned watching just 1 Oz epi...
1    A wonderful little production. The filming tec...
2    I thought wonderful way spend time hot summer ...
3    Basically there's family little boy (Jake) thi...
4    Petter Mattei's "Love Time Money" visually stu...
Name: reviews_preprocessed, dtype: object

## Count Words

Now we want to count the words per sentiment to find good matching keywords.
We split the dataset first into positive and negative reviews and then count the words per sentiment

In [19]:
positive_reviews = \
    imdb_dataset_raw.loc[imdb_dataset_raw.sentiment == 'positive', ['reviews_preprocessed', 'sentiment']]

In [20]:
negative_reviews = \
    imdb_dataset_raw.loc[imdb_dataset_raw.sentiment == 'negative', ['reviews_preprocessed', 'sentiment']]

In [59]:
def get_word_count_list(reviews: pd.DataFrame, min_df: int = 10) -> pd.DataFrame:
    vectorizer = CountVectorizer(min_df=min_df)
    X = vectorizer.fit_transform(reviews.reviews_preprocessed.values)
    word_list = vectorizer.get_feature_names()
    count_list = X.toarray().sum(axis=0)
    word_count = pd.DataFrame(dict(zip(word_list, count_list)), index=['count']).transpose().sort_values('count', ascending=False)
    # Remove stop words again
    word_count = word_count.loc[~word_count.index.isin(ENGLISH_STOP_WORDS)]
    return word_count

### Positive Reviews

First lets look at some positive reviews and find good keywords which maybe describe them.

In [63]:
word_count_positive = get_word_count_list(positive_reviews, min_df=100)

In [64]:
word_count_positive.describe()

Unnamed: 0,count
count,3627.0
mean,557.271574
std,1355.342072
min,100.0
25%,159.0
50%,253.0
75%,507.0
max,42093.0


In [65]:
# All words in quantile 95 and higher
q95_positive_words =  word_count_positive.loc[word_count_positive['count'] > word_count_positive.quantile(0.95)['count']]

In [70]:
q95_positive_words.index.values

array(['film', 'movie', 'like', 'good', 'just', 'great', 'story', 'time',
       'really', 'people', 'love', 'best', 'life', 'way', 'films',
       'think', 'characters', 'don', 'movies', 'character', 'seen', 'man',
       'watch', 'make', 'little', 'does', 'know', 'did', 'years', 'end',
       'scene', 'real', 'scenes', 'say', 'acting', 'plot', 'world',
       'makes', 'better', 'new', 've', '10', 'young', 'work', 'old',
       'lot', 'quite', 'cast', 'funny', 'series', 'director', 'actors',
       'music', 'role', 'watching', 'look', 'bad', 'doesn', 'family',
       'performance', 'things', 'comedy', 'times', 'going', 'big', 'saw',
       'long', 'thing', 'actually', 'excellent', 'didn', 'bit', 'fun',
       'right', 'action', 'thought', 'fact', 'feel', 'want', 'come',
       'played', 'especially', 'got', 'war', 'horror', 'beautiful', 'day',
       'pretty', 'dvd', 'different', 'shows', 'gets', 'tv', 'interesting',
       'true', 'job', 'll', 'woman', 'probably', 'far', 'wonderful',

In [71]:
# Manual create list with keywords
positive_keywords = [
    'like','good','great','love', 'best', 'funny','excellent','fun','beautiful','interesting','wonderful',
    'original','perfect','classic','loved','recommend','amazing','favorite'
]

### Negative Reviews

Lets do the same process with negative reviews and find some keywords which describe bad movie reviews.

In [72]:
word_count_negative = get_word_count_list(negative_reviews, min_df=100)

In [73]:
word_count_negative.describe()

Unnamed: 0,count
count,3356.0
mean,593.115018
std,1543.380762
min,100.0
25%,160.0
50%,252.0
75%,525.0
max,50090.0


In [74]:
# All words in quantile 95 and higher
q95_negative_words = word_count_negative.loc[
    word_count_negative['count'] > word_count_negative.quantile(0.95)['count']]

In [76]:
q95_negative_words.index.values

array(['movie', 'film', 'like', 'just', 'good', 'bad', 'time', 'really',
       'don', 'story', 'people', 'make', 'movies', 'plot', 'acting',
       'way', 'characters', 'watch', 'think', 'did', 'character', 'know',
       'better', 'seen', 'films', 'little', 'say', 'scene', 'thing',
       'end', 'does', 'scenes', 've', 'didn', 'watching', 'great',
       'doesn', 'actually', 'man', 'actors', 'worst', 'director', 'life',
       'funny', 'going', 'look', 'love', 'real', 'minutes', 'old',
       'pretty', 'horror', 'want', 'best', 'script', 'guy', 'work', '10',
       'got', 'lot', 'isn', 'things', 'original', 'fact', 'thought',
       'makes', 'point', 'new', 'big', 'long', 'years', 'gets', 'far',
       'interesting', 'cast', 'making', 'right', 'action', 'come',
       'awful', 'quite', 'money', 'll', 'kind', 'poor', 'comedy',
       'boring', 'trying', 'reason', 'stupid', 'probably', 'looking',
       'looks', 'instead', 'terrible', 'away', 'maybe', 'believe', 'saw',
       'girl', '

In [77]:
negative_keywords = [
    'bad', 'worst','horror','awful','poor','boring','stupid','terrible','waste','worse','horrible'
]

## Labeling Functions

Now we start to build labeling functions with Snorkel with these keywords and check the coverage.

This is an iterative process of course so we surely have to add more keywords and regulary expressions ;-) 

In [200]:
def keyword_lookup(x, keyword, label):
    return label if keyword in x[COLUMN_WITH_TEXT].lower() else ABSTAIN


In [201]:
def make_keyword_lf(keyword: str, label: str) -> LabelingFunction:
    """
    Creates labeling function based on keyword.
    Args:
        keywords:
        label:

    Returns:

    """
    return LabelingFunction(
        name=f"keyword_{keyword}",
        f=keyword_lookup,
        resources=dict(keyword=keyword, label=label),
    )

In [202]:
def create_labeling_functions(keywords: pd.DataFrame) -> np.ndarray:
    """
    Create Labeling Functions based on the columns keyword and regex. Appends column lf to df.

    Args:
        keywords: DataFrame with processed keywords

    Returns:
        All labeling functions. 1d Array with shape: (number_of_lfs x 1)
    """
    keywords = keywords.assign(lf=keywords.progress_apply(
        lambda x:make_keyword_lf(x.keyword, x.label_id), axis=1
    ))
    lfs = keywords.lf.values
    return lfs

In [203]:
make_keyword_lf(keywords.loc[0, 'keyword'], 1)

LabelingFunction keyword_like, Preprocessors: []

In [204]:
def make_keyword_df(positive_keywords: [str], negative_keywords: [str]) -> pd.DataFrame:
    positive = pd.DataFrame(
        pd.Series({x: 'positive' for (x) in positive_keywords}, name="label")).reset_index().rename(columns={"index":"keyword"})
    
    negative = pd.DataFrame(
        pd.Series({x: 'negative' for (x) in negative_keywords}, name="label")).reset_index().rename(columns={"index":"keyword"})
    keywords = positive.append(negative)
    
    assert len(positive) + len(negative) == len(keywords), "Shapes doesn't match"
    
    keywords.loc[keywords.label == 'positive', 'label_id'] = int(POSITIVE)
    keywords.loc[keywords.label == 'negative', 'label_id'] = int(NEGATIVE)
    keywords.reset_index(inplace=True, drop=True)
    return keywords


In [205]:
keywords = make_keyword_df(positive_keywords, negative_keywords)

In [206]:
keywords

Unnamed: 0,keyword,label,label_id
0,like,positive,1.0
1,good,positive,1.0
2,great,positive,1.0
3,love,positive,1.0
4,best,positive,1.0
5,funny,positive,1.0
6,excellent,positive,1.0
7,fun,positive,1.0
8,beautiful,positive,1.0
9,interesting,positive,1.0


In [207]:
labeling_functions = create_labeling_functions(keywords)

100%|██████████| 29/29 [00:00<00:00, 12752.65it/s]


In [208]:
labeling_functions

array([LabelingFunction keyword_like, Preprocessors: [],
       LabelingFunction keyword_good, Preprocessors: [],
       LabelingFunction keyword_great, Preprocessors: [],
       LabelingFunction keyword_love, Preprocessors: [],
       LabelingFunction keyword_best, Preprocessors: [],
       LabelingFunction keyword_funny, Preprocessors: [],
       LabelingFunction keyword_excellent, Preprocessors: [],
       LabelingFunction keyword_fun, Preprocessors: [],
       LabelingFunction keyword_beautiful, Preprocessors: [],
       LabelingFunction keyword_interesting, Preprocessors: [],
       LabelingFunction keyword_wonderful, Preprocessors: [],
       LabelingFunction keyword_original, Preprocessors: [],
       LabelingFunction keyword_perfect, Preprocessors: [],
       LabelingFunction keyword_classic, Preprocessors: [],
       LabelingFunction keyword_loved, Preprocessors: [],
       LabelingFunction keyword_recommend, Preprocessors: [],
       LabelingFunction keyword_amazing, Preproce

### Apply Labeling Functions

Now lets apply all labeling functions on our reviews and check some statistics.

In [209]:
applier = PandasLFApplier(lfs=labeling_functions)
applied_lfs = applier.apply(df=imdb_dataset_raw)

  from pandas import Panel
100%|██████████| 50000/50000 [00:08<00:00, 5649.52it/s]


In [213]:
applied_lfs

array([[-1, -1, -1, ..., -1, -1, -1],
       [-1, -1,  1, ..., -1, -1, -1],
       [-1, -1,  1, ..., -1, -1, -1],
       ...,
       [-1,  1, -1, ..., -1, -1, -1],
       [ 1, -1, -1, ..., -1, -1, -1],
       [-1,  1, -1, ..., -1, -1, -1]])

Now we have a matrix with all labeling functions applied. This matrix has the shape $(instances \times labeling functions)$

In [220]:
print("Shape of applied labeling functions: ", applied_lfs.shape)
print("Number of reviews", len(imdb_dataset_raw))
print("Number of labeling functions", len(labeling_functions))

Shape of applied labeling functions:  (50000, 29)
Number of reviews 50000
Number of labeling functions 29


### Analysis

Now we can analyse some basic stats about our labeling functions. The main figures are:

- Coverage: How many labeling functions match at all
- Overlaps: How many labeling functions overlap with each other (e.g. awesome and amazing)
- Conflicts: How many labeling functions overlap and have different labels (e.g. awesome and bad)
- Correct: Correct LFs
- Incorrect: Incorrect Lfs

In [223]:
LFAnalysis(L=applied_lfs, lfs=labeling_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
keyword_like,0,[1],0.51858,0.49426,0.29586
keyword_good,1,[1],0.38944,0.3767,0.22624
keyword_great,2,[1],0.27614,0.266,0.12862
keyword_love,3,[1],0.25364,0.24712,0.11644
keyword_best,4,[1],0.19712,0.191,0.09734
keyword_funny,5,[1],0.13242,0.13242,0.07612
keyword_excellent,6,[1],0.0725,0.06944,0.0278
keyword_fun,7,[1],0.23214,0.2289,0.12876
keyword_beautiful,8,[1],0.08422,0.0815,0.03512
keyword_interesting,9,[1],0.11144,0.10734,0.06326


Lets have a look at some basic examples

In [229]:
imdb_dataset_raw.iloc[applied_lfs[:, 1] == POSITIVE, :].sample(10, random_state=1).loc[:, ['review','sentiment']]

Unnamed: 0,review,sentiment
14708,"Some of the greatest and most loved horror movies have a wicked sense of humour, but when a film comes along that isn't as original as the ""classics"" but just goes at it for laughs then a bunch of po-faced, wanna-be critics completely slag it off. This film made me laugh aloud several times, this is testament to the way this film was approached and it shows. The two main leads look natural and believable together and this really helps this film. You root for them the whole way and laugh along with them, everyone has friends like both of these guys. Another highlight for me was the monster truck, it's awesome, intimidating and really well shot. Taking inspiration (completely stealing) from loads of films, the most obvious being Duel, Jeepers Creepers and probably in reference to the Jack Black alike co-star Orange County. But really you can pick any road trip gone wrong movie and find a reference here. But so what, it's not trying to win any Oscars just give the viewer a good dose of frights and laughs and on that score it's a 10! Obviously It's not getting a 10, I give real sensible reviews and scores unlike 99% of the people on IMDb. There is no-way this movie can get a zero like so many lazy idiots give to too many films and as fun as it was it ain't getting a 10 either. It's just a good fun movie for anyone with a sense of humour and a liking for scares. You really can't get anymore simple than that.",negative
24770,"Maximum risk is quite surprising to a person that has seen more then on of his movies. Director Ringo Lam made an average action-movie, that can be compared with most of the other mid-quality action movies, what is a special predicate to a `Muscles from Brussels`movie. It has a quite classy style, an interesting atmosphere and, last but not least, the beautyful Natasha Henstridge. Even VanDamme doesn´t make you crying by his acting, he does a relatively good job. Of course you may not compare Maximum Risk (oh, what a creative title!) to `Ronin`, but after watching `Knock off` it´s the hell of a good movie... in special standards, of course.",negative
38047,"The way this story played out and the interaction between the 2 lead characters may lead me to believe that if the X-Files continues without Mulder and Scully, these would be a pretty good replacement duo.",positive
23499,"Paul Naschy as a ghostly security guard in this is scarier than most of his fur-and-shoe-polish werewolf guises. The story is not unfamiliar, a bunch of kids going to party at an abandoned school. The thing is, that one of these kid's fathers did the same thing years ago but he's now deceased, and the latest group of kids seem to be reliving an event from 23 years ago. This is fairly well done for films of this type, and there's an air of mystery to what's going on because apparently what happened to the kids before is somewhat of a mystery and perhaps the truth wasn't revealed. So no, not just your standard slice and dice. This moves along at a fairly good clip and doesn't let you lose interest like a lot of films do, and the oddball story is compelling enough to keep you interested too, and there's some suspense which is lacking in a lot of films these days. The ending is rather abrupt and I suppose is left mostly to your imagination, but then again it doesn't out-stay its welcome either. 7 out of 10, check it out.",positive
21515,"As a big fan of gorilla movies in general, I anticipated that this one would be great - and as for the gorilla effects, They were quite good, however - that is the only thing I can write about this flop. The film claims to be based on a true story but in effect, it does not even come close to what actually happened to ""Buddy"" - who in real life, was the famous Gargantua, sold to Ringling Bros. by our supposed ""heroic"" Gertrude Lintz, known by many animal enthusiasts as a woman who hardly had her animals' welfare in the best interest. As far as Buddy being portrayed as becoming aggressive, this was total fiction and at no time did the gorilla, in real life, resort to such behavior. buddy did, in fact, escape his wooden crate (not a plush cage room as depicted in movie) during a storm, to seek shelter and comfort in the house, which frightened Gertrude Lintz into selling him. No, Buddy was not released into a gorilla family surrounded by lush trees in a zoological paradise - he was abandoned in a wooden crate, deep in the back of a garage for some time with only a single light bulb for comfort and then sold to the circus - where he actually lived a better life having peanuts thrown at him until he died (historically the oldest living gorilla on record, by the way) before a show in Miami. Notice also, in the film, how Buddy grows older but the chimpanzees never age. (The chimps, by the way, were not raised simultaneously with other animals, including Buddy, as portrayed in the film)",negative
9933,"This is an absolutely true and faithful adaptation of 'The Hollow'. It could be argued that the actual mystery here is not one Christie's best but what makes 'The Hollow' special is the characterisation and I found the actors here, more or less without exception, were perfect in their parts. In such a uniformly good cast it's difficult to select stand out performances but I have to say that Sarah Miles is just perfect as Lucy Angkatell. What is extraordinary is that she not only conveys Lucy's dottiness, tactlessness and her more lovable qualities BUT she also manages to pull off the underlying truth that in fact, Lucy is not really all that nice! Megan Dodds is also very good as Henrietta and Claire Price very affecting as Gerda. John Christow is really quite an unlikeable character but Jonathan Cake nevertheless manages to make us see what his women see in him.<br /><br />As I said, the script follows the story quite faithfully. The only disconcerting thing I found was that Midge and Edward's relationship really comes out of nowhere and I do believe that some of it must have ended up on the cutting room floor! Theirs a secondary story however and the primary story is very well done. The whole thing looks beautiful as well, really capturing a perfect English autumn.<br /><br />Its a beautiful film in every respect and well worth seeing.",positive
45308,"This could have been a really good movie if someone would just have known how to finish the film.<br /><br />The story was going along just fine and heading towards that point in every movie like this where the ""gray"" characters turn ""good"" and the ""bad"" guys get their just desserts and *boom* ... it's like they ran out of script and the cast just started to make things up.<br /><br />Which wouldn't have been so bad ... if the cast had just continued with the character development they had already put in place. But such is not the case and the movie soon becomes a goofy mess.<br /><br />My advice is to watch this movie up to about the last 30 minutes ... and then shut it off. At this point, imagine how you think the next 30 minutes will look based on what you have seen so far.<br /><br />Believe me, the ending you come up with will look far better than how this film actually ends. Trust me on this.",negative
49387,"Murder in Mesopotamia, I have always considered one of the better Poirot books, as it is very creepy and has an ingenious ending. There is no doubt that the TV adaptation is visually striking, with some lovely photography and a very haunting music score. As always David Suchet is impeccable as Hercule Poirot, the comedic highlight of the episode being Poirot's battle with a mosquito in the middle of the night, and Hugh Fraser is good as the rather naive Captain Hastings. The remainder of the cast turn in decent performances, but are careful not to overshadow the two leads, a danger in some Christie adaptations. Some of the episode was quite creepy, a juxtaposition of an episode as tragic as Five Little Pigs, an episode that I enjoyed a lot more than this one. What made it creepy in particular, putting aside the music was when Louise Leidner sees the ghostly face through the window. About the adaptation, it was fairly faithful to the book, but I will say that there were three things I didn't like. The main problem was the pacing, it is rather slow, and there are some scenes where very little happens. I didn't like the fact also that they made Joseph Mercado a murderer. In the book, I see him as a rather nervous character, but the intervention of the idea of making him a murderer, and under-developing that, made him a less appealing character, though I am glad they didn't miss his drug addiction. (I also noticed that the writers left out the fact that Mrs Mercado in the book falls into hysteria when she believes she is the murderer's next victim.) The other thing that wasn't so impressive was that I felt that it may have been more effective if the adaptation had been in the viewpoint of Amy Leatheran, like it was in the book, Amy somehow seemed less sensitive in the adaptation. On the whole, despite some misjudgements on the writers' behalf, I liked Murder in Mesopotamia. 7/10 Bethany Cox.",positive
38691,"I've seen this film on Sky Cinema not too long ago.. I must admit, it was a really good Western which features 2 of the big names.. On one side, there's Charlton Heston, playing the infamous and retired lawman Samuel Burgade. On the other.. The late James Coburn playing the villainous Zach Provo.. seeking revenge on Burgade no matter what the cost..!<br /><br />The good thing about this film was there was some really good characters.. Most of the actors played it out really well.. Especially James Coburn, who I find that he was really mean in this film.. But that how it was..<br /><br />Christopher Mitchum, who I've seen everywhere in other films.. Playing Hal Brickman.. I felt his character was left out in the cold, but he manage to get himself back in by teaming up with Burgade, to bring down Provo's posse's!<br /><br />All in all, it was a great film.. Very good to watch.. Great score from the late Jerry Goldsmith..<br /><br />Wonderful piece of Western persona..! 8 out of 10.",positive
10216,"Finding this piece sandwiched between a stale prequel and a rehashed 80s machomovie on a UPN affiliate's midday Saturday program would be misleading. It deserves better and definitely uses its talented leads' best attributes to its maximum advantage. Bracco and Walken team to provide a movie that while perhaps predictable to those familiar with their genre, do the streetwise, 'troubled minds' routine that they are so good at portraying. For a chance to ride a psychological roller coaster a la Fuqua's ""Training Day,"" dive back into the world of early '90s TV movies to find ""Scam""!",positive


### Majority Vote

Now we make a majority vote based on all rule matches.

In [231]:
majority_model = MajorityLabelVoter(cardinality=2)
predictions = majority_model.predict(applied_lfs)

In [236]:
# Positive is a bit overrepresented
np.unique(predictions, return_counts=True)

(array([-1,  0,  1]), array([ 6817,  6141, 37042]))

In [242]:
predictions.shape

(50000,)

In [247]:
imdb_dataset_raw['label_id'] = imdb_dataset_raw.sentiment.map({'positive':POSITIVE, 'negative':NEGATIVE})

In [248]:
imdb_dataset_raw['prediction'] = predictions

In [252]:
acc = np.sum(imdb_dataset_raw.label_id == imdb_dataset_raw.prediction) / len(imdb_dataset_raw)
print("Accuracy of Majority Vote model is: ", acc)

Accuracy of Majority Vote model is:  0.57028


# Finish

Now, we have created a weak supervision dataset. Of course it is not perfect but it is something with which we can compare performances of different denoising methods with. :-) 