# Stats for Human-Provided Rationales

In this notebook, we analyze how many times each of the human-provided rationales appear in the documents.

In [1]:
from glob import glob
import csv
import operator
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import re

In [2]:
def load_imdb_data(path_to_imdb):
    print("Loading the imdb reviews data")
    train_neg_files = glob(path_to_imdb + r"/train/neg/*.txt")
    train_pos_files = glob(path_to_imdb + r"/train/pos/*.txt")
    train_corpus = []
    y_train = []
    for tnf in train_neg_files:
            with open(tnf, 'r', errors='replace') as f:
                line = f.read()
                train_corpus.append(line)
                y_train.append(0)

    for tpf in train_pos_files:
        with open(tpf, 'r', errors='replace') as f:
            line = f.read()
            train_corpus.append(line)
            y_train.append(1)

    test_neg_files = glob(path_to_imdb + r"/test/neg/*.txt")
    test_pos_files = glob(path_to_imdb + r"/test/pos/*.txt")

    test_corpus = []

    y_test = []

    for tnf in test_neg_files:
        with open(tnf, 'r', errors='replace') as f:
            test_corpus.append(f.read())
            y_test.append(0)

    for tpf in test_pos_files:
        with open(tpf, 'r', errors='replace') as f:
            test_corpus.append(f.read())
            y_test.append(1)

    print("Data loaded.")
    return train_corpus, y_train, test_corpus, y_test

In [3]:
path_to_imdb = "C:/Users/mbilgic/Desktop/aclImdb"

In [4]:
train_corpus, y_train, test_corpus, y_test = load_imdb_data(path_to_imdb)

Loading the imdb reviews data
Data loaded.


In [5]:
train_corpus = np.array(train_corpus)
test_corpus = np.array(test_corpus)
y_train = np.array(y_train)
y_test = np.array(y_test)

In [6]:
def parse_human_phrases(phrase_file, to_lower=False):
    with open(phrase_file) as csvfile:
        reader=csv.reader(csvfile, delimiter='\t')
        phrases={} # key:phrase value:number of times appeared in the collection
        for row in reader:            
            for phrase in row:
                if phrase is not "":
                    if to_lower:
                        phrase = phrase.lower()
                    if phrase not in phrases:
                        phrases[phrase] = 1
                    else:
                        phrases[phrase] += 1
    return phrases

First we analyze how many times each rationale appears in the human-provided list. We can do this by either lowercasing the rationales or without it. Below are the ones where we lowercase the human-provided rationales.

In [7]:
neg_phrases_dict = parse_human_phrases('Negative.tsv', to_lower=True)
sorted(neg_phrases_dict.items(), key=operator.itemgetter(1), reverse=True)

[('acting was terrible', 3),
 ("don't waste your time", 2),
 ('acting sucks', 2),
 ('2 out of 10', 2),
 ('avoid this movie', 2),
 ('to many scenes out of context', 1),
 ('will be left a little lost', 1),
 ('they cuss in like every sentence', 1),
 ('did not like the idea of the female turtle', 1),
 ('cannibal scenes are poor', 1),
 ('violent mob by the crazy chantings', 1),
 ('is shockingly violent', 1),
 ('aggravating, boring and almost completely devoid of any redeeming virtue',
  1),
 ('happens pretty sluggishly', 1),
 ('flashy, camera shaking thing gave me a headache', 1),
 ('what does it mean for our society that the standard is so terribly low', 1),
 ('a gratuitous rape scene', 1),
 ('nothing worthy happens during the entire movie', 1),
 ('wishing i had a gun to blow any memory of this film out of my head', 1),
 ('this crosses the line of "bad"', 1),
 ("film's title is misleading", 1),
 ('non-sense style ', 1),
 ('void this title at all costs', 1),
 ('dragonlord dude was corny', 1

As can be seen in the above analysis, many of the rationales are unique.

In [8]:
pos_phrases_dict = parse_human_phrases('Positive.tsv', to_lower=True)
sorted(pos_phrases_dict.items(), key=operator.itemgetter(1), reverse=True)

[('highly recommended', 3),
 ('great movie', 2),
 ('very enjoyable movie', 2),
 ('absence of violence and porno sex was refreshing', 2),
 ('it was great', 2),
 ('one of the best', 2),
 ('i recommend this movie', 2),
 ('i highly recommend', 2),
 ('sends a powerful message', 2),
 ('i loved it', 2),
 ('good one', 2),
 ('characters were very convincing and felt like you could understand their feelings',
  2),
 ('i recommend', 2),
 ('story had a unique and interesting arrangement', 2),
 ('one of the best movies', 2),
 ('watch it', 2),
 ('excellent movie', 2),
 ('acting was very good', 2),
 ('liked stanley & iris very much', 2),
 ('great combination', 1),
 ('periodical setting point of view', 1),
 ('score courtesy of bruno nicolai is catchy', 1),
 ('have recommended this movie', 1),
 ('a haunting musical score makes for a most enjoyable and rewarding experience',
  1),
 ('my daughter was entranced', 1),
 ('spectacular magic tricks', 1),
 ('lot of fun', 1),
 ('if you can enjoy a story of real

The same result for the positive phrases. Next, we analyze how many times each of these phrases appear in positive and negative documents in the training corpus.

In [9]:
def print_phrase_stats(human_vocab):
    vectorizer = CountVectorizer(vocabulary=human_vocab, lowercase=True, ngram_range=(2,20), binary=True)
    train_phrases_X = vectorizer.fit_transform(train_corpus)
    vocab = vectorizer.get_feature_names()
    all_counts = np.sum(train_phrases_X, axis=0)
    all_counts_array = all_counts.A1
    all_counts_sorted_indices = np.argsort(all_counts_array)[::-1]
    in_neg_counts = np.sum(train_phrases_X[y_train==0], axis=0)
    in_neg_counts = in_neg_counts.A1
    in_pos_counts = np.sum(train_phrases_X[y_train==1], axis=0)
    in_pos_counts = in_pos_counts.A1
    print("The Phrase\tneg_count\tpos_count\ttotal_count")
    for i in all_counts_sorted_indices:
        print("%s\t%d\t%d\t%d" %(vocab[i], in_neg_counts[i], in_pos_counts[i], all_counts_array[i]))

In [10]:
print_phrase_stats(list(neg_phrases_dict.keys()))

The Phrase	neg_count	pos_count	total_count
the worst	1660	164	1824
bad movie	319	47	366
worst movie	333	11	344
too long	184	115	299
very disappointed	80	13	93
terrible movie	67	0	67
save your money	58	1	59
skip this one	56	1	57
should be ashamed	53	3	56
avoid this movie	53	3	56
made no sense	46	4	50
the worst movie ever made	36	0	36
there is no plot	34	1	35
just horrible	32	3	35
acting was terrible	30	2	32
save your time	28	2	30
not in good way	29	0	29
movie was terrible	26	1	27
very obvious	17	9	26
movie sucked	24	2	26
weak plot	23	2	25
new low	22	3	25
rips off	19	5	24
movie was bad	20	3	23
bit boring	11	11	22
acting is wooden	18	2	20
worst one	17	3	20
not fun	17	3	20
my least favorite	11	9	20
plot is predictable	14	5	19
over dramatic	13	5	18
miss out	5	13	18
only good thing about this movie	16	1	17
pains me	15	1	16
its so bad	14	0	14
totally unnecessary	9	4	13
acting sucks	12	0	12
movie was awful	12	0	12
worse still	8	4	12
film is terrible	10	1	11
holy crap	7	4	11
this movie is awful

In [11]:
print_phrase_stats(list(pos_phrases_dict.keys()))

The Phrase	neg_count	pos_count	total_count
the best	911	2011	2922
watch it	527	681	1208
watch this	683	467	1150
very well	177	576	753
one of the best	86	614	700
my favorite	147	535	682
was good	250	242	492
was great	133	338	471
watch this movie	249	170	419
great movie	82	289	371
great film	55	241	296
highly recommend	29	226	255
enjoyed it	52	164	216
highly recommended	8	205	213
fun to watch	92	116	208
big fan	105	83	188
one of the greatest	31	129	160
really enjoyed	36	118	154
even better	39	106	145
very entertaining	20	116	136
her best	49	87	136
good one	65	71	136
was excellent	26	107	133
comic relief	58	65	123
lot of fun	26	92	118
it was great	23	91	114
see it again	20	84	104
great fun	11	77	88
great performances	9	77	86
best work	18	60	78
one of the best movies	4	71	75
great to see	9	66	75
excellent movie	2	64	66
oscar worthy	15	30	45
nice touch	18	19	37
another good	11	25	36
is also good	5	30	35
quite interesting	16	17	33
fantastic job	7	25	32
blew me away	2	29	31
real treat	1	29	30

As can be seen above, even though some phrases appear frequently, many of the phrases are quite infrequent. One thing, however, does not make sense: these rationales were copied from the training instances, and therefore, no phrase should have a zero count.

A closer observation, however, reveals the phrases that are have zero counts appear to have the following in common:
1. They are long (remember that the n-gram limit is set to 20).
2. They have punctuation in them. The default tokenizer in the countvectorizer removes punctuation.
3. They have special characters (e.g., / in '3/10'). The default tokenizer removes special characters.
4. They have single-character words (e.g., 'a'). The default tokenizer removes single-character words.

Let's change the default tokenizer to recognize single-character words and special characters. Moreover, let's preprocess the  human-provided rationales through this new tokenizer.

In [12]:
#token_pattern = r"(?u)\b\w\w+\b" # the default token pattern; does not match single character words, like 'a'
#token_pattern = r"(?u)\b\w+\b" # matches single character words, but removes apostrophes and slashes and so on
token_pattern = r"(?u)\b\S+\b" # removes only punctuation

In [13]:
tokenizer = re.compile(token_pattern)

In [14]:
neg_phrases_list = list(neg_phrases_dict.keys())
neg_list=[]
for neg_phrase in neg_phrases_list:
    neg_list.append(" ".join(tokenizer.findall(neg_phrase)))

In [15]:
for i in range(len(neg_list)):
    print("%s\n%s\n\n" %(neg_phrases_list[i], neg_list[i]))

to many scenes out of context
to many scenes out of context


will be left a little lost
will be left a little lost


they cuss in like every sentence
they cuss in like every sentence


did not like the idea of the female turtle
did not like the idea of the female turtle


cannibal scenes are poor
cannibal scenes are poor


violent mob by the crazy chantings
violent mob by the crazy chantings


is shockingly violent
is shockingly violent


aggravating, boring and almost completely devoid of any redeeming virtue
aggravating boring and almost completely devoid of any redeeming virtue


happens pretty sluggishly
happens pretty sluggishly


flashy, camera shaking thing gave me a headache
flashy camera shaking thing gave me a headache


what does it mean for our society that the standard is so terribly low
what does it mean for our society that the standard is so terribly low


a gratuitous rape scene
a gratuitous rape scene


nothing worthy happens during the entire movie
nothing worthy ha

In [16]:
pos_phrases_list = list(pos_phrases_dict.keys())
pos_list = []
for pp in pos_phrases_list:
    pos_list.append(" ".join(tokenizer.findall(pp)))

In [17]:
def print_phrase_stats(human_vocab, token_pattern=r"(?u)\b\w\w+\b"):
    vectorizer = CountVectorizer(vocabulary=human_vocab, lowercase=True, ngram_range=(2,20), binary=True, token_pattern=token_pattern)
    train_phrases_X = vectorizer.fit_transform(train_corpus)
    vocab = vectorizer.get_feature_names()
    all_counts = np.sum(train_phrases_X, axis=0)
    all_counts_array = all_counts.A1
    all_counts_sorted_indices = np.argsort(all_counts_array)[::-1]
    in_neg_counts = np.sum(train_phrases_X[y_train==0], axis=0)
    in_neg_counts = in_neg_counts.A1
    in_pos_counts = np.sum(train_phrases_X[y_train==1], axis=0)
    in_pos_counts = in_pos_counts.A1
    print("The Phrase\tneg_count\tpos_count\ttotal_count")
    for i in all_counts_sorted_indices:
        print("%s\t%d\t%d\t%d" %(vocab[i], in_neg_counts[i], in_pos_counts[i], all_counts_array[i]))

In [18]:
neg_set = set(neg_list) # can't have duplicate entries

In [19]:
print_phrase_stats(list(neg_set), token_pattern=token_pattern)

The Phrase	neg_count	pos_count	total_count
the worst	1641	160	1801
worst movie	332	11	343
bad movie	292	40	332
too long	172	108	280
don't waste your time	171	6	177
don't bother	95	15	110
very disappointed	75	13	88
1 out of 10	80	1	81
2 out of 10	64	0	64
terrible movie	64	0	64
don't watch it	44	16	60
save your money	56	1	57
should be ashamed	51	3	54
avoid this movie	50	3	53
skip this one	51	1	52
made no sense	42	4	46
it's nothing	28	16	44
there is no plot	34	1	35
acting was terrible	30	2	32
the worst movie ever made	31	0	31
save your time	27	2	29
which is a shame	14	13	27
very obvious	17	9	26
just horrible	23	3	26
weak plot	23	2	25
movie was terrible	22	1	23
new low	20	3	23
rips off	17	5	22
movie sucked	20	2	22
bit boring	10	10	20
movie was bad	17	2	19
worst one	17	2	19
plot is predictable	14	4	18
acting is wooden	16	2	18
my least favorite	10	8	18
miss out	4	13	17
pains me	15	1	16
only good thing about this movie	15	1	16
not fun	14	1	15
is a terrible film	14	0	14
its so bad	14	0	14
acti

In [20]:
pos_set = set(pos_list) # can't have duplicate entries

In [21]:
print_phrase_stats(list(pos_set), token_pattern=token_pattern)

The Phrase	neg_count	pos_count	total_count
the best	897	1983	2880
watch it	505	647	1152
watch this	678	460	1138
very well	167	523	690
one of the best	85	601	686
my favorite	146	530	676
i loved	134	445	579
see this movie	220	187	407
watch this movie	237	166	403
i recommend	102	268	370
great movie	74	274	348
was good	179	150	329
was great	78	205	283
great film	53	224	277
highly recommend	29	224	253
enjoyed it	50	154	204
highly recommended	8	191	199
fun to watch	87	110	197
a masterpiece	47	141	188
big fan	105	81	186
i highly recommend	21	161	182
one of the greatest	31	127	158
really enjoyed	36	118	154
even better	37	98	135
her best	49	85	134
very entertaining	20	112	132
was excellent	26	104	130
good one	58	64	122
comic relief	53	63	116
i loved it	10	99	109
lot of fun	24	85	109
see it again	19	77	96
did a great job	17	72	89
great fun	10	73	83
great performances	9	73	82
one of the best movies	4	71	75
best work	17	58	75
great to see	9	66	75
excellent movie	2	62	64
i'd recommend	29	34	63
9 ou

These results make a bit more sense, but, there are still phrases that have zero counts. Again, some are pretty long (possibly longer than 20 phrases), but even some short ones have zero counts. We need to investigate why this is happening.

Next, let's analyze how many negative (positive) documents contain zero negative (positive) phrases.

In [25]:
def print_document_stats(human_vocab, corpus, token_pattern):
    vectorizer = CountVectorizer(vocabulary=human_vocab, lowercase=True, ngram_range=(2,20), binary=True, token_pattern=token_pattern)
    X=vectorizer.fit_transform(corpus)
    counts = np.sum(X, axis=1)
    ca = counts.A1
    print("Has zero phrases:\t%d" %(np.sum(ca==0)))
    print("Has one phrase:\t%d" %(np.sum(ca==1)))
    print("Has two ore more phrases:\t%d" %(np.sum(ca>=2)))

In [26]:
print_document_stats(list(neg_set), train_corpus[y_train==0], token_pattern)

Has zero phrases:	9327
Has one phrase:	2402
Has two ore more phrases:	771


Given that there are only 12500 negative documents, more than 9K having zero human-provided rationales is quite concerning. We need to work on possible solutions.

In [27]:
print_document_stats(list(pos_set), train_corpus[y_train==1], token_pattern)

Has zero phrases:	6233
Has one phrase:	3497
Has two ore more phrases:	2770


Given that there are only 12500 positive documents, more than 6K having zero human-provided rationales is concerning. We need to work on possible solutions.