### Mike Ogrysko
### CS 766 Information Retrieval and Natural Language Processing

Parsing the IMDB movie reviews for sentiment
- IMDB movie review data
- Top 20 most frequent words in reviews grouped by sentiment
- 20 top frequent bigrams in reviews grouped by sentiment
- 20 top frequent bigrams, which are 'NN' POS tagged in reviews grouped by sentiment
- 4-grams that have counts 2 or more in reviews grouped by sentiment
- Probabilities of words that come after "worst film ever" and "best movie ever"

In [1]:
from collections import defaultdict
import csv
from string import punctuation
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import nltk
import numpy as np
import operator
import re


In [3]:
Reviews, Sentiments = [], []

with open('movie_data.csv','r', encoding='utf8') as fin:
    reader = csv.reader(fin, delimiter=',', quotechar='"')
    header = next(reader)
    for i, line in enumerate(reader):
        Reviews += [line[0]]
        Sentiments +=[int(line[1])]

N=len(Reviews)
M=len(Sentiments)
print('Total reviews loaded', N)
print('Total sentiments loaded', M)

Total reviews loaded 50000
Total sentiments loaded 50000


**Top 20 most frequent words in reviews grouped by sentiment**

In [4]:
#combination of stop words and punctuations, also get rid of br
stop_words = stopwords.words('english') + list(punctuation)
stop_words_set = set(stop_words) | set(['br', 'The', 'This'])

#develop tokenizer
def tokenize(text):
    terms = word_tokenize(text)
    #filter stop words
    terms = [w for w in terms if w not in stop_words_set and not w.isdigit()]
    #regex for contractions and other special character strings
    terms = [w for w in terms if not re.search(r'^\W+|\w\'\w+|\'\w+$', w)]
    terms = [w for w in terms if not re.search(r'^[^a-z]+$', w)]
    #regex for words two letters or less and numbers
    terms = [w for w in terms if not re.search(r'^\b\w{1,2}\b|(?<!\S)\d+(?!\S)$', w)]
    #lemmatize
    lemmatizer = WordNetLemmatizer()
    #was passing get_wordnet_pos() into lemmatizer but stopped because of memory issues
    #terms = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in terms]
    terms = [lemmatizer.lemmatize(w, 'n') for w in terms]
    return terms



In [5]:
#get bag0 list
Bag0 = []
for i, review in enumerate(Reviews):
    if Sentiments[i] == 0:
        Bag0 += [review]
        
#get bag1 list
Bag1 = []
for i, review in enumerate(Reviews):
    if Sentiments[i] == 1:
        Bag1 += [review]

In [6]:
#function to get term/review counts - _reviews = list, _pos = 0 or 1 (no or yes)
def tokenize_dict(_reviews, _pos):
    my_dict = defaultdict(int)
    for review in _reviews:
        if _pos == 1:
            terms = set(nltk.pos_tag(tokenize(review)))
        else:
            terms = set(tokenize(review))
        for term in terms:
            my_dict[term] +=1
    return my_dict

In [7]:
#create count dictionaries for the bags - no POS
vocab_counts_bag0 = tokenize_dict(Bag0, 0)
vocab_counts_bag1 = tokenize_dict(Bag1, 0)

In [8]:
#sort the dictionaries and store the top 20
sort_vocab_counts_bag0 = dict(sorted(vocab_counts_bag0.items(), key=lambda kv:kv[1],reverse=True)[:20])
sort_vocab_counts_bag1 = dict(sorted(vocab_counts_bag1.items(), key=lambda kv:kv[1],reverse=True)[:20])


In [9]:
#print top 20 bag 0
print("Sentiment 0 - 20 most frequent")
for i in sort_vocab_counts_bag0:
    print(f"{sort_vocab_counts_bag0[i]} {i}")

Sentiment 0 - 20 most frequent
17151 movie
14096 film
13424 one
12300 like
9540 time
9461 would
9244 good
8975 even
8693 make
8566 get
8355 bad
8060 character
7908 could
7732 really
7700 see
7166 much
6784 scene
6704 story
6645 thing
6437 made


In [10]:
#print top 20 bag 1
print("Sentiment 1 - 20 most frequent")
for i in sort_vocab_counts_bag1:
    print(f"{sort_vocab_counts_bag1[i]} {i}")

Sentiment 1 - 20 most frequent
14649 movie
14533 film
13513 one
10283 like
9636 time
8872 good
8454 see
8278 story
8128 character
7962 great
7737 make
7490 get
7449 would
7197 well
6929 really
6693 also
6453 much
6208 way
6145 even
6121 scene


**20 top frequent bigrams in reviews grouped by sentiment**

In [11]:
def grams_dict(_text, _n):
    grams_dict_counts = defaultdict(int)
    for review in _text:
        terms = tokenize(review)
        if len(terms) >= _n:
            for i in range(len(terms)-_n+1):
                gram_li = [_ for _ in terms[i:i+_n]]
                gram = ' '.join(gram_li)
                grams_dict_counts[gram] += 1
    return grams_dict_counts

In [12]:
#create count dictionaries for the bags - no POS
bigram_counts_bag0 = grams_dict(Bag0, 2)
bigram_counts_bag1 = grams_dict(Bag1, 2)

In [13]:
#sort the dictionaries and store the top 20
sort_bigram_counts_bag0 = dict(sorted(bigram_counts_bag0.items(), key=lambda kv:kv[1],reverse=True)[:20])
sort_bigram_counts_bag1 = dict(sorted(bigram_counts_bag1.items(), key=lambda kv:kv[1],reverse=True)[:20])


In [14]:
#print top 20 bag 0
print("Sentiment 0 - 20 most frequent bigrams")
for i in sort_bigram_counts_bag0:
    print(f"{sort_bigram_counts_bag0[i]} {i}")

Sentiment 0 - 20 most frequent bigrams
2127 look like
1611 ever seen
1383 special effect
1363 waste time
1252 movie ever
1136 bad movie
1057 main character
1026 worst movie
980 movie like
949 horror movie
937 much better
867 year old
866 one worst
828 low budget
787 make movie
780 good movie
748 horror film
735 watch movie
729 see movie
710 bad guy


In [15]:
#print top 20 bag 1
print("Sentiment 1 - 20 most frequent bigrams")
for i in sort_bigram_counts_bag1:
    print(f"{sort_bigram_counts_bag1[i]} {i}")

Sentiment 1 - 20 most frequent bigrams
1488 one best
927 ever seen
893 first time
832 even though
821 New York
818 main character
765 special effect
744 year old
737 year ago
732 see movie
715 look like
706 movie like
678 good movie
670 great movie
635 film like
604 movie ever
585 horror film
568 great film
566 watch movie
561 well done


**20 top frequent bigrams, which are 'NN' POS tagged in reviews grouped by sentiment**

In [16]:
def grams_dict_NN(_text, _n):
    grams_dict_counts = defaultdict(int)
    for review in _text:
        terms = nltk.pos_tag(tokenize(review))
        if len(terms) >= _n:
            for i in range(len(terms)-_n+1):
                count = 0
                gram_li, gram_pos = [], []
                for term in terms[i:i+_n]:
                    gram_li.append(term[0])
                    gram_pos.append(term[1])
                    if 'NN' in term[1]:
                        count += 1
                if count == _n:
                    key=""
                    for j, k in enumerate(gram_li):
                        key += k +" ("+ gram_pos[j]+") "
                    grams_dict_counts[key.strip()] += 1
    return grams_dict_counts

In [17]:
#create count dictionaries for the bags - wPOS
bigramNN_counts_bag0 = grams_dict_NN(Bag0, 2)


In [18]:
#create count dictionaries for the bags - wPOS
bigramNN_counts_bag1 = grams_dict_NN(Bag1, 2)


In [19]:
#sort the dictionaries and store the top 20
sort_bigramNN_counts_bag0 = dict(sorted(bigramNN_counts_bag0.items(), key=lambda kv:kv[1],reverse=True)[:20])


In [20]:
#sort the dictionaries and store the top 20
sort_bigramNN_counts_bag1 = dict(sorted(bigramNN_counts_bag1.items(), key=lambda kv:kv[1],reverse=True)[:20])


In [21]:
#print top 20 bag 0
print("Sentiment 0 - 20 most frequent bigrams w NN")
for i in sort_bigramNN_counts_bag0:
    print(f"{sort_bigramNN_counts_bag0[i]} {i}")

Sentiment 0 - 20 most frequent bigrams w NN
1252 waste (NN) time (NN)
814 horror (NN) movie (NN)
733 horror (NN) film (NN)
478 story (NN) line (NN)
460 watch (NN) movie (NN)
436 New (NNP) York (NNP)
432 part (NN) movie (NN)
411 thing (NN) movie (NN)
397 production (NN) value (NN)
383 character (NN) development (NN)
372 movie (NN) movie (NN)
346 sex (NN) scene (NN)
299 camera (NN) work (NN)
293 action (NN) scene (NN)
287 action (NN) movie (NN)
284 time (NN) money (NN)
273 watch (NN) film (NN)
266 plot (NN) line (NN)
261 film (NN) maker (NN)
257 time (NN) movie (NN)


In [22]:
#print top 20 bag 1
print("Sentiment 1 - 20 most frequent bigrams w NN")
for i in sort_bigramNN_counts_bag1:
    print(f"{sort_bigramNN_counts_bag1[i]} {i}")

Sentiment 1 - 20 most frequent bigrams w NN
821 New (NNP) York (NNP)
571 horror (NN) film (NN)
454 story (NN) line (NN)
413 horror (NN) movie (NN)
349 watch (NN) movie (NN)
338 World (NNP) War (NNP)
284 movie (NN) movie (NN)
266 watch (NN) film (NN)
257 see (NN) film (NN)
256 part (NN) movie (NN)
255 movie (NN) time (NN)
251 end (NN) film (NN)
249 film (NN) film (NN)
245 action (NN) movie (NN)
245 movie (NN) watch (NN)
241 point (NN) view (NN)
237 film (NN) time (NN)
236 production (NN) value (NN)
231 part (NN) film (NN)
231 Film (NNP) Festival (NNP)


**4-grams that have counts 2 or more in reviews grouped by sentiment**

In [23]:
#generate 4grams for Bag0
bag0_4gram_dict = grams_dict(Bag0, 4)
bag0_4gram_dict = {k:v for k, v in bag0_4gram_dict.items() if v >= 2}


In [24]:
#generate 4grams for Bag1
bag1_4gram_dict = grams_dict(Bag1, 4)
bag1_4gram_dict = {k:v for k, v in bag1_4gram_dict.items() if v >= 2}


In [25]:
#sort the dictionaries and give the top 5
sorted_bag0_4gram = sorted(bag0_4gram_dict.items(), key= lambda kv:kv[1], reverse=True)
sorted_bag1_4gram = sorted(bag1_4gram_dict.items(), key= lambda kv:kv[1], reverse=True)


In [26]:
print('Sentiment 0 4grams - all\n')
for i in sorted_bag0_4gram[:20]:
    print(f"{i[1]} {i[0]}")


Sentiment 0 4grams - all

392 worst movie ever seen
215 one worst movie ever
170 worst film ever seen
114 worst movie ever made
107 one worst film ever
69 worst film ever made
44 life never get back
42 movie ever seen life
28 movie seen long time
28 one worst movie seen
27 Plan From Outer Space
25 could done better job
24 movie complete waste time
24 really wanted like movie
23 trivialBoring trivialBoring trivialBoring trivialBoring
22 One worst movie ever
21 vote four. Title Brazil
20 ever seen entire life
19 left cutting room floor
19 possibly worst film ever


In [27]:
print('Sentiment 1 4grams - all\n')
for i in sorted_bag1_4gram[:20]:
    print(f"{i[1]} {i[0]}")


Sentiment 1 4grams - all

52 best movie ever seen
47 one best movie ever
37 Never Say Never Again
34 one best movie seen
29 vote eight. Title Brazil
29 movie seen long time
28 vote seven. Title Brazil
28 funniest movie ever seen
26 Best Years Our Lives
26 Tony Hawk Pro Skater
25 one best film ever
25 one best film seen
25 best movie ever made
24 greatest show ever mad
24 show ever mad full
23 good guy bad guy
23 ever mad full stop.OZ
23 mad full stop.OZ greatest
23 full stop.OZ greatest show
23 stop.OZ greatest show ever


**Probabilities of words that come after "worst film ever" and "best movie ever"**

In [28]:
#get full vocab dict
vocab_4gram_dict = grams_dict(Reviews, 4)

In [29]:
#get dictionary of 'worst film ever'
vocab_4gram_dict_worst = {k:vocab_4gram_dict[k] for k in vocab_4gram_dict if 'worst film ever ' in k}
sorted_vocab_4gram_dict_worst = dict( sorted(vocab_4gram_dict_worst.items(), key= lambda kv:kv[1], reverse=True))


In [30]:
#calculate probabilities of worst film ever
sum_worst = 0
for k in sorted_vocab_4gram_dict_worst:
    sum_worst += sorted_vocab_4gram_dict_worst[k]
print(f"Count 'worst film ever': {sum_worst}\n")

sum_worst_prob = {}
for k in sorted_vocab_4gram_dict_worst:
    sum_worst_prob[k] = sorted_vocab_4gram_dict_worst[k]/sum_worst

print(f"Probabilities of 'worst film ever': \n")
for i in sum_worst_prob:
    print(f"{i} {sum_worst_prob[i]:.3f}")

Count 'worst film ever': 327

Probabilities of 'worst film ever': 

worst film ever seen 0.523
worst film ever made 0.223
worst film ever seen. 0.018
worst film ever misfortune 0.015
worst film ever saw 0.012
worst film ever see 0.009
worst film ever created 0.009
worst film ever paid 0.009
worst film ever watched 0.006
worst film ever viewed 0.006
worst film ever displeasure 0.006
worst film ever sat 0.006
worst film ever wasted 0.006
worst film ever made. 0.006
worst film ever witnessed 0.006
worst film ever acting 0.006
worst film ever bad 0.003
worst film ever well 0.003
worst film ever conceived 0.003
worst film ever Not 0.003
worst film ever successful 0.003
worst film ever Watchers 0.003
worst film ever though 0.003
worst film ever missfortune 0.003
worst film ever encountered 0.003
worst film ever mean 0.003
worst film ever like 0.003
worst film ever Hulk 0.003
worst film ever generally 0.003
worst film ever major 0.003
worst film ever character 0.003
worst film ever seen.And 0

In [31]:
#get dictionary of 'best movie ever'
vocab_4gram_dict_best = {k:vocab_4gram_dict[k] for k in vocab_4gram_dict if 'best movie ever ' in k and 'dumbest' not in k}
sorted_vocab_4gram_dict_best = dict( sorted(vocab_4gram_dict_best.items(), key= lambda kv:kv[1], reverse=True))



In [32]:
#calculate probabilities of best movie ever
sum_best = 0
for k in sorted_vocab_4gram_dict_best:
    sum_best += sorted_vocab_4gram_dict_best[k]
print(f"Count 'best movie ever': {sum_best}\n")

sum_best_prob = {}
for k in sorted_vocab_4gram_dict_best:
    sum_best_prob[k] = sorted_vocab_4gram_dict_best[k]/sum_best

print(f"Probabilities of 'best movie ever': \n")
for i in sum_best_prob:
    print(f"{i} {sum_best_prob[i]:.3f}")

Count 'best movie ever': 145

Probabilities of 'best movie ever': 

best movie ever seen 0.400
best movie ever made 0.241
best movie ever saw 0.028
best movie ever seen. 0.028
best movie ever watched 0.014
best movie ever perfect 0.007
best movie ever And 0.007
best movie ever nice 0.007
best movie ever saw.Overcoming 0.007
best movie ever sitting 0.007
best movie ever Mira 0.007
best movie ever got 0.007
best movie ever heart-stopping 0.007
best movie ever make 0.007
best movie ever put 0.007
best movie ever bunch 0.007
best movie ever hand 0.007
best movie ever care 0.007
best movie ever Holocaust 0.007
best movie ever according 0.007
best movie ever one 0.007
best movie ever still 0.007
best movie ever viewed. 0.007
best movie ever get 0.007
best movie ever Absolutely 0.007
best movie ever funny 0.007
best movie ever movie 0.007
best movie ever watch 0.007
best movie ever uncommon 0.007
best movie ever Fire 0.007
best movie ever opinion 0.007
best movie ever idea 0.007
best movie ev