## DS 7337 - Natural Language Processing

### Author: Brandon Croom

### Homework: 8

### Assignment Criteria:

Perform a vocabulary-based sentiment analysis of the movie reviews you used in homework 5 and homework 7, by doing the following:

1.	In Python, load one of the sentiment vocabularies referenced in the textbook, and run the sentiment analyzer as explained in the corresponding reference. Add words to the sentiment vocabulary, if you think you need to, to better fit your particular text collection.

2.	For each of the clusters you created in homework 7, compute the average, median, high, and low sentiment scores for each cluster. Explain whether you think this reveals anything interesting about the clusters.

3.	For extra credit, analyze sentiment of chunks as follows:

 - Take the chunks from homework 5, and in Python, run each chunk individually through your sentiment analyzer that you used in question 1. If the chunk registers a nonneutral sentiment, save it in a tabular format (the chunk, the sentiment score).

 - Now sort the table twice, once to show the highest negative-sentiment-scoring chunks at the top and again to show the highest positive-sentiment-scoring chunks at the top. Examine the upper portions of both sorted lists, to identify any trends, and explain what you see. 

Submit all of your inputs and outputs and your code for this assignment, along with a brief written explanation of your findings. 


NOTE: Code for scraping IMDB website modified from https://shravan-kuchkula.github.io/scrape_imdb_movie_reviews/#step-4-for-each-of-the-movie-reviews-link-get-a-positive-user-review-link-and-a-negative-movie-review-link


In [1]:
# import nltk and other items
import nltk
from nltk.tokenize import word_tokenize 
import numpy as np
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from bs4 import BeautifulSoup
import re
import pandas as pd
import itertools
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *

pd.options.display.max_colwidth=500

#define some static values, specifically our source of information IMBD, our output file and the url for our IMDB search
base_url = "https://www.imdb.com"
outfile = "DS_7337_NLP_HW5_CROOM.csv"

# API call to select:
#  -feature films
#  -which are rated atleast 4.0
#  -having atleast 50,000 votes
#  -in the Thriller genre
#  -sorted by user rating
#  -limit to 250 movies
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''

In [2]:
# define helper functions for the work

# Utility function which takes a url and returns a Soup object.
def urlToSoup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    return soup

# Utility function to filter out movie tag issues
def filterMovieTags(movieSoup):
    # find all a-tags with class:None
    movie_tags = movieSoup.find_all('a', attrs={'class': None})

    # filter the a-tags to get just the titles
    movie_tags = [tag.attrs['href'] for tag in movie_tags 
                  if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]

    # remove duplicate links
    movie_tags = list(dict.fromkeys(movie_tags))

    return movie_tags

# function to build the list of movie review links
def buildReviewLinks(movie_tags):
    movie_links = [base_url + tag + 'reviews' for tag in movie_tags]
    return movie_links

# Function that returns the index of negative and positive review.
def minMax(a):   
    # get the index of least rated user review
    minpos = a.index(min(a))
    
    # get the index of highest rated user review
    maxpos = a.index(max(a))
    
    return minpos, maxpos

# Function returns a negative and positive review for each movie.
def getMovieReviews(soup):    
    # get a list of user ratings
    user_review_ratings = [tag.previous_element for tag in 
                           soup.find_all('span', attrs={'class': 'point-scale'})]
    
    # find the index of negative and positive review
    n_index, p_index = minMax(list(map(int, user_review_ratings)))
    
    # get the review tags
    user_review_list = soup.find_all('a', attrs={'class':'title'})
    
    # get the negative and positive review tags
    n_review_tag = user_review_list[n_index]
    p_review_tag = user_review_list[p_index]
    
    # return the negative and positive review link
    n_review_link = base_url + n_review_tag['href']
    p_review_link = base_url + p_review_tag['href']
    
    return n_review_link, p_review_link

# Returns the user review text given the review url.
def getMovieReviewText(review_url):   
    # get the review_url's soup
    soup = urlToSoup(review_url)
    
    # find div tags with class text show-more__control
    tag = soup.find('div', attrs={'class': 'text show-more__control'})
    
    return tag.getText()

# Returns the movie title from the review url.
def getMovieTitle(review_url):
    # get the review_url's soup
    soup = urlToSoup(review_url)
    
    # find h1 tag
    tag = soup.find('h1')
    
    return list(tag.children)[1].getText()

# Returns the NP chunked review results
def processReviewText(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [getMovieReviewTags(sent) for sent in sentences]
    return sentences

# Returns the tagged version of the review
def getMovieReviewTags(reviewSentence):	
    nps = []

    # NOTE: For purposes of the exercise I'm treating
    # the grammar variable as a global so that the regular
    # expression patterns can be change it as needed
    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(reviewSentence)

    # loop through the trees produced and pull out only the 
    # NP subtrees
    for subtree in tree.subtrees():
        if subtree.label() == 'NP':
            t = subtree
            t = ' '.join(word for word, tag in t.leaves())
            nps.append(t)

    return nps

# Returns the k-means clusters for the documents
def getKMeans(reviews, kVal = 5):
    # build a TFIDFVectorizer with the engligh stop words
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(reviews)

    # execute KMeans on the vectorized data
    model = KMeans(n_clusters=kVal, init='k-means++', max_iter=100, n_init=1)
    model.fit(X)

    # print out the top terms per cluster for the user
    print("Top terms per cluster:")
    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    
    clusters = []

    for i in range(kVal):
        cluster_terms = []
        print("Cluster %d:" % i),

        for ind in order_centroids[i, :10]:
         cluster_terms.append(terms[ind])
         print(' %s' % terms[ind]),
        
        clusters.append(cluster_terms)
        print

    print("\n")

    return clusters

def getClusterSentimentScores(termsList):
    all_scores = []

    for terms in termsList:
        scores = []

        for term in terms:
            score = sid.polarity_scores(term)
            scores.append(score['compound'])

        all_scores.append(np.array(scores))

    for i, scores in enumerate(all_scores):
        print('Cluster {}: '.format(i), end='')

        mean = scores.mean()
        print('mean: {0:.1}'.format(mean), end=' | ')

        median = np.median(scores)
        print('median: {}'.format(median), end=' | ')

        _max = scores.max()
        print('max: {0:.1}'.format(_max), end=' | ')

        _min = scores.min()
        print('min: {0:.1}'.format(_min))

def sentiment_scores(sentence,printScores=True): 
  
    # Create a SentimentIntensityAnalyzer object. 
    sid_obj = SentimentIntensityAnalyzer() 
  
    # polarity_scores method of SentimentIntensityAnalyzer 
    # oject gives a sentiment dictionary. 
    # which contains pos, neg, neu, and compound scores. 
    sentiment_dict = sid_obj.polarity_scores(sentence) 
    
    if (printScores):
        print("Overall sentiment dictionary is : ", sentiment_dict) 
        print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative") 
        print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral") 
        print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive") 
  
        print("Sentence Overall Rated As", end = " ") 
  
        # decide sentiment as positive, negative and neutral 
        if sentiment_dict['compound'] >= 0.05 : 
            print("Positive") 
  
        elif sentiment_dict['compound'] <= - 0.05 : 
            print("Negative") 
  
        else : 
            print("Neutral") 

    return sentiment_dict['compound']

def getChunkedSentimentScores(reviewChunk,printScores=False):

    termList = []
    scoreList = []

    for terms in reviewChunk:
        for term in terms:
            termStr = ' '.join(term)
            sent_score = sentiment_scores(termStr,printScores)

            if sent_score >= 0.05 : 
                termList.append(termStr)
                scoreList.append(sent_score)
            elif sent_score <= - 0.05 : 
                termList.append(termStr)
                scoreList.append(sent_score)

    return termList, scoreList

First lets connect to IMDB. We're going to pull films that are rated highly rated in the thriller genre. Pull back 250 movies

In [3]:
# API call to select:
## feature films
## which are rated atleast 4.0
## having atleast 50,000 votes
## in the Thriller genre
## sorted by user rating
## limit to 250 movies
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''

# transform the movie url list to a BeautifulSoup object
movies_soup = urlToSoup(url)

# in looking at the returned list there's some filtering that needs to be done
# filter off tags with none and just cleanup the list in general
movie_tags = filterMovieTags(movies_soup)

# Print out the number of reviews we have and show the first 10 items
print("There are a total of " + str(len(movie_tags)) + " movie user reviews")
print("Displaying 10 user reviews links")
movie_tags[:10]

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['/title/tt0468569/',
 '/title/tt1375666/',
 '/title/tt6751668/',
 '/title/tt0816692/',
 '/title/tt0114369/',
 '/title/tt0102926/',
 '/title/tt7286456/',
 '/title/tt0482571/',
 '/title/tt0407887/',
 '/title/tt0114814/']

Now let's build out the list of specific movie review URLs to be able to parse the reviews

In [4]:
# build out the list of reviews
review_links = buildReviewLinks(movie_tags)

print("There are a total of " + str(len(review_links)) + " movie user reviews")
print("Displaying 10 user reviews links")
review_links[:10]

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1375666/reviews',
 'https://www.imdb.com/title/tt6751668/reviews',
 'https://www.imdb.com/title/tt0816692/reviews',
 'https://www.imdb.com/title/tt0114369/reviews',
 'https://www.imdb.com/title/tt0102926/reviews',
 'https://www.imdb.com/title/tt7286456/reviews',
 'https://www.imdb.com/title/tt0482571/reviews',
 'https://www.imdb.com/title/tt0407887/reviews',
 'https://www.imdb.com/title/tt0114814/reviews']

Now go through each of the reviews and filter by positive and negative reviews. This will give us two individual movie reviews for each movie

In [5]:

# get a list of soup objects
movie_soups = [urlToSoup(link) for link in review_links]

# get all movie review links
movie_review_list = [getMovieReviews(movie_soup) for movie_soup in movie_soups]

movie_review_list = list(itertools.chain(*movie_review_list))

# For a check make a listing of the review URLs
print("There are a total of " + str(len(movie_review_list)) + " individual movie reviews")
print("Displaying 10 reviews")
movie_review_list[:10]

There are a total of 500 individual movie reviews
Displaying 10 reviews


['https://www.imdb.com/review/rw2599771/',
 'https://www.imdb.com/review/rw5478826/',
 'https://www.imdb.com/review/rw2286063/',
 'https://www.imdb.com/review/rw4692192/',
 'https://www.imdb.com/review/rw5389992/',
 'https://www.imdb.com/review/rw5195256/',
 'https://www.imdb.com/review/rw3119344/',
 'https://www.imdb.com/review/rw5145037/',
 'https://www.imdb.com/review/rw1136748/',
 'https://www.imdb.com/review/rw0370669/']

To make things a bit easier to read build a pandas data frame. This data frame will house the movie title, the link the the user review, the user review and the sentiment of the review.

In [6]:
# get review text from the review link
review_texts = [getMovieReviewText(url) for url in movie_review_list]

# get movie name from the review link
movie_titles = [getMovieTitle(url) for url in movie_review_list]

# label each review with negative or positive
review_sentiment = np.array(['negative', 'positive'] * (len(movie_review_list)//2))

# construct a dataframe
df = pd.DataFrame({'movie': movie_titles, 'user_review_permalink': movie_review_list,
             'user_review': review_texts, 'sentiment': review_sentiment})

# show the dataframe head to show what we return back
df.head()

Unnamed: 0,movie,user_review_permalink,user_review,sentiment
0,The Dark Knight,https://www.imdb.com/review/rw2599771/,"Let's open this review with the fact that Batman is THE BEST superhero ever, with no other even coming close to his levels. So when a sequel is made for an unforgettable movie like ""Batman Begins"" expectations are EXTREMELY high. I think that if you could describe this movie in a few words it would be ""152 minutes of pure awesomeness."" Christopher Nolan probably created the best ever comic book adaptation of a movie. The movie itself has been adapted in a much more realistic fashion than the...",negative
1,The Dark Knight,https://www.imdb.com/review/rw5478826/,"Confidently directed, dark, brooding, and packed with impressive action sequences and a complex story, The Dark Knight includes a career-defining turn from Heath Ledger as well as other Oscar worthy performances, TDK remains not only the best Batman movie, but comic book movie ever created.",positive
2,Inception,https://www.imdb.com/review/rw2286063/,"I have to say to make such an impressive trailer and such an uninteresting film, takes some doing.Here you have most of the elements that would make a very good film. You have great special effects, a sci-fi conundrum, beautiful visuals and good sound. Yet the most important part of the film is missing. There is no plot, character or soul to this film. It's like having a beautiful building on the outside with no paint or decoration on the inside.It's an empty shell of a film. There is no ten...",negative
3,Inception,https://www.imdb.com/review/rw4692192/,"My 3rd time watching this movie! Yet, it still stunned my mind, kept me enjoyed its every moment and left me with many thoughts afterward.\nFor someone like me, who've rarely slept without dream, it's so exciting watching how Christopher Nolan had illustrated every single characteristic of dream on the big screen. As it's been done so sophisticatedly, I do believe the rumour that Nolan had spent 10 years to finish the script of Inception. In my opinion, it's been so far the greatest achievem...",positive
4,Parasite,https://www.imdb.com/review/rw5389992/,"The film is ok but that's as far as I go. It's worth a watch but only a score of 6 for me. It's not a masterpiece, it's not film of the decade it's not even film of the year. I don't understand how people can give it 10, 10 is a perfect score. 10 is unachievable in my opinion as nothing is perfect.",negative


Cluster using K-means for 5 clusters

In [7]:
# Execute the K-Means function on the reviews. We'll initially use the default number of clusters which is 5
top_terms_k5 = getKMeans(review_texts)

Top terms per cluster:
Cluster 0:
 caligari
 alan
 young
 film
 cabinet
 tale
 man
 beckert
 war
 movie
Cluster 1:
 movie
 film
 good
 just
 like
 story
 horror
 don
 people
 way
Cluster 2:
 police
 movie
 killer
 serial
 film
 make
 korea
 bank
 lousy
 corruption
Cluster 3:
 action
 film
 movie
 best
 love
 movies
 time
 bond
 great
 films
Cluster 4:
 film
 story
 movie
 best
 character
 time
 films
 characters
 like
 just




In [8]:
# Execute the K-Means function on the reviews, use a number of clusters equal to 10
top_terms_k10 = getKMeans(review_texts,10)

Top terms per cluster:
Cluster 0:
 film
 movie
 best
 warriors
 action
 time
 character
 jackman
 series
 franchise
Cluster 1:
 movie
 film
 scene
 just
 coraline
 kane
 town
 like
 children
 laughton
Cluster 2:
 film
 horror
 story
 movie
 time
 films
 great
 best
 action
 character
Cluster 3:
 movie
 russian
 like
 acting
 seen
 story
 time
 ve
 punk
 serial
Cluster 4:
 film
 com
 list
 imdb
 best
 http
 www
 noir
 favorite
 films
Cluster 5:
 movie
 film
 good
 like
 just
 oscar
 movies
 story
 really
 make
Cluster 6:
 film
 bourne
 movie
 action
 feel
 like
 just
 good
 time
 story
Cluster 7:
 hitchcock
 film
 grant
 hitch
 films
 like
 murder
 vertigo
 character
 perfect
Cluster 8:
 movie
 film
 story
 don
 great
 just
 good
 people
 movies
 like
Cluster 9:
 film
 films
 sci
 fi
 action
 life
 time
 story
 gattaca
 best




In [9]:
# Execute the K-Means function on the reviews, use a number of clusters equal to 20
top_terms_k20 = getKMeans(review_texts,20)

Top terms per cluster:
Cluster 0:
 niro
 connery
 scorsese
 bond
 craig
 daniel
 nonsense
 blair
 witch
 untolerable
Cluster 1:
 media
 social
 disconnect
 lou
 internet
 wednesday
 twists
 required
 technology
 pleased
Cluster 2:
 horror
 house
 wan
 conjuring
 amityville
 bope
 film
 insidious
 movie
 nascimento
Cluster 3:
 hard
 action
 die
 film
 bond
 movie
 love
 mcclane
 franchise
 john
Cluster 4:
 movie
 minutes
 sense
 just
 film
 lola
 make
 30
 direction
 time
Cluster 5:
 film
 com
 list
 imdb
 watch
 http
 www
 10
 ending
 favorite
Cluster 6:
 slevin
 movie
 people
 lesson
 time
 haneke
 criminal
 crash
 racism
 hackman
Cluster 7:
 german
 conditions
 submarine
 film
 caligari
 south
 modern
 stasi
 dreyman
 hundreds
Cluster 8:
 good
 movie
 film
 story
 like
 perfect
 bad
 guys
 just
 movies
Cluster 9:
 movie
 police
 end
 kong
 departed
 run
 don
 cop
 hong
 table
Cluster 10:
 coraline
 selick
 haneke
 world
 children
 parents
 voice
 acts
 movie
 neighbors
Cluster 11:
 b

1.	In Python, load one of the sentiment vocabularies referenced in the textbook, and run the sentiment analyzer as explained in the corresponding reference. Add words to the sentiment vocabulary, if you think you need to, to better fit your particular text collection.

In [10]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = None
try:
    sid = SentimentIntensityAnalyzer('vader_lexicon')
except:
    nltk.download('vader_lexicon')
    sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\croomb\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [11]:
sid.polarity_scores('):{')

{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.5106}

2.	For each of the clusters you created in homework 7, compute the average, median, high, and low sentiment scores for each cluster. Explain whether you think this reveals anything interesting about the clusters.

In [12]:
# get the cluster sentiment scores for the 5 clusters
getClusterSentimentScores(top_terms_k5)

Cluster 0: mean: -0.06 | median: 0.0 | max: 0e+00 | min: -0.6
Cluster 1: mean: 0.02 | median: 0.0 | max: 0.4 | min: -0.6
Cluster 2: mean: -0.1 | median: 0.0 | max: 0e+00 | min: -0.6
Cluster 3: mean: 0.2 | median: 0.0 | max: 0.6 | min: 0e+00
Cluster 4: mean: 0.1 | median: 0.0 | max: 0.6 | min: 0e+00


For the five cluster sentiment, we see that each mean is above zero which would imply a positive sentiment in all clusters. THis could be due to higher positive terms in those clusters

In [13]:
# get the cluster sentiment scores for the 10 clusters
getClusterSentimentScores(top_terms_k10)

Cluster 0: mean: 0.06 | median: 0.0 | max: 0.6 | min: 0e+00
Cluster 1: mean: 0.04 | median: 0.0 | max: 0.4 | min: 0e+00
Cluster 2: mean: 0.07 | median: 0.0 | max: 0.6 | min: -0.6
Cluster 3: mean: 0.04 | median: 0.0 | max: 0.4 | min: 0e+00
Cluster 4: mean: 0.1 | median: 0.0 | max: 0.6 | min: 0e+00
Cluster 5: mean: 0.08 | median: 0.0 | max: 0.4 | min: 0e+00
Cluster 6: mean: 0.08 | median: 0.0 | max: 0.4 | min: 0e+00
Cluster 7: mean: 0.06 | median: 0.0 | max: 0.6 | min: -0.7
Cluster 8: mean: 0.1 | median: 0.0 | max: 0.6 | min: 0e+00
Cluster 9: mean: 0.06 | median: 0.0 | max: 0.6 | min: 0e+00


For the ten cluster sentiment, we see that the majority of the clusters still maintain a positive sentiment. Cluster 0 has a mean of zero which would indicate a neutral sentiment. Clusters 6 and 8 have a negative mean which would indicate negative sentiment for these clusters.

In [14]:
# get the cluster sentiment for the 20 clusters
getClusterSentimentScores(top_terms_k20)

Cluster 0: mean: -0.08 | median: 0.0 | max: 0e+00 | min: -0.4
Cluster 1: mean: 0.04 | median: 0.0 | max: 0.4 | min: 0e+00
Cluster 2: mean: -0.06 | median: 0.0 | max: 0e+00 | min: -0.6
Cluster 3: mean: -0.007 | median: 0.0 | max: 0.6 | min: -0.6
Cluster 4: mean: 0e+00 | median: 0.0 | max: 0e+00 | min: 0e+00
Cluster 5: mean: 0.05 | median: 0.0 | max: 0.5 | min: 0e+00
Cluster 6: mean: -0.2 | median: 0.0 | max: 0e+00 | min: -0.6
Cluster 7: mean: 0e+00 | median: 0.0 | max: 0e+00 | min: 0e+00
Cluster 8: mean: 0.08 | median: 0.0 | max: 0.6 | min: -0.5
Cluster 9: mean: 0e+00 | median: 0.0 | max: 0e+00 | min: 0e+00
Cluster 10: mean: 0e+00 | median: 0.0 | max: 0e+00 | min: 0e+00
Cluster 11: mean: 0e+00 | median: 0.0 | max: 0e+00 | min: 0e+00
Cluster 12: mean: -0.02 | median: 0.0 | max: 0.4 | min: -0.6
Cluster 13: mean: 0.1 | median: 0.0 | max: 0.6 | min: 0e+00
Cluster 14: mean: 0e+00 | median: 0.0 | max: 0e+00 | min: 0e+00
Cluster 15: mean: -0.05 | median: 0.0 | max: 0e+00 | min: -0.5
Cluster 16

### Question 2 Analysis Summarization

<u>Cluster Execution- k=5</u>

For the five cluster sentiment, we see that each mean is above zero which would imply a positive sentiment in all clusters. THis could be due to higher positive terms in those clusters.

<u>Cluster Execution - k=10</u>

For the ten cluster sentiment, we see that the majority of the clusters still maintain a positive sentiment. Cluster 0 has a mean of zero which would indicate a neutral sentiment. Clusters 6 and 8 have a negative mean which would indicate negative sentiment for these clusters.

<u>Cluster Execution - k=20</u>

For the twenty cluster sentiment, we see that for clusters 0, 6, 14 and 18 the mean is zero which indicates a neutral sentiment. Clusters 1, 8, 16 and 19 have negative means which indicate negative sentiment. The remaining clusters have a positive mean which indicates a positive sentiment.

3.	For extra credit, analyze sentiment of chunks as follows:

 - Take the chunks from homework 5, and in Python, run each chunk individually through your sentiment analyzer that you used in question 1. If the chunk registers a nonneutral sentiment, save it in a tabular format (the chunk, the sentiment score).

 - Now sort the table twice, once to show the highest negative-sentiment-scoring chunks at the top and again to show the highest positive-sentiment-scoring chunks at the top. Examine the upper portions of both sorted lists, to identify any trends, and explain what you see. 

In [15]:
# Use one of the chunking sets from HW5 to build out chucks. We'll add these to our data frame
grammar = """
    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}
           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS>+}
    """   
df['proc_user_review2'] = df['user_review'].apply(processReviewText)

Now get the sentiment for each chunk of words and put them into a dataframe for later analysis

In [16]:
# get the chunked sentiment analysis results. Get the score and the list of terms
termList, scoreList = getChunkedSentimentScores(df['proc_user_review2'].tolist())

# build out a data frame for easy analysis
df_sent = pd.DataFrame({'TermsStr': termList, 'Sentiment_Score': scoreList})

# double check the data frame to make sure we have results
df_sent.head()

Unnamed: 0,TermsStr,Sentiment_Score
0,review fact that Batman THE BEST superhero levels,0.7125
1,First movie PERFECT for people punches drama explosions,0.6633
2,anything BUT corny few important characters flawless,0.7684
3,Christian Bale amazing actor,0.5859
4,Michael Cane portrays Alfred Pennyworth love with Batman loyal butler,0.8074


Sort the list with the negative sentiments at the top

In [17]:
# Sort the data frame on sentiment score, ascending so we get the negative reviews at the top
df_sent.sort_values(by=['Sentiment_Score'],ascending=True,inplace=True)

# Display the first 10 records
df_sent.head(10)

Unnamed: 0,TermsStr,Sentiment_Score
2229,EVIL DEAD II director Sam Raimi infamous sequel THE EVIL DEAD outrageous prequel ARMY OF DARKNESS Evil Dead II publicity materials Evil Dead Dawn horror comedy film Sam Raimi parody sequel film Evil Dead,-0.9956
1393,narrator failure with blood veins March of Time commentator loud film horses off track titles dramatic music tension characters while plot time mystery and tension characters Johnny Clay Sterling Hayden convict years in Alcatraz dollar hold-up bunch human beings jigsaw pieces intricacy plan Irish barman amiable old book-keeper crooked cop George Peatty Elisha Cook Jr  eyes mobile mouth incarnation of fear uncertainty countless Hollywood thrillers personification boy killer Kubrick tricks wi...,-0.9934
2225,Raimi director points for finesse manic breathless style outré black comedy Evil Dead franchise devil caution wind spirit Evil Dead trilogy Army of Darkness third film Raimi-Campbell remake Evil Dead cult sensation audiences films day,-0.9853
2227,Evil Dead Dawn sinister comedy blacker death consistent guffaws eyes blood and guts.Ash Bruce Campbell survivor of THE EVIL DEAD same cabin woods forces,-0.9768
55,line life 'lesser human beings passing unpleasant feelings mental shock 'proper human beings.However Ki-Taek family Mr. Park house Mr. Park family existence 'parasite presence basement room house.With such disillusionment and anger Ki-Taek Mr. Park parasite host parasite dead host body whole carcass nourishment offsprings.The tragedy something exceptional defect or vice Mr. Park Ki-Taek personalities,-0.9726
2818,Often rape reports deal violation scene home thought violation sheer terror victims scene men and women rape men men,-0.9702
2568,close-ups occasion without sound actors facial expressions key moments characters pain fear anger bitter anguish remorse or grief dialogue could.This breathtaking film critical smash box-office hit,-0.9661
2209,Evil Dead Dawn sequel cult classic film Evil Dead film prequel Army of Darkness years,-0.9657
3754,film Spielberg Munich intrigue point dramatic thriller nature national retaliation something Spielberg tries equilibrium sensitive issue attention consciousness Israeli assassins attempts injury innocent girl deplorable suffering Palestinians right of self-determination homeland Everyone story different opinion Munich Israel response murder Olympics Israeli athletes militants Spielberg disputable film Israel retaliation Israeli commando Mossad agent father assassination squad means mastermin...,-0.9652
2226,Evil Dead Dawn gruesome affair lot time Campbell arch delivery wide-eyed crazed performance style Raimi Scott Speigel reinvention Evil Dead premise,-0.9633


Sort the list with the positive sentiments at the top

In [18]:
# Sort the data frame on sentiment score, descending so we get the positive reviews at the top
df_sent.sort_values(by=['Sentiment_Score'],ascending=False,inplace=True)

# Display the first 10 records
df_sent.head(10)

Unnamed: 0,TermsStr,Sentiment_Score
560,GREAT ESCAPE liberties in tone character portrayal key elements ignorance.BRIDGE ON THE RIVER KWAI great film liberties technical details events THE GREAT ESCAPE .... VON RYAN EXPRESS realistic alternative,0.9782
2542,Throw magnificent screenplay marvelous cinematography plot efficiency Swiss watch added bonus gorgeous actress result huge winner,0.9747
1106,level expectations city city chase scene CIA Bourne several key officers Agency covert operation Bourne beginnings.David Strathairn GOOD NIGHT AND GOOD LUCK Scott Glenn FREEDOM WRITERS star Agency higher-ups attempts Jason Bourne Bourne and someone own department.Female Joan Allen THE ICE STORM Julia Stiles THE OMEN pull okay performances Allen CIA agent superiors Jason Bourne Stiles love interest Jason danger Bourne,0.9739
2466,True link strong films themes pain and loss stories settings amazing attention share strong focus superb performances.Eastwood way honors for Penn Freeman Swank Hackman others Academy,0.9633
1593,great acting great characters great story great dialogue,0.9545
3212,hallmark masterpiece of art film making.It years courage affect on me.This engaging thought provoking loving movie incredible pacing lovable characters don´t sci fi horror movies love story sci fi tones.I years heart,0.9538
2634,truly masterpiece script-wise anything brilliant script excellent actors,0.9382
253,thing Tabu piece of Info Radhika case high profile info hand from Ayushman Tabu whereabouts Raghavan truth time- audience brilliant tale hearts suspense illusion for sure.Raghavan Ayushman cane Rabbit duck handle saying- that Ayushman illusion way presence of Rabbit story.It delightful way audience things original word movie title notice difference something validation,0.9371
2031,Nick Frost great second banana punchline Pegg straight line adolescent way relief sidekick.The film great surprise performance from Timothy Dalton,0.936
224,comedic talent fun film.If fine drama storyline twists little humor great acting and dialog classic film,0.9352


### Question 3 Analysis

Looking that the top ten items from the most negative and most positive sentiments we see trends in the individual listings. For example, in the negative sentiments we see a lot of negative words such as failure, fear and anger. These terms all elicit a negative emotion and make sense. We can also see that with respect to the sentiment analysis there is a lot of negative sentiment detected for the movie series "The Evil Dead". Without context we could simply say reviewers dislike this movie series. However, this negative sentiments could also be due to the fact that these movies sit in the horror genre and many reviews could be positive but the descriptive words used around horror movies may unexpectedly turn the sentiment negative.

Looking at the positive list, We see a lot of positive words such as great, fun, and magnificent. These listing of positive sentiments doesn't seem to have a lot of focus on a particular genere of movie or a particular movie. The sentiment seems to be more spread out for the positive sentiment items.