## DS 7337 - Natural Language Processing

### Author: Brandon Croom

### Homework: 7

### Assignment Criteria:

Cluster the reviews that you collected in homework 5, by doing the following:

1.	In Python, select any one of the clustering methods covered in this course. Run it over the collection of reviews, and show at least two different ways of clustering the reviews, e.g., changing k in k-Means clustering or changing where you “cut” in Agnes or Diana.  

2.	Try to write a short phrase to characterize (give a natural interpretation of) what each cluster is generally centered on semantically. Is this hard to do in some cases? If so, make note of that fact. 

3.	Explain which of the two clustering results from question 1 is preferable (if one of them is), and why. 

NOTE: Code for scraping IMDB website modified from https://shravan-kuchkula.github.io/scrape_imdb_movie_reviews/#step-4-for-each-of-the-movie-reviews-link-get-a-positive-user-review-link-and-a-negative-movie-review-link


In [1]:
# import nltk and other items
import nltk
from nltk.tokenize import word_tokenize 
import numpy as np
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from bs4 import BeautifulSoup
import re
import pandas as pd
import itertools
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *

pd.options.display.max_colwidth=500

#define some static values, specifically our source of information IMBD, our output file and the url for our IMDB search
base_url = "https://www.imdb.com"
outfile = "DS_7337_NLP_HW5_CROOM.csv"

# API call to select:
#  -feature films
#  -which are rated atleast 4.0
#  -having atleast 50,000 votes
#  -in the Thriller genre
#  -sorted by user rating
#  -limit to 250 movies
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''

In [2]:
# define helper functions for the work

# Utility function which takes a url and returns a Soup object.
def urlToSoup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    return soup

# Utility function to filter out movie tag issues
def filterMovieTags(movieSoup):
    # find all a-tags with class:None
    movie_tags = movieSoup.find_all('a', attrs={'class': None})

    # filter the a-tags to get just the titles
    movie_tags = [tag.attrs['href'] for tag in movie_tags 
                  if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]

    # remove duplicate links
    movie_tags = list(dict.fromkeys(movie_tags))

    return movie_tags

# function to build the list of movie review links
def buildReviewLinks(movie_tags):
    movie_links = [base_url + tag + 'reviews' for tag in movie_tags]
    return movie_links

# Function that returns the index of negative and positive review.
def minMax(a):   
    # get the index of least rated user review
    minpos = a.index(min(a))
    
    # get the index of highest rated user review
    maxpos = a.index(max(a))
    
    return minpos, maxpos

# Function returns a negative and positive review for each movie.
def getMovieReviews(soup):    
    # get a list of user ratings
    user_review_ratings = [tag.previous_element for tag in 
                           soup.find_all('span', attrs={'class': 'point-scale'})]
    
    # find the index of negative and positive review
    n_index, p_index = minMax(list(map(int, user_review_ratings)))
    
    # get the review tags
    user_review_list = soup.find_all('a', attrs={'class':'title'})
    
    # get the negative and positive review tags
    n_review_tag = user_review_list[n_index]
    p_review_tag = user_review_list[p_index]
    
    # return the negative and positive review link
    n_review_link = base_url + n_review_tag['href']
    p_review_link = base_url + p_review_tag['href']
    
    return n_review_link, p_review_link

# Returns the user review text given the review url.
def getMovieReviewText(review_url):   
    # get the review_url's soup
    soup = urlToSoup(review_url)
    
    # find div tags with class text show-more__control
    tag = soup.find('div', attrs={'class': 'text show-more__control'})
    
    return tag.getText()

# Returns the movie title from the review url.
def getMovieTitle(review_url):
    # get the review_url's soup
    soup = urlToSoup(review_url)
    
    # find h1 tag
    tag = soup.find('h1')
    
    return list(tag.children)[1].getText()

# Returns the NP chunked review results
def processReviewText(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [getMovieReviewTags(sent) for sent in sentences]
    return sentences

# Returns the tagged version of the review
def getMovieReviewTags(reviewSentence):	
    nps = []

    # NOTE: For purposes of the exercise I'm treating
    # the grammar variable as a global so that the regular
    # expression patterns can be change it as needed
    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(reviewSentence)

    # loop through the trees produced and pull out only the 
    # NP subtrees
    for subtree in tree.subtrees():
        if subtree.label() == 'NP':
            t = subtree
            t = ' '.join(word for word, tag in t.leaves())
            nps.append(t)

    return nps

# Returns the k-means clusters for the documents
def getKMeans(reviews, kVal = 5):
    # build a TFIDFVectorizer with the engligh stop words
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(reviews)

    # execute KMeans on the vectorized data
    model = KMeans(n_clusters=kVal, init='k-means++', max_iter=100, n_init=1)
    model.fit(X)

    # print out the top terms per cluster for the user
    print("Top terms per cluster:")
    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    
    for i in range(kVal):
        print("Cluster %d:" % i),
        for ind in order_centroids[i, :10]:
         print(' %s' % terms[ind]),
        print

    print("\n")

First lets connect to IMDB. We're going to pull films that are rated highly rated in the thriller genre. Pull back 250 movies

In [3]:
# API call to select:
## feature films
## which are rated atleast 4.0
## having atleast 50,000 votes
## in the Thriller genre
## sorted by user rating
## limit to 250 movies
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''

# transform the movie url list to a BeautifulSoup object
movies_soup = urlToSoup(url)

# in looking at the returned list there's some filtering that needs to be done
# filter off tags with none and just cleanup the list in general
movie_tags = filterMovieTags(movies_soup)

# Print out the number of reviews we have and show the first 10 items
print("There are a total of " + str(len(movie_tags)) + " movie user reviews")
print("Displaying 10 user reviews links")
movie_tags[:10]

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['/title/tt0468569/',
 '/title/tt1375666/',
 '/title/tt6751668/',
 '/title/tt0816692/',
 '/title/tt0114369/',
 '/title/tt0102926/',
 '/title/tt7286456/',
 '/title/tt0482571/',
 '/title/tt0407887/',
 '/title/tt0114814/']

Now let's build out the list of specific movie review URLs to be able to parse the reviews

In [4]:
# build out the list of reviews
review_links = buildReviewLinks(movie_tags)

print("There are a total of " + str(len(review_links)) + " movie user reviews")
print("Displaying 10 user reviews links")
review_links[:10]

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1375666/reviews',
 'https://www.imdb.com/title/tt6751668/reviews',
 'https://www.imdb.com/title/tt0816692/reviews',
 'https://www.imdb.com/title/tt0114369/reviews',
 'https://www.imdb.com/title/tt0102926/reviews',
 'https://www.imdb.com/title/tt7286456/reviews',
 'https://www.imdb.com/title/tt0482571/reviews',
 'https://www.imdb.com/title/tt0407887/reviews',
 'https://www.imdb.com/title/tt0114814/reviews']

Now go through each of the reviews and filter by positive and negative reviews. This will give us two individual movie reviews for each movie

In [5]:

# get a list of soup objects
movie_soups = [urlToSoup(link) for link in review_links]

# get all movie review links
movie_review_list = [getMovieReviews(movie_soup) for movie_soup in movie_soups]

movie_review_list = list(itertools.chain(*movie_review_list))

# For a check make a listing of the review URLs
print("There are a total of " + str(len(movie_review_list)) + " individual movie reviews")
print("Displaying 10 reviews")
movie_review_list[:10]

There are a total of 500 individual movie reviews
Displaying 10 reviews


['https://www.imdb.com/review/rw2599771/',
 'https://www.imdb.com/review/rw5478826/',
 'https://www.imdb.com/review/rw2286063/',
 'https://www.imdb.com/review/rw4692192/',
 'https://www.imdb.com/review/rw5589331/',
 'https://www.imdb.com/review/rw5195256/',
 'https://www.imdb.com/review/rw3119344/',
 'https://www.imdb.com/review/rw5145037/',
 'https://www.imdb.com/review/rw1136748/',
 'https://www.imdb.com/review/rw0370669/']

To make things a bit easier to read build a pandas data frame. This data frame will house the movie title, the link the the user review, the user review and the sentiment of the review.

In [6]:
# get review text from the review link
review_texts = [getMovieReviewText(url) for url in movie_review_list]

# get movie name from the review link
movie_titles = [getMovieTitle(url) for url in movie_review_list]

# label each review with negative or positive
review_sentiment = np.array(['negative', 'positive'] * (len(movie_review_list)//2))

# construct a dataframe
df = pd.DataFrame({'movie': movie_titles, 'user_review_permalink': movie_review_list,
             'user_review': review_texts, 'sentiment': review_sentiment})

# show the dataframe head to show what we return back
df.head()

Unnamed: 0,movie,user_review_permalink,user_review,sentiment
0,The Dark Knight,https://www.imdb.com/review/rw2599771/,"Let's open this review with the fact that Batman is THE BEST superhero ever, with no other even coming close to his levels. So when a sequel is made for an unforgettable movie like ""Batman Begins"" expectations are EXTREMELY high. I think that if you could describe this movie in a few words it would be ""152 minutes of pure awesomeness."" Christopher Nolan probably created the best ever comic book adaptation of a movie. The movie itself has been adapted in a much more realistic fashion than the...",negative
1,The Dark Knight,https://www.imdb.com/review/rw5478826/,"Confidently directed, dark, brooding, and packed with impressive action sequences and a complex story, The Dark Knight includes a career-defining turn from Heath Ledger as well as other Oscar worthy performances, TDK remains not only the best Batman movie, but comic book movie ever created.",positive
2,Inception,https://www.imdb.com/review/rw2286063/,"I have to say to make such an impressive trailer and such an uninteresting film, takes some doing.Here you have most of the elements that would make a very good film. You have great special effects, a sci-fi conundrum, beautiful visuals and good sound. Yet the most important part of the film is missing. There is no plot, character or soul to this film. It's like having a beautiful building on the outside with no paint or decoration on the inside.It's an empty shell of a film. There is no ten...",negative
3,Inception,https://www.imdb.com/review/rw4692192/,"My 3rd time watching this movie! Yet, it still stunned my mind, kept me enjoyed its every moment and left me with many thoughts afterward.\nFor someone like me, who've rarely slept without dream, it's so exciting watching how Christopher Nolan had illustrated every single characteristic of dream on the big screen. As it's been done so sophisticatedly, I do believe the rumour that Nolan had spent 10 years to finish the script of Inception. In my opinion, it's been so far the greatest achievem...",positive
4,Parasite,https://www.imdb.com/review/rw5589331/,"I find it somewhat odd that some people have interpreted this film as anticapitalist. It shines a very narrow light on poverty but it is all conveyed in a very surreal scenario which makes it quite difficult to emotionally engage with the situations that transpire. However, that's not the only reason.The main characters are not particularly likeable, which I'm sure defenders of the film would describe them as morally grey, given the prevailing cultural obsession with the moral ambiguity.It's...",negative


1.	In Python, select any one of the clustering methods covered in this course. Run it over the collection of reviews, and show at least two different ways of clustering the reviews, e.g., changing k in k-Means clustering or changing where you “cut” in Agnes or Diana. 


Run the K-Means calculation with 5 clusters, the default setup by the getKMeans function

In [7]:
# Execute the K-Means function on the reviews. We'll initially use the default number of clusters which is 5
getKMeans(review_texts)

Top terms per cluster:
Cluster 0:
 movie
 people
 film
 like
 good
 turkish
 think
 real
 story
 know
Cluster 1:
 film
 movie
 just
 story
 like
 best
 good
 time
 really
 don
Cluster 2:
 film
 hitchcock
 murder
 movie
 films
 best
 time
 just
 man
 grant
Cluster 3:
 movie
 film
 character
 good
 ending
 just
 great
 story
 like
 plot
Cluster 4:
 movie
 action
 film
 movies
 seen
 time
 best
 watch
 ve
 great




In [8]:
# Execute the K-Means function on the reviews, use a number of clusters equal to 10
getKMeans(review_texts,10)

Top terms per cluster:
Cluster 0:
 message
 media
 movie
 fred
 lou
 symbolism
 dead
 true
 film
 evil
Cluster 1:
 movie
 film
 just
 like
 story
 don
 good
 people
 know
 think
Cluster 2:
 film
 movie
 plot
 films
 park
 time
 acting
 really
 oldboy
 way
Cluster 3:
 movie
 good
 film
 10
 fans
 action
 eastwood
 best
 movies
 love
Cluster 4:
 action
 movie
 time
 film
 best
 alien
 fi
 sci
 love
 films
Cluster 5:
 film
 story
 movie
 list
 films
 com
 city
 imdb
 10
 best
Cluster 6:
 film
 movie
 police
 action
 watch
 time
 characters
 character
 end
 great
Cluster 7:
 hitchcock
 grant
 bourne
 cary
 hepburn
 film
 bond
 kennedy
 spy
 hannay
Cluster 8:
 bond
 craig
 daniel
 action
 mission
 007
 james
 movie
 series
 goldfinger
Cluster 9:
 bourne
 cruise
 max
 film
 vincent
 korea
 murder
 action
 spielberg
 insider




In [9]:
# Execute the K-Means function on the reviews, use a number of clusters equal to 20
getKMeans(review_texts,20)

Top terms per cluster:
Cluster 0:
 corrugated
 chevaux
 les
 deux
 shells
 diaboliques
 principal
 lover
 apartment
 school
Cluster 1:
 movie
 imdb
 action
 film
 movies
 just
 time
 good
 10
 best
Cluster 2:
 horror
 film
 dead
 zombies
 mall
 house
 wednesday
 lives
 evil
 concerns
Cluster 3:
 film
 good
 like
 movie
 really
 great
 just
 character
 perfect
 best
Cluster 4:
 bond
 goldfinger
 film
 max
 martial
 arts
 gibson
 teenagerdefinite
 ussexy
 actionmoralitya
Cluster 5:
 film
 movie
 best
 people
 like
 love
 story
 performance
 time
 just
Cluster 6:
 action
 film
 bourne
 man
 police
 movie
 films
 story
 just
 best
Cluster 7:
 film
 just
 time
 life
 way
 movie
 like
 good
 films
 carlito
Cluster 8:
 spanish
 cell
 experiment
 prisoners
 211
 prison
 escape
 alcatraz
 film
 alicia
Cluster 9:
 film
 city
 crowe
 corruption
 best
 character
 confidential
 police
 story
 la
Cluster 10:
 war
 film
 turing
 movie
 donovan
 like
 character
 elizabeth
 just
 alma
Cluster 11:
 park

2. Try to write a short phrase to characterize (give a natural interpretation of) what each cluster is generally centered on semantically. Is this hard to do in some cases? If so, make note of that fact. 

In looking at the clusters defined above we see the following:

With K=5, there doesn't seem to be any real definition of what the centroid is centered on. All of the clusters have the words film and movie in them which could be center points. The clusters themselves don't make a whole lot of sense

With K=10, we do start to see that the clusters start to cluster around commonalities in specific movies. For example, Cluster 0 seems to indicate word attributes based on the movie "Heat" starring Al Pacino and Robert De Niro. Similar results are seen in Cluster 8, which seems to be focused on the movie "The Machinist" starring Christian Bale. On the flip side certain clusters don't look to good. Cluster 9 is one of these. It looks just like a random sampling of words.

With K=20, we start to see a continuation of the clusters around commonalities to specific movies. We also start to see similar movie items start to be clustered together. For example, Cluster 18 seems to focus in on the movie "The Hunt for Red October". We also do see some "odd" clusters that don't really provide key information. For example Cluster 14 is just a bunch of random words that define a movie in general.

3.	Explain which of the two clustering results from question 1 is preferable (if one of them is), and why.

From the clustering runs that were performed (K=5, K=10, K-20), the higher K is the most preferred. As described in question 2, the values that were higher than 5 started to get closer to grouping the reivews by specific topics or even got close to the specific movie. This would provide a much richer analysis and be much easier to explain than the items in clusters that are 5 or less. With the k value at 5 we saw that the clusters didn't really have much meaning. 