## DS 7337 - Natural Language Processing

### Author: Brandon Croom

### Homework: 5

### Assignment Criteria:

1.	Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8, which together will make up your semester project.   
* it does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
* Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
* Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
* Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews. 

2.	Extract noun phrase (NP) chunks from your reviews using the following procedure:
* In Python, use BeautifulSoup to grab the main review text from each link.  
* Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
* You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.

3.	Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).

NOTE: Code for scraping IMDB website modified from https://shravan-kuchkula.github.io/scrape_imdb_movie_reviews/#step-4-for-each-of-the-movie-reviews-link-get-a-positive-user-review-link-and-a-negative-movie-review-link


In [1]:
# import nltk and other items
import nltk
from nltk.tokenize import word_tokenize 
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import itertools
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *

pd.options.display.max_colwidth=500

#define some static values, specifically our source of information IMBD, our output file and the url for our IMDB search
base_url = "https://www.imdb.com"
outfile = "DS_7337_NLP_HW5_CROOM.csv"

# API call to select:
#  -feature films
#  -which are rated atleast 4.0
#  -having atleast 50,000 votes
#  -in the Thriller genre
#  -sorted by user rating
#  -limit to 250 movies
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''

In [2]:
# define helper functions for the work

# Utility function which takes a url and returns a Soup object.
def urlToSoup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    return soup

# Utility function to filter out movie tag issues
def filterMovieTags(movieSoup):
    # find all a-tags with class:None
    movie_tags = movieSoup.find_all('a', attrs={'class': None})

    # filter the a-tags to get just the titles
    movie_tags = [tag.attrs['href'] for tag in movie_tags 
                  if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]

    # remove duplicate links
    movie_tags = list(dict.fromkeys(movie_tags))

    return movie_tags

# function to build the list of movie review links
def buildReviewLinks(movie_tags):
    movie_links = [base_url + tag + 'reviews' for tag in movie_tags]
    return movie_links

# Function that returns the index of negative and positive review.
def minMax(a):   
    # get the index of least rated user review
    minpos = a.index(min(a))
    
    # get the index of highest rated user review
    maxpos = a.index(max(a))
    
    return minpos, maxpos

# Function returns a negative and positive review for each movie.
def getMovieReviews(soup):    
    # get a list of user ratings
    user_review_ratings = [tag.previous_element for tag in 
                           soup.find_all('span', attrs={'class': 'point-scale'})]
    
    # find the index of negative and positive review
    n_index, p_index = minMax(list(map(int, user_review_ratings)))
    
    # get the review tags
    user_review_list = soup.find_all('a', attrs={'class':'title'})
    
    # get the negative and positive review tags
    n_review_tag = user_review_list[n_index]
    p_review_tag = user_review_list[p_index]
    
    # return the negative and positive review link
    n_review_link = base_url + n_review_tag['href']
    p_review_link = base_url + p_review_tag['href']
    
    return n_review_link, p_review_link

# Returns the user review text given the review url.
def getMovieReviewText(review_url):   
    # get the review_url's soup
    soup = urlToSoup(review_url)
    
    # find div tags with class text show-more__control
    tag = soup.find('div', attrs={'class': 'text show-more__control'})
    
    return tag.getText()

# Returns the movie title from the review url.
def getMovieTitle(review_url):
    # get the review_url's soup
    soup = urlToSoup(review_url)
    
    # find h1 tag
    tag = soup.find('h1')
    
    return list(tag.children)[1].getText()

# Returns the NP chunked review results
def processReviewText(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [getMovieReviewTags(sent) for sent in sentences]
    return sentences

# Returns the tagged version of the review
def getMovieReviewTags(reviewSentence):	
    nps = []

    # NOTE: For purposes of the exercise I'm treating
    # the grammar variable as a global so that the regular
    # expression patterns can be change it as needed
    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(reviewSentence)

    # loop through the trees produced and pull out only the 
    # NP subtrees
    for subtree in tree.subtrees():
        if subtree.label() == 'NP':
            t = subtree
            t = ' '.join(word for word, tag in t.leaves())
            nps.append(t)

    return nps



1.	Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8, which together will make up your semester project.   
* it does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
* Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
* Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
* Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews

First lets connect to IMDB. We're going to pull films that are rated highly rated in the thriller genre. Pull back 250 movies

In [3]:
# API call to select:
## feature films
## which are rated atleast 4.0
## having atleast 50,000 votes
## in the Thriller genre
## sorted by user rating
## limit to 250 movies
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''

# transform the movie url list to a BeautifulSoup object
movies_soup = urlToSoup(url)

# in looking at the returned list there's some filtering that needs to be done
# filter off tags with none and just cleanup the list in general
movie_tags = filterMovieTags(movies_soup)

# Print out the number of reviews we have and show the first 10 items
print("There are a total of " + str(len(movie_tags)) + " movie user reviews")
print("Displaying 10 user reviews links")
movie_tags[:10]

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['/title/tt0468569/',
 '/title/tt1375666/',
 '/title/tt6751668/',
 '/title/tt0816692/',
 '/title/tt0114369/',
 '/title/tt0102926/',
 '/title/tt7286456/',
 '/title/tt0482571/',
 '/title/tt0407887/',
 '/title/tt0114814/']

Now let's build out the list of specific movie review URLs to be able to parse the reviews

In [4]:
# build out the list of reviews
review_links = buildReviewLinks(movie_tags)

print("There are a total of " + str(len(review_links)) + " movie user reviews")
print("Displaying 10 user reviews links")
review_links[:10]

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1375666/reviews',
 'https://www.imdb.com/title/tt6751668/reviews',
 'https://www.imdb.com/title/tt0816692/reviews',
 'https://www.imdb.com/title/tt0114369/reviews',
 'https://www.imdb.com/title/tt0102926/reviews',
 'https://www.imdb.com/title/tt7286456/reviews',
 'https://www.imdb.com/title/tt0482571/reviews',
 'https://www.imdb.com/title/tt0407887/reviews',
 'https://www.imdb.com/title/tt0114814/reviews']

Now go through each of the reviews and filter by positive and negative reviews. This will give us two individual movie reviews for each movie

In [5]:

# get a list of soup objects
movie_soups = [urlToSoup(link) for link in review_links]

# get all movie review links
movie_review_list = [getMovieReviews(movie_soup) for movie_soup in movie_soups]

movie_review_list = list(itertools.chain(*movie_review_list))

# For a check make a listing of the review URLs
print("There are a total of " + str(len(movie_review_list)) + " individual movie reviews")
print("Displaying 10 reviews")
movie_review_list[:10]

ProxyError: HTTPSConnectionPool(host='www.imdb.com', port=443): Max retries exceeded with url: /title/tt0114746/reviews (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001C87DBDE760>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')))

To make things a bit easier to read build a pandas data frame. This data frame will house the movie title, the link the the user review, the user review and the sentiment of the review.

In [6]:
# get review text from the review link
review_texts = [getMovieReviewText(url) for url in movie_review_list]

# get movie name from the review link
movie_titles = [getMovieTitle(url) for url in movie_review_list]

# label each review with negative or positive
review_sentiment = np.array(['negative', 'positive'] * (len(movie_review_list)//2))

# construct a dataframe
df = pd.DataFrame({'movie': movie_titles, 'user_review_permalink': movie_review_list,
             'user_review': review_texts, 'sentiment': review_sentiment})

# show the dataframe head to show what we return back
df.head()

NameError: name 'movie_review_list' is not defined

2.	Extract noun phrase (NP) chunks from your reviews using the following procedure:
* In Python, use BeautifulSoup to grab the main review text from each link.  
* Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
* You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.


Now let's run the reviews through our processReviewText function. This function will tokenize the review, run it through the NP parser and then return only the NP tree values. We're basing this initial search on our grammer value defined to look at: sequences of proper nouns, nouns following each other and determiner/possessive, adjectives and noun. We'll add these to a new column called proc_user_review

In [7]:
# use an initial grammar regular expression that looks at sequences of proper nouns, nouns following each other and determiner/possessive, adjectives and noun.
grammar = r"""
     NP: {<DT|PP\$>?<JJ>*<NN>}  
         {<NNP>+}               
         {<NN><NN>}               
    """

df['proc_user_review'] = df['user_review'].apply(processReviewText)
df.head()

NameError: name 'df' is not defined

Now we'll change up the grammar value to add additional checks to see if we can get better or worse results. We'll create a new column in the data frame called proc_user_review2

In [8]:
grammar = """
    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}
           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS>+}
    """   
df['proc_user_review2'] = df['user_review'].apply(processReviewText)
df.head()

NameError: name 'df' is not defined

In [9]:
df.to_csv(outfile)

NameError: name 'df' is not defined

3.	Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).

### Analysis Results:

For this homework the following tasks were performed:
* Using IMDB as the source we pulled 250 movies to gather reviews through a scraping mechanism. This list of movies could change over time.
* The movie selection was based off of movies in the thriller genre. This genere provided a good selection of reviews and movies.
* For each of the movies we gathered a positive and negative review based on the users provided rating value. From this we were able to pull 500 total reviews to analyze.
* Each of the reviews was then run through a NP-chunker with shallow parsing. We utilized the regular expression parser and ran each review through twice with differing levels of regular expression requirements

The parser results were interesting:
* For the most part both parsers captured NP records reasonably well
* Both parsers had issues where users did not correctly break sentences. We see some instances of periods not being followed by spaces which caused the parsers to collapse words. It may have been beneficial to look at a pre processing step that ensured more correctly formed sentences.
* The more strict parser did remove determinates from the parsing trees. This may or may not be a good thing depending upon future use cases. In general for this exercise it did not have a large impact on the understandability