# MSDS 7337 Homework 5

Author: Nathan Wall

Date: 7/2/2019

The notebook below works through the questions as part of the homework 5. This will utilize comments 

Notebook Sections:
- [Q1: Collect User Review Permalinks](#q1)
- [Q2: Extract Noun Phrase Chunks](#q2)
- [Q3: Output all Chunks & Summary](#q3)

In [15]:
import pandas as pd
import numpy as np
import requests
import json
import nltk
from bs4 import BeautifulSoup

## Q1: Collect User Review Links from IMDB
<a id='q1'></a>

Below we set up method to scrape user reviews from IMDB's website. Per the question below we makes sure we meet the following criteria:

1) All the reviews are from the same genre

2) Capture reviews from several movies

3) Collect both positive & negative reviews

First lets get the list of movies from the genre of interest

In [2]:
def getMovies(genre):
    page = requests.get('https://www.imdb.com/search/title/?genres='+genre) 
    soup = BeautifulSoup(page.content, 'html.parser')
    #find the article links
    soupTitle = soup.find_all('h3', {"class": "lister-item-header"})

    links = []
    titles = []
    for s in soupTitle:
        links.append('https://www.imdb.com' + s.find('a').get('href') + 'reviews')
        titles.append(s.find('a').text)

    title_df = pd.DataFrame(list(zip(titles, links)), columns = ['title', 'link'])
    return title_df

In [3]:
genre = 'crime'
title_df = getMovies(genre)

In [4]:
title_df.head()

Unnamed: 0,title,link
0,Dark,https://www.imdb.com/title/tt5753856/reviews
1,Murder Mystery,https://www.imdb.com/title/tt1618434/reviews
2,Big Little Lies,https://www.imdb.com/title/tt3920596/reviews
3,Lucifer,https://www.imdb.com/title/tt4052886/reviews
4,Jessica Jones,https://www.imdb.com/title/tt2357547/reviews


Using the links to user review pages extracted from the genre above we scrape all the user review permalinks for the movie & shows.

In [5]:
def getReviewLinks(link):
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    reviewPage = soup.find_all('a', {"class": "title"})
    reviewLinks = []
    for r in reviewPage:
        reviewLinks.append('https://www.imdb.com' + r.get('href'))

    return reviewLinks

In [8]:
reviews = []
for link in title_df['link']:
    reviews.append(getReviewLinks(link))

title_df["reviewLinks"] = reviews

In [9]:
print("Scraped review links for {m} Movie & TV Shows for the {g} genre".format(m=len(title_df), g=genre))

Scraped review links for 50 Movie & TV Shows for the crime genre


In [10]:
allReviews = title_df.reviewLinks.apply(pd.Series) \
    .merge(title_df, right_index = True, left_index = True) \
    .drop(["reviewLinks"], axis = 1) \
    .melt(id_vars = ['title', 'link'], value_name = "reviewLink") \
    .drop("variable", axis = 1) \
    .dropna()

In [11]:
print("{r} Total Reviews for Movie & TV Shows from the {g} genre".format(r=len(allReviews), g=genre))

1219 Total Reviews for Movie & TV Shows from the crime genre


This ends the neccesary steps for question 1. We now have a dataframe of all the user reviews for the top 50 movies from the crime genre. However, the function is set up to be able to perform this task on any of the genres defined on the IMBD website, https://www.imdb.com/feature/genre/.

In the next section we will pull out the actual review text.

## Q2: Extract User Review Noun Phrases (NP)
<a id='q2'></a>

Using the review permalinks extracted we will pull out the review text.

In [12]:
def getReview(link):
    page = requests.get(link) 
    soup = BeautifulSoup(page.content, 'html.parser')
    review = soup.find_all('script', {"type": "application/ld+json"})
    reviewJSON = json.loads(review[0].get_text())
    try:
        reviewText = reviewJSON['reviewBody']
    except:
        reviewText = reviewJSON['name']
    return reviewText

In [13]:
reviewText = []
for link in allReviews["reviewLink"]:
    reviewText.append(getReview(link))

allReviews["reviewText"] = reviewText

In [14]:
allReviews.head()

Unnamed: 0,title,link,reviewLink,reviewText
0,Dark,https://www.imdb.com/title/tt5753856/reviews,https://www.imdb.com/review/rw4949287/,"Insanely good, every episode shocks you in way..."
1,Murder Mystery,https://www.imdb.com/title/tt1618434/reviews,https://www.imdb.com/review/rw4934621/,A nice easy breezy murder mystery. Full of fun...
2,Big Little Lies,https://www.imdb.com/title/tt3920596/reviews,https://www.imdb.com/review/rw3649582/,Exquisite. 'Big Little Lies' takes us to an in...
3,Lucifer,https://www.imdb.com/title/tt4052886/reviews,https://www.imdb.com/review/rw4838256/,Thank Luci himself that fox cancelled this sho...
4,Jessica Jones,https://www.imdb.com/title/tt2357547/reviews,https://www.imdb.com/review/rw4086289/,"This review pains me to write, because I genui..."


We now have all the reviews for the movie & TV shows from our genre. We will run this through a chunker to extract all of the noun phrases.

For this initial process we will use the rudimentary rules bases chunker from the NLTK book, although will likely use spaCy or something more intelligent in future assignments. 

In [17]:
#define the regex rule for noun phrases
grammar = ('''
    NP: {<DT>?<JJ>*<NN>} # NP
    ''')
#create chunker
chunkParser = nltk.RegexpParser(grammar)

In [23]:
def chunkReview(review):
    sent = nltk.sent_tokenize(review)
    NP = []
    for s in sent:
        tagged = nltk.pos_tag(nltk.word_tokenize(s))
        tree = chunkParser.parse(tagged)
        for subtree in tree.subtrees():
            if subtree.label() == "NP":
                NP.append(subtree)
    return NP

In [26]:
allReviews['NP'] = allReviews["reviewText"].apply(chunkReview)
allReviews.head()

Unnamed: 0,title,link,reviewLink,reviewText,NP
0,Dark,https://www.imdb.com/title/tt5753856/reviews,https://www.imdb.com/review/rw4949287/,"Insanely good, every episode shocks you in way...","[[(every, DT), (episode, NN)], [(The, DT), (co..."
1,Murder Mystery,https://www.imdb.com/title/tt1618434/reviews,https://www.imdb.com/review/rw4934621/,A nice easy breezy murder mystery. Full of fun...,"[[(A, DT), (nice, JJ), (easy, JJ), (breezy, NN..."
2,Big Little Lies,https://www.imdb.com/title/tt3920596/reviews,https://www.imdb.com/review/rw3649582/,Exquisite. 'Big Little Lies' takes us to an in...,"[[(an, DT), (incredible, JJ), (journey, NN)], ..."
3,Lucifer,https://www.imdb.com/title/tt4052886/reviews,https://www.imdb.com/review/rw4838256/,Thank Luci himself that fox cancelled this sho...,"[[(fox, NN)], [(this, DT), (show, NN)], [(job,..."
4,Jessica Jones,https://www.imdb.com/title/tt2357547/reviews,https://www.imdb.com/review/rw4086289/,"This review pains me to write, because I genui...","[[(This, DT), (review, NN)], [(the, DT), (pinn..."


We now have the Noun Phrases for all the reviews in our sample. Below we show an example of several chunked reviews & corresponding noun phrases specific to the show "Dark" on Netflix.

In [48]:
dark = allReviews[allReviews['title'] == "Dark"]
for index, row in dark.head(3).iterrows():
    print('----------------------------------')
    print('Review Text')
    print('----------------------------------')
    print(row['reviewText'])
    print('----------------------------------')
    print('Noun Phrases')
    print('----------------------------------')
    for np in row['NP']:
        print(np)

----------------------------------
Review Text
----------------------------------
Insanely good, every episode shocks you in ways you never thought was possible. The constant gripping revelations were so unexpected but tied the story so well together. Cannot wait for the next season
----------------------------------
Noun Phrases
----------------------------------
(NP every/DT episode/NN)
(NP The/DT constant/JJ gripping/NN)
(NP the/DT story/NN)
(NP the/DT next/JJ season/NN)
----------------------------------
Review Text
----------------------------------
This show is the best thing Netflix has done. It is an absolute masterpiece of story telling.
----------------------------------
Noun Phrases
----------------------------------
(NP This/DT show/NN)
(NP thing/NN)
(NP an/DT absolute/JJ masterpiece/NN)
(NP story/NN)
(NP telling/NN)
----------------------------------
Review Text
----------------------------------
One of the most successful series I've seen in recent years!
----------------

## Q3: Output all the chunks in a single list for each review
<a id='q3'></a>

Below we output the results for everything after a brief review of what we have done for this task.

For this project we selected movies & TV shows from the "crime" drama. However, this code can easily be used to pull any of the genre's on IMDB's site using the first getMovies() function. We then scraped the links to the user review pages for the top 50 movie & TV shows for that genre. Using that the main user review page we were able to extract all the permalink the first 50 reviews for that corresponding movie or TV show. If less than 50 reviews were available then all the review links were stored. Once we had all the review links we were able to parse those links & extract the actual review text from the JSON in the HTML code.

Once we had the the review text from the 1219 reviews from 50 different Movies & Television shoes we were able to chunk the noun phrases of those text. The chunker we implemented was simple rules based regular expressions chunker. After tokeninizing and tagging our text with parts of speech we defined our NP chunck as a sequence that contains an optional determinant, then any adjectives, and finally a noun. The first two parts of the chunk are optional.

In [49]:
finalJSON = allReviews[['title','reviewText','NP']].to_json(orient='records')

In [50]:
with open('/home/newall/MSDS/MSDS7337/HW5/msds7337_nwall_reviews.json', 'w', encoding='utf-8') as outfile:
    json.dump(finalJSON, outfile, ensure_ascii=False, indent=2)

The final JSON file for all of these reviews is almost 2MBs and will be submitted with the final file.