# Scraping Rotten Tomatoes for User Reviews of _The Host_

Throughout this notebook, I will work on scraping user reviews of _The Host_ on [Rotten Tomatoes](https://www.rottentomatoes.com/m/the_host_2007/reviews?type=user&intcmp=rt-scorecard_audience-score-reviews). After scraping these user reviews, I will also filter by date to include only reviews posted between 2007 and 2009 (which I explain further down below). 

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import json
from nltk.corpus import stopwords
import random
import re
import os
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk import tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from collections import Counter

In [2]:
characters_to_strip = '().[]!,"'

In [3]:
%run functions.ipynb

In [4]:
url = 'https://www.rottentomatoes.com/m/the_host_2007/reviews?type=user&intcmp=rt-scorecard_audience-score-reviews'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


In [5]:
resp = requests.get(url, headers=headers)

In [6]:
html = resp.text

In [7]:
doc = BeautifulSoup(html, 'html.parser')

In [8]:
reviews = doc.find_all('li', attrs={'class':'audience-reviews__item'})

In [9]:
len(reviews)

10

In [10]:
reviews[0]

<li class="audience-reviews__item" data-qa="review-item">
<div class="audience-reviews__user-wrap">
<a href="/user/id/979050185">
<span class="audience-review__default-image"></span>
</a>
<div class="audience-reviews__name-wrap ">
<a class="audience-reviews__name" data-qa="review-name" href="/user/id/979050185">
                                woofy g
                            </a>
</div>
</div>
<div class="audience-reviews__review-wrap">
<span class="audience-reviews__score"><span class="star-display" data-qa="star-display"><span class="star-display__filled "></span><span class="star-display__filled "></span><span class="star-display__filled "></span><span class="star-display__filled "></span><span class="star-display__filled "></span></span></span>
<span class="audience-reviews__duration" data-qa="review-duration">Mar 16, 2021</span>
<p class="audience-reviews__review js-review-text clamp clamp-8 js-clamp" data-qa="review-text">Amazing monster movie about the strength of the famili

* Credit:
The code below, used to scrape audience reviews from Rotten Tomatoes, was adapted from: https://stackoverflow.com/questions/62386747/how-to-scrape-rottentomatoes-audience-reviews-using-python

In [30]:
r = requests.get("https://www.rottentomatoes.com/m/the_host_2007/reviews?type=user")
data = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))

movieId = data["movieId"]

def getReviews(endCursor):
    r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
    params = {
        "direction": "next",
        "endCursor": endCursor,
        "startCursor": ""
    })
    return r.json()

reviews = []
result = {}
for i in range(0, 600):
    print(f"[{i}] request review")
    result = getReviews(result["pageInfo"]["endCursor"] if i != 0  else "")
    reviews.extend([t for t in result["reviews"]])
    time.sleep(0.1)

print(f"got {len(reviews)} reviews")

[0] request review
[1] request review
[2] request review
[3] request review
[4] request review
[5] request review
[6] request review
[7] request review
[8] request review
[9] request review
[10] request review
[11] request review
[12] request review
[13] request review
[14] request review
[15] request review
[16] request review
[17] request review
[18] request review
[19] request review
[20] request review
[21] request review
[22] request review
[23] request review
[24] request review
[25] request review
[26] request review
[27] request review
[28] request review
[29] request review
[30] request review
[31] request review
[32] request review
[33] request review
[34] request review
[35] request review
[36] request review
[37] request review
[38] request review
[39] request review
[40] request review
[41] request review
[42] request review
[43] request review
[44] request review
[45] request review
[46] request review
[47] request review
[48] request review
[49] request review
[50] reque

[396] request review
[397] request review
[398] request review
[399] request review
[400] request review
[401] request review
[402] request review
[403] request review
[404] request review
[405] request review
[406] request review
[407] request review
[408] request review
[409] request review
[410] request review
[411] request review
[412] request review
[413] request review
[414] request review
[415] request review
[416] request review
[417] request review
[418] request review
[419] request review
[420] request review
[421] request review
[422] request review
[423] request review
[424] request review
[425] request review
[426] request review
[427] request review
[428] request review
[429] request review
[430] request review
[431] request review
[432] request review
[433] request review
[434] request review
[435] request review
[436] request review
[437] request review
[438] request review
[439] request review
[440] request review
[441] request review
[442] request review
[443] request

In [11]:
reviews[-1]

<li class="audience-reviews__item" data-qa="review-item">
<div class="audience-reviews__user-wrap">
<a href="/user/id/978895067">
<span class="audience-review__default-image"></span>
</a>
<div class="audience-reviews__name-wrap ">
<a class="audience-reviews__name" data-qa="review-name" href="/user/id/978895067">
                                Huizi P
                            </a>
</div>
</div>
<div class="audience-reviews__review-wrap">
<span class="audience-reviews__score"><span class="star-display" data-qa="star-display"><span class="star-display__filled "></span><span class="star-display__filled "></span><span class="star-display__filled "></span><span class="star-display__filled "></span><span class="star-display__empty"></span></span></span>
<span class="audience-reviews__duration" data-qa="review-duration">Nov 05, 2020</span>
<p class="audience-reviews__review js-review-text clamp clamp-8 js-clamp" data-qa="review-text">"The host" has the usual routine of South Korean disaste

In [75]:
# Getting info I want:

host_reviews = []

for review in reviews:
    rev_dict = {
        'date' : review['timeFromCreation'],
        'text' : review['review'],
        'score' : review['score']
    }
    host_reviews.append(rev_dict)
    

KeyError: 'timeFromCreation'

In [38]:
host_reviews[:3]

[{'date': 'Mar 16, 2021',
  'score': 5,
  'text': 'Amazing monster movie about the strength of the familial bind, as well as the cringey stupidity of authority figures and social hierarchies.'},
 {'date': 'Feb 14, 2021',
  'score': 3.5,
  'text': "Bong Joon Ho's humor just gets me. I love it. As a monster movie it isn't bad. I suggest just watching it for the great directing, the good acting, and the silliness that ensues."},
 {'date': 'Feb 05, 2021',
  'score': 0.5,
  'text': 'God Awful Horrendous Film Than Is An Insult To The Film Industry And The History Of Cinematography!!1!!!'}]

Next steps: Write out as json file, filter words I don't want

at least 5000 revs for each movie

the host: 2007-2009
parasite: 2019-2021

Write out the list as a json file:

In [39]:
#with open('../data/user_reviews/tomatoes_host_user.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(host_reviews))

Open the json file that I just downloaded:

In [76]:
host_rt = json.load(open('../data/user_reviews/tomatoes_host_user.json'))

Look at a random example from the list of Rotten Tomatoes (rt) user reviews of _The Host_ :

In [77]:
host_rt[4000]

{'date': 'Aug 30, 2007',
 'score': 4,
 'text': "All the buzz is well-deserved; it's just as good as people say. Only gripe being that it lulls for 20 min or so in the middle, but only because the rest is brilliant."}

Filter by date! 

I only want reviews that were posted between its availability in the US, which was starting in 2007, and 2009. This is because I want to analyze audience reviews that were posted within the first two years of the movie's release. That way, these reviews would not be affected by more recent hype around Director Bong Joon Ho's work. Instead, reviews that are most closely posted to the release date would reflect more authentic reviews on the movie. Additionally, this works well because _Parasite_ was released in 2019, which means at the time I am completing this project in 2021, it has been 2 years since that movie came out. Thus, for the audience reviews, this evens out the time between the film's release date and when the reviews are posted.

_The Host_ 
"The film was released on a limited basis in the United States on March 9, 2007, and on DVD, Blu-ray, and HD DVD formats on July 24, 2007"

In [78]:
host_rt[4000]['date']

'Aug 30, 2007'

date: mmm dd, yyyy

In [79]:
host_rt[4000]['date'][8:12]

'2007'

In [80]:
# Creating a new key:value pair for the year that the review was posted:

years = []

for rev in host_rt:
    years.append(rev['date'][8:12])
    rev['year'] = int(rev['date'][8:12])
    

In [81]:
# Checking to see if this worked:

host_rt[4000]

{'date': 'Aug 30, 2007',
 'score': 4,
 'text': "All the buzz is well-deserved; it's just as good as people say. Only gripe being that it lulls for 20 min or so in the middle, but only because the rest is brilliant.",
 'year': 2007}

Yay! It worked! Now we can use this new key:value pair in each dictionary in our list, called `year`, and we can filter by year so that we get a list of only the reviews that were posted between 2007 and 2009.

In [82]:
type(host_rt[4000]['year'])

int

In [84]:
# Filtering by year:

rt_host_user = []

for rev in host_rt:
    if rev['year']<2010:
        revs_dict = {
            'date' : rev['date'],
            'score' : rev['score'],
            'text' : rev['text']
        }
        rt_host_user.append(revs_dict)

In [87]:
len(rt_host_user)

5216

I have 5216 user reviews of _The Host_ to work with!

In [85]:
rt_host_user[0]

{'date': 'Dec 30, 2009',
 'score': 3.5,
 'text': 'For a monster movie this was quite good. Decent special effects and good acting all around. I feel that the aspect of military involvement and the supposed virus scare was never really expanded upon properly, leaving lots of "why?" when the credits role. Overall a great overseas horror flick, and I gotta say it\'s good to see something from that part of te globe where te main antagonist is NOT a woman with long black hair hanging in front of her face :)'}

In [86]:
rt_host_user[-1]

{'date': 'Jul 05, 2007',
 'score': 4.5,
 'text': 'Forget every monster movie you have ever seen: the Host wins hands down. Funny, poignant and exciting, this movie is more about the dysfunctional family that combat the mutated fish thingy rather than the monster itself. Similar in style to Q: the Winged Serpent, this nevertheless has a unique style, especially with the colour-saturated tones and the unorthodox way they show the monster in the first ten minutes of the film. I did think they could have dissed the Yanks more for the pollutuon of their river though. Awesome: I insist that everyone see this fantastic movie immediately! ZFK maximum rating.'}

Saving out this new list of dictionaries as a json file, so that I can carry out analysis on my list of user reviews for _The Host_ that were posted between 2007 and 2009:

In [88]:
with open('../data/user_reviews/tomatoes_host_user.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(rt_host_user))

In [89]:
rt_host_user = json.load(open('../data/user_reviews/tomatoes_host_user.json'))

In [90]:
len(rt_host_user)

5216

It looks like we have plenty of reviews (5,216) to work with!

## Conclusion

In this notebook, I worked on scraping, cleaning, extracting information from user reviews of _The Host_ using the Rotten Tomatoes site. Next, before I move onto analysis of these reviews, I will look into filtering out non-English reviews so that I can carry out my analyses on English ones. This will be done in the `lang_detect` notebook, and once I am finished with that, please refer to `analysis_user_tomatoes1` and `analysis_user_tomatoes2` to see what analysis I carried out!

Thank you.