# Scraping Rotten Tomatoes for User Reviews of _Parasite_

In this notebook, I work on scraping, cleaning, and extracting reviews of _Parasite_ from [Rotten Tomatoes](https://www.rottentomatoes.com/m/parasite_2019/reviews?type=verified_audience&intcmp=rt-scorecard_audience-score-reviews). Going through the same process as the one in the `tomatoes_host_user` notebook, I compile these user reviews so I can carry out analysis on them to understand what audience users have said about these movies!

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import json
from nltk.corpus import stopwords
import random
import re
import os
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk import tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from collections import Counter

In [2]:
chars_to_strip = '().[]!,"'

In [3]:
%run functions.ipynb

In [4]:
url = 'https://www.rottentomatoes.com/m/parasite_2019/reviews?type=verified_audience&intcmp=rt-scorecard_audience-score-reviews'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


In [5]:
resp = requests.get(url, headers=headers)

In [6]:
html = resp.text

In [7]:
doc = BeautifulSoup(html, 'html.parser')

In [8]:
reviews = doc.find_all('li', attrs={'class':'audience-reviews__item'})

In [9]:
len(reviews)

10

In [12]:
reviews[9]

<li class="audience-reviews__item" data-qa="review-item">
<div class="audience-reviews__user-wrap">
<span class="audience-review__default-image"></span>
<div class="audience-reviews__name-wrap ">
<span class="audience-reviews__name" data-qa="review-name">Mickey S</span>
</div>
</div>
<div class="audience-reviews__review-wrap">
<span class="audience-reviews__score"><span class="star-display" data-qa="star-display"><span class="star-display__filled "></span><span class="star-display__filled "></span><span class="star-display__empty"></span><span class="star-display__empty"></span><span class="star-display__empty"></span></span></span>
<span class="audience-reviews__verified hidden-md hidden-lg js-verified-review">Verified</span>
<span class="audience-reviews__verified js-verified-popover hidden-xs hidden-sm" data-container="body" data-placement="bottom" data-toggle="popover" data-trigger="hover" type="button">Verified</span>
<span class="audience-reviews__duration" data-qa="review-durati

* Credit:
The code below, used to scrape audience reviews from Rotten Tomatoes, was adapted from: https://stackoverflow.com/questions/62386747/how-to-scrape-rottentomatoes-audience-reviews-using-python

In [15]:
r = requests.get("https://www.rottentomatoes.com/m/parasite_2019/reviews?type=user")
data = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))

movieId = data["movieId"]

def getReviews(endCursor):
    r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
    params = {
        "direction": "next",
        "endCursor": endCursor,
        "startCursor": ""
    })
    return r.json()

reviews = []
result = {}
for i in range(0, 468):
    print(f"[{i}] request review")
    result = getReviews(result["pageInfo"]["endCursor"] if i != 0  else "")
    reviews.extend([t for t in result["reviews"]])
    time.sleep(0.1)

print(f"got {len(reviews)} reviews")

[0] request review
[1] request review
[2] request review
[3] request review
[4] request review
[5] request review
[6] request review
[7] request review
[8] request review
[9] request review
[10] request review
[11] request review
[12] request review
[13] request review
[14] request review
[15] request review
[16] request review
[17] request review
[18] request review
[19] request review
[20] request review
[21] request review
[22] request review
[23] request review
[24] request review
[25] request review
[26] request review
[27] request review
[28] request review
[29] request review
[30] request review
[31] request review
[32] request review
[33] request review
[34] request review
[35] request review
[36] request review
[37] request review
[38] request review
[39] request review
[40] request review
[41] request review
[42] request review
[43] request review
[44] request review
[45] request review
[46] request review
[47] request review
[48] request review
[49] request review
[50] reque

[396] request review
[397] request review
[398] request review
[399] request review
[400] request review
[401] request review
[402] request review
[403] request review
[404] request review
[405] request review
[406] request review
[407] request review
[408] request review
[409] request review
[410] request review
[411] request review
[412] request review
[413] request review
[414] request review
[415] request review
[416] request review
[417] request review
[418] request review
[419] request review
[420] request review
[421] request review
[422] request review
[423] request review
[424] request review
[425] request review
[426] request review
[427] request review
[428] request review
[429] request review
[430] request review
[431] request review
[432] request review
[433] request review
[434] request review
[435] request review
[436] request review
[437] request review
[438] request review
[439] request review
[440] request review
[441] request review
[442] request review
[443] request

In [16]:
len(reviews)

4672

In [56]:
reviews[:8]

[{'createDate': '2021-05-01T18:51:44.935Z',
  'displayImageUrl': None,
  'displayName': 'Sarah K',
  'hasProfanity': False,
  'hasSpoilers': False,
  'isSuperReviewer': False,
  'isVerified': False,
  'rating': 'STAR_5',
  'review': "Watch it!!!!!!! I don't need to say a word, just trust me.",
  'score': 5,
  'timeFromCreation': '3h ago',
  'updateDate': '2021-05-01T18:51:45.025Z',
  'user': {'accountLink': '/user/id/904364501',
   'displayName': 'Sarah K',
   'realm': 'RT',
   'userId': '904364501'}},
 {'createDate': '2021-04-29T00:36:11.591Z',
  'displayImageUrl': None,
  'displayName': 'Keelan M',
  'hasProfanity': False,
  'hasSpoilers': False,
  'isSuperReviewer': False,
  'isVerified': False,
  'rating': 'STAR_5',
  'review': 'unlike the justice league the snyder cut the parasite is not 4 hours long and actually good and balances its comedy, drama and horror elements with razor precision therefore it is pretentious therefore it is not kino',
  'score': 5,
  'timeFromCreation': '3

In [46]:
# Getting the info I want
# Making a new list of dictionaries with that info:

parasite_reviews = []

for rev in reviews:
    rev_dict = {
        'date' : rev['timeFromCreation'],
        'text' : rev['review'],
        'score' : rev['score']
    }
    parasite_reviews.append(rev_dict)

In [47]:
len(parasite_reviews)

4672

Pretty decent amount of reviews! It looks like we have a bit more for user reviews of _The Host_ (5,216) but 4,672 is pretty close. And this number might change after we filter out non-English reviews, which you can take a look at in the `lang_detect` notebook.

Now let's look at the first several reviews in this list:

In [48]:
parasite_reviews[:10]

[{'date': '3h ago',
  'score': 5,
  'text': "Watch it!!!!!!! I don't need to say a word, just trust me."},
 {'date': '3d ago',
  'score': 5,
  'text': 'unlike the justice league the snyder cut the parasite is not 4 hours long and actually good and balances its comedy, drama and horror elements with razor precision therefore it is pretentious therefore it is not kino'},
 {'date': '3d ago',
  'score': 5,
  'text': "One of the best films I've ever seen in my life, couldn't have been executed better.\nBrilliant actors, amazing script, great twist, just marvelous!"},
 {'date': '4d ago',
  'score': 5,
  'text': '100% fresh baby no puedes darle menos a esto'},
 {'date': '4d ago',
  'score': 5,
  'text': "This movie is a masterpiece in all senses. Parasite portrays the intimate true about the social classes. The movement of characters show how the difference between classes in life occurs. The colors and the photography speak without using words. The script is amazing, the film is of elegant s

Saving out my list as a json file:

In [49]:
with open('../data/user_reviews/tomatoes_parasite_user.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(parasite_reviews))

Loading in the data to make some initial observations about my data descriptives:

In [53]:
rt_parasite_user = json.load(open('../data/user_reviews/tomatoes_parasite_user.json'))

In [54]:
rt_parasite_user[:10]

[{'date': '3h ago',
  'score': 5,
  'text': "Watch it!!!!!!! I don't need to say a word, just trust me."},
 {'date': '3d ago',
  'score': 5,
  'text': 'unlike the justice league the snyder cut the parasite is not 4 hours long and actually good and balances its comedy, drama and horror elements with razor precision therefore it is pretentious therefore it is not kino'},
 {'date': '3d ago',
  'score': 5,
  'text': "One of the best films I've ever seen in my life, couldn't have been executed better.\nBrilliant actors, amazing script, great twist, just marvelous!"},
 {'date': '4d ago',
  'score': 5,
  'text': '100% fresh baby no puedes darle menos a esto'},
 {'date': '4d ago',
  'score': 5,
  'text': "This movie is a masterpiece in all senses. Parasite portrays the intimate true about the social classes. The movement of characters show how the difference between classes in life occurs. The colors and the photography speak without using words. The script is amazing, the film is of elegant s

One thing to note: it looks like some of the most recent reviews, which are shown at the top of the list in the code cell result above, have a different `date` format. This is because the information I extracted for my list is actually representing the time that has surpassed since the review was posted. 

So for example, I collected this data on May 1, 2021. As we can see, the very first dictionary in our list is a review that was posted on that same day, hence the `date` shows up as " 3 hr ago ". This is obviously not the format we want, so I will go ahead and manually change that in the json file.

Now it will have the same format as the rest of this list:

In [57]:
rt_parasite_user = json.load(open('../data/user_reviews/tomatoes_parasite_user.json'))

In [58]:
rt_parasite_user[:10]

[{'date': 'May 1, 2021',
  'score': 5,
  'text': "Watch it!!!!!!! I don't need to say a word, just trust me."},
 {'date': 'Apr 29, 2021',
  'score': 5,
  'text': 'unlike the justice league the snyder cut the parasite is not 4 hours long and actually good and balances its comedy, drama and horror elements with razor precision therefore it is pretentious therefore it is not kino'},
 {'date': 'Apr 28, 2021',
  'score': 5,
  'text': "One of the best films I've ever seen in my life, couldn't have been executed better.\nBrilliant actors, amazing script, great twist, just marvelous!"},
 {'date': 'Apr 28, 2021',
  'score': 5,
  'text': '100% fresh baby no puedes darle menos a esto'},
 {'date': 'Apr 27, 2021',
  'score': 5,
  'text': "This movie is a masterpiece in all senses. Parasite portrays the intimate true about the social classes. The movement of characters show how the difference between classes in life occurs. The colors and the photography speak without using words. The script is amaz

Great! Now the dates are all following the same format. 

## Conclusion

In this notebook, I went through the process of scraping, organizing, and extracting information from Rotten Tomatoes on user reviews of _Parasite_ . I was able to get 4,672 reviews, and now I'll move on to filtering out non-English reviews so that I can run my analyses. For this, please refer to the `lang_detect` notebook. And for the data analysis on these user reviews, refer to `analysis_user_tomatoes1` and `analysis_user_tomatoes2`! 

See you!