# Scraping Metacritic for User Reviews of _Parasite_

In this notebook, I work on scraping [Metacritic](https://www.metacritic.com/movie/parasite/user-reviews?sort-by=date) for user reviews of _Parasite_ . I also work on scraping critic reviews on the site as well, but will end up using critic reviews from Nexis Uni for my analysis.

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import json
from nltk.corpus import stopwords
import random
import re
import os
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk import tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from collections import Counter

In [2]:
characters_to_strip = '().[]!,"'

In [3]:
%run functions.ipynb

In [6]:
url = 'https://www.metacritic.com/movie/parasite/user-reviews?sort-by=date'
url2 = 'https://www.metacritic.com/movie/parasite/user-reviews?sort-by=date&page=1'
url3 = 'https://www.metacritic.com/movie/parasite/user-reviews?sort-by=date&page=2'
url4 = 'https://www.metacritic.com/movie/parasite/user-reviews?sort-by=date&page=3'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


In [43]:
resp = requests.get(url, headers=headers)
resp2 = requests.get(url2, headers=headers)
resp3 = requests.get(url3, headers=headers)
resp4 = requests.get(url4, headers=headers)

In [44]:
html = resp.text
html2 = resp2.text
html3 = resp3.text
html4 = resp4.text

In [45]:
doc = BeautifulSoup(html, 'html.parser')
doc2 = BeautifulSoup(html2, 'html.parser')
doc3 = BeautifulSoup(html3, 'html.parser')
doc4 = BeautifulSoup(html4, 'html.parser')

In [55]:
parasite_revs1 = doc.find_all('div', attrs={'class':"review"})
parasite_revs2 = doc2.find_all('div', attrs={'class':"review"})
parasite_revs3 = doc3.find_all('div', attrs={'class':"review"})
parasite_revs4 = doc4.find_all('div', attrs={'class':"review"})

In [56]:
print(len(parasite_revs1))
print(len(parasite_revs2))
print(len(parasite_revs3))
print(len(parasite_revs4))

100
100
100
26


In [57]:
# Combining the lists into one list:

parasite_revs = []

for rev in parasite_revs1:
    parasite_revs.append(rev)
    
for rev in parasite_revs2:
    parasite_revs.append(rev)
    
for rev in parasite_revs3:
    parasite_revs.append(rev)
    
for rev in parasite_revs4:
    parasite_revs.append(rev)

In [58]:
len(parasite_revs)

326

In [59]:
parasite_revs[0]

<div class="review pad_top1">
<div class="left fl">
<div class="metascore_w user large movie positive indiv perfect">10</div>
</div>
<div class="right fl">
<div class="title pad_btm_half"><span class="author"><a href="/user/AkioTenebris">AkioTenebris</a></span><span class="date">Apr 30, 2021</span></div>
<div class="summary">
<div class="review_body">
<span>Director {Bong Joon-Ho} has done it again with a captivating and extraordinary film. Leaving me questioning if the real world is actually like this! Not only is the climax risen to a new level but the jaw dropping ending left me shook.</span>
</div>
</div>
<div class="interactions pad_top1">
<span class="helpful_wrapper">
<div class="helpful" data-mcrefid="13909291" data-mcreftype="230">
<div class="thumbs">
<span class="text"><span class="yes_count">0</span> of <span class="total_count">0</span> users found this helpful</span><span class="thumb_up"><i aria-hidden="true" class="fa fa-thumbs-up"></i><span class="count">0</span></span

Creating a dictionary and extracting the information I want, including date that the review was posted, the text of the review, and the score that the reviewer gave for the movie:

In [60]:
def extract_review_data(review_div):
    
    review = {}
    
    review['date'] = review_div.find('span', class_='date').text.strip()
    review['text'] = review_div.find('div', class_='review_body').text.strip()
    review['score'] = review_div.find('div', class_='metascore_w').text.strip()
    
    return review


In [61]:
extract_review_data(parasite_revs[0])

{'date': 'Apr 30, 2021',
 'score': '10',
 'text': 'Director {Bong Joon-Ho} has done it again with a captivating and extraordinary film. Leaving me questioning if the real world is actually like this! Not only is the climax risen to a new level but the jaw dropping ending left me shook.'}

In [62]:
extract_review_data(parasite_revs[10])

{'date': 'Feb  6, 2021',
 'score': '8',
 'text': "Second half of this movie has a lot of brawling which I found tiresome. But, this is the type of movie that rewards viewers who pay close attention. While it is sympathetic towards the poor family, it doesn't get into what would happen if theSecond half of this movie has a lot of brawling which I found tiresome. But, this is the type of movie that rewards viewers who pay close attention. While it is sympathetic towards the poor family, it doesn't get into what would happen if the tables were turned. Would the formerly poor family behave like the rich family in the film? Parasite goes only so far as to suggest Maybe. It's a rambunctious movie with considerable underlying subtlety. Remarkable.… Expand"}

In [63]:
extract_review_data(parasite_revs[90])

{'date': 'Apr  1, 2020',
 'score': '5',
 'text': 'Parasite starts very well and really get\'s your attention but after the plot starts to "develop\', it goes from amusing to absurd. The ilogical, violent and disapointing ending does not match the clever plot of the beggining.The charactersParasite starts very well and really get\'s your attention but after the plot starts to "develop\', it goes from amusing to absurd. The ilogical, violent and disapointing ending does not match the clever plot of the beggining.The characters are shallowly presented and inconsistent at times. The sister, Da-hye (jung Ji-so) is the best written and also has the best performance, wich makes her the only character I really cared about. Ki-woo, or Kevin is presented to us as a smart guy at the beginning but suddenly, after the first hour or so, he seens extremely dull.the camera work is great and deserves a mention.… Expand'}

In [64]:
parasite_revs[90]

<div class="review pad_top1">
<div class="left fl">
<div class="metascore_w user large movie mixed indiv">5</div>
</div>
<div class="right fl">
<div class="title pad_btm_half"><span class="author"><a href="/user/FelpsGP">FelpsGP</a></span><span class="date">Apr  1, 2020</span></div>
<div class="summary">
<div class="review_body">
<span class="inline_expand_collapse inline_collapsed" id="review_blurb_11195568"><span class="blurb blurb_collapsed">Parasite starts very well and really get's your attention but after the plot starts to "develop', it goes from amusing to absurd. The ilogical, violent and disapointing ending does not match the clever plot of the beggining.<br/>The characters</span><span class="blurb blurb_expanded">Parasite starts very well and really get's your attention but after the plot starts to "develop', it goes from amusing to absurd. The ilogical, violent and disapointing ending does not match the clever plot of the beggining.<br/>The characters are shallowly presente

Making a new list and going through each review to extract the information by using my function:

In [65]:
parasite_data = []
for review in parasite_revs:
    parasite_data.append(extract_review_data(review))

In [66]:
parasite_data[:3]

[{'date': 'Apr 30, 2021',
  'score': '10',
  'text': 'Director {Bong Joon-Ho} has done it again with a captivating and extraordinary film. Leaving me questioning if the real world is actually like this! Not only is the climax risen to a new level but the jaw dropping ending left me shook.'},
 {'date': 'Apr 29, 2021',
  'score': '10',
  'text': 'phenomenal, involves all genres of cinema, and in a fantastic way; like I’ve never seen before.'},
 {'date': 'Apr 20, 2021',
  'score': '9',
  'text': "This review contains spoilers, click expand to view.\n        \nNormally I'm not much into Korean entertainment. The only Korean entertainment I have ever consumed is PUBG. So I didn't know what to expect when I begun watching this film. I saw it with a blank slate, with no preformed opinion whatsoever. The only time I had seen a Korean film before was The Train to Busan, and that too is made by Bong Joon-Ho. I had watched it from somewhere in the middle when it was coming on TV. So in a way we c

Writing this out as a JSON file:

In [69]:
with open('../data/user_reviews/metacritic_parasite_user.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(parasite_data))

In [70]:
metacritic_parasite_user = json.load(open('../data/user_reviews/metacritic_parasite_user.json'))

In [73]:
len(metacritic_parasite_user)

326

Looks like we have 326 user reviews of _Parasite_ and we can take a look at the first few reviews:

In [72]:
metacritic_parasite_user[:3]

[{'date': 'Apr 30, 2021',
  'score': '10',
  'text': 'Director {Bong Joon-Ho} has done it again with a captivating and extraordinary film. Leaving me questioning if the real world is actually like this! Not only is the climax risen to a new level but the jaw dropping ending left me shook.'},
 {'date': 'Apr 29, 2021',
  'score': '10',
  'text': 'phenomenal, involves all genres of cinema, and in a fantastic way; like I’ve never seen before.'},
 {'date': 'Apr 20, 2021',
  'score': '9',
  'text': "This review contains spoilers, click expand to view.\n        \nNormally I'm not much into Korean entertainment. The only Korean entertainment I have ever consumed is PUBG. So I didn't know what to expect when I begun watching this film. I saw it with a blank slate, with no preformed opinion whatsoever. The only time I had seen a Korean film before was The Train to Busan, and that too is made by Bong Joon-Ho. I had watched it from somewhere in the middle when it was coming on TV. So in a way we c

## Getting critic reviews of _Parasite_

Here I will try scraping critic reviews from the site:

In [4]:
critic_url = 'https://www.metacritic.com/movie/parasite/critic-reviews'

In [7]:
resp = requests.get(critic_url, headers=headers)

In [8]:
html = resp.text

In [9]:
doc = BeautifulSoup(html, 'html.parser')

In [10]:
review_divs = doc.find_all('div', attrs={'class':'review'})

In [11]:
len(review_divs)

53

In [12]:
review_divs[0]

<div class="review pad_top1 pad_btm1">
<div class="left fl">
<div class="metascore_w large movie positive indiv perfect">100</div>
</div>
<div class="right fl">
<div class="title pad_btm_half"><span class="source"><a href="/publication/the-observer-uk?filter=movies"><img alt="The Observer (UK)" class="pub-img" src="https://static.metacritic.com/images/publications/1567125862_2798_2064361283.png" title="The Observer (UK)"/></a></span><span class="author"><a href="/critic/mark-kermode?filter=movies">Mark Kermode</a></span><span class="date">Feb 10, 2020</span></div>
<div class="summary">
<a class="no_hover" href="https://www.theguardian.com/film/2020/feb/09/parasite-review-bong-joon-ho-tragicomic-masterpiece" rel="noopener" target="_blank">
                                                Thrillingly played by a flawless ensemble cast who hit every note and harmonic resonance of Bong and co-writer Han Jin-won’s multitonal script, it’s a tragicomic masterclass that will get under your skin

Writing a function to extract the data I want:

In [13]:
def extract_criticreview_data(review_div):
    
    review = {}
    
    review['rating'] = review_div.find('div', class_='metascore_w').text.strip()
    review['text'] = review_div.find('div', class_='summary').text.strip()
    review['review_url'] = review_div.find('a', class_='read_full').attrs['href']
    try:
        review['source'] = review_div.find('span', class_='source').find('img').attrs['alt']
    except:
        review['source'] = review_div.find('span', class_='source').text
    review['critic'] = review_div.find('span', class_='author').text
    
    return review


In [15]:
extract_criticreview_data(review_divs[10])

{'critic': 'Ty Burr',
 'rating': '100',
 'review_url': 'https://www.bostonglobe.com/arts/movies/2019/10/17/family-ties-and-lies-dark-comic-masterpiece-parasite/Yq2AQEsAaZWS4IY56mQ3BM/story.html',
 'source': 'Boston Globe',
 'text': 'Parasite becomes a social satire of almost breathless audacity, a three-dimensional chess game of Darwinian one-upmanship that is by turns hilarious, terrifying, and brutal.\n                                    \nRead full review'}

In [16]:
criticreview_data=[]
for review in review_divs:
      try:
            criticreview_data.append(extract_criticreview_data(review))
      except:
            print('Error trying to extract review data')
            print(review)
            print()

Error trying to extract review data
<div class="ad_unit review pad_top1 pad_btm1" id="native_top">
<script type="text/javascript">
            
                            pushToDisplay('native_top', null, 'top', false, 'NmEGjUCMQTmJR3rtbiq2vRTC'), false;
                    </script>
</div>



In [17]:
len(criticreview_data)

52

In [18]:
with open('../data/critic_reviews/metacritic_parasite_critic.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(criticreview_data, indent=4))

# adding the `indent=4` param to json.dumps will produce multiline output with indents that make
# it easier to read and edit the JSON file by hand

## Conclusion

In this notebook, I scraped user reviews of _Parasite_ and also tried scraping and extracting critic reviews from the site. I ended up with 326 user reviews and 52 critic reviews. Next, I will filter out non-English words in my `lang_detect` notebook, go through analysis of user reviews in `analysis_user_metacritic` and explain my analysis of critic reviews in `analysis_critic_nexis`. 

Thank you~