# Scraping IMDB reviews with Selenium

Here I'll be using `selenium` to collect IMDB reviews for Star Wars: The Last Jedi

In [13]:
import os, re
import time

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import WebDriverException

In [14]:
driver = webdriver.Chrome()
driver.get("http://www.imdb.com/title/tt2527336/reviews?ref_=tt_urv")

In [28]:
text_elements = driver.find_elements_by_xpath('//*[@id="main"]/section/div[2]/div[2]/div/div[1]/div[1]/div[4]/div[1]')

In [33]:
text_element = text_elements[0]
text_element

<selenium.webdriver.remote.webelement.WebElement (session="ad1220ade4419de158f967b3ca0b35b4", element="0.8578279281383991-2")>

In [54]:
text_element

<selenium.webdriver.remote.webelement.WebElement (session="ad1220ade4419de158f967b3ca0b35b4", element="0.8578279281383991-2")>

In [55]:
button = driver.find_element_by_xpath('//*[@id="load-more-trigger"]')

In [59]:
last_height = driver.execute_script("return document.body.scrollHeight")

In [71]:
from selenium.common.exceptions import ElementNotVisibleException

In [72]:
ElementNotVisibleException

selenium.common.exceptions.ElementNotVisibleException

In [75]:
for i in range(100):
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        button.click()
    except ElementNotVisibleException:
        time.sleep(0.5)
        continue
    time.sleep(1.5)

Now that I've scraped the reviews, I can extract them from the html file using `BeautifulSoup`.

In [77]:
import os
from bs4 import BeautifulSoup

In [81]:
with open("../imdb/imdb_swtlj_reviews.html", 'r') as f:
    imdb_body = f.read()

In [108]:
imdb_soup = BeautifulSoup(imdb_body, "html5lib")

In [125]:
review_title = imdb_soup.find(text = 'Lived up to the hype, and in some places, beyond.')

In [228]:
reviews = imdb_soup.find_all("div", {'class': 'review-container'})

In [144]:
review = reviews[0]

In [194]:
# get user rating
imdb_rating = int(review.find("div", {'class': 'ipl-ratings-bar'}).find('span').find('span').text)
imdb_rating

2

In [186]:
# get title
review.find("div", {'class': 'title'}).text

'I remember when critics used to give bad reviews for mediocre scripts'

In [179]:
# get review date
review.find("span", {"class": 'review-date'}).text

'18 December 2017'

In [177]:
# user link
review.find("span", {'class': 'display-name-link'}).a['href']

# username
review.find("span", {'class': 'display-name-link'}).text

'zparadigm'

In [185]:
# get review text
print(review.find('div', {'class': 'text'}).text)

Lazy story writing and more plot-holes than an average movie experience. I just don't see anything bold about this story. Lets see bad guys have new tech that has a single weakpoint? CheckDeath Star related weapon? CheckNew Heroes in safety bubble while everyone around them dies? CheckCute aliens to sell toys? CheckHoth like battle on sand instead of snow? CheckSpace chase like episode 5... well not as exciting but CheckOk now explain bombs that fall in space. Leia surviving space long enough to force pull back to the cruiser. Dj learning about the shuttles. Rose being portrayed as a strong female character, yet throws the survival of the resistance away for a fangirl love of Finn. Phasma was a complete waste.
I enjoyed three moments in the film if I am being honest and the person next to me fell asleep in the middle and started snoring. When the movie ended no one clapped we all just sat there thinking WTF did we just watch.


In [206]:
import re

In [208]:
# get number of useful votes
useful_count, vote_count = re.findall("\d+", review.find("div", {'class': 'actions text-muted'}).text)

Cool, now I can wrap all of this together to build up the data I want.

In [242]:
reviews = imdb_soup.find_all("div", {'class': 'review-container'})

In [316]:
imdb_ratings = []
user_names = []
user_links = []
review_dates = []
review_text = []
upvotes = []
downvotes = []

skip_indexes = []
for i, review in enumerate(reviews):
    print("\rIteration %d out of %d" %(i+1, len(reviews)), end = "")
    
    try:
        # add user rating
        star_rating = review.find("div", {'class': 'ipl-ratings-bar'})
        if star_rating is None:
            imdb_ratings.append(star_rating)
        else:
            imdb_ratings.append(int(star_rating.find('span').find('span').text))
        
        # add username:
        user_names.append(review.find("span", {'class': 'display-name-link'}).text)

        # add user link:
        user_links.append(review.find("span", {'class': 'display-name-link'}).a['href'])

        # add review date:
        review_dates.append(review.find("span", {"class": 'review-date'}).text)

        # add review text:
        review_text.append(review.find('div', {'class': 'text'}).text)

        # vote counts
        upvote_count, total_count = re.findall("[0-9,]+", 
                                                  review.find("div", {'class': 'actions text-muted'}).text.replace(",", ""))

        # add upvote count:
        upvotes.append(upvote_count)
        downvotes.append(int(total_count) - int(upvote_count))
        
    except AttributeError:
        print("\rskipped %-100d" %i)
        skip_indexes.append(i)

Iteration 3683 out of 3683

In [317]:
import pandas as pd

In [318]:
imdb_data = pd.DataFrame(
    dict(
        imdb_ratings = imdb_ratings,
        user_names = user_names,
        user_links = user_links,
        review_dates = review_dates,
        review_text = review_text,
        upvotes = upvotes,
        downvotes = downvotes))

In [319]:
imdb_data

Unnamed: 0,downvotes,imdb_ratings,review_dates,review_text,upvotes,user_links,user_names
0,62,2.0,18 December 2017,Lazy story writing and more plot-holes than an...,283,http://www.imdb.com/user/ur83388943/?ref_=tt_urv,zparadigm
1,13,4.0,18 December 2017,I've read articles saying the THE LAST JEDI is...,75,http://www.imdb.com/user/ur6303192/?ref_=tt_urv,zvelf-1
2,56,1.0,20 December 2017,"As much as I wanted to love this movie, it utt...",215,http://www.imdb.com/user/ur83445313/?ref_=tt_urv,jonmit-88269
3,5,3.0,22 December 2017,I can honestly say that i have never left a mo...,34,http://www.imdb.com/user/ur43330146/?ref_=tt_urv,gordonwright10-681-637828
4,135,4.0,16 December 2017,"To clarify before beginning my review, I'm a 3...",447,http://www.imdb.com/user/ur2978113/?ref_=tt_urv,penny514
5,369,2.0,14 December 2017,After seeing the unbelievable rating of this m...,1065,http://www.imdb.com/user/ur24350271/?ref_=tt_urv,gotton2013
6,20,1.0,24 December 2017,This is a reupload of my old review that gathe...,83,http://www.imdb.com/user/ur54742537/?ref_=tt_urv,realmuthaf
7,118,2.0,21 December 2017,I can't really get into what's bad with this m...,371,http://www.imdb.com/user/ur63423510/?ref_=tt_urv,nadblaster
8,106,1.0,17 December 2017,I was totally baffled during the movie and whe...,327,http://www.imdb.com/user/ur23034917/?ref_=tt_urv,tuanskie
9,94,1.0,17 December 2017,Wow this is getting hammered by reviews and no...,293,http://www.imdb.com/user/ur43600791/?ref_=tt_urv,Marcus James


In [323]:
imdb_data.to_csv("../data/imdb_data.csv", index = None)

In [321]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
%matplotlib inline

In [322]:
(imdb_data.imdb_ratings
     .plot(kind = 'hist')
)

NameError: name '_converter' is not defined