# IMDb Reviews Scraper
### Iordache Ioan-Bogdan
I chose to collect movie reviews from IMDb since they contain useful metadata for us to be able to build a dataset and run different Machine Learning algorithms on it.  We may try to detect the sentiment of a review by trying to predict its score or find how useful a review was for other users. All of these can be done through supervised learning since we already have the dataset "labeled" by the metadata mentioned above, like the score and the number of users who claimed that a review was helpful and those who claimed the opposite.

## Searching for a movie
All IMDb URLs for a given movie are based on a movie ID. For a better scraping experience, I chose to implement the logic for retrieving this ID by providing the movie title (not necessarily the exact title).

In order to do that, we use the search function of IMDb through the search URL by providing the search string. We need some processing on the search string: remove non-alphanumerical characters and replace white spaces with the '+' character.

In [1]:
import re


def movie_title_to_search_string(title: str) -> str:
    """ Remove non-alphanumerical characters and convert movie
        title string to the format accepted by IMDB search link
        (i.e. replace whitespaces with +)
    """
    regex = re.compile("[^a-zA-Z 0-9]")
    title = regex.sub(" ", title)
    return re.sub("\s+", "+", title)

By searching for the first <i>td</i> tag with the <i>.result_text</i> class we are able to get the ID of the first movie returned by the search query. If that tag cannot be found we throw an exception. It is also a good idea to retrieve the official title of the movie as well.

In [2]:
import bs4
import requests
from typing import Tuple

headers = {"Accept-Language": "en-US"}


class MovieNotFoundException(Exception):
    pass


def get_movie_true_title_and_id(movie_title: str) -> Tuple[str, str]:
    """ Get IMDB id for given movie
    """
    search_string = movie_title_to_search_string(movie_title)
    response = requests.get(
        f"https://www.imdb.com/find?q={search_string}",
        headers = headers
    )
    souped_content = bs4.BeautifulSoup(response.content, "html.parser")
    try:
        first_result = souped_content.find("td", class_="result_text")
        movie_link = first_result.find("a")
        movie_title = movie_link.text
        movie_id = movie_link["href"].split('/')[2]
        return movie_title, movie_id
    except Exception as ex:
        raise MovieNotFoundException(
            f"Movie with title \'{movie_title}\' could NOT be found"
        )


In [3]:
# Test get_movie_id
get_movie_true_title_and_id("spiderman 2")

('Spider-Man 2', 'tt0316654')

## Extracting reviews
### One review from its container
A single review appears on the reviews page of a movie encapsulated inside a <i>.review-container div</i> tag. Below we implement the logic for extracting information for a single review from its container. We extract:
 * review title (always present)
 * text, the content of the review (always present)
 * score, integer between 1 and 10 (can be missing)
 * user name (always present)
 * date of the review (in the format yyyy/mm/dd)
 * number of users that voted on the usefulness of this review
 * number of users that considered this review useful

Some processing has to be done on the review text. We remove all the tags but keep the text between them. If there are any URLs present in the text, we remove them and replace them with an <b>`[URL]`</b> token. Finally, we replace all multiple consecutive white spaces with a single space character and strip all the white spaces from the beginning and the end of the text string.  

In [4]:
from datetime import datetime
from typing import List, NamedTuple, Optional


class Review(NamedTuple):
    title: str
    text: str
    score: Optional[int]
    user: str
    date: str
    helpfulness_votes: int
    positive_helpfulness_votes: int

    def __str__(self) -> str:
        return (
            f"{self.title} ({self.score}/10)\n" +
            f"by {self.user} on {self.date}\n" +
            f"{self.text}\n\n" +
            f"{self.positive_helpfulness_votes} out of " +
            f"{self.helpfulness_votes} found it useful"
        )


def process_review_text(text_div: bs4.element.Tag) -> str:
    """ Remove html tags and shrink consecutive whitespaces.
        Replace all URLs with a special token.
    """
    # remove all tags but keep the text data
    text = text_div.get_text(" ", strip=True)
    # replace links with [URL] token
    text = re.sub(r"https?://[^ ]+", "[URL]", text)
    text = re.sub(r"www.[^ ]+", "[URL]", text)
    # change multiple consecutive whitespaces into a single space
    text = re.sub("\s+", " ", text)
    return text


def extract_review_from_div(tag: bs4.element.Tag) -> Review:
    """ Extract all review info from the review-container div
    """
    title = tag.find("a", class_="title").text.strip()
    user = tag.find("span", class_="display-name-link").find("a").text
    date = tag.find("span", class_="review-date").text
    # change date format to numerical one
    date = datetime.strptime(date, "%d %B %Y").strftime("%Y/%m/%d")
    try:
        score = int(tag.find("span", class_="point-scale").previous_sibling.text)
    except Exception:
        score = None  # no score is specified for this review
    help_text = tag.find(
        "div", class_="actions text-muted"
    ).text.strip().split(' ')
    positive_helpfulness_votes, helpfulnes_votes = int(help_text[0].replace(',', '')), int(help_text[3].replace(',', ''))

    # get review text
    text_div = tag.find("div", class_="text show-more__control")
    text = process_review_text(text_div)
    
    return Review(title=title, text=text, score=score, user=user, date=date,
        helpfulness_votes=helpfulnes_votes,
        positive_helpfulness_votes=positive_helpfulness_votes
    )

We test the above methods on the code of a review container extracted from an IMDb page. Some changes were applied in order to catch some corner cases.

In [5]:
# Test extract_review_from_div

div = """<div class="review-container">
        <div class="lister-item-content">
    <div class="ipl-ratings-bar">
            <span class="rating-other-user-rating">
            <svg class="ipl-icon ipl-star-icon  " xmlns="http://www.w3.org/2000/svg" fill="#000000" height="24" viewBox="0 0 24 24" width="24">
                <path d="M0 0h24v24H0z" fill="none"/>
                <path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"/>
                <path d="M0 0h24v24H0z" fill="none"/>
            </svg>
                <span>9</span><span class="point-scale">/10</span>
            </span>
    </div>
<a href="/review/rw5638050/?ref_=tt_urv"
class="title" > Just rewatching in 2020
</a>            <div class="display-name-date">
                    <span class="display-name-link"><a href="/user/ur68951879/?ref_=tt_urv"
>koyushun</a></span><span class="review-date">14 April 2020</span>
            </div>
            <div class="content">
                <div class="text show-more__control">I was a kid when I watched this in cinema back in 2004
I just want to say after all these years after a few version of Spider-Mans and all the MCU movie.
<br>
<br>
This one is hands down the best Superhero movie <a href="google.com">text to keep</a>.
It has everything done within 2 hours. Perfectly <a href="https://google.com">https://google.com</a> caught up what it left of from the previous Spider-Man and Toby Maguire will always be my Spider-Man <a href="www.google.com">www.google.com</a>.</div>
                <div class="actions text-muted">
                    26 out of 37 found this helpful.
                        <span>
                            Was this review helpful? <a href="/registration/signin?ref_=urv"
> Sign in</a> to vote.
                        </span>
                        <br/>
                    <a href="/review/rw5638050/?ref_=tt_urv"
>Permalink</a>
                </div>
            </div>
        </div>
        <div class="clear"></div>
    </div>
"""

div_soup = bs4.BeautifulSoup(div).find("div", class_="review-container")
print(extract_review_from_div(div_soup))


Just rewatching in 2020 (9/10)
by koyushun on 2020/04/14
I was a kid when I watched this in cinema back in 2004 I just want to say after all these years after a few version of Spider-Mans and all the MCU movie. This one is hands down the best Superhero movie text to keep . It has everything done within 2 hours. Perfectly [URL] caught up what it left of from the previous Spider-Man and Toby Maguire will always be my Spider-Man [URL] .

26 out of 37 found it useful


### Many reviews
Scraping all of the reviews for a movie it is a little bit trickier than iterating through all of the review containers on the page. The pagination on IMDb is done by clicking the "Load more" button at the bottom of the page. This button does not change the URL of the page, but instead it does an asynchronous request that fetches the new components that need to be rendered on the current page. To simulate that request we need a "pagination key" that is specified in an attribute of the load button. Using that we can retrieve more reviews and also the pagination key for the next batch.

Using this method we can scrape all the reviews but we chose to stop after a maximum of 100 for a given movie.

In [6]:
MAX_REVIEW_COUNT = 50


def scrape_reviews(movie_id: str) -> List[Review]:
    """Get all reviews for given movie id
    """
    request_link = f"https://www.imdb.com/title/{movie_id}/reviews"
    review_divs = []
    while len(review_divs) < MAX_REVIEW_COUNT:
        response = requests.get(
            request_link,
            headers = headers
        )
        souped_content = bs4.BeautifulSoup(response.content, "html.parser")
        review_divs.extend(
            souped_content.find_all("div", class_="review-container")
        )
        
        # load more link
        load_more_div = souped_content.find("div", class_="load-more-data")
        try:
            pagination_key = load_more_div["data-key"]
        except Exception:
            break
        request_link = (
            f"https://www.imdb.com/title/{movie_id}/reviews/_ajax?ref_=undefined&" +
            f"paginationKey={pagination_key}"
        )
    reviews = []
    for review_div in review_divs:
        reviews.append(extract_review_from_div(review_div))

    return reviews

In [7]:
# Test scrape_reviews
import random


reviews = scrape_reviews("tt0316654")
print(len(reviews))
print(random.choice(reviews))

50
The best Spiderman movie (10/10)
by freemantle_uk on 2008/05/06
I'm a big fan of comic books and comic book conversation when done right. Spiderman is on of the best comic best to be made and is an important cultural figure and important to Marvel Comics. The first and the third films were both good, the first was an introduction and the third really underused Venom. Spiderman 2, like X-Men 2 is best in the film franchise. Spiderman is meant to be more family friendly and a little serious then say the X-Men films and Batman Begins, but is still very fun and avoids going down the camp or comic root which is easy to do. Spiderman 2 takes place after the first film and the credits give you a brief run down of what happened. Peter Parker (Tobey Maguire) is struggling in his life, he had rejected the love of his life, Mary-Jane Watson (Kirsten Durst), his friendship with Harry Osborn (James Franco) is strained because he believed Spiderman killed his father and Peter wouldn't tell him an

## Building a dataset
Now we are able to run these methods for a list of movie titles and retrieve all the reviews for them. We are going to store them in an in-memory DataFrame and we can also write them to disk on a CSV file for later usage.

In [8]:
from tqdm import tqdm
import pandas as pd

MOVIES = pd.read_csv("movie_titles.csv", header=0)["Title"]
COLUMNS = ["movie", "title", "text", "score", "user", "date", "helpfulness_votes", "positive_helpfulness_votes"]


all_reviews = []
for movie in tqdm(MOVIES):
    try:
        movie_title, movie_id = get_movie_true_title_and_id(movie)
    except MovieNotFoundException as ex:
        print(ex)
        continue
    
    reviews = scrape_reviews(movie_id)
    all_reviews.extend(
        (
            (
                movie_title, review.title, review.text, review.score, review.user,
                review.date, review.helpfulness_votes, review.positive_helpfulness_votes
            ) 
            for review in reviews
        )
    )

 42%|████▏     | 417/1000 [16:47<18:03,  1.86s/it]Movie with title 'The Headhunter's Calling' could NOT be found
100%|██████████| 1000/1000 [40:13<00:00,  2.41s/it]


In [9]:
import pandas as pd
import random


random.shuffle(all_reviews)
df = pd.DataFrame(all_reviews, columns=COLUMNS)
df.to_csv("imdb_reviews.csv")
print(len(df))
df.head(10)

46856


Unnamed: 0,movie,title,text,score,user,date,helpfulness_votes,positive_helpfulness_votes
0,Couples Retreat,Worst movie I've seen in a while...,I watched this for free On Demand and still fe...,5.0,glyeakley,2010/09/25,47,30
1,Disaster Movie,The irony is very correct.,"Since the name of the movie is ""Disaster Movie...",,mewte,2008/08/29,411,345
2,Seven Psychopaths,"Skinny, toothless, and blind","Hmm, it's quite risky with all these movies th...",6.0,mircea-lungu,2013/04/21,24,11
3,The Counselor,"A dark, bleak masterpiece about predators - 10/10",Don't believe the bad reviews here: If you lov...,10.0,rockenrohl,2013/10/26,139,86
4,Scouts Guide to the Zombie Apocalypse,Scouts vs. Zombies gotta like that!,Comedy and Horror mix well in this coming of a...,6.0,philipmorrison-73118,2016/01/24,3,2
5,Unbroken,Fail,"The movie failed to inspire, motivate, or even...",2.0,rchevall,2014/12/25,154,75
6,Carol,Terribly slow and predictable drama,Mara plays Therese who falls in love with a mu...,1.0,CineCritic2517,2015/10/19,115,50
7,28 Weeks Later,"Rip roaring, Genre explosion, takes the series...","28 Weeks Later, the sequel to the Danny Boyle ...",9.0,myrkeyjones,2007/05/15,296,149
8,In Bruges,Superb.,It's rare to find a film that can go from hila...,10.0,imdb-19548,2008/04/01,37,21
9,JFK,"""I like a man who's not afraid of bad odds.""",I've avoided watching this film intentionally ...,10.0,classicsoncall,2018/04/26,7,4


In [10]:
from collections import Counter

cnt = Counter()
for _, row in df.iterrows():
    cnt[row["movie"]] += 1

cnt

4,
         'Planet Terror': 50,
         'The Belko Experiment': 50,
         'Tropic Thunder': 50,
         'Transformers: Age of Extinction': 50,
         'Tracktown': 10,
         'Handsome Devil': 37,
         'The Imitation Game': 50,
         'Popstar: Never Stop Never Stopping': 50,
         'No Strings Attached': 50,
         'Sex Tape': 50,
         'Masterminds': 50,
         'Nocturnal Animals': 50,
         'The Judge': 50,
         'Apocalypto': 50,
         'Love, Rosie': 50,
         'Max Steel': 50,
         'The Book of Life': 50,
         'Ex Machina': 50,
         'Across the Universe': 50,
         'Blue Valentine': 50,
         'The Odyssey': 74,
         'Pandorum': 50,
         'Arrival': 50,
         'Transformers: Dark of the Moon': 50,
         'The Boy': 50,
         'Collide': 50,
         'The Hunger Games: Mockingjay - Part 2': 50,
         'I Am the Pretty Thing That Lives in the House': 50,
         'The Ridiculous 6': 50,
         'Birth of the Dragon'