# IMDb Reviews Scraper
### Iordache Ioan-Bogdan
I chose to collect movie reviews from IMDb since they contain useful metadata for us to be able to build a dataset and run different Machine Learning algorithms on it.  We may try to detect the sentiment of a review by trying to predict its score or find how useful a review was for other users. All of these can be done through supervised learning since we already have the dataset "labeled" by the metadata mentioned above, like the score and the number of users who claimed that a review was helpful and those who claimed the opposite.

## Searching for a movie
All IMDb URLs for a given movie are based on a movie ID. For a better scraping experience, I chose to implement the logic for retrieving this ID by providing the movie title (not necessarily the exact title).

In order to do that, we use the search function of IMDb through the search URL by providing the search string. We need some processing on the search string: remove non-alphanumerical characters and replace white spaces with the '+' character.

In [1]:
import re


def movie_title_to_search_string(title: str) -> str:
    """ Remove non-alphanumerical characters and convert movie
        title string to the format accepted by IMDB search link
        (i.e. replace whitespaces with +)
    """
    regex = re.compile("[^a-zA-Z 0-9]")
    title = regex.sub(" ", title)
    return re.sub("\s+", "+", title)

By searching for the first <i>td</i> tag with the <i>.result_text</i> class we are able to get the ID of the first movie returned by the search query. If that tag cannot be found we throw an exception. It is also a good idea to retrieve the official title of the movie as well.

In [2]:
import bs4
import requests
from typing import Tuple

headers = {"Accept-Language": "en-US"}


class MovieNotFoundException(Exception):
    pass


def get_movie_true_title_and_id(movie_title: str) -> Tuple[str, str]:
    """ Get IMDB id for given movie
    """
    search_string = movie_title_to_search_string(movie_title)
    response = requests.get(
        f"https://www.imdb.com/find?q={search_string}",
        headers = headers
    )
    souped_content = bs4.BeautifulSoup(response.content, "html.parser")
    try:
        first_result = souped_content.find("td", class_="result_text")
        movie_link = first_result.find("a")
        movie_title = movie_link.text
        movie_id = movie_link["href"].split('/')[2]
        return movie_title, movie_id
    except Exception as ex:
        raise MovieNotFoundException(
            f"Movie with title \'{movie_title}\' could NOT be found"
        )


In [3]:
# Test get_movie_id
get_movie_true_title_and_id("spiderman 2")

('Spider-Man 2', 'tt0316654')

## Extracting reviews
### One review from its container
A single review appears on the reviews page of a movie encapsulated inside a <i>.review-container div</i> tag. Below we implement the logic for extracting information for a single review from its container. We extract:
 * review title (always present)
 * text, the content of the review (always present)
 * score, integer between 1 and 10 (can be missing)
 * user name (always present)
 * date of the review (in the format yyyy/mm/dd)
 * number of users that voted on the usefulness of this review
 * number of users that considered this review useful

Some processing has to be done on the review text. We remove all the tags but keep the text between them. If there are any URLs present in the text, we remove them and replace them with an <b>`[URL]`</b> token. Finally, we replace all multiple consecutive white spaces with a single space character and strip all the white spaces from the beginning and the end of the text string.  

In [4]:
from datetime import datetime
from typing import List, NamedTuple, Optional


class Review(NamedTuple):
    title: str
    text: str
    score: Optional[int]
    user: str
    date: str
    helpfulness_votes: int
    positive_helpfulness_votes: int

    def __str__(self) -> str:
        return (
            f"{self.title} ({self.score}/10)\n" +
            f"by {self.user} on {self.date}\n" +
            f"{self.text}\n\n" +
            f"{self.positive_helpfulness_votes} out of " +
            f"{self.helpfulness_votes} found it useful"
        )


def process_review_text(text_div: bs4.element.Tag) -> str:
    """ Remove html tags and shrink consecutive whitespaces.
        Replace all URLs with a special token.
    """
    # remove all tags but keep the text data
    text = text_div.get_text(" ", strip=True)
    # replace links with [URL] token
    text = re.sub(r"https?://[^ ]+", "[URL]", text)
    text = re.sub(r"www.[^ ]+", "[URL]", text)
    # change multiple consecutive whitespaces into a single space
    text = re.sub("\s+", " ", text)
    return text


def extract_review_from_div(tag: bs4.element.Tag) -> Review:
    """ Extract all review info from the review-container div
    """
    title = tag.find("a", class_="title").text.strip()
    user = tag.find("span", class_="display-name-link").find("a").text
    date = tag.find("span", class_="review-date").text
    # change date format to numerical one
    date = datetime.strptime(date, "%d %B %Y").strftime("%Y/%m/%d")
    try:
        score = int(tag.find("span", class_="point-scale").previous_sibling.text)
    except Exception:
        score = None  # no score is specified for this review
    help_text = tag.find(
        "div", class_="actions text-muted"
    ).text.strip().split(' ')
    positive_helpfulness_votes, helpfulnes_votes = int(help_text[0].replace(',', '')), int(help_text[3].replace(',', ''))

    # get review text
    text_div = tag.find("div", class_="text show-more__control")
    text = process_review_text(text_div)
    
    return Review(title=title, text=text, score=score, user=user, date=date,
        helpfulness_votes=helpfulnes_votes,
        positive_helpfulness_votes=positive_helpfulness_votes
    )

We test the above methods on the code of a review container extracted from an IMDb page. Some changes were applied in order to catch some corner cases.

In [5]:
# Test extract_review_from_div

div = """<div class="review-container">
        <div class="lister-item-content">
    <div class="ipl-ratings-bar">
            <span class="rating-other-user-rating">
            <svg class="ipl-icon ipl-star-icon  " xmlns="http://www.w3.org/2000/svg" fill="#000000" height="24" viewBox="0 0 24 24" width="24">
                <path d="M0 0h24v24H0z" fill="none"/>
                <path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"/>
                <path d="M0 0h24v24H0z" fill="none"/>
            </svg>
                <span>9</span><span class="point-scale">/10</span>
            </span>
    </div>
<a href="/review/rw5638050/?ref_=tt_urv"
class="title" > Just rewatching in 2020
</a>            <div class="display-name-date">
                    <span class="display-name-link"><a href="/user/ur68951879/?ref_=tt_urv"
>koyushun</a></span><span class="review-date">14 April 2020</span>
            </div>
            <div class="content">
                <div class="text show-more__control">I was a kid when I watched this in cinema back in 2004
I just want to say after all these years after a few version of Spider-Mans and all the MCU movie.
<br>
<br>
This one is hands down the best Superhero movie <a href="google.com">text to keep</a>.
It has everything done within 2 hours. Perfectly <a href="https://google.com">https://google.com</a> caught up what it left of from the previous Spider-Man and Toby Maguire will always be my Spider-Man <a href="www.google.com">www.google.com</a>.</div>
                <div class="actions text-muted">
                    26 out of 37 found this helpful.
                        <span>
                            Was this review helpful? <a href="/registration/signin?ref_=urv"
> Sign in</a> to vote.
                        </span>
                        <br/>
                    <a href="/review/rw5638050/?ref_=tt_urv"
>Permalink</a>
                </div>
            </div>
        </div>
        <div class="clear"></div>
    </div>
"""

div_soup = bs4.BeautifulSoup(div).find("div", class_="review-container")
print(extract_review_from_div(div_soup))


Just rewatching in 2020 (9/10)
by koyushun on 2020/04/14
I was a kid when I watched this in cinema back in 2004 I just want to say after all these years after a few version of Spider-Mans and all the MCU movie. This one is hands down the best Superhero movie text to keep . It has everything done within 2 hours. Perfectly [URL] caught up what it left of from the previous Spider-Man and Toby Maguire will always be my Spider-Man [URL] .

26 out of 37 found it useful


### Many reviews
Scraping all of the reviews for a movie it is a little bit trickier than iterating through all of the review containers on the page. The pagination on IMDb is done by clicking the "Load more" button at the bottom of the page. This button does not change the URL of the page, but instead it does an asynchronous request that fetches the new components that need to be rendered on the current page. To simulate that request we need a "pagination key" that is specified in an attribute of the load button. Using that we can retrieve more reviews and also the pagination key for the next batch.

Using this method we can scrape all the reviews but we chose to stop after a maximum of 100 for a given movie.

In [35]:
MAX_REVIEW_COUNT = 200


def scrape_reviews(movie_id: str) -> List[Review]:
    """Get all reviews for given movie id
    """
    request_link = f"https://www.imdb.com/title/{movie_id}/reviews"
    review_divs = []
    while len(review_divs) < MAX_REVIEW_COUNT:
        response = requests.get(
            request_link,
            headers = headers
        )
        souped_content = bs4.BeautifulSoup(response.content, "html.parser")
        review_divs.extend(
            souped_content.find_all("div", class_="review-container")
        )
        
        # load more link
        load_more_div = souped_content.find("div", class_="load-more-data")
        try:
            pagination_key = load_more_div["data-key"]
        except Exception:
            break
        request_link = (
            f"https://www.imdb.com/title/{movie_id}/reviews/_ajax?ref_=undefined&" +
            f"paginationKey={pagination_key}"
        )
    reviews = []
    for review_div in review_divs:
        reviews.append(extract_review_from_div(review_div))

    return reviews

In [25]:
# Test scrape_reviews
import random


reviews = scrape_reviews("tt0316654")
print(len(reviews))
print(random.choice(reviews))

25
~ MORE Web-Slinging Fun ~ (9/10)
by Aysen08 on 2005/01/06
Normally sequels are hugely disappointing in comparison to the original film. So it's hard to believe that they could even come close to topping this movies original, alas they do! Picking up where the first movie starts off and introducing several new comic book regulars, we bypass the history establishing story of the original and head straight into what life is like for Peter Parker being a college student, Spiderman, and trying to hold down a job. This time Spidey must face the likes of Doc Ock, portrayed by Alfred Molina (he's come a long way since Indiana Jones cameos) excellently. Once again they couldn't have picked a better actor to portray the villain Spiderman has to face. Spiderman must take out Doc, save the city, save the girl, and still get good grades. The action sequences once again are entrancing and the story is well written. Kudos once again to the actors involved and Sam Raimi on creating another hit. Loo

## Building a dataset
Now we are able to run these methods for a list of movie titles and retrieve all the reviews for them. We are going to store them in an in-memory DataFrame and we can also write them to disk on a CSV file for later usage.

In [36]:
from tqdm import tqdm
import pandas as pd

MOVIES = pd.read_csv("movie_titles.csv", header=0)["Title"]
COLUMNS = ["movie", "title", "text", "score", "user", "date", "helpfulness_votes", "positive_helpfulness_votes"]


all_reviews = []
for movie in tqdm(MOVIES):
    try:
        movie_title, movie_id = get_movie_true_title_and_id(movie)
    except MovieNotFoundException as ex:
        print(ex)
        continue
    
    reviews = scrape_reviews(movie_id)
    all_reviews.extend(
        (
            (
                movie_title, review.title, review.text, review.score, review.user,
                review.date, review.helpfulness_votes, review.positive_helpfulness_votes
            ) 
            for review in reviews
        )
    )

 42%|████▏     | 417/1000 [48:03<57:47,  5.95s/it]  Movie with title 'The Headhunter's Calling' could NOT be found
100%|██████████| 1000/1000 [1:52:22<00:00,  6.74s/it]


In [37]:
import pandas as pd
import random


random.shuffle(all_reviews)
print(len(df))
df = pd.DataFrame(all_reviews, columns=COLUMNS)
df.to_csv("imdb_reviews.csv")
df.head(10)

23683


Unnamed: 0,movie,title,text,score,user,date,helpfulness_votes,positive_helpfulness_votes
0,Knight of Cups,Existential journey of love and loss,This film is more of an exploration of emotion...,9.0,DesertMirage,2016/09/28,5,2
1,Hotel Transylvania 2,The is very good,Good movie good good good good good good good ...,10.0,suhailitwins,2018/08/31,0,0
2,Sausage Party,S**t Party,Here's the entire script Food 1: F**k Food 2: ...,1.0,davelaidlaw-37831,2016/10/30,257,136
3,2012,Oh dear...,I went into this with very low expectations fo...,1.0,hoju_31,2009/12/14,20,13
4,Mamma Mia!,Absolutely Fantastic!,"An excellent cast, beautiful location, and abs...",10.0,Jessicat_McGonagall,2008/07/20,5,4
5,Zombieland,Feelgood Zombie Flick,"This movie is an amiable, if rather aimless at...",,JoeytheBrit,2010/03/29,1,1
6,Pandorum,Powerful Story Marred By Mind Boggling Violence!,I'm started to get a bit concerned for lovers ...,5.0,liberalgems,2009/09/30,17,7
7,It's Only the End of the World,Life in a day,"Xavier Dolan, where have you been all my life?...",8.0,Morten_5,2017/06/07,2,0
8,Horrible Bosses 2,Uncessary fun!,Well The sequel might be unnecessary but I was...,7.0,unsalakgun,2019/04/21,5,2
9,The Pursuit of Happyness,"It's a sucky life, and just when you think it ...",I get the feeling that this was produced just ...,5.0,Howlin Wolf,2008/02/18,18,11


In [38]:
from collections import Counter

cnt = Counter()
for _, row in df.iterrows():
    cnt[row["movie"]] += 1

cnt

rankenstein': 164,
         'The Souvenir': 116,
         'Tusk': 200,
         'The Judge': 200,
         'Seven Pounds': 200,
         'Why Him?': 200,
         'Wanted': 200,
         'A Dark Song': 155,
         'Straight Outta Compton': 200,
         'Fantastic Four': 200,
         'Ghost Rider': 200,
         'Shin Godzilla': 201,
         'The Gift': 200,
         'Green Lantern': 200,
         'The Invisible Guest': 200,
         'Terminator Salvation': 200,
         'Regression': 118,
         'Brooklyn': 200,
         'Equals': 148,
         'Cinderella': 200,
         'Mechanic: Resurrection': 200,
         'Mr. Right': 200,
         'Alexander and the Terrible, Horrible, No Good, Very Bad Day': 112,
         'Frozen': 200,
         'Big Hero 6': 200,
         'Percy Jackson & the Olympians: The Lightning Thief': 200,
         'Denial': 114,
         'PK': 224,
         'Dope': 117,
         'The Great Gatsby': 200,
         'The Longest Ride': 153,
         'Kingsman: The S