# Scraping IMDb Reviews: A Walkthrough

A step-by-step guide of using my scrapers to scrape movie reviews on IMDb.

In [1]:
import os
import sys

import pandas as pd

sys.path.append(os.path.abspath(".."))
from src.imdb_review_scrapers import scrape_reviews_all_movies
from src.imdb_url_scraper import get_imdb_url
from src.movie_data_scraper import clean_movie_data, scrape_movie_data

## Step 1: Scrape movie data on The Number

In [2]:
url = "https://www.the-numbers.com/market/2024/top-grossing-movies"
scraped_df = scrape_movie_data(url)

In [3]:
scraped_df

Unnamed: 0,Rank,Movie,Release Date,Distributor,Genre,2024 Gross,Tickets Sold
0,1,Inside Out 2,"Jun 14, 2024",Walt Disney,Adventure,"$652,980,194",57734765.0
1,2,Deadpool & Wolverine,"Jul 26, 2024",Walt Disney,Action,"$636,745,858",56299369.0
2,3,Wicked,"Nov 22, 2024",Universal,Musical,"$432,943,285",38279689.0
3,4,Moana 2,"Nov 27, 2024",Walt Disney,Musical,"$404,017,489",35722148.0
4,5,Despicable Me 4,"Jul 3, 2024",Universal,Adventure,"$361,004,205",31919028.0
...,...,...,...,...,...,...,...
695,696,After Death,"Oct 27, 2023",Angel Studios,Documentary,$705,62.0
696,697,50 km/h,"Jul 19, 2024",Sony Pictures,Comedy,$543,48.0
697,698,The Road Dance,"Oct 13, 2023",Music Box Films,Drama,$510,45.0
698,699,Alimañas,"Sep 13, 2024",Sony Pictures,Comedy,$485,43.0


## Step 2: Clean the movie data

Cleaning includes:
- adding ``movie_id`` column
- normalizing column names
- converting ``release_date`` to date type
- extracting ``release_year``
- changing values in ``gross`` and ``tickets_sold`` to float

In [4]:
cleaned_df = clean_movie_data(scraped_df)

In [5]:
cleaned_df

Unnamed: 0,movie_id,rank,movie_title,release_date,distributor,genre,gross_2024,tickets_sold,release_year
0,0,1,Inside Out 2,2024/06/14,Walt Disney,Adventure,652980194.0,57734765.0,2024
1,1,2,Deadpool & Wolverine,2024/07/26,Walt Disney,Action,636745858.0,56299369.0,2024
2,2,3,Wicked,2024/11/22,Universal,Musical,432943285.0,38279689.0,2024
3,3,4,Moana 2,2024/11/27,Walt Disney,Musical,404017489.0,35722148.0,2024
4,4,5,Despicable Me 4,2024/07/03,Universal,Adventure,361004205.0,31919028.0,2024
...,...,...,...,...,...,...,...,...,...
695,695,696,After Death,2023/10/27,Angel Studios,Documentary,705.0,62.0,2023
696,696,697,50 km/h,2024/07/19,Sony Pictures,Comedy,543.0,48.0,2024
697,697,698,The Road Dance,2023/10/13,Music Box Films,Drama,510.0,45.0,2023
698,698,699,Alimañas,2024/09/13,Sony Pictures,Comedy,485.0,43.0,2024


In [6]:
pd.set_option('display.max_colwidth', None)
cleaned_df['movie_title'].head(20)

0                     Inside Out 2
1             Deadpool & Wolverine
2                           Wicked
3                          Moana 2
4                  Despicable Me 4
5          Beetlejuice Beetlejuice
6                   Dune: Part Two
7                         Twisters
8     Godzilla x Kong: The New Em…
9                  Kung Fu Panda 4
10           Bad Boys: Ride or Die
11    Kingdom of the Planet of th…
12                    Gladiator II
13            Sonic the Hedgehog 3
14                 It Ends With Us
15                  The Wild Robot
16           Venom: The Last Dance
17          A Quiet Place: Day One
18           Mufasa: The Lion King
19     Ghostbusters: Frozen Empire
Name: movie_title, dtype: object

⚠️ **Heads-up**: Long movie names will not fully displayed on The Number table (see examples above with movie 8 and 11), so we will need to manually edit those names for the next step.

ℹ️ As this is a small example to illustrate the procedure, I will take a subset of first five movies to search for their review URLs on IMDb.

In [7]:
cleaned_subset = cleaned_df.head(5)
cleaned_subset

Unnamed: 0,movie_id,rank,movie_title,release_date,distributor,genre,gross_2024,tickets_sold,release_year
0,0,1,Inside Out 2,2024/06/14,Walt Disney,Adventure,652980194.0,57734765.0,2024
1,1,2,Deadpool & Wolverine,2024/07/26,Walt Disney,Action,636745858.0,56299369.0,2024
2,2,3,Wicked,2024/11/22,Universal,Musical,432943285.0,38279689.0,2024
3,3,4,Moana 2,2024/11/27,Walt Disney,Musical,404017489.0,35722148.0,2024
4,4,5,Despicable Me 4,2024/07/03,Universal,Adventure,361004205.0,31919028.0,2024


## Step 3: Search for URLs to the user reviews with movie names

ℹ️ As there may be many search results with the same movie name, the function ``get_imdb_url()`` will also try to match the year to get the correct movie link.

In [8]:
url_df = get_imdb_url(cleaned_subset)


         Title: Inside Out 2 (2024)
Trying match: Inside Out 2 (2024)

         Title: Deadpool & Wolverine (2024)
Trying match: Deadpool & Wolverine (2024)

         Title: Wicked (2024)
Trying match: Wicked (2024)

         Title: Moana 2 (2024)
Trying match: Moana 2 (2024)

         Title: Despicable Me 4 (2024)
Trying match: Despicable Me 4 (2024)


## Step 4: Scrape movie reviews

ℹ️ The function ``scrape_reviews_all_movies()`` scrapes IMDb reviews for all movies in the dataframe, from the earliest available up to one week after a movie’s release, enabling downstream tasks like predicting box office sales from pre- and post-release sentiment.

For illustration, I will scrape reviews from one movie.

In [None]:
wicked_df = url_df.iloc[[2]]
wicked_reviews = scrape_reviews_all_movies(wicked_df)



Scraping reviews for movie: Wicked...
***Scraping review 1...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 1
***Scraping review 2...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 2
***Scraping review 3...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 3
***Scraping review 4...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 4
***Scraping review 5...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 5
***Scraping review 6...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 6
***Scraping review 7...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 7
***Scraping review 8...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 8
***Scraping review 9...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 9
***Scraping review 10...***
Parsed review date: 2024/11/19 | Cutoff: 2024/11/29
Scraped review 1

In [None]:
wicked_reviews.head()

Unnamed: 0,movie_id,movie_title,release_date,review_title,review_author,review_date,rating,review_content,helpful_count_up,helpful_count_down
0,2,Wicked,2024/11/22,WICKED IS A MASTERPIECE!!!,popcultureqveen,2024/11/19,10,"WOW!!! This film exceeded my expectations. Can we talk about how BLOWN AWAY I was by both Ariana and Cynthia's performances, and honestly the entire cast!\n\nAriana wasn't kidding when she said she put her entire heart and soul into preparing for the role of Glinda. This is a dream come true for her, and I couldn't think of anyone else for this role.\n\nCynthia's belting was out of this world. She stepped into the role of Elphaba with ease.\n\nThe singing was LIVE mind you.... including the stunts.\n\nI will be watching a million times in theaters (and hopefully soon on digital) in anticipation for Part 2!",358,465
1,2,Wicked,2024/11/22,Best movie of 2024!!!,darvindw,2024/11/19,10,This movie is by far one of the best movies of all time. The director is a genius and the actors did an amazong job!!! Ariana Grande and Cynthia Erivo are not only able to showcase big emotions but also are in a superior league when it comes to vocal performance. Arianas powerful vocals combined with extrodinary vocal control are able to get you to tears as well as the power and pureness of Cynthia Erivos voice. Both of them as well as the other actors are able to push you to tears but also get you to joy. Especially Grande has a great sense of humor and is going to make you laugh throughout the movie!,154,290
2,2,Wicked,2024/11/22,"NO JUST NO, NOT NEEDED",magiciancolin,2024/11/19,2,"This film's cinematography and sets is the only thing going for it. The story isn't anything new from the musical. The costuming is ugly, the acting is okay and Ariana Grande is at least trying to do anything with what she is given. There is nothing here and once again we have an over hyped dumpster fire the will probably be given award nominations because it's catering towards that. If you're a fan of films that are remakes, then this if the film for you. Fans of the musical. Then this is not the film for you and should stick the the musical and avoid the film as it'll ruin the musical for you. I still don't know how they will make a sequel. That's DEFINITELY NOT NEEDED.",105,176
3,2,Wicked,2024/11/22,a masterpiece,agustiniluis,2024/11/19,10,"Cinema is back on! 'Wicked' casts a spellbinding charm on the big screen! Director John M. Chu masterfully adapts Gregory Maguire's beloved novel and the stage show, bringing Oz's iconic witches to life. Cynthia Erivo shines as Elphaba, complexity and depth radiating from every note. Ariana Grande dazzles as Glinda, comedic timing and vulnerability perfectly balanced. Their chemistry ignites a poignant exploration of friendship, prejudice, and self-discovery. A must-see musical spectacle for the whole family, probably the biggest movie of the year and a big award contender. Could it give movies like 'Titanic' and 'Return of the King' a run for their money??",88,188
4,2,Wicked,2024/11/22,"Wow! Epic, stunning, genius movie musical!",mdebard,2024/11/19,10,"Advance screening of the new movie Wicked (part1): Wow! Spectacular, creative, wonderful dancing and singing and sets. Almost like old MGM massive musical movies from the '60's with CGI. I loved the brief cameo of Kristen Chenowith and Idina Menzel singing with Ariana Grande and Cynthia Erivo. The latter two leads have great chemistry together with wonderful voices and acting. A sweeping 2.5 hour epic that justifies its two parts. Incredible costumes. Michelle Yeoh is wonderful as Madam Morrible. Some changes from the musical are both necessary and wonderful additions to the movie. Make sure to see it when it opens this weekend!",76,174


In [None]:
wicked_reviews.shape

(100, 10)

## Step 5: Save the dataframes

In [None]:
cleaned_df.to_csv("../data/cleaned_movie_df.csv", index=False)
url_df.to_csv("../data/url_df.csv", index=False)
wicked_reviews.to_csv("../data/wicked_reviews.csv", index=False)