<h1>IMDB Scraper</h1>

<p>This is a scraper for gleaning data from imdb.com. We are mostly trying to get the budget and box office data for titles.</p>

In [352]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import json
import pyarrow.feather as feather

<h2>Running an advanced search</h2>
<p>First, we will run an advanced search on imdb.com.</p>
<p>We are filtering in the search:</p>
<ul>
    <li>feature films</li>
    <li>from 2010-2021</li>
    <li>made in USA</li>
    <li>longer than an 60 min</li>
   </ul>

In [279]:
#I need to hard code the url for imdb's specific advanced search fields that include commas. This function will return 
#the search results url for a page in the pagination of the search specified by count

PER_PAGE = 250

def get_search_url(count):
    date = '2010-01-01,2021-12-31'
    countries = 'us'
    runtime = '60'
    start = (count * PER_PAGE) + 1
    search_string = f'https://www.imdb.com/search/title/?title_type=feature&release_date={date}&countries={countries}&runtime={runtime},&count={PER_PAGE}&start={start}&ref=adv_nxt'
    
    return requests.get(search_string)

In [280]:
#get first 5000 in search

MAX_RESULTS = 5000
num_of_pages = MAX_RESULTS // PER_PAGE

results = []

for i in range(num_of_pages):
    r = get_search_url(i)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    all_titles = soup.find_all('div', class_='lister-item-content')
    
    for title in all_titles:
        result = {
            'id' : title.a['href'][-10:-1],
            'url' : title.a['href'],
            'title' : title.a.text,
            'year' : title.find('span', class_="lister-item-year").text if title.find('span', class_="lister-item-year") != None else None,
            'genre' : title.find('span', class_='genre').text.strip().strip('\n').split(',') if title.find('span', class_='genre') != None else None,
            'certificate' : title.find('span', class_='certificate').text if title.find('span', class_='certificate') != None else None,
            'runtime' : title.find('span', class_='runtime').text if title.find('span', class_='runtime') != None else None,
            'imdb_rating' : title.find('div', class_='ratings-imdb-rating').find('strong').text if title.find('div', class_='ratings-imdb-rating') != None else None,
            'metascore' : title.find('span', class_="metascore mixed").text.strip() if title.find('span', class_="metascore mixed") != None else None,
        }
        
        results.append(result)

    time.sleep(2)
        
    print(f'collected {i+1} / {num_of_pages} pages', end="\r")



collected 20 / 20 pages

In [281]:
len(results)

5000

<h2>Getting budget and box office info</h2>

<p>We need to iterate to each title we just scraped to get the budget and box office info.</p>

In [305]:
#Some information isn't available from the details in the imdb search page. We need to go to each page
#to get budget and box office info, as well as cast and director

def get_data_from_dataid(dataid, final_data_holder, soup, isNumber=False):
    data_li = soup.select(dataid)
    data = [y.get_text() for x in data_li for y in x.find_all(final_data_holder, {'class':"ipc-metadata-list-item__list-content-item"})]
    
    if len(data) > 0:
        if isNumber:
            data = int(''.join([x for x in data[0] if x.isnumeric()]))
        return data
    else:
        return


for i, result in enumerate(results):
    r = requests.get('https://imdb.com' + result['url'])
    soup = BeautifulSoup(r.text, 'html.parser')
    
    
    metadata = soup.find('script')
    meta_json = json.loads(metadata.contents[0])
    meta_keys = meta_json.keys()
    result['principals'] = [{'name':x['name'], 'id':x['url'][-10:-1]} for x in meta_json['actor']] if 'actor' in meta_keys else None
    result['director'] = [{'name':x['name'], 'id':x['url'][-10:-1]} for x in meta_json['director']] if 'director' in meta_keys else None
    result['creator'] = [{'id':x['url'][-10:-1]} for x in meta_json['creator']] if 'creator' in meta_keys else None
    
    result['budget'] = get_data_from_dataid('li[data-testid="title-boxoffice-budget"]', 'span', soup, isNumber = True)
    result['domestic_box_office'] = get_data_from_dataid('li[data-testid="title-boxoffice-grossdomestic"]', 'span', soup, isNumber = True)
    result['worldwide_box_office'] = get_data_from_dataid('li[data-testid="title-boxoffice-cumulativeworldwidegross"]', 'span', soup, isNumber = True)
    result['origin'] = get_data_from_dataid('li[data-testid="title-details-origin"]', 'a', soup)
    
    
    print(f'getting more info for {i+1} / {len(results)} results', end="\r")

getting more info for 1784 / 5000 results

In [341]:
imdb_df = pd.DataFrame(results)

In [342]:
#clean numbers

def string_to_int(string):
    return int(''.join([x for x in string if x.isnumeric()])) if string != None else None

def string_to_float(string):
    return float(''.join([x for x in string if x.isnumeric() or x =='.'])) if string != None else None

imdb_df.year = imdb_df.year.map(string_to_int)
imdb_df.runtime = imdb_df.runtime.map(string_to_int)
imdb_df.imdb_rating = imdb_df.imdb_rating.map(string_to_float)
imdb_df.metascore = imdb_df.metascore.map(string_to_int)

In [347]:
for i in range(3):
    imdb_df[f'genre_{i+1}'] = imdb_df['genre'].map(lambda x: x[i].replace(' ','') if len(x) > i else None)

In [348]:
imdb_df.head()

Unnamed: 0,id,url,title,year,genre,certificate,runtime,imdb_rating,metascore,principals,director,creator,budget,domestic_box_office,worldwide_box_office,origin,genre_1,genre_2,genre_3
0,tt1477834,/title/tt1477834/,Aquaman,2018,"[Action, Adventure, Fantasy]",PG-13,143,6.8,55.0,"[{'name': 'Jason Momoa', 'id': 'nm0597388'}, {...","[{'name': 'James Wan', 'id': 'nm1490123'}]","[{'id': 'co0002663'}, {'id': 'co0283444'}, {'i...",160000000.0,335104314.0,1148528000.0,"[United States, Australia]",Action,Adventure,Fantasy
1,tt1879016,/title/tt1879016/,Operation Mincemeat,2021,"[Drama, War]",PG-13,128,6.7,,"[{'name': 'Colin Firth', 'id': 'nm0000147'}, {...","[{'name': 'John Madden', 'id': 'nm0006960'}]","[{'id': 'co0230132'}, {'id': 'co0243890'}, {'i...",,,12288590.0,"[United Kingdom, United States]",Drama,War,
2,tt4513678,/title/tt4513678/,Ghostbusters: Afterlife,2021,"[Adventure, Comedy, Fantasy]",PG-13,124,7.1,45.0,"[{'name': 'Carrie Coon', 'id': 'nm4689420'}, {...","[{'name': 'Jason Reitman', 'id': 'nm0718646'}]","[{'id': 'co0050868'}, {'id': 'co0309252'}, {'i...",75000000.0,129360575.0,197360600.0,"[United States, Canada]",Adventure,Comedy,Fantasy
3,t10954652,/title/tt10954652/,Old,2021,"[Drama, Horror, Mystery]",PG-13,108,5.8,55.0,"[{'name': 'Gael García Bernal', 'id': 'nm03055...","[{'name': 'M. Night Shyamalan', 'id': 'nm07961...","[{'id': 'co0005073'}, {'id': 'co0054054'}, {'i...",,48276510.0,90146510.0,"[United States, Japan]",Drama,Horror,Mystery
4,t10872600,/title/tt10872600/,Spider-Man: No Way Home,2021,"[Action, Adventure, Fantasy]",PG-13,148,8.3,,"[{'name': 'Tom Holland', 'id': 'nm4043618'}, {...","[{'name': 'Jon Watts', 'id': 'nm1218281'}]","[{'id': 'co0050868'}, {'id': 'co0532247'}, {'i...",200000000.0,804747988.0,1892748000.0,[United States],Action,Adventure,Fantasy


In [349]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5000 non-null   object 
 1   url                   5000 non-null   object 
 2   title                 5000 non-null   object 
 3   year                  5000 non-null   int64  
 4   genre                 5000 non-null   object 
 5   certificate           4508 non-null   object 
 6   runtime               5000 non-null   int64  
 7   imdb_rating           4999 non-null   float64
 8   metascore             1560 non-null   float64
 9   principals            4999 non-null   object 
 10  director              5000 non-null   object 
 11  creator               4995 non-null   object 
 12  budget                2429 non-null   float64
 13  domestic_box_office   2646 non-null   float64
 14  worldwide_box_office  3384 non-null   float64
 15  origin               

<h2>Save to file</h2>

<p>Everything looks good for now, so we will save to file. We're using pyarrow to save to a feather file, so we can preserve the list and dictionary structures we scraped</p>

In [350]:
#export to feather to preserve the data structures (lists and dictionaries) we created in the scrape
#we will clean up the dataset in the next notebook

feather.write_feather(imdb_df, 'imdb_scrape_full.feather')