scrapping top 250 movies by meta scores on IMDB using Beautiful soup

The final dataframe will contains the below elements:

* `name` - title of the movie, 
* `year` - release year of the movie, 
* `rating` - user score of the movie, 
* `m_score` - meta score of the movie, 
* `vote` - number of votes.

First, we import the requried packages

In [1]:
import bs4
import requests
import time
import random as ran
import sys
import pandas as pd

Since scraping the data is an iterative process, we define separate functions for each purpose.

First we are going to define a function which will extract the targeted elements from a 'movie block list'

In [31]:
def scrape_mblock(movie_block):
    
    movieb_data ={}
  
    try:
        movieb_data['name'] = movie_block.find('a').get_text() # Name of the movie
    except:
        movieb_data['name'] = None

    try:    
        movieb_data['year'] = str(movie_block.find('span',{'class': 'lister-item-year'}).contents[0][1:-1]) # Release year
    except:
        movieb_data['year'] = None

    try:
        #movieb_data['rating'] = float(movie_block.find('div',{'class':'inline-block ratings-imdb-rating'}).get('data-value')) #rating
        movieb_data['rating'] = float(movie_block.find('span',{'class':'metascore'}).get_text()) #rating
    except:
        movieb_data['rating'] = None

    try:
        movieb_data['m_score'] = float(movie_block.find('span',{'class':'metascore favorable'}).contents[0].strip()) #meta score
    except:
        movieb_data['m_score'] = None

    try:
        movieb_data['votes'] = int(movie_block.find('span',{'name':'nv'}).get('data-value')) # votes
    except:
        movieb_data['votes'] = None

    return movieb_data
    

Then we create the below function to scrape all movie blocks within a single search result page

In [32]:
def scrape_m_page(movie_blocks):
    
    page_movie_data = []
    num_blocks = len(movie_blocks)
    
    for block in range(num_blocks):
        page_movie_data.append(scrape_mblock(movie_blocks[block]))
    
    return page_movie_data

Now we built functions to extract all movie data from a single page.

Next function will be created to iterate the above made function through all pages of the search result untill we scrape data for the targeted number of movies

In [28]:
def scrape_this(link,t_count):
    
    #from IPython.core.debugger import set_trace

    base_url = link
    target = t_count
    
    current_mcount_start = 0
    current_mcount_end = 0
    remaining_mcount = target - current_mcount_end 
    
    new_page_number = 1
    
    movie_data = []
    
    
    while remaining_mcount > 0:

        url = base_url + str(new_page_number)
        
        #set_trace()
        
        source = requests.get(url).text
        soup = bs4.BeautifulSoup(source,'html.parser')
        
        movie_blocks = soup.findAll('div',{'class':'lister-item-content'})
        
        movie_data.extend(scrape_m_page(movie_blocks))   
        
        #current_mcount_start = int(soup.find("div", {"class":"nav"}).find("div", {"class": "desc"}).contents[1].get_text().split("-")[0])
        
        current_mcount_start = int(soup.find("div", {"class":"footer filmosearch"}).find("span", {"class": "pagination-range"}).get_text().split("of")[0].split("-")[0].strip())
        print(current_mcount_start)
        
        #current_mcount_end = int(soup.find("div", {"class":"nav"}).find("div", {"class": "desc"}).contents[1].get_text().split("-")[1].split(" ")[0])

        current_mcount_end = int(soup.find("span", {"class": "pagination-range"}).get_text().split("of")[0].split("-")[1].strip())

        remaining_mcount = target - current_mcount_end
        
        print('\r' + "currently scraping movies from: " + str(current_mcount_start) + " - "+str(current_mcount_end), "| remaining count: " + str(remaining_mcount), flush=True, end ="")
        
        new_page_number += 1
        
        time.sleep(ran.randint(0, 10))
    
    return movie_data
    
    

Finally, we put together all functions created above to scrape the top 150 movies on the list

In [33]:
base_scraping_link = "https://www.imdb.com/list/ls041970465/?sort=list_order,asc&st_dt=&mode=detail&page="

top_movies = 150 #input("How many movies do you want to scrape?")
films = []

films = scrape_this(base_scraping_link,int(top_movies))

print('\r'+"List of top " + str(top_movies) +" movies:" + "\n", end="\n")
pd.DataFrame(films)

1
currently scraping movies from: 1 - 100 | remaining count: 50101
List of top 150 movies:es from: 101 - 200 | remaining count: -50



Unnamed: 0,name,year,rating,m_score,votes
0,The Godfather,1972,100.0,100.0,1908453
1,Casablanca,1942,100.0,100.0,583285
2,Rear Window,1954,100.0,100.0,501749
3,Citizen Kane,1941,100.0,100.0,450527
4,Vertigo,1958,100.0,100.0,410772
...,...,...,...,...,...
195,V for Vendetta,2005,62.0,62.0,1140432
196,The Help,2011,62.0,62.0,472265
197,The Green Mile,1999,61.0,61.0,1334502
198,Judgment at Nuremberg,1961,60.0,,79920


In [30]:
pd.DataFrame(films).to_csv("films.csv", index=False)