# <center> IMDb top 500 movies scraping <center>

-- Portfolio Project by Samrat Kundu
- linkdin: https://www.linkedin.com/in/samratkundu97/

- This is the Part 1 of this Project

# Project Outline:

IMDb_link: https://www.imdb.com/list/ls050782187/?sort=list_order,asc&st_dt=&mode=detail&page=1
<p>Extracting information for top 250 IMDb movies</p>
<ul>
    <li>Title</li>
    <li>Release year</li>
    <li>Film Rating</li>
    <li>Runtime</li>
    <li>Genre</li>
    <li>IMDb rating</li>
    <li>Metascore</li>
    <li>Description</li>
    <li>Director name</li>
    <li>Votes</li>
    <li>Gross</li>
</ul>
Putting all the information in a CSV file

# Import Nessecary Modules:

In [1]:
# requests
import requests

# Beautiful Soup
from bs4 import BeautifulSoup

# Pandas
import pandas as pd

# Numpy
import numpy as np

# sleep
from time import sleep

# randint
from random import randint

# Check
print('Imported Succesfully')

Imported Succesfully


# Getting Access

In [2]:
# putting the website link into a variable named url
imdb_url = 'https://www.imdb.com/list/ls050782187/?sort=list_order,asc&st_dt=&mode=detail&page=1'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}

# using requests to access the url
response = requests.get(imdb_url, headers=headers)

# checking the status code
response.status_code

200

# Using Beautiful Soup to start scraping

In [3]:
# Creating a variable soup to store entire html code into it
soup = BeautifulSoup(response.text, 'lxml')

In [4]:
# Scrape the first 5 movie title

# create a variable all_movies to store div tag of each movie  
all_movies = soup.find_all('div', class_='lister-item-content')

# Scrape first five movie title
for movie in all_movies[:5]:
    print(movie.find('h3', class_='lister-item-header').a.text.strip())

The Godfather
The Silence of the Lambs
Star Wars: Episode V - The Empire Strikes Back
The Shawshank Redemption
The Shining


- Now we have access of the main div tag of each movie 
- Let's try to scrape all the nessecary data for the first movie 

In [5]:
# Scrape all the nessecary data for the first movie

print('All the data for the first movie\n')

first_movie = all_movies[0]

# movie title
print('Movie Title : ',  first_movie.find('h3').a.text.strip())

# Release Year
print('Release Year : ', first_movie.find('span', class_='lister-item-year text-muted unbold').text.strip())

# Film Rating
print('Film Rating : ', first_movie.find('span', class_='certificate').text.strip())

# Runtime
print('Run Time : ', first_movie.find('span', class_='runtime').text.strip())

# Genre 
print('Genre : ', first_movie.find('span', class_='genre').text.strip())

# IMDb rating
print('IMDb Rating : ', first_movie.find('span', class_='ipl-rating-star__rating').text.strip())

# Metascore
print('Metascore : ', first_movie.find('div', class_='inline-block ratings-metascore').span.text)

# Description
print('Description : ', first_movie.find_all('p')[1].text.strip())

# Director 
print('Director : ', first_movie.find_all('p', class_='text-muted text-small')[1].find_all('a')[0].text)

# Votes
print('Votes : ', first_movie.find_all('p', class_='text-muted text-small')[2].find_all('span')[1].text)

#Gross
print('Gross : ', first_movie.find_all('p', class_='text-muted text-small')[2].find_all('span')[-1].text)


All the data for the first movie

Movie Title :  The Godfather
Release Year :  (1972)
Film Rating :  A
Run Time :  175 min
Genre :  Crime, Drama
IMDb Rating :  9.2
Metascore :  100        
Description :  Don Vito Corleone, head of a mafia family, decides to hand over his empire to his youngest son Michael. However, his decision unintentionally puts the lives of his loved ones in grave danger.
Director :  Francis Ford Coppola
Votes :  1,929,034
Gross :  $134.97M


<h3>Snapshot of the first movie from the website</h3>

<img src='the_Godfather.jpg'>

<h4>Verdict for the first scrape</h4>

- Every Value is correct! except for the Film Rating. 
- Which Should be R but here I got A
- In film rating 'R' means Restricted and 'A' Means Adults Only
- Both rating is meant for 18+ audiences so here the data is vaild
- We will further notice other film ratings later whether we get some alternative values or not

# Automation
- Now using loop automate the whole task by iterating each movie url
- Create a DataFrame
- Export the DataFrame to a CSV file

## Step 1. Using for loop to iterate all the data

In [6]:

imdb_data = []

for i in range(1,6):
    # headers value
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}

    # first_page url
    url = f'https://www.imdb.com/list/ls050782187/?st_dt=&mode=detail&page={i}&sort=list_order,asc'
    
    # using requests to access the url
    response = requests.get(url, headers=headers)

    # parsing data 
    soup = BeautifulSoup(response.content, 'lxml')

    # assigning sleep between 0 and 5 seconds randomly
    sleep(randint(0,5))

    # create a variable all_movies to store div tag of each movie  
    all_movies = soup.find_all('div', class_='lister-item-content')

    # now iterate all the movies to extract and store data into those empty lists
    for movie in all_movies:

        # movie title
        title = movie.find('h3').a.text.strip()
        
        # Release Year
        rl = movie.find('span', class_='lister-item-year text-muted unbold')
        release_year = rl.text.strip() if rl else np.nan
        
        # Film Rating
        fr = movie.find('span', class_='certificate')
        film_rating = fr.text.strip() if fr else np.nan
        
        # Runtime
        rt = movie.find('span', class_='runtime')
        runtime = rt.text.strip() if rt else np.nan
        
        # Genre 
        gnr = movie.find('span', class_='genre')
        genre = [gnr.text.strip() if gnr else np.nan]
        
        # IMDb rating
        imdb = movie.find('span', class_='ipl-rating-star__rating')
        imdb_rating = imdb.text.strip() if imdb else np.nan
        
        # Metascore
        mscore = movie.find('div', class_='inline-block ratings-metascore')
        metascore = mscore.span.text if mscore else np.nan
        
        # Description
        desc =  movie.find_all('p')[1]
        description = desc.text.strip() if desc else np.nan
        
        # Director 
        dir_name = movie.find_all('p', class_='text-muted text-small')[1].find_all('a')[0]
        director = [dir_name.text if dir_name else np.nan]
        
        # Votes
        vts = movie.find_all('p', class_='text-muted text-small')[2].find_all('span')[1]
        votes = vts.text if vts else np.nan
        
        #Gross
        grs = movie.find_all('p', class_='text-muted text-small')[2].find_all('span')[-1]
        gross = grs.text if grs else np.nan

        
        # append all the data in to a 2D list
        imdb_data.append([title, release_year, film_rating, runtime, genre, imdb_rating, metascore, description, director, votes, gross])

    # print a statement     
    print(f'Complete Page: {i}')

print('''
.
.
.
''')
print('All Done!')
    

Complete Page: 1
Complete Page: 2
Complete Page: 3
Complete Page: 4
Complete Page: 5

.
.
.

All Done!


In [7]:
# Length of all rows
len(imdb_data)

500

## Step 2. Inset the data into a pandas DataFrame

In [8]:
# Create a list columns to store all the column headers
columns = ['movie_title', 'release_year', 'film_rating', 'runtime', 'genre', 'imdb_rating', 'metascore', 'description', 'director', 'votes', 'gross' ]

# Create DataFrame (df_imdb) and insert imdb_data
df_imdb = pd.DataFrame(imdb_data, columns=columns)

In [9]:
# first and last five rows 
df_imdb

Unnamed: 0,movie_title,release_year,film_rating,runtime,genre,imdb_rating,metascore,description,director,votes,gross
0,The Godfather,(1972),A,175 min,"[Crime, Drama]",9.2,100,"Don Vito Corleone, head of a mafia family, dec...",[Francis Ford Coppola],1929034,$134.97M
1,The Silence of the Lambs,(1991),A,118 min,"[Crime, Drama, Thriller]",8.6,86,A young F.B.I. cadet must receive the help of ...,[Jonathan Demme],1479099,$130.74M
2,Star Wars: Episode V - The Empire Strikes Back,(1980),UA,124 min,"[Action, Adventure, Fantasy]",8.7,82,After the Rebels are overpowered by the Empire...,[Irvin Kershner],1331089,$290.48M
3,The Shawshank Redemption,(1994),A,142 min,[Drama],9.3,82,"Over the course of several years, two convicts...",[Frank Darabont],2771093,$28.34M
4,The Shining,(1980),A,146 min,"[Drama, Horror]",8.4,66,A family heads to an isolated hotel for the wi...,[Stanley Kubrick],1058395,$44.02M
...,...,...,...,...,...,...,...,...,...,...,...
495,"Me, Myself & Irene",(2000),A,116 min,[Comedy],6.6,49,A nice-guy cop with Dissociative Identity Diso...,[Bobby Farrelly],244713,$90.57M
496,The Darjeeling Limited,(2007),R,91 min,"[Adventure, Comedy, Drama]",7.2,67,"A year after their father's funeral, three bro...",[Wes Anderson],209306,$11.90M
497,Fear,(1996),Not Rated,97 min,"[Drama, Thriller]",6.2,51,"When Nicole met David; handsome, charming, aff...",[James Foley],51603,$20.75M
498,Planet Terror,(2007),A,105 min,"[Action, Comedy, Horror]",7,,"After an experimental bio-weapon is released, ...",[Robert Rodriguez],217953,217953


## Step 3. Export the DataFrame to a CSV file 

In [15]:
df_imdb.to_csv('top500_imdb_movies.csv')

- Kindly visit the Part 2 of this Project for Exploratory Data Analysis and Visualization