<img src="https://i.imgur.com/FoKB5Z5.png" align="left" width="300" height="250" title="source: imgur.com" /></a>

## Program Code: J620-002-4:2020 

## Program Name: FRONT-END SOFTWARE DEVELOPMENT

## Title :  Case Study - IMDB Web Scraping

#### Name: Justin Chong

#### IC Number: 960327-07-5097

#### Date : 7/7/2023

#### Introduction : Practising more with BeautifulSoup and Selenium when Web Scraping on IMDB's Top 1000 movies chart.



#### Conclusion : I am getting a lot better at Web Scraping than before with BeautifulSoup and Selenium.






**Reference : https://medium.com/better-programming/the-only-step-by-step-guide-youll-need-to-build-a-web-scraper-with-python-e79066bd895a**

In [1]:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import time

## 1. Import Data by using webscrapping
Open the URL with headless webdriver and parse the page source into html with beautifulsoup

In [2]:
url = 'https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prvt'

Append data found into list according to the category

In [3]:
# driver = webdriver.Chrome('C:\\Users\ACER\Desktop\ChromeDriver\chromedriver')
driver = webdriver.Chrome()
driver.get(url)

data = []
while True:
    soup = BeautifulSoup(driver.page_source,'html.parser')
    soup_list = soup.find_all("div", attrs={"class":"lister-item-content"})
    for tr in soup_list:
        rank = tr.find('span', attrs={'class':'lister-item-index unbold text-primary'}).text.rstrip()
        title = tr.find('a').text.rstrip()
        year = tr.find('span', attrs={'class':'lister-item-year text-muted unbold'}).text.rstrip()
        
        certificate_span = tr.find('span', attrs={'class':'certificate'})
        certificate = ''
        if certificate_span:
            certificate = certificate_span.text.rstrip()        
        
        runtime = tr.find('span', attrs={'class':'runtime'}).text.rstrip()
        genre = tr.find('span', attrs={'class':'genre'}).text.rstrip().replace('\n', '')
        imdb_rating = tr.find('div', attrs={'class':'inline-block ratings-imdb-rating'}).find('strong').text.rstrip()
        
        metacritic_span = tr.find('div', attrs={'class':'inline-block ratings-metascore'})
        metacritic_score = ''
        if metacritic_span:
            metacritic_score = metacritic_span.find('span').text.rstrip()

        for p_tag in tr.find_all('p', attrs={'class': 'text-muted'}):
            synopsis = p_tag.text.rstrip().replace('\n', '')

        for crew in tr.find_all('p', attrs={'class':''}):
            crew_info = crew.text.rstrip().replace('\n', '')
        director_info, star_info = crew_info.split("|")
        directors = director_info.replace('Director:', '').replace('Directors:', '').lstrip()
        stars = star_info.replace('Star:', '').replace('Stars:', '').lstrip()

        votes_span = tr.find('span', string='Votes:')
        votes = ''
        if votes_span:
            votes = votes_span.find_next_sibling('span')['data-value']

        gross_span = tr.find('span', string='Gross:')
        gross = ''
        if gross_span:
            gross = gross_span.find_next_sibling('span')['data-value']

        top_chart_span = tr.find('span', attrs={'class':'text-muted top-chart-rank'})
        top_chart_rank = ''
        if top_chart_span:
            top_chart_rank = top_chart_span.find_next_sibling('span')['data-value']

        data.append((rank, title, year, certificate, runtime, genre, imdb_rating, metacritic_score, synopsis, directors, stars, votes, gross, top_chart_rank))

        button_div = tr.find('a', attrs={'class':'lister-page-next next-page'})
        if button_div:
            print('next button exists')

    try:
        button = driver.find_element(By.LINK_TEXT, 'Next »')
        if button.is_displayed() and button.is_enabled():
            button.click()

    except NoSuchElementException:
        break
        
driver.quit()

data

[('1.',
  'Oppenheimer',
  '(2023)',
  'R',
  '180 min',
  'Biography, Drama, History',
  '8.7',
  '88',
  'The story of American scientist, J. Robert Oppenheimer, and his role in the development of the atomic bomb.',
  'Christopher Nolan',
  'Cillian Murphy, Emily Blunt, Matt Damon, Robert Downey Jr.',
  '257156',
  '',
  '29'),
 ('2.',
  'Mission: Impossible - Dead Reckoning Part One',
  '(2023)',
  'PG-13',
  '163 min',
  'Action, Adventure, Thriller',
  '8.0',
  '81',
  'Ethan Hunt and his IMF team must track down a dangerous weapon before it falls into the wrong hands.',
  'Christopher McQuarrie',
  'Tom Cruise, Hayley Atwell, Ving Rhames, Simon Pegg',
  '108525',
  '',
  ''),
 ('3.',
  'Interstellar',
  '(2014)',
  'PG-13',
  '169 min',
  'Adventure, Drama, Sci-Fi',
  '8.7',
  '74',
  'When Earth becomes uninhabitable in the future, a farmer and ex-NASA pilot, Joseph Cooper, is tasked to pilot a spacecraft, along with a team of researchers, to find a new planet for humans.',
  'C

Check if the data is webscrapped successfully

In [4]:
data

[('1.',
  'Oppenheimer',
  '(2023)',
  'R',
  '180 min',
  'Biography, Drama, History',
  '8.7',
  '88',
  'The story of American scientist, J. Robert Oppenheimer, and his role in the development of the atomic bomb.',
  'Christopher Nolan',
  'Cillian Murphy, Emily Blunt, Matt Damon, Robert Downey Jr.',
  '257156',
  '',
  '29'),
 ('2.',
  'Mission: Impossible - Dead Reckoning Part One',
  '(2023)',
  'PG-13',
  '163 min',
  'Action, Adventure, Thriller',
  '8.0',
  '81',
  'Ethan Hunt and his IMF team must track down a dangerous weapon before it falls into the wrong hands.',
  'Christopher McQuarrie',
  'Tom Cruise, Hayley Atwell, Ving Rhames, Simon Pegg',
  '108525',
  '',
  ''),
 ('3.',
  'Interstellar',
  '(2014)',
  'PG-13',
  '169 min',
  'Adventure, Drama, Sci-Fi',
  '8.7',
  '74',
  'When Earth becomes uninhabitable in the future, a farmer and ex-NASA pilot, Joseph Cooper, is tasked to pilot a spacecraft, along with a team of researchers, to find a new planet for humans.',
  'C

## 2. Building a DataFrame With pandas 
Put the data into data frame with Pandas

In [5]:
imdb_columns = ['Ranking', 'Title', 'Year', 'Rating', 'Runtime', 'Genre', 'IMDB Rating', 'Metacritic Score', 'Synopsis', 
                'Directors', 'Stars', 'Votes', 'Gross', 'Top 250 Chart Ranking']
df = pd.DataFrame(data, columns = imdb_columns).set_index('Ranking')
df

Unnamed: 0_level_0,Title,Year,Rating,Runtime,Genre,IMDB Rating,Metacritic Score,Synopsis,Directors,Stars,Votes,Gross,Top 250 Chart Ranking
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.,Oppenheimer,(2023),R,180 min,"Biography, Drama, History",8.7,88,"The story of American scientist, J. Robert Opp...",Christopher Nolan,"Cillian Murphy, Emily Blunt, Matt Damon, Rober...",257156,,29
2.,Mission: Impossible - Dead Reckoning Part One,(2023),PG-13,163 min,"Action, Adventure, Thriller",8.0,81,Ethan Hunt and his IMF team must track down a ...,Christopher McQuarrie,"Tom Cruise, Hayley Atwell, Ving Rhames, Simon ...",108525,,
3.,Interstellar,(2014),PG-13,169 min,"Adventure, Drama, Sci-Fi",8.7,74,When Earth becomes uninhabitable in the future...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",1955135,188020017,24
4.,Guardians of the Galaxy Vol. 3,(2023),PG-13,150 min,"Action, Adventure, Comedy",8.1,64,"Still reeling from the loss of Gamora, Peter Q...",James Gunn,"Chris Pratt, Chukwudi Iwuji, Bradley Cooper, P...",238956,,
5.,Spider-Man: Across the Spider-Verse,(2023),PG,140 min,"Animation, Action, Adventure",8.9,86,"Miles Morales catapults across the Multiverse,...","Joaquim Dos Santos, Kemp Powers, Justin K. Tho...","Shameik Moore, Hailee Steinfeld, Brian Tyree H...",197390,,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...
996.,Paan Singh Tomar,(2012),Not Rated,135 min,"Action, Biography, Crime",8.2,,"The story of Paan Singh Tomar, an Indian athle...",Tigmanshu Dhulia,"Irrfan Khan, Mahie Gill, Rajesh Abhay, Hemendr...",37207,39567,
997.,The Breath,(2009),,128 min,"Action, Drama, Thriller",8.0,,Story of 40-man Turkish task force who must de...,Levent Semerci,"Mete Horozoglu, Ilker Kizmaz, Birce Akalay, Ib...",34549,,
998.,Anand,(1971),Not Rated,122 min,"Drama, Musical",8.1,,The story of a terminally ill man who wishes t...,Hrishikesh Mukherjee,"Rajesh Khanna, Amitabh Bachchan, Sumita Sanyal...",34604,,
999.,Andaz Apna Apna,(1994),Not Rated,160 min,"Action, Comedy, Romance",8.0,,Two slackers competing for the affections of a...,Rajkumar Santoshi,"Aamir Khan, Salman Khan, Raveena Tandon, Karis...",54394,,


## 3. Data Cleaning

Data cleaning - remove the '()' from year

In [6]:
df['Year'] = df['Year'].str.strip('()')
df

Unnamed: 0_level_0,Title,Year,Rating,Runtime,Genre,IMDB Rating,Metacritic Score,Synopsis,Directors,Stars,Votes,Gross,Top 250 Chart Ranking
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.,Oppenheimer,2023,R,180 min,"Biography, Drama, History",8.7,88,"The story of American scientist, J. Robert Opp...",Christopher Nolan,"Cillian Murphy, Emily Blunt, Matt Damon, Rober...",257156,,29
2.,Mission: Impossible - Dead Reckoning Part One,2023,PG-13,163 min,"Action, Adventure, Thriller",8.0,81,Ethan Hunt and his IMF team must track down a ...,Christopher McQuarrie,"Tom Cruise, Hayley Atwell, Ving Rhames, Simon ...",108525,,
3.,Interstellar,2014,PG-13,169 min,"Adventure, Drama, Sci-Fi",8.7,74,When Earth becomes uninhabitable in the future...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",1955135,188020017,24
4.,Guardians of the Galaxy Vol. 3,2023,PG-13,150 min,"Action, Adventure, Comedy",8.1,64,"Still reeling from the loss of Gamora, Peter Q...",James Gunn,"Chris Pratt, Chukwudi Iwuji, Bradley Cooper, P...",238956,,
5.,Spider-Man: Across the Spider-Verse,2023,PG,140 min,"Animation, Action, Adventure",8.9,86,"Miles Morales catapults across the Multiverse,...","Joaquim Dos Santos, Kemp Powers, Justin K. Tho...","Shameik Moore, Hailee Steinfeld, Brian Tyree H...",197390,,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...
996.,Paan Singh Tomar,2012,Not Rated,135 min,"Action, Biography, Crime",8.2,,"The story of Paan Singh Tomar, an Indian athle...",Tigmanshu Dhulia,"Irrfan Khan, Mahie Gill, Rajesh Abhay, Hemendr...",37207,39567,
997.,The Breath,2009,,128 min,"Action, Drama, Thriller",8.0,,Story of 40-man Turkish task force who must de...,Levent Semerci,"Mete Horozoglu, Ilker Kizmaz, Birce Akalay, Ib...",34549,,
998.,Anand,1971,Not Rated,122 min,"Drama, Musical",8.1,,The story of a terminally ill man who wishes t...,Hrishikesh Mukherjee,"Rajesh Khanna, Amitabh Bachchan, Sumita Sanyal...",34604,,
999.,Andaz Apna Apna,1994,Not Rated,160 min,"Action, Comedy, Romance",8.0,,Two slackers competing for the affections of a...,Rajkumar Santoshi,"Aamir Khan, Salman Khan, Raveena Tandon, Karis...",54394,,


Data cleaning - remove the min from the timemin value

In [7]:
df['Runtime'] = df['Runtime'].str.strip('min')
df = df.rename(columns={'Runtime': 'Runtime (min)'})
df

Unnamed: 0_level_0,Title,Year,Rating,Runtime (min),Genre,IMDB Rating,Metacritic Score,Synopsis,Directors,Stars,Votes,Gross,Top 250 Chart Ranking
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.,Oppenheimer,2023,R,180,"Biography, Drama, History",8.7,88,"The story of American scientist, J. Robert Opp...",Christopher Nolan,"Cillian Murphy, Emily Blunt, Matt Damon, Rober...",257156,,29
2.,Mission: Impossible - Dead Reckoning Part One,2023,PG-13,163,"Action, Adventure, Thriller",8.0,81,Ethan Hunt and his IMF team must track down a ...,Christopher McQuarrie,"Tom Cruise, Hayley Atwell, Ving Rhames, Simon ...",108525,,
3.,Interstellar,2014,PG-13,169,"Adventure, Drama, Sci-Fi",8.7,74,When Earth becomes uninhabitable in the future...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",1955135,188020017,24
4.,Guardians of the Galaxy Vol. 3,2023,PG-13,150,"Action, Adventure, Comedy",8.1,64,"Still reeling from the loss of Gamora, Peter Q...",James Gunn,"Chris Pratt, Chukwudi Iwuji, Bradley Cooper, P...",238956,,
5.,Spider-Man: Across the Spider-Verse,2023,PG,140,"Animation, Action, Adventure",8.9,86,"Miles Morales catapults across the Multiverse,...","Joaquim Dos Santos, Kemp Powers, Justin K. Tho...","Shameik Moore, Hailee Steinfeld, Brian Tyree H...",197390,,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...
996.,Paan Singh Tomar,2012,Not Rated,135,"Action, Biography, Crime",8.2,,"The story of Paan Singh Tomar, an Indian athle...",Tigmanshu Dhulia,"Irrfan Khan, Mahie Gill, Rajesh Abhay, Hemendr...",37207,39567,
997.,The Breath,2009,,128,"Action, Drama, Thriller",8.0,,Story of 40-man Turkish task force who must de...,Levent Semerci,"Mete Horozoglu, Ilker Kizmaz, Birce Akalay, Ib...",34549,,
998.,Anand,1971,Not Rated,122,"Drama, Musical",8.1,,The story of a terminally ill man who wishes t...,Hrishikesh Mukherjee,"Rajesh Khanna, Amitabh Bachchan, Sumita Sanyal...",34604,,
999.,Andaz Apna Apna,1994,Not Rated,160,"Action, Comedy, Romance",8.0,,Two slackers competing for the affections of a...,Rajkumar Santoshi,"Aamir Khan, Salman Khan, Raveena Tandon, Karis...",54394,,


Data cleaning - remove the $ and M from the data value

In [8]:
# for the gross that would have the $ and M in the value,
# I acquired the data-value of the span that matches the string "Gross:", 
# which only has the numeric value, minus the $ and M

# the code is as follows:
# gross_span = tr.find('span', string='Gross:')
#     gross = ''
#     if gross_span:
#         gross = gross_span.find_next_sibling('span')['data-value']
df[['Title', 'Gross']]

Unnamed: 0_level_0,Title,Gross
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1
1.,Oppenheimer,
2.,Mission: Impossible - Dead Reckoning Part One,
3.,Interstellar,188020017
4.,Guardians of the Galaxy Vol. 3,
5.,Spider-Man: Across the Spider-Verse,
...,...,...
996.,Paan Singh Tomar,39567
997.,The Breath,
998.,Anand,
999.,Andaz Apna Apna,


Data cleaning - clear the ',' from the votes value

In [9]:
# for the votes that would have the "," in the value,
# I acquired the data-value of the span that matches the string "Votes:", 
# which only has the numeric value, minus the ","

# The code is as follows:
# votes_span = tr.find('span', string='Votes:')
#     votes = ''
#     if votes_span:
#         votes = votes_span.find_next_sibling('span')['data-value']
df[['Title', 'Votes']]

Unnamed: 0_level_0,Title,Votes
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1
1.,Oppenheimer,257156
2.,Mission: Impossible - Dead Reckoning Part One,108525
3.,Interstellar,1955135
4.,Guardians of the Galaxy Vol. 3,238956
5.,Spider-Man: Across the Spider-Verse,197390
...,...,...
996.,Paan Singh Tomar,37207
997.,The Breath,34549
998.,Anand,34604
999.,Andaz Apna Apna,54394


## 4. Display Cleaned and Converted Code in Pandas

In [10]:
df

Unnamed: 0_level_0,Title,Year,Rating,Runtime (min),Genre,IMDB Rating,Metacritic Score,Synopsis,Directors,Stars,Votes,Gross,Top 250 Chart Ranking
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.,Oppenheimer,2023,R,180,"Biography, Drama, History",8.7,88,"The story of American scientist, J. Robert Opp...",Christopher Nolan,"Cillian Murphy, Emily Blunt, Matt Damon, Rober...",257156,,29
2.,Mission: Impossible - Dead Reckoning Part One,2023,PG-13,163,"Action, Adventure, Thriller",8.0,81,Ethan Hunt and his IMF team must track down a ...,Christopher McQuarrie,"Tom Cruise, Hayley Atwell, Ving Rhames, Simon ...",108525,,
3.,Interstellar,2014,PG-13,169,"Adventure, Drama, Sci-Fi",8.7,74,When Earth becomes uninhabitable in the future...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",1955135,188020017,24
4.,Guardians of the Galaxy Vol. 3,2023,PG-13,150,"Action, Adventure, Comedy",8.1,64,"Still reeling from the loss of Gamora, Peter Q...",James Gunn,"Chris Pratt, Chukwudi Iwuji, Bradley Cooper, P...",238956,,
5.,Spider-Man: Across the Spider-Verse,2023,PG,140,"Animation, Action, Adventure",8.9,86,"Miles Morales catapults across the Multiverse,...","Joaquim Dos Santos, Kemp Powers, Justin K. Tho...","Shameik Moore, Hailee Steinfeld, Brian Tyree H...",197390,,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...
996.,Paan Singh Tomar,2012,Not Rated,135,"Action, Biography, Crime",8.2,,"The story of Paan Singh Tomar, an Indian athle...",Tigmanshu Dhulia,"Irrfan Khan, Mahie Gill, Rajesh Abhay, Hemendr...",37207,39567,
997.,The Breath,2009,,128,"Action, Drama, Thriller",8.0,,Story of 40-man Turkish task force who must de...,Levent Semerci,"Mete Horozoglu, Ilker Kizmaz, Birce Akalay, Ib...",34549,,
998.,Anand,1971,Not Rated,122,"Drama, Musical",8.1,,The story of a terminally ill man who wishes t...,Hrishikesh Mukherjee,"Rajesh Khanna, Amitabh Bachchan, Sumita Sanyal...",34604,,
999.,Andaz Apna Apna,1994,Not Rated,160,"Action, Comedy, Romance",8.0,,Two slackers competing for the affections of a...,Rajkumar Santoshi,"Aamir Khan, Salman Khan, Raveena Tandon, Karis...",54394,,


## 5. Saving Your Data to a CSV

In [11]:
df.to_csv('imdb_top_1000.csv')

## 6. Conclusion
What have you leanrt from this practice?

I find that most recent movies have been overtaking older movies in the top 1000 chart, as well as earning a higher gross compared to older movies.