# Applying Web Scraping On IMDB & Metascore Movie Ratings. #
---

## Purpose of this project: ## 

The objective is to showcase my familiarity using web scraping technique to collect data from web. 

## Table of content: ##

* Find an efficient scraping framework and do a test run.
    * [Extracting HTML Text and locating the data of interest.](#Extracting-HTML-Text-and-locating-the-data-of-interest.)
    * [Finding scraping framework for each data of interest.](#Finding-scraping-framework-for-each-data-of-interest.)
    * [Handling 'NoneType' data.](#Handling-'NoneType'-data.)
    * [Compiling all the scraping frameworks and do a test run.](#Compiling-all-the-scraping-frameworks-and-do-a-test-run.)
* Implement an actual run with the scraping framework. 
    * [Strategising multi-webpage scraping.](#Strategising-multi-webpage-scraping.)
    * [Setting up URL parameters to make request over multi-webpage.](#Setting-up-URL-parameters-to-make-request-over-multi-webpage.)
    * [Setting up crawl-rate control to avoid site traffic.](#Setting-up-crawl-rate-control-to-avoid-site-traffic.)
    * [Setting up framework to monitor the crawl-rate.](#Setting-up-framework-to-monitor-the-crawl-rate.)
    * [Compiling all scraping frameworks and set-up to begin scraping.](#Compiling-all-scraping-frameworks-and-set-up-to-begin-scraping.)
* Save the dataset into CSV file. 
    * [Convert scraped data into dataframe object.](#Convert-scraped-data-into-dataframe-object.)
    * [Minor data cleaning and save the dataset into CSV file.](#Minor-data-cleaning-and-save-the-dataset-into-CSV-file.)
* [Notes for future reference.](#Notes-for-future-reference.)

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Extracting HTML Text and locating the data of interest. ##

Return to [Table of content:](#Table-of-content:)

---

Before the scraping procedure begins, the first thing is to extract the entire HTML text from an URL link. Then find only the HTML tags from the extracted HTML text that contain all the movies and their relevant data and store those data in a variable.

In [2]:
url = "https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&count=100&page=1"

# Extract HTML text from the URL link.
# Assign a parameter to `headers` to avoid translation issue.
response = requests.get(url, headers={"Accept-Language": "en-US, en;q=0.5"})
print(response.text[:200])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatibl


In [3]:
# Parse `response.text` and create a Beautiful Soup object.
html_soup = BeautifulSoup(response.text, "html.parser")
print(type(html_soup))

<class 'bs4.BeautifulSoup'>


In [4]:
# Find only the `div` tags which contain the movies and their relevent data.
movie_containers = html_soup.find_all("div", class_="lister-item mode-advanced")
print(type(movie_containers))

# Check whether the total list of movies matches the given URL link (...count=100).
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
100


## Finding scraping framework for each data of interest. ##

Return to [Table of content:](#Table-of-content:)

---

Select only the first movie and scrape each relevant data, by turn, to find the scraping framework. 

__Here's a list of relevant data:__

* Title of the movie.
* Year of release.
* IMDB rating.
* Metascore.
* Number of votes.

In [5]:
# Select the first movie and skim through the output.
first_movie = movie_containers[0]
print(first_movie)

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt3315342"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt3315342/?ref_=adv_li_i"> <img alt="Logan" class="loadlate" data-tconst="tt3315342" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzc5MTU4N2EtYTkyMi00NjdhLTg3NWEtMTY4OTEyMzJhZTAzXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt3315342/?ref_=adv_li_tt">Logan</a>
<span class="lister-item-year text-muted unbold">(2017)</span>
</h3>
<p class="text-muted ">
<span class="certificate">R</span>
<span class="ghost">|</span>
<span class="runtime">137 min</span>
<span class="ghost">|</

---
__Scrape the title of the movie.__

In [6]:
# Scrape the movie title under `a` tag nested within `h3` tag with a `lister-item-header` class attribute.
first_title = first_movie.find("h3", class_="lister-item-header").a.text
print(first_title)

Logan


---
__Scrape the movie's year of release.__

In [7]:
# Scrape the year under `span` tag nested within `h3` tag.
first_year = first_movie.h3.find("span", class_="lister-item-year text-muted unbold").text
print(first_year)

(2017)


---
__Scrape the IMDB rating.__

In [8]:
# Scrape the IMDB rating under `strong` tag.
first_imdb = float(first_movie.strong.text)
print(type(first_imdb))
print(first_imdb)

<class 'float'>
8.1


---
__Scrape the Metascore rating.__

In [9]:
# Scrape the Metascore rating under `span` tag with a `metascore` class attribute.
first_mscore = int(first_movie.find("span", class_="metascore").text)
print(type(first_mscore))
print(first_mscore)

<class 'int'>
77


---
__Scrape the number of votes.__

In [10]:
# Find the votes number under `span` tag with a `nv` name attribute.
first_votes = first_movie.find("span", attrs = {"name": "nv"})
print(type(int(first_votes["data-value"])))

# Scrape the votes number under the `data-value` attributes.
print(int(first_votes["data-value"]))

<class 'int'>
510372


## Handling 'NoneType' data. ##

Return to [Table of content:](#Table-of-content:)

---

During the scarping process, if any specific HTML tag is not availble due to the unavailable data, the output will return a null value. Hence, to prevent runtime error, my approach is to first identify the part on both the websites and HTML where the null value occurs. 

In [11]:
# Check the obejct type for unavailable data.
movie_no_mscore = movie_containers[22].find("div", class_="ratings-metascore")
print(type(movie_no_mscore))

<class 'NoneType'>


---
As you can see, some movies do not have metascore, thus the output returns a null value. My approach to this problem is to implement `if` statement to ignore any movie without metascore.

## Compiling all the scraping frameworks and do a test run. ##

Return to [Table of content:](#Table-of-content:)

---

Before proceding to the next step, do a test run to ensure all the scraping frameworks work fine when compiled together.

In [12]:
# Create lists to store all the scraped data.
titles = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Repeat the scraping process over 10 movies.
for movie in movie_containers[0:10]: 
    
    # Check whether movie contains metascore.
    movie_no_mscore = movie.find("div", class_="ratings-metascore")

    # If movie contains metascore, then:
    if movie_no_mscore is not None:      
        # Scrape and append each data to the specific list.
        title = movie.find("h3", class_="lister-item-header").a.text
        titles.append(title)

        year = movie.h3.find("span", class_="lister-item-year text-muted unbold").text
        years.append(year)

        imdb = movie.strong.text
        imdb_ratings.append(float(imdb))

        mscore = movie.find("span", class_="metascore").text
        metascores.append(int(mscore))

        vote = movie.find("span", attrs = {"name": "nv"})["data-value"]
        votes.append(int(vote))

__Check the scraping result in dataframe object.__

In [13]:
test_df = pd.DataFrame({
    "movies": titles,
    "year": years,
    "imdb": imdb_ratings,
    "metascore": metascores,
    "votes": votes
    }
)

print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
movies       10 non-null object
year         10 non-null object
imdb         10 non-null float64
metascore    10 non-null int64
votes        10 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 480.0+ bytes
None


Unnamed: 0,movies,year,imdb,metascore,votes
0,Logan,(2017),8.1,77,510372
1,Wonder Woman,(2017),7.5,76,437812
2,Dunkirk,(2017),8.0,94,419985
3,Star Wars: The Last Jedi,(2017),7.2,85,416710
4,Guardians of the Galaxy Vol. 2,(2017),7.7,67,412523
5,Thor: Ragnarok,(2017),7.9,74,388962
6,Spider-Man: Homecoming,(2017),7.5,73,359437
7,Get Out,(I) (2017),7.7,84,333820
8,Blade Runner 2049,(2017),8.1,81,332841
9,Baby Driver,(2017),7.6,86,325723


## Strategising multi-webpage scraping. ##

Return to [Table of content:](#Table-of-content:)

---

__After running a scraping test for a single URL link, the next step is to figure out a way to repeat the scraping framework over multiple URL links:__

* According to my observation, there are 100 movies for a single URL link. 2 pages are enough to scrape a maximum of 200 movies each year. Since the objective is to scrape movies starting from year 2000-2017, there will be a maximum of 3600 list of movies altogether. Although some movies without metascore will be ignored, it is still very likely to scrape at least 2000 movies.

* Finally, controlling the crawl-rate is vital to prevent the server from banning the IP address. Additionally, it keeps from disrupting the activity of the website by allowing the server to respond to other users' requests too.

## Setting up URL parameters to make request over multi-webpage. ##

Return to [Table of content:](#Table-of-content:)

In [14]:
url_pages = [str(page) for page in range(1,3)]
url_years = [str(year) for year in range(2000,2018)]

## Setting up crawl-rate control to avoid site traffic. ##

Return to [Table of content:](#Table-of-content:)

In [15]:
from time import sleep, time
from random import randint

# Output "blah" for every random seconds between 1 to 3.
for _ in range(5):
    print("Blah")
    sleep(randint(1,4))

Blah
Blah
Blah
Blah
Blah


---
## Setting up framework to monitor the crawl-rate. ##

Return to [Table of content:](#Table-of-content:)

In [16]:
start_time = time()
request = 0

# For each new request, calculate the number of request per second.
for _ in range(5):
    request += 1
    sleep(randint(1,3))
    elapsed_time = time() - start_time
    print("Request: {}; Frequency: {} requests/s".format(request, request / elapsed_time))

Request: 1; Frequency: 0.9974535149959144 requests/s
Request: 2; Frequency: 0.6652134218412246 requests/s
Request: 3; Frequency: 0.7478595987348288 requests/s
Request: 4; Frequency: 0.7979067973916388 requests/s
Request: 5; Frequency: 0.8314011364433754 requests/s


---
__Setting up `clear_output` to only display the latest output.__

In [17]:
from IPython.core.display import clear_output

start_time = time()
request = 0

for _ in range(5):
    request += 1
    sleep(randint(1,3))
    elapsed_time = time() - start_time
    print("Request: {}; Frequency: {} requests/s".format(request, request / elapsed_time))
    clear_output(wait=True)

Request: 5; Frequency: 0.4987569556331597 requests/s


---
__Setting up warning to inform unsuccessful status code.__

In [18]:
from warnings import warn

warn("Warning simulation")

  This is separate from the ipykernel package so we can avoid doing imports until


## Compiling all scraping frameworks and set-up to begin scraping. ##

Return to [Table of content:](#Table-of-content:)

---

In [19]:
# Create lists to store the scraped data.
titles = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Prepare for the crawl-rate monitoring.
start_time = time()
request = 0

# Loop through each year from 2000 to 2018.
for url_year in url_years:   
    
    # Loop through each page from 1 to 2.
    for url_page in url_pages:
        
        # Get the HTML text from the URL link.
        url = "https://www.imdb.com/search/title?release_date={year}&sort=num_votes,desc&count=100&page={page}".format(year=url_year, page=url_page)
        response = requests.get(url, headers={"Accept-Language": "en-US, en;q=0.5"})
        request += 1
        
        # Pause the loop for every random seconds between 8 to 15.
        sleep(randint(8,15))
        
        # Set up crawl-rate monitoring.
        elapsed_time = time() - start_time
        print("Response: {}; Frequency: {} requests/s".format(request, request / elapsed_time))
        clear_output(wait=True)
        
        # Warn unsuccessful status code.
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))
            
        # Break the loop if request number is greater than 36 (total requests).
        if request > 36:
            warn("Request number is greater than 36.")
            break
        
        # Parse `response.text` and create a Beautiful Soup object.
        page_html = BeautifulSoup(response.text, "html.parser")
        
        # Find only the `div` tags which contain movies and its relevent data.
        mv_containers = page_html.find_all("div", class_="lister-item mode-advanced")

        # Loop through the list of movies to repeat the scraping process.
        for mv in mv_containers:     
            # Check whether movie contains metascore.
            mv_mscore = mv.find("div", class_="ratings-metascore")
            
            # If movie contains metascore, then:
            if mv_mscore is not None:
                # Scrape and append each data to the specific list.
                title = mv.find("h3", class_="lister-item-header").a.text
                titles.append(title)

                year = mv.h3.find("span", class_="lister-item-year text-muted unbold").text
                years.append(year)

                imdb = mv.strong.text
                imdb_ratings.append(float(imdb))

                mscore = mv.find("span", class_="metascore").text
                metascores.append(int(mscore))

                vote = mv.find("span", attrs = {"name": "nv"})["data-value"]
                votes.append(int(vote))

Response: 36; Frequency: 0.06802559686732557 requests/s


## Convert scraped data into dataframe object. ##

Return to [Table of content:](#Table-of-content:)

---

In [20]:
movie_ratings = pd.DataFrame({
    "movies": titles,
    "year": years,
    "imdb": imdb_ratings,
    "metascore": metascores,
    "votes": votes
    }
)

# Check the result.
print(movie_ratings.info())
movie_ratings.head(15)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2948 entries, 0 to 2947
Data columns (total 5 columns):
movies       2948 non-null object
year         2948 non-null object
imdb         2948 non-null float64
metascore    2948 non-null int64
votes        2948 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 115.2+ KB
None


Unnamed: 0,movies,year,imdb,metascore,votes
0,Gladiator,(2000),8.5,67,1155643
1,Memento,(2000),8.5,80,991943
2,Snatch,(2000),8.3,55,693236
3,Requiem for a Dream,(2000),8.3,68,673397
4,X-Men,(2000),7.4,64,516669
5,Cast Away,(2000),7.8,73,454808
6,American Psycho,(2000),7.6,64,413578
7,Unbreakable,(2000),7.3,62,304556
8,Meet the Parents,(2000),7.0,73,286175
9,Mission: Impossible II,(2000),6.1,59,279510


## Minor data cleaning and save the dataset into CSV file. ##

Return to [Table of content:](#Table-of-content:)

---

__Sorting the order of the columns.__

In [21]:
movie_ratings = movie_ratings[["movies", "year", "imdb", "metascore", "votes"]]
movie_ratings.head()

Unnamed: 0,movies,year,imdb,metascore,votes
0,Gladiator,(2000),8.5,67,1155643
1,Memento,(2000),8.5,80,991943
2,Snatch,(2000),8.3,55,693236
3,Requiem for a Dream,(2000),8.3,68,673397
4,X-Men,(2000),7.4,64,516669


---
__Converting the `dtype` under `year` column to integer without any bracket or symbol.__ First, find out all the unique values in order to figure out an efficient way to extract the year value.

In [22]:
movie_ratings["year"].unique()

array(['(2000)', '(I) (2000)', '(2001)', '(I) (2001)', '(2002)',
       '(I) (2002)', '(2003)', '(I) (2003)', '(2004)', '(I) (2004)',
       '(2005)', '(I) (2005)', '(2006)', '(I) (2006)', '(2007)',
       '(I) (2007)', '(2008)', '(I) (2008)', '(2009)', '(I) (2009)',
       '(II) (2009)', '(2010)', '(I) (2010)', '(II) (2010)', '(2011)',
       '(I) (2011)', '(IV) (2011)', '(2012)', '(I) (2012)', '(II) (2012)',
       '(2013)', '(I) (2013)', '(II) (2013)', '(2014)', '(I) (2014)',
       '(II) (2014)', '(III) (2014)', '(2015)', '(I) (2015)',
       '(II) (2015)', '(VI) (2015)', '(III) (2015)', '(2016)',
       '(II) (2016)', '(I) (2016)', '(IX) (2016)', '(V) (2016)', '(2017)',
       '(I) (2017)', '(III) (2017)', '(II) (2017)'], dtype=object)

---
Since all the year values are placed after the symbolic value, and all of them follow the same format struture, my approach is to extract the value starting from the end using pandas string slicing.

In [23]:
movie_ratings.loc[:, "year"] = movie_ratings["year"].str[-5:-1].astype(int)
movie_ratings.head()

Unnamed: 0,movies,year,imdb,metascore,votes
0,Gladiator,2000,8.5,67,1155643
1,Memento,2000,8.5,80,991943
2,Snatch,2000,8.3,55,693236
3,Requiem for a Dream,2000,8.3,68,673397
4,X-Men,2000,7.4,64,516669


---
__Check whether the rating range for `imdb` and `metascore` columns are within expected interval.__

In [24]:
movie_ratings.describe().loc[["min", "max"],["imdb", "metascore"]]

Unnamed: 0,imdb,metascore
min,1.6,7.0
max,9.0,100.0


---
The maximum value for `metascore` should be 100, while the minimum 0. The maximum value for `imdb` is 10, while the mimimum 0. No abnormal value is found since all the values fall within the expected maximum and minimum value. 

__Normalise the `imdb` rating to standardise with `metascore` in order to make it easy for comparison during analysis process.__

In [25]:
movie_ratings["n_imdb"] = (movie_ratings["imdb"] * 10).astype(int)
movie_ratings.head()

Unnamed: 0,movies,year,imdb,metascore,votes,n_imdb
0,Gladiator,2000,8.5,67,1155643,85
1,Memento,2000,8.5,80,991943,85
2,Snatch,2000,8.3,55,693236,83
3,Requiem for a Dream,2000,8.3,68,673397,83
4,X-Men,2000,7.4,64,516669,74


---
__Save the dataset as .csv format.__

In [26]:
movie_ratings.to_csv("movie_ratings_2000_to_2017.csv")

## Notes for future reference. ##

__List of things to take note:__

* URL link and URL parameters may have changed in the future.
* Structure of HTML code and HTML tag attributes may have changed. Current scraping framework may no longer work.
* Under the section "Handling NoneType data", the cell may need to be updated.
* Current `IMDB`, `metascore`, and `votes` will be outdated in the future. 
* The dataset only contain movies released in 2000-2017.

Return to [Table of content:](#Table-of-content:)