## Scraping Movie Data
Scraping data from imdb website at least 200 movies from each year 2015 to 2019. Data consists of `movie_name`, `release_year`, `imdb` rating, `metascore` and `total_users` vote.

In [1]:
from requests import get
url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
print(response.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


In [2]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

In [3]:
# html_soup

In [4]:
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


So, number of movies review in one page is 50.
<br/>
<br/>
Start extracting data for first movie of 2017.

In [5]:
first_movie = movie_containers[0]

In [6]:
first_movie.div

<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt3315342"></div>
</div>

In [7]:
first_movie.a

<a href="/title/tt3315342/"> <img alt="Logan" class="loadlate" data-tconst="tt3315342" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzc5MTU4N2EtYTkyMi00NjdhLTg3NWEtMTY4OTEyMzJhZTAzXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>
</a>

In [8]:
first_movie.h3

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt3315342/">Logan</a>
<span class="lister-item-year text-muted unbold">(2017)</span>
</h3>

In [9]:
first_movie.h3.a

<a href="/title/tt3315342/">Logan</a>

In [10]:
first_name = first_movie.h3.a.text
first_name

'Logan'

In [11]:
first_year = first_movie.h3.find('span', class_ = 'lister-item-year text-muted unbold')
first_year

<span class="lister-item-year text-muted unbold">(2017)</span>

In [12]:
first_year = first_year.text
first_year = first_year.replace('(','')
first_year = first_year.replace(')','')
first_year

'2017'

In [13]:
first_imdb = float(first_movie.strong.text)
first_imdb

8.1

In [14]:
first_mscore = first_movie.find('span', class_ = 'metascore favorable')
first_mscore = int(first_mscore.text)
print(first_mscore)

77


In [15]:
first_votes = first_movie.find('span', attrs = {'name':'nv'})
first_votes

<span data-value="595697" name="nv">595,697</span>

In [16]:
first_votes = int(first_votes['data-value'])
first_votes

595697

In [17]:
twentythird_movie_mscore = movie_containers[22].find('div', class_ = 'inline-block ratings-metascore')
type(twentythird_movie_mscore)
# Due to absence to metascore rating for this movie.

NoneType

Storing data for that page in a list.

In [18]:
# Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
        name = container.h3.a.text
    names.append(name)
# The year
    year = container.h3.find('span', class_ = 'lister-item-year').text
    years.append(year)
# The IMDB rating
    imdb = float(container.strong.text)
    imdb_ratings.append(imdb)
# The Metascore
    m_score = container.find('span', class_ = 'metascore')
    if m_score is not None:
        metascores.append(int(m_score.text))
    else:
        metascores.append(None)
# The number of votes
    vote = container.find('span', attrs = {'name':'nv'})['data-value']
    votes.append(int(vote))

In [19]:
import pandas as pd
test_df = pd.DataFrame({'movie_name': names,
'release_year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes
})
print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
movie_name      50 non-null object
release_year    50 non-null object
imdb            50 non-null float64
metascore       43 non-null float64
votes           50 non-null int64
dtypes: float64(2), int64(1), object(2)
memory usage: 2.1+ KB
None


Unnamed: 0,movie_name,release_year,imdb,metascore,votes
0,Logan,(2017),8.1,77.0,595697
1,Thor: Ragnarok,(2017),7.9,74.0,528781
2,Guardians of the Galaxy Vol. 2,(2017),7.6,67.0,520118
3,Star Wars: The Last Jedi,(2017),7.0,85.0,519053
4,Wonder Woman,(2017),7.4,76.0,513709
5,Dunkirk,(2017),7.9,94.0,496052
6,Spider-Man: Homecoming,(2017),7.4,73.0,468118
7,Get Out,(I) (2017),7.7,84.0,439164
8,It,(I) (2017),7.3,69.0,417501
9,Blade Runner 2049,(2017),8.0,81.0,409215


## Scraping Data from multiple pages.

Here is the **URL** `http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1`.
<br/>
<br/>
Consists of release_date(only year i.e. `2017`), sort(according to number of votes(`num_votes`) i.e. `desc` or `asc`) and last page number i.e. `1`, and each page contains 50 movies.

In [20]:
pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2015,2020)]

To avoid blocking our IP address, we will use `sleep()` function from Python’s `time` module. `sleep()` will pause the execution of the loop for a specified amount of seconds.

In [21]:
from time import sleep
from random import randint

In [22]:
from time import time;start_time = time()
requests = 0
for _ in range(5):
# A request would go here
    requests += 1
    sleep(randint(1,3))
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))

Request: 1; Frequency: 0.9981561389530639 requests/s
Request: 2; Frequency: 0.9974280160171429 requests/s
Request: 3; Frequency: 0.996722192420771 requests/s
Request: 4; Frequency: 0.6650218056720869 requests/s
Request: 5; Frequency: 0.7122746147084202 requests/s


Since we’re going to make 72 requests, our work will look a bit untidy as the output accumulates. To avoid that, we’ll clear the output after each iteration, and replace it with information about the most recent request. To do that we’ll use the `clear_output()` function from the `IPython’s core.display` module. We’ll set the wait parameter of `clear_output()` to True to wait with replacing the current output until some new output appears.

In [23]:
from IPython.core.display import clear_output
start_time = time()
requests = 0
for _ in range(5):
# A request would go here
    requests += 1
    sleep(randint(1,3))
    current_time = time()
    elapsed_time = current_time - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)

Request: 1; Frequency: 0.49990876194065437 requests/s
Request: 2; Frequency: 0.6656939098104092 requests/s
Request: 3; Frequency: 0.4991694401912241 requests/s
Request: 4; Frequency: 0.44389916514587896 requests/s
Request: 5; Frequency: 0.4162640541708253 requests/s


To monitor the status code we’ll set the program to warn us if there’s something off. A successful request is indicated by a status code of 200. We’ll use the `warn()` function from the `warnings` module to throw a warning if the status code is not 200.

In [24]:
from warnings import warn;warn("Warning Simulation")

  """Entry point for launching an IPython kernel.


In [25]:
# Redeclaring the lists to store data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Preparing the monitoring of the loop
start_time = time()
requests = 0

headers = {"Accept-Language": "en-US, en;q=0.5"}

# For every year in the interval 2000-2017
for year_url in years_url:

    # For every page in the interval 1-4
    for page in pages:

        # Make a get request
        response = get('http://www.imdb.com/search/title?release_date=' + year_url +
        '&sort=num_votes,desc&page=' + page, headers = headers)

        # Pause the loop
        sleep(randint(8,15))
        # Monitor the requests
        requests += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
        clear_output(wait = True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requests, response.status_code))

        # Break the loop if the number of requests is greater than expected
        if requests > 72:
            warn('Number of requests was greater than expected.')
            break

        # Parse the content of the request with BeautifulSoup
        page_html = BeautifulSoup(response.text, 'html.parser')

        # Select all the 50 movie containers from a single page
        mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')

        # For every movie of these 50
        for container in mv_containers:
            # If the movie has a Metascore, then:
            if container.find('div', class_ = 'ratings-metascore') is not None:

                # Scrape the name
                name = container.h3.a.text
                names.append(name)
                
                # Scrape the year
                year = container.h3.find('span', class_ = 'lister-item-year').text
                years.append(year)

                # Scrape the IMDB rating
                imdb = float(container.strong.text)
                imdb_ratings.append(imdb)

                # Scrape the Metascore
                m_score = container.find('span', class_ = 'metascore').text
                metascores.append(int(m_score))

                # Scrape the number of votes
                vote = container.find('span', attrs = {'name':'nv'})['data-value']
                votes.append(int(vote))

Request:20; Frequency: 0.06291955952528032 requests/s


In [26]:
movie_ratings = pd.DataFrame({'movie_name': names,
'release_year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes
})
print(movie_ratings.info())
movie_ratings.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 872 entries, 0 to 871
Data columns (total 5 columns):
movie_name      872 non-null object
release_year    872 non-null object
imdb            872 non-null float64
metascore       872 non-null int64
votes           872 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 34.2+ KB
None


Unnamed: 0,movie_name,release_year,imdb,metascore,votes
0,Star Wars: Episode VII - The Force Awakens,(2015),7.9,81,822755
1,Mad Max: Fury Road,(2015),8.1,90,815099
2,The Martian,(2015),8.0,80,710212
3,Avengers: Age of Ultron,(2015),7.3,66,696457
4,The Revenant,(2015),8.0,76,652809
5,Inside Out,(I) (2015),8.2,94,563734
6,Jurassic World,(2015),7.0,59,554218
7,Ant-Man,(2015),7.3,64,530340
8,The Hateful Eight,(2015),7.8,68,471893
9,Spotlight,(I) (2015),8.1,93,384001


In [27]:
movie_ratings['release_year'].value_counts()

(2018)          168
(2017)          152
(2016)          148
(2015)          140
(2019)          128
(I) (2015)       28
(I) (2016)       24
(I) (2017)       20
(I) (2018)       16
(II) (2016)      12
(I) (2019)       12
(II) (2015)       8
(II) (2019)       4
(III) (2019)      4
(III) (2018)      4
(IX) (2016)       4
Name: release_year, dtype: int64

In [28]:
movie_ratings.loc[:, 'release_year'] = movie_ratings['release_year'].str[-5:-1].astype(int)

In [29]:
movie_ratings['release_year'].value_counts()

2018    188
2016    188
2015    176
2017    172
2019    148
Name: release_year, dtype: int64

In [30]:
movie_ratings.head()

Unnamed: 0,movie_name,release_year,imdb,metascore,votes
0,Star Wars: Episode VII - The Force Awakens,2015,7.9,81,822755
1,Mad Max: Fury Road,2015,8.1,90,815099
2,The Martian,2015,8.0,80,710212
3,Avengers: Age of Ultron,2015,7.3,66,696457
4,The Revenant,2015,8.0,76,652809


In [31]:
movie_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 872 entries, 0 to 871
Data columns (total 5 columns):
movie_name      872 non-null object
release_year    872 non-null int32
imdb            872 non-null float64
metascore       872 non-null int64
votes           872 non-null int64
dtypes: float64(1), int32(1), int64(2), object(1)
memory usage: 30.8+ KB


In [32]:
movie_ratings.to_csv('movie_ratings.csv')

`movie_ratings` DataFrame converted into `csv` file format.