# IMDB scraping tutorial
[IMDB scraping](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)

[Another great tutorial](https://www.dataquest.io/blog/web-scraping-tutorial-python/)

#### About HTML [tags](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)

In [1]:
from requests import get
from bs4 import BeautifulSoup

In [2]:
url = 'https://www.imdb.com/search/title/?release_date=1972&sort=num_votes,desc&page=1'
headers = {"Accept-Language": "en-US"}

response = get(url, headers)
soup = BeautifulSoup(response.content, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [3]:
movie_containers = soup.find_all('div', class_='lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


In [4]:
first_movie = movie_containers[0]
type(first_movie)

bs4.element.Tag

In [5]:
soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Released between 1972-01-01 and 1972-12-31
(Sorted by Number of Votes Descending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1

### Getting the title of the movie
**"first_movie" is a `Tag` object, and the various HTML tags within it are stored as its attributes. We can access them just like we would access any attribute of a Python object. However, using a tag name as an attribute will only select the first tag by that name. If we run first_movie.div, we only get the content of the first div tag:**

In [6]:
first_movie.div

<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt0068646"></div>
</div>

In [7]:
first_movie.a

<a href="/title/tt0068646/"> <img alt="El padrino" class="loadlate" data-tconst="tt0068646" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY98_CR1,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB466725069_.png" width="67"/>
</a>

In [8]:
first_movie.h3

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0068646/">El padrino</a>
<span class="lister-item-year text-muted unbold">(1972)</span>
</h3>

We want the title of the first movie. And we're getting closer

In [9]:
first_movie.h3.a

<a href="/title/tt0068646/">El padrino</a>

In [10]:
# We got the title of the movie
first_movie.h3.a.text

'El padrino'

**Exercise** Get:
- The year of release
- The IMDB rating
- The Metascore
- The number of votes

### The year of release

In [11]:
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year.text

'(1972)'

### The IMDB rating

In [12]:
first_imdb = first_movie.find('div', class_='ratings-bar').strong
first_imdb = float(first_imdb.text)
first_imdb

9.2

### The Metascore

In [13]:
first_metascore = first_movie.find('span', class_='metascore favorable')
first_metascore = int(first_metascore.text)
first_metascore

100

### The number of votes

In [14]:
first_votes = first_movie.find('span', attrs={'name':'nv'})
first_votes = int(first_votes['data-value'])
first_votes

1607743

That's it! We got the stats for a single movie. Now we can try to make a script to get the stats for all the movies in that single page. 

### The script for a single page
Some of the movies have no metascore rating. We want to avoid these in our script. We'll add an `if` statement to skip those.

In [15]:
# The fourth movie does not have a metascore rating, therefore is going to return a NoneType.
movie_containers[3].find('span', class_='metascore favorable')

In [16]:
# Creating empty list
names = []
years = []
imdbs = []
metascores = []
votes = []

for container in movie_containers:
    
    if container.find('div', class_ = 'ratings-metascore') is not None:
        
        # Get the title
        name = container.h3.a.text
        names.append(name)
        
        # Get the year
        year = container.h3.find('span', class_='lister-item-year text-muted unbold')
        year = year.text[1:-1]
        years.append(year)
        
        # Get the IMDB rating
        imdb = container.find('div', class_='ratings-bar').strong
        imdb = float(imdb.text)
        imdbs.append(imdb)
        
        # Get the Metascore
        mscore = container.find('span', class_ = 'metascore')
        mscore = int(mscore.text)
        metascores.append(mscore)
        
        # Get the number of votes
        vote = container.find('span', attrs={'name':'nv'})
        vote = int(vote['data-value'])
        votes.append(vote)
        
print(votes)

[1607743, 98352, 80377, 49553, 48064, 41276, 40793, 38375, 37956, 32703, 32018, 29989, 28517, 27782, 26620, 26443, 24522, 21903, 16610, 13042, 12535, 12144, 11351, 9841, 9208, 8489, 8175, 7644, 6510]


**Let's see what we got**

In [17]:
import pandas as pd

In [28]:
pd.DataFrame(
    {
        'Year': years,
        'Title': names,
        'IMDB rating': imdbs,
        'Metascore rating': metascores,
        'Number of votes': votes
    }
)

Unnamed: 0,Year,Title,IMDB rating,Metascore rating,Number of votes
0,1972,El padrino,9.2,100,1607743
1,1972,Defensa,7.7,80,98352
2,1972,Solaris,8.1,90,80377
3,1972,El último tango en París,7.0,77,49553
4,1972,Cabaret,7.8,80,48064
5,1972,Frenesí,7.4,92,41276
6,1972,La aventura del Poseidón,7.1,70,40793
7,1972,El discreto encanto de la burguesía,7.9,93,38375
8,1972,Todo lo que usted siempre quiso saber sobre el...,6.8,66,37956
9,1972,La última casa a la izquierda,6.0,68,32703


### Script for multiple pages

Controlling the rate of crawling is beneficial for us, and for the website we are scraping. If we avoid hammering the server with tens of requests per second, then we are much less likely to get our IP address banned. We also avoid disrupting the activity of the website we scrape by allowing the server to respond to other users’ requests too.

In [30]:
from time import sleep
from random import randint

for _ in range(0, 5):
    print('Hola')
    sleep(randint(1, 3))

Hola
Hola
Hola
Hola
Hola


Changing urls parameters

In [31]:
pages_url = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2000,2018)]

In [32]:
from tqdm import tqdm

for i in tqdm(range(int(2e6))):
    pass

100%|███████████████████████████████████████████████████████████████████| 2000000/2000000 [00:01<00:00, 1584346.55it/s]


In [33]:
from warnings import warn

for i in range(9):
    if i > 5 and i < 7:
        warn('Number greater than expected')
    else:
        print(i)

0
1
2
3
4
5


  """


7
8


In [34]:
# Code

# Redefining empty lists
names = []
years = []
imdbs = []
metascores = []
votes = []
requests = 0

# for every year in the interval
for year_url in tqdm(years_url):
    
    # for every page in the interval
    for page_url in pages_url:
        
        # make a get request
        response = get('http://www.imdb.com/search/title?release_date=' + year_url + '&sort=num_votes,desc&page=' + page_url, headers = headers)
        
        # pause the loop
        sleep(randint(3, 7))
        
        # warn for non 200 status code
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requests, response.status_code))
            
        # breaking the loop if there are too many requests
        if requests > 72:
            warn('Number of requests was greater than expected.')
            break
        
        # parse the content from response
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # select all 50 movie containers from each page
        mv_container = soup.find_all('div', class_='lister-item mode-advanced')
        
        # for every movie of these 50
        for container in mv_container:
            
            # if the movie has a metascore, then
            if container.find('div', class_ = 'ratings-metascore') is not None:
                
                # Get the title
                name = container.h3.a.text
                names.append(name)
                
                # Get the year
                year = container.h3.find('span', class_='lister-item-year text-muted unbold')
                year = year.text[1:-1]
                years.append(year)

                # Get the IMDB rating
                imdb = container.find('div', class_='ratings-bar').strong
                imdb = float(imdb.text)
                imdbs.append(imdb)

                # Get the Metascore
                mscore = container.find('span', class_ = 'metascore')
                mscore = int(mscore.text)
                metascores.append(mscore)

                # Get the number of votes
                vote = container.find('span', attrs={'name':'nv'})
                vote = int(vote['data-value'])
                votes.append(vote)

100%|██████████████████████████████████████████████████████████████████████████████████| 18/18 [08:22<00:00, 27.89s/it]


In [35]:
movies = pd.DataFrame(
    {
        'Year': years,
        'Title': names,
        'IMDB rating': imdbs,
        'Metascore rating': metascores,
        'Number of votes': votes
    }
)

movies

Unnamed: 0,Year,Title,IMDB rating,Metascore rating,Number of votes
0,2000,Gladiator,8.5,67,1333418
1,2000,Memento,8.4,80,1119517
2,2000,Snatch,8.3,55,778103
3,2000,Requiem for a Dream,8.3,68,762510
4,2000,X-Men,7.4,64,568629
...,...,...,...,...,...
3267,2017,Darkest Hour,7.4,75,170709
3268,I) (2017,Bright,6.3,29,167728
3269,2017,Valerian and the City of a Thousand Planets,6.5,51,161222
3270,2017,Baywatch,5.5,37,159673


In [46]:
movies.drop_duplicates().sort_values(by='IMDB rating', ascending=False).head(50)

Unnamed: 0,Year,Title,IMDB rating,Metascore rating,Number of votes
1468,2008,The Dark Knight,9.0,84,2288649
568,2003,The Lord of the Rings: The Return of the King,8.9,94,1633160
196,2001,The Lord of the Rings: The Fellowship of the Ring,8.8,92,1649927
1828,2010,Inception,8.8,74,2052275
376,2002,The Lord of the Rings: The Two Towers,8.7,87,1476592
2560,2014,Interstellar,8.6,74,1497168
379,2002,City of God,8.6,79,696110
202,2001,Spirited Away,8.6,96,645170
1113,2006,The Departed,8.5,85,1181953
1112,2006,The Prestige,8.5,66,1181974
