# Multipage scraping

We practiced web scraping when all the information is in a single table of a single page in a site. What happens when we want to scrape information from multiple pages?

Go to https://www.imdb.com/search/title/ and enter the following parameters, leaving all other fields blank or with its default value:

- Title Type: Feature film

- Release date: From 1990 to 1992

- User Rating: 7.5 to "-"

The page you get should be familiar. There's a list with movies and each movie has its title, release year, crew, etc. You could inspect the page and build the code to collect the date.

However, the results we obtained contain 631 movies, and each page only contains 50 of them (you can change the settings to obtain up to 250 movies/page, but that still won't make it till the end).

The way to automatize web scraping in these cases is to look at the URLs The one we've obtained is the following:

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,

If you scroll down and click on "Next", the URL is now: https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt

Click again on "Next" and here's the new URL: https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=101&ref_=adv_nxt

The patterns are clear: our search options are in the parameters title_type, release_dateand user_rating. Then, we have the start parameter, which jumps in intervals of 50, and the ref_ parameter, which takes the value of "adv_nxt".

Let's do some requests:

In [None]:
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,<br>
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,&start=51&ref_=adv_nxt<br>
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,&start=101&ref_=adv_nxt

In [1]:
# 1. import libraries
import requests
from bs4 import BeautifulSoup

In [2]:
# 2. url: we start with the 'second' page
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt"

In [3]:
# 3. download html with a get request
response = requests.get(url)
print(response.status_code)

200


In [4]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.text, 'html.parser')
# 4.2. check that the html code looks like it should
print(soup.prettify)

<bound method Tag.prettify of 
<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Feature Film,
Released between 1990-01-01 and 1992-12-31,
User Rating at least 7.5
(Sorted by Popularity Ascending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/search/title/?title_type=feat

Now, we'll have to build a loop where we simply replace the 51 for all the other values (jumping by 50) up until the end of the results. For simplicity, we will build manually this list of values to iterate through:



In [None]:
base_url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,"
base_url

'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,'

# Activity

Create the code to build the corresponding urls to scrape.

In [8]:
# Your code here

start = 51
link  = 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,'
urls  = [link]

for i in range(9):
    urls.append(f'{link}&start={start}&ref_=adv_nxt')
    start = start+50
    
urls

['https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,',
 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,&start=51&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,&start=101&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,&start=151&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,&start=201&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,&start=251&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&user_rating=7.5,&start=301&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-01-01&us

Respectful scraping:

Before starting with the actual scraping, though, there's something we need to note when sending massive, automated requests to websites: it's rude.

We just have 13 of them, which is not too many, but it's still a good practice to let a few seconds pass in between requests. Some pages don't like being scraped and will block your IP if they detect it's sending automated requests. Others might have a small server for the traffic they handle, and sending too many requests might crash the site.

The sleep module will help us with that. Here's how it works, waiting 2 seconds between each iteration in a for loop:

In [None]:
# To make it more "human", we can randomize the waiting time:
from time import sleep
from random import randint


We will now scrape all the pages and store the response into a list - waiting a few seconds in between requests:

In [None]:
for url in urls:
    response = requests.get(url)
    if ( response.status_code == 200 ):
        print("Webpage successfully retrieved!")
        #scrape_website()
        seconds_to_wait = randint(2,7)
        print("Wating {} seconds before scraping the next website!".format(seconds_to_wait))
        sleep(seconds_to_wait)
    else:
        print("Unable to get the html code. Jumping to the next website")

Webpage successfully retrieved!
Wating 3 seconds before scraping the next website!
Webpage successfully retrieved!
Wating 4 seconds before scraping the next website!
Webpage successfully retrieved!
Wating 3 seconds before scraping the next website!
Webpage successfully retrieved!
Wating 7 seconds before scraping the next website!
Webpage successfully retrieved!
Wating 4 seconds before scraping the next website!
Webpage successfully retrieved!
Wating 3 seconds before scraping the next website!
Webpage successfully retrieved!
Wating 3 seconds before scraping the next website!
Webpage successfully retrieved!
Wating 7 seconds before scraping the next website!


Note how if you print the object pages after running the code above, you'll just see the response code messages, but the html code is still accessible and you can parse it the same way we've always done:


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Feature Film,
Released between 1990-01-01 and 1992-12-31,
User Rating at least 7.5
(Sorted by Popularity Ascending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
 

It's the moment to build the code that collects all the 631 movie titles and their synopsis in a dataframe.

#### titles

In [None]:
# Parse just the first page, for testing purposes

# Paste the Selector from the first movie title copied from Chrome Dev Tools

# Trim the selection


[<a href="/title/tt0102926/">El silencio de los corderos</a>,
 <a href="/title/tt0099685/">Uno de los nuestros</a>,
 <a href="/title/tt0099487/">Eduardo Manostijeras</a>,
 <a href="/title/tt0099785/">Solo en casa</a>,
 <a href="/title/tt0103064/">Terminator 2: El juicio final</a>,
 <a href="/title/tt0104257/">Algunos hombres buenos</a>,
 <a href="/title/tt0105323/">Esencia de mujer</a>,
 <a href="/title/tt0099810/">La caza del Octubre Rojo</a>,
 <a href="/title/tt0105236/">Reservoir Dogs</a>,
 <a href="/title/tt0100802/">Desafío total</a>,
 <a href="/title/tt0100157/">Misery</a>,
 <a href="/title/tt0105695/">Sin perdón</a>,
 <a href="/title/tt0099348/">Bailando con lobos</a>,
 <a href="/title/tt0106308/">El ejército de las tinieblas</a>,
 <a href="/title/tt0099674/">El padrino: Parte III</a>,
 <a href="/title/tt0103939/">Chaplin</a>,
 <a href="/title/tt0101921/">Tomates verdes fritos</a>,
 <a href="/title/tt0101414/">La bella y la bestia</a>,
 <a href="/title/tt0104691/">El último mohi

#### synopsis

In [None]:
# Paste the Selector from the first movie title copied from Chrome Dev Tools


[<p class="text-muted">
     A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.</p>]

In [None]:
# Trim the selection


[<p class="text-muted">
     A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.</p>,
 <p class="text-muted">
     The story of <a href="/name/nm1453737">Henry Hill</a> and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.</p>,
 <p class="text-muted">
     An artificial man, who was incompletely constructed and has scissors for hands, leads a solitary life. Then one day, a suburban lady meets him and introduces him to her world.</p>,
 <p class="text-muted">
     An eight-year-old troublemaker must protect his house from a pair of burglars when he is accidentally left home alone by his family during Christmas vacation.</p>,
 <p class="text-muted">
     A cyborg, identical to the one who failed to kill Sarah Connor, must now protect her teenage son, John Connor, fro

There are many approaches to do this. The one we'll follow is: 

- Loop through the pages we collected, parse them ("create the soup") and store the parsed pages in a list. 

- For each parsed page, select the "blocks of HTML elements" that contain all the information of each movie (the title, the synopsis and other stuff). 

- For each one of the "blocks" we collected in the previous step: 

    - Get the movie titles and store them in a list 

    - Get the synopsis and store them in a list

## Activity

#### Scraping presidents

Our objective is to create a dataframe with information about the presidents of the United States. To do this, we will go through this steps:

1. Scrape this [list of presidents of the United States](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States).


In [None]:
# 1. import libraries



# 2. find url and store it in a variable

# 3. download html with a get request


# 4.1. parse html (create the 'soup')

# 4.2. check that the html code looks like it should


2. Collect all the links to the Wikipedia page of each president.


In [None]:
# we can access the links searching for the attribute "href"
# in each element


In [None]:
# Now, we just assemble a new request to the link
# send request


# parse & store html


3. Scrape the Wikipedia page of each president.


In this step we could very well store the whole wikipedia page for each president, or just the tiny, final pieces of information. Storing the boxes is a middle ground (we don't have too much noise but retain the flexibility of deciding later which specific elements to extract).

When sending multiple requests, remember to be respectful by spacing the requests a few seconds from each other. We will also pring the success code to monitor that everything is going well:

In [None]:
# 2. find url and store it in a variable


    # send request
 
   
    # parse & store html
    
    # respectful nap:
 

4. Find and store information about each president.


We extracted the 'infoboxes': now it's time to exctract especific pieces of information from them. Let's test what can we get from single presidents and then assemble a loop for all of them - as usual.

Here, we will use [the string argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) in the find function, since wikipedia tags and classes are not always helpfulto locate. The string argument allows us to locate elements by its actual content.

In [None]:
#Birthday

#Political party

#Number of sons/daughters


5

5. Organize the information in a dataframe where we have each president as a row and each variable we collected as a column. Consider: .json_normalize()