# Web Scraping multiple pages 

We have practiced web scraping when all the information we wanted was on a single table of a site. What happens when we want to scrape information from multiple pages?

## First example - IMDB 

Go to https://www.imdb.com/search/title/ and enter the following parameters, leaving all other fields blank or with its default value:

- Title Type: Feature film

- Release date: From 1990 to 1992

- User Rating: 7.5 to "-"

The page you get should be familiar. There's a list with movies and each movie has its title, release year, crew, etc. You could inspect the page and build the code to collect the date.

Note the resulting query obtained contain hundreds of movies, and each page only contains 50 of them (you can change the settings to obtain up to 250 movies/page, but that still won't be the complete list).

One way to automatize multi page web scraping is to look at the URLs. 

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,

Note what the url looks like if you scroll down and click on "Next", the URL is now: 

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt

Can you see the pattern?

our search options are in the parameters title_type, release_date and user_rating. Then, we have the start parameter, which jumps in intervals of 50, and the ref_ parameter, which takes the value of "adv_nxt".

In [1]:
#  import libraries
from bs4 import BeautifulSoup
import requests

In [2]:
#  url: this time, start with the 'second' page
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt"

In [3]:
# download html with a request, check response code 
response=requests.get(url)
response.status_code

200

In [None]:
#  parse html (create the 'soup')

# check that the html code looks as expected 


Now, we'll have to build a list of values which jumps by 50, up to the total number of movies we want to scrape.  

In [None]:
# define iterations 


In [None]:
# check the iterations work


In [None]:
# create the url string for the page search, populate with the iterations


In [None]:
# test the urls 


### Respectful scraping:

Before starting with the actual scraping, though, there's something we need to note when sending automated requests to websites: it's good practice to let a few seconds pass in between requests. 

Some pages don't like being scraped and will block your IP if they detect you are sending automated requests. Others might have a small server for the traffic they handle, and sending too many requests might crash the site.

The sleep module will help us with that. 

In [None]:
from time import sleep

#simple example 
for i in range(5):
    print(i)
    sleep(3)



In [None]:
# To make it more "human", we can randomize the waiting time:
from random import randint



### Assembling the script to send and store multiple requests

Note: if you print the object pages after running the code above, you'll just see the response code messages, but the html code is still accessible and you can parse it the same way as before

### Build code to collect the relevant information from the Request 

this is what we need : 

##### Parse just the first page, for testing purposes
- soup=BeautifulSoup(pages[0].content, "html.parser")

##### title and synopsis

- soup.select("div.lister-item-content > h3 > a")
- soup.select("div.lister-item-content > p:nth-child(4)")

#### titles

In [None]:
# Parse just the first page, for testing purposes

# Paste the Selector from the first movie title copied from Chrome Dev Tools

# Trim the selection


#### synopsis

In [None]:
# Paste the Selector from the first movie title copied from Chrome Dev Tools


In [None]:
# Trim the selection


### combine all the code 

There are many approaches to do this. The one we'll follow is: 

- Loop through the pages we collected, parse them ("create the soup") and store the parsed pages in a list. 

- For each parsed page, select the "blocks of HTML elements" that contain all the information of each movie (the title, the synopsis and other stuff). 

- For each one of the "blocks" we collected in the previous step: 

    - Get the movie titles and store them in a list 

    - Get the synopsis and store them in a list

In [None]:
# check the output and identify any wrangling steps we missed 

-----------

## 2nd example - Scraping presidents

Our objective is to create a dataframe with information about the presidents of the United States. To do this, we will go through 5 steps:

1. Scrape this [list of presidents of the United States](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States).


In [None]:
# 1. import libraries



# 2. find url and store it in a variable

# 3. download html with a get request


# 4.1. parse html (create the 'soup')

# 4.2. check that the html code looks like it should


2. Collect all the links to the Wikipedia page of each president.


In [None]:
# we can access the links searching for the attribute "href"
# in each element


In [None]:
# Now, we just assemble a new request to the link
# send request


# parse & store html


3. Scrape the Wikipedia page of each president.


In this step we could very well store the whole wikipedia page for each president, or just the tiny, final pieces of information. Storing the boxes is a middle ground (we don't have too much noise but retain the flexibility of deciding later which specific elements to extract).

When sending multiple requests, remember to be respectful by spacing the requests a few seconds from each other. We will also ping the success code to monitor that everything is going well:

In [None]:
# 2. find url and store it in a variable


    # send request
 
   
    # parse & store html
    
    # respectful nap:
 

4. Find and store information about each president.


We extracted the 'infoboxes': now it's time to extract specific information from them. First test what can we get from a single president and then assemble a loop for all of them.

Here, we will use [the string argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) in the find function, since wikipedia tags and classes are not always helpful to locate. The string argument allows us to locate elements by its actual content.

In [None]:
#Birthday

#Political party

#Number of sons/daughters


# collect with a loop 

5. Organize the information in a dataframe where we have each president as a row and each variable we collected as a column.