# Lab | Web Scraping Multiple Pages


As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field. Open a new Jupyter notebook and scrape at least 3 of these sites.

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
- Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

### Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'


In [2]:
url = 'https://en.wikipedia.org/wiki/Python'

In [3]:
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [4]:
soup = BeautifulSoup(response.text, 'html.parser')

In [5]:
# Find the element with id "mw-content-text"
content_text = soup.find('div', {'id': 'mw-content-text'})
links = content_text.find_all('a', href=True)


In [6]:
# Extract and print the href attribute for each link
for link in links:
    href = link.get('href')
    print(href)

https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
/w/index.php?title=Python&action=edit&section=1
/wiki/Pythonidae
/wiki/Python_(genus)
/wiki/Python_(mythology)
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_(programming_language)
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/wiki/Python_Anghelo
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(Efteling)
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/w/index.php?title=Python&action=edit&section=5
/wiki/Python_(automobile_maker)
/wiki/Python_(Ford_prototype)
/w/index.php?title=Python&action=edit&section=6
/wiki/Python_(missile)
/wiki/Python_(nuclear_primary)
/wiki/Colt_Python
/w/index.php?title=Python&action=edit&section=7
/wiki/Python_(codename)
/wiki/Python_(film)
/wiki/Monty_Python
/wiki/Python_(Monty)_Picture

### Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'


In [7]:
url = 'https://uscode.house.gov/download/download.shtml'

In [8]:
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [9]:
soup = BeautifulSoup(response.text, 'html.parser')

In [10]:
# Find elements with the class 'usctitlechanged'
changed_elements = soup.select('.usctitlechanged')
changed_elements

[<div class="usctitlechanged" id="us/usc/t38">
 
           Title 38 - Veterans' Benefits <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>]

In [11]:
# Print the content of each matching element
for element in changed_elements:
    print(element.text)



          Title 38 - Veterans' Benefits ٭



In [12]:
# Count the number of matching elements
num_elements = len(changed_elements)
print(f'Total number of elements with class "usctitlechanged": {num_elements}')

Total number of elements with class "usctitlechanged": 1


### Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [18]:
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [19]:
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [20]:
soup = BeautifulSoup(response.text, 'html.parser')

In [21]:
# Find the table containing language data
table = soup.find('table', {'class': 'wikitable'})
table

<table class="wikitable sortable static-row-numbers">
<caption>Languages with at least 50 million first-language speakers<sup class="reference" id="cite_ref-e26_7-1"><a href="#cite_note-e26-7">[7]</a></sup>
</caption>
<tbody><tr>
<th>Language
</th>
<th data-sort-type="number">Native speakers<br/><small>(millions)</small>
</th>
<th>Language family
</th>
<th>Branch
</th></tr>
<tr>
<td><a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>
</td>
<td>939
</td>
<td><a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
</td>
<td><a href="/wiki/Sinitic_languages" title="Sinitic languages">Sinitic</a>
</td></tr>
<tr>
<td><a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>
</td>
<td>485
</td>
<td><a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
</td>
<td><a href="/wiki/Romance_languages" title="Romance languages">Romance</a>
</td></tr>
<tr>
<td><a class="mw-red

In [22]:
row_count = 0
language = []
numbers = []

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table', class_='wikitable')
    
    for row in table.find_all('tr')[1:]:
        columns = row.find_all(['td','th'])
        language.append(columns[0].text.strip())
        numbers.append(columns[1].text.strip())
        
        row_count+=1
        if row_count >= 10:
            break
    
    languages = pd.DataFrame({'Language':language, 'Native Speakers':numbers})
    print(languages)

else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

           Language Native Speakers
0  Mandarin Chinese             939
1           Spanish             485
2           English             380
3             Hindi             345
4        Portuguese             236
5           Bengali             234
6           Russian             147
7          Japanese             123
8       Yue Chinese            86.1
9        Vietnamese            85.0
