

# Lab | Web Scraping Multiple Pages

#### Business goal:

- Check the `case_study_gnod.md` file.
- Make sure you've understood the big picture of your project:

  - the goal of the company (`Gnod`),
  - their current product (`Gnoosic`),
  - their strategy, and
  - how your project fits into this context.

  Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

#### Instructions 

#### Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

#### Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

#### Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`
- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`
- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`
- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`
- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`




In [197]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup


## Python on Wikipedia

In [198]:
# Python on Wikipedia

url1 = 'https://en.wikipedia.org/wiki/Python'
response = requests.get(url1)
response

<Response [200]>

In [199]:
soup = BeautifulSoup(response.content, 'html.parser')

In [200]:
# Exctracting all links within <a> tags'

hrefs = [link.get('href') for link in soup.find_all('a')]

# Filtering the links
python_links = [item for item in hrefs if 'Python' in item]

display(python_links)

['/w/index.php?title=Special:CreateAccount&returnto=Python',
 '/w/index.php?title=Special:UserLogin&returnto=Python',
 '/w/index.php?title=Special:CreateAccount&returnto=Python',
 '/w/index.php?title=Special:UserLogin&returnto=Python',
 'https://af.wikipedia.org/wiki/Python',
 'https://als.wikipedia.org/wiki/Python',
 'https://az.wikipedia.org/wiki/Python_(d%C9%99qiql%C9%99%C5%9Fdirm%C9%99)',
 'https://be.wikipedia.org/wiki/Python',
 'https://cs.wikipedia.org/wiki/Python_(rozcestn%C3%ADk)',
 'https://da.wikipedia.org/wiki/Python',
 'https://de.wikipedia.org/wiki/Python',
 'https://eu.wikipedia.org/wiki/Python_(argipena)',
 'https://fr.wikipedia.org/wiki/Python',
 'https://hr.wikipedia.org/wiki/Python_(razdvojba)',
 'https://id.wikipedia.org/wiki/Python',
 'https://ia.wikipedia.org/wiki/Python_(disambiguation)',
 'https://is.wikipedia.org/wiki/Python_(a%C3%B0greining)',
 'https://it.wikipedia.org/wiki/Python_(disambigua)',
 'https://la.wikipedia.org/wiki/Python_(discretiva)',
 'https://

In [201]:
string_to_prepend = 'https://en.wikipedia.org/'

python_final = [string_to_prepend + item for item in python_links if item.startswith('/wiki/')]

python_final


['https://en.wikipedia.org//wiki/Python',
 'https://en.wikipedia.org//wiki/Talk:Python',
 'https://en.wikipedia.org//wiki/Python',
 'https://en.wikipedia.org//wiki/Python',
 'https://en.wikipedia.org//wiki/Special:WhatLinksHere/Python',
 'https://en.wikipedia.org//wiki/Special:RecentChangesLinked/Python',
 'https://en.wikipedia.org//wiki/Pythonidae',
 'https://en.wikipedia.org//wiki/Python_(genus)',
 'https://en.wikipedia.org//wiki/Python_(mythology)',
 'https://en.wikipedia.org//wiki/Python_(programming_language)',
 'https://en.wikipedia.org//wiki/Python_of_Aenus',
 'https://en.wikipedia.org//wiki/Python_(painter)',
 'https://en.wikipedia.org//wiki/Python_of_Byzantium',
 'https://en.wikipedia.org//wiki/Python_of_Catana',
 'https://en.wikipedia.org//wiki/Python_Anghelo',
 'https://en.wikipedia.org//wiki/Python_(Efteling)',
 'https://en.wikipedia.org//wiki/Python_(Busch_Gardens_Tampa_Bay)',
 'https://en.wikipedia.org//wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 'https://en.wikipedia

## Titles changed

In [202]:

url2 = 'https://uscode.house.gov/download/download.shtml'
response = requests.get(url2)
response

<Response [200]>

In [203]:
soup = BeautifulSoup(response.content, 'html.parser')

In [204]:
target_titles = soup.find_all('div', class_ = 'usctitlechanged')
target_titles

[<div class="usctitlechanged" id="us/usc/t25">
 
           Title 25 - Indians
 
         </div>,
 <div class="usctitlechanged" id="us/usc/t26">
 
           Title 26 - Internal Revenue Code
 
         </div>,
 <div class="usctitlechanged" id="us/usc/t49">
 
           Title 49 - Transportation <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>,
 <div class="usctitlechanged" id="us/usc/t51">
 
           Title 51 - National and Commercial Space Programs <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>]

In [205]:
target_titles = [ title.text.strip() for title in target_titles]
target_titles

['Title 25 - Indians',
 'Title 26 - Internal Revenue Code',
 'Title 49 - Transportation ٭',
 'Title 51 - National and Commercial Space Programs ٭']

In [206]:
print(f'{len(target_titles)} titles have been changed in the last year.')

4 titles have been changed in the last year.


## 20 latest earthquakes info

## could not access to the content

In [207]:
url3 = 'https://www.emsc-csem.org/Earthquake_information/'
response = requests.get(url3)
display(response)
soup = BeautifulSoup(response.content, 'html.parser')

<Response [200]>

In [208]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification">
   <meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/>
   <meta content="43b36314ccb77957" name="y_key"/>
   <meta content="all" name="robots"/>
   <meta content="Get informed on the latest earthquakes occurred around the globe. earthquakes today - recent and latest earthquakes, earthquake map and earthquake information. Earthquake information for europe. EMSC (European Mediterranean Seismological Centre) provides real time earthquake information for seismic events with magnitude larger than 5 in the European Mediterranean area and larger than 7 in the rest of the world." lang="en" name="description"/>
   <meta content="705855916142039" property="fb:app_id"/>
   <meta content="en_FR" property="og:locale"/>
   <meta content="website" property="og:type"/>
   <meta content="EMSC - European-Mediterranean Seismo

In [209]:


# Parse the HTML
soup = BeautifulSoup(response.content, 'html.parser')

# Find the <td> tag with class 'tbdat'
td_tag = soup.find('td', class_='tbdat')

# Check if the <td> tag was found
if td_tag:
    # Find the <a> tag within the <td> tag
    a_tag = td_tag.find('a')
    
    # Check if the <a> tag was found
    if a_tag:
        # Extract the href attribute
        href = a_tag.get('href')
        
        # Extract the full text (date/time and "ago" text)
        full_text = a_tag.get_text(separator=" ", strip=True)
        
        # Alternatively, separate the date/time and the "ago" part
        # The first part of the content is the date/time, and the last part is the "ago" text
        date_time = a_tag.contents[0].strip()
        ago_text = a_tag.find('div', class_='tago').text.strip()
        
        print(f"URL: {href}")
        print(f"Full Text: {full_text}")
        print(f"Date/Time: {date_time}")
        print(f"Ago Text: {ago_text}")
    else:
        print("No <a> tag found within the <td> tag.")
else:
    print("No <td> tag with class 'tbdat' found.")



No <td> tag with class 'tbdat' found.


## Wikipedia languages
### could not access to the content

In [247]:
# Wikipedia

url4 = 'https://en.wikipedia.org/'
response = requests.get(url4)
response

<Response [200]>

In [214]:
soup = BeautifulSoup(response.content, 'html.parser')

In [212]:
wiki_lang = soup.find_all('div', class_= 'central-featured-lang lang8')

In [215]:
soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-not-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-l

## UK government dataset

In [218]:
# UK 

url5 = 'https://data.gov.uk/'
response = requests.get(url5)
response

<Response [200]>

In [219]:
soup = BeautifulSoup(response.content, 'html.parser')

In [220]:
datasets = soup.find_all('h3', class_ = 'govuk-heading-s dgu-topics__heading')
datasets

[<h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a></

In [224]:
datas = [data.text.strip()for data in datasets]

for _ in datas:
    print(f'UK Government dataset of: {_}')

UK Government dataset of: Business and economy
UK Government dataset of: Crime and justice
UK Government dataset of: Defence
UK Government dataset of: Education
UK Government dataset of: Environment
UK Government dataset of: Government
UK Government dataset of: Government spending
UK Government dataset of: Health
UK Government dataset of: Mapping
UK Government dataset of: Society
UK Government dataset of: Towns and cities
UK Government dataset of: Transport
UK Government dataset of: Digital service performance
UK Government dataset of: Government reference data


## Top 10 languages


In [225]:

url6 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
response = requests.get(url6)
response

<Response [200]>

In [226]:
soup = BeautifulSoup(response.content, 'html.parser')

In [230]:
langs = soup.find_all('a', class_ = "mw-redirect", limit = 13)
langs

[<a class="mw-redirect" href="/wiki/Native_speaker" title="Native speaker">native speakers</a>,
 <a class="mw-redirect" href="/wiki/Mutually_intelligible" title="Mutually intelligible">mutually intelligible</a>,
 <a class="mw-redirect" href="/wiki/Arabic_language" title="Arabic language">Arabic</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:hin" title="ISO 639:hin">Hindi</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:por" title="ISO 639:por">Portuguese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:ben" title="ISO 639:ben">Bengali</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:rus" title="ISO 639:rus">Russian</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:jpn" title="ISO 639:jpn">Japanese</a>,
 <a class="mw-redirect" href="/w

In [237]:
langs_most_spoken = [l.text.strip()for l in langs]

langs_most_spoken = langs_most_spoken[3:]


for index, language in enumerate(langs_most_spoken, start=1):
    print(f'{index} - {language}')


1 - Mandarin Chinese
2 - Spanish
3 - English
4 - Hindi
5 - Portuguese
6 - Bengali
7 - Russian
8 - Japanese
9 - Yue Chinese
10 - Vietnamese


## FBI Most-Wanted


In [239]:
url7 = 'https://www.fbi.gov/wanted/topten'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url7, headers=headers)
response

<Response [200]>

In [240]:
soup = BeautifulSoup(response.content, 'html.parser')

In [241]:
wanted = soup.find_all('li', class_ = 'portal-type-person castle-grid-block-item')
wanted

[<li class="portal-type-person castle-grid-block-item">
 <a href="https://www.fbi.gov/wanted/topten/wilver-villegas-palomino">
 <div class="focuspoint" data-base-url="https://www.fbi.gov/wanted/topten/wilver-villegas-palomino/@@images/image/" data-focus-x="-0.04" data-focus-y="0.050375" data-h="692" data-scale="preview" data-scales-info='{"icon": {"w": 32, "h": 40}, "listing": {"w": 16, "h": 20}, "thumb": {"w": 128, "h": 160}, "preview": {"w": 400, "h": 500}, "high": {"w": 1400, "h": 1751}, "tile": {"w": 64, "h": 80}, "large": {"w": 768, "h": 961}, "mini": {"w": 200, "h": 250}}' data-w="553"><img alt="WILVER VILLEGAS-PALOMINO" class="" src="https://www.fbi.gov/wanted/topten/wilver-villegas-palomino/@@images/image/preview"/></div>
 </a>
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/wilver-villegas-palomino">WILVER VILLEGAS-PALOMINO</a>
 </h3>
 </li>,
 <li class="portal-type-person castle-grid-block-item">
 <a href="https://www.fbi.gov/wanted/topten/vitelhomme-innocent"

In [243]:
most_wanted = [w.text.strip()for w in wanted]
most_wanted

['WILVER VILLEGAS-PALOMINO',
 "VITEL'HOMME INNOCENT",
 'ARNOLDO JIMENEZ',
 'ALEXIS FLORES',
 'OMAR ALEXANDER CARDENAS',
 'YULAN ADONAY ARCHAGA CARIAS',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'DONALD EUGENE FIELDS II',
 'RUJA IGNATOVA',
 'ALEJANDRO ROSALES CASTILLO']

In [244]:
for index, bandit in enumerate(most_wanted, start=1):
    print(f'{index}. {bandit}')

1. WILVER VILLEGAS-PALOMINO
2. VITEL'HOMME INNOCENT
3. ARNOLDO JIMENEZ
4. ALEXIS FLORES
5. OMAR ALEXANDER CARDENAS
6. YULAN ADONAY ARCHAGA CARIAS
7. BHADRESHKUMAR CHETANBHAI PATEL
8. DONALD EUGENE FIELDS II
9. RUJA IGNATOVA
10. ALEJANDRO ROSALES CASTILLO
