In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

from time import sleep
from random import randint

1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`

In [2]:
# create request 
url = "https://en.wikipedia.org/wiki/Python"
response = requests.get(url)
response.status_code

200

In [3]:
# create soup
soup = BeautifulSoup(response.content, "html.parser")

In [4]:
[link.get('href') for link in soup.find_all('a') if link.get('href')][0:5] 

['#bodyContent',
 '/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random']

2. Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`

*Each update of the United States Code is a release point. This page provides downloadable files for the current release point. All files are current through Public Law 118-51 (04/24/2024). Titles in **bold** have been changed since the last release point*

In [5]:
# setup
url = "http://uscode.house.gov/download/download.shtml"
response = requests.get(url)
response.status_code
soup = BeautifulSoup(response.content, "html.parser")

In [6]:
[title.text.strip() for title in soup.select('div.usctitlechanged')]

['Title 8 - Aliens and Nationality',
 'Title 10 - Armed Forces ٭',
 'Title 15 - Commerce and Trade',
 'Title 16 - Conservation',
 'Title 21 - Food and Drugs',
 'Title 22 - Foreign Relations and Intercourse',
 'Title 50 - War and National Defense']

3. Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`

In [7]:
# adding header to avoid being blocked 
# https://medium.com/@dungwoong/pretending-im-a-human-while-web-scraping-d5464e36f24
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:78.0)   Gecko/20100101 Firefox/78.0", 
"Referer": "https://www.google.com"}
# setup
url = "https://www.fbi.gov/wanted/topten"
response = requests.get(url, headers=headers)
display(response.status_code)
soup = BeautifulSoup(response.content, "html.parser")

200

In [8]:
fbi_soup = soup.select('ul > li > h3 > a')
fbi_soup

[<a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>,
 <a href="https://www.fbi.gov/wanted/topten/ruja-ignatova">RUJA IGNATOVA</a>,
 <a href="https://www.fbi.gov/wanted/topten/donald-eugene-fields-ii">DONALD EUGENE FIELDS II</a>,
 <a href="https://www.fbi.gov/wanted/topten/wilver-villegas-palomino">WILVER VILLEGAS-PALOMINO</a>,
 <a href="https://www.fbi.gov/wanted/topten/vitelhomme-innocent">VITEL'HOMME INNOCENT</a>,
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>,
 <a href="https://www.fbi.gov/wanted/topten/alexis-flores">ALEXIS FLORES</a>,
 <a href="https://www.fbi.gov/wanted/topten/omar-alexander-cardenas">OMAR ALEXANDER CARDENAS</a>,
 <a href="https://www.fbi.gov/wanted/topten/yulan-adonay-archaga-carias">YULAN ADONAY ARCHAGA CARIAS</a>,
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>]

In [9]:
[name.text for name in fbi_soup]

['ALEJANDRO ROSALES CASTILLO',
 'RUJA IGNATOVA',
 'DONALD EUGENE FIELDS II',
 'WILVER VILLEGAS-PALOMINO',
 "VITEL'HOMME INNOCENT",
 'ARNOLDO JIMENEZ',
 'ALEXIS FLORES',
 'OMAR ALEXANDER CARDENAS',
 'YULAN ADONAY ARCHAGA CARIAS',
 'BHADRESHKUMAR CHETANBHAI PATEL']

4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`

In [10]:
# create request 
url = "https://www.emsc-csem.org/Earthquake_information/"
response = requests.get(url)
response.status_code
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>

<html lang="en"><head><meta charset="utf-8"/><meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification"><meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/><meta content="43b36314ccb77957" name="y_key"/><meta content="all" name="robots"/><meta content="Get informed on the latest earthquakes occurred around the globe. earthquakes today - recent and latest earthquakes, earthquake map and earthquake information. Earthquake information for europe. EMSC (European Mediterranean Seismological Centre) provides real time earthquake information for seismic events with magnitude larger than 5 in the European Mediterranean area and larger than 7 in the rest of the world." lang="en" name="description"/><meta content="705855916142039" property="fb:app_id"/><meta content="en_FR" property="og:locale"/><meta content="website" property="og:type"/><meta content="EMSC - European-Mediterranean Seismological Centre" property="og:site_name"

5. List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`

6. A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`

7. Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`