###  <font color='#eb3483'> Webscraping </font>

Together, `BeautifulSoup` and `re` give us powerful tools for extracting information from webscraped data. Let's take a minute to see how this might work.  

Suppose that we want to scrape the price information:

https://webscraper.io/test-sites/e-commerce/allinone

Take a minute to visit the webpage with your webrowser. If you are using Chrome, right click and select "Inspect". This will show you the underlying HTML code for this page.

Let's webscrape! ... which really means reading the HTML code into Python. We use the `requests` package for this.

In [1]:
!pip install requests
!pip install pandas
!pip install beautifulsoup4



####  <font color='#eb3483'> Approach using Regular Expressions </font>

In [2]:
import requests
from bs4 import BeautifulSoup
import re

url = 'https://webscraper.io/test-sites/e-commerce/allinone'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [3]:
# Find all elements with the class 'price'
prices = soup.find_all(class_='price')

# Extract the text from each element
for price in prices:
    # Use a regular expression to extract the price in the format $XX.XX
    price_text = price.get_text()
    price_value = re.findall(r'\$\d+\.\d{2}', price_text)
    if price_value:
        print(price_value[0])

$379.95
$148.99


####  <font color='#eb3483'> Extracting data from tables </font>

In [4]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [5]:
# Fetch the Web Page
url = "https://www.worldometers.info/world-population/population-by-country/"
response = requests.get(url)
html_content = response.content

# Parse the HTML Content
soup = BeautifulSoup(html_content, 'html.parser')

In [6]:
# Find the Table
table = soup.find('table', id='example2')  # Find the table with the specific id

# Extract Table Headers
headers = []
for th in table.find_all('th'):
    headers.append(th.text.strip())

print("Headers:", headers)

Headers: ['#', 'Country (or dependency)', 'Population (2023)', 'Yearly Change', 'Net Change', 'Density (P/Km²)', 'Land Area (Km²)', 'Migrants (net)', 'Fert. Rate', 'Med. Age', 'Urban Pop %', 'World Share']


In [7]:
# Extract Table Rows
rows = []
for tr in table.find_all('tr'):
    cells = tr.find_all('td')
    if len(cells) > 0:
        row = [cell.text.strip() for cell in cells]
        rows.append(row)

# Print the first 5 rows to check
for row in rows[:5]:
    print(row)

['1', 'India', '1,428,627,663', '0.81 %', '11,454,490', '481', '2,973,190', '-486,136', '2.0', '28', '36 %', '17.76 %']
['2', 'China', '1,425,671,352', '-0.02 %', '-215,985', '152', '9,388,211', '-310,220', '1.2', '39', '65 %', '17.72 %']
['3', 'United States', '339,996,563', '0.50 %', '1,706,706', '37', '9,147,420', '999,700', '1.7', '38', '83 %', '4.23 %']
['4', 'Indonesia', '277,534,122', '0.74 %', '2,032,783', '153', '1,811,570', '-49,997', '2.1', '30', '59 %', '3.45 %']
['5', 'Pakistan', '240,485,658', '1.98 %', '4,660,796', '312', '770,880', '-165,988', '3.3', '21', '35 %', '2.99 %']


In [8]:
# Create DataFrame
df = pd.DataFrame(rows, columns=headers)
df.head()

Unnamed: 0,#,Country (or dependency),Population (2023),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,1,India,1428627663,0.81 %,11454490,481,2973190,-486136,2.0,28,36 %,17.76 %
1,2,China,1425671352,-0.02 %,-215985,152,9388211,-310220,1.2,39,65 %,17.72 %
2,3,United States,339996563,0.50 %,1706706,37,9147420,999700,1.7,38,83 %,4.23 %
3,4,Indonesia,277534122,0.74 %,2032783,153,1811570,-49997,2.1,30,59 %,3.45 %
4,5,Pakistan,240485658,1.98 %,4660796,312,770880,-165988,3.3,21,35 %,2.99 %
