## **#04: Data Collection Methods: Beautiful Soup**
- Instructor: [Jaeung Sim](https://www.business.uconn.edu/person/jaeung-sim/) (University of Connecticut)
- Course: OPIM 5512 Data Science Using Python
- Last updated: February 6, 2025

**Objectives**
1. Exercise web scraping and HTML parsing using `requests` and `Beautiful Soup`.
2. Learn `Selenium` using Python (not Google Colab).

**References**
* [Tutorial: Web Scraping with Python Using Beautiful Soup](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/)
* [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)

### **Basic Information**

**A. What is Web Scraping?**

>**Web scraping** is the process of gathering information from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation.

**B. Reasons for Web Scraping**

>Some websites offer data sets that are downloadable in CSV format, or accessible via an Application Programming Interface (API). But many websites with useful data don't offer these convenient options.
>
>If we wanted to analyze this data, or download it for use in some other app, we wouldn't want to painstakingly copy-paste everything. Web scraping is a technique that lets us use programming to do the heavy lifting.
>
>Automated web scraping can be a solution to speed up the data collection process. You write your code once, and it will get the information you want many times and from many pages.

**C. Technical Challenges**

* Variety and durability of websites
>Every website is different. While you'll encounter general structures that repeat themselves, each website is unique and will need personal treatment if you want to extract the relevant information.
>
>Also, websites constantly change. Say you've built a shiny new web scraper that automatically cherry-picks what you want from your resource of interest. The first time you run your script, it works flawlessly. But when you run the same script only a short while later, you run into a discouraging and lengthy stack of tracebacks!

* Hidden websites
> Some pages contain information that's hidden behind a login. That means you'll need an account to be able to scrape anything from the page. The process to make an HTTP request from your Python script is different from how you access a page from your browser. Just because you can log in to the page through your browser doesn't mean you'll be able to scrape it with your Python script.
>
> Fortunately, some libraries come with their built-in capacity to handle authentication. With these techniques, you can log in to websites when making the HTTP request from your Python script and then scrape information that's hidden behind a login.

* Dynamic websites
> Unlike a static website, where the server sends you an HTML page that already contains all the page information in the response, with a dynamic website, the server might not send back any HTML at all. Instead, you could receive JavaScript code as a response, which will look completely different from what you saw when you inspected the page with your browser's developer tools. The only way to go from the JavaScript code you received to the content that you're interested in is to execute the code, just like your browser does.
>
>There are some solutions for this. For example, `requests-html` is a project created by the author of the `requests` library that allows you to render JavaScript using syntax that's similar to the syntax in `requests`. It also includes capabilities for parsing the data by using `Beautiful Soup` under the hood. Another popular choice for scraping dynamic content is `Selenium`. You can think of Selenium as a slimmed-down browser that executes the JavaScript code for you before passing on the rendered HTML response to your script.

**D. Legal Challenges**
>Unfortunately, there's not a cut-and-dry answer on whether web scraping is legal. Some websites explicitly allow web scraping. Others explicitly forbid it. Many websites don't offer any clear guidance one way or the other.
>
>Before scraping any website, we should look for a terms and conditions page to see if there are explicit rules about scraping. If there are, we should follow them. If there are not, then it becomes more of a judgement call.
>
>If you're scraping a page respectfully for educational purposes, then you're usually okay. Still, it's a good idea to do some research on your own and make sure that you're not violating any terms of service before you start a large-scale project.
>
>Remember, though, that web scraping consumes server resources for the host website. If we're just scraping one page once, that isn't going to cause a problem. But if our code is scraping 1,000 pages once every ten minutes, that could quickly get expensive for the website owner.
>
>Thus, in addition to following any and all explicit rules about web scraping posted on the site, it's also a good idea to follow these best practices:
>* Never scrape more frequently than you need to.
>* Consider caching the content you scrape so that it's only downloaded once.
>* Build pauses into your code using functions like `time.sleep()` to keep from overwhelming servers with too many requests too quickly.

**E. Alternatives**
* Alternative web scrappers
> There are several free web crawlers you can use without complex programming. For instance, Google Chrome provides its own web scraper as an extension tool.

* Application Programing Interfaces (APIs)
>Some website providers offer application programming interfaces (APIs) that allow you to access their data in a predefined manner. With APIs, you can avoid parsing HTML. Instead, you can access the data directly using formats like JSON and XML. HTML is primarily a way to present content to users visually.
>
>When you use an API, the process is generally more stable than gathering the data through web scraping. That’s because developers create APIs to be consumed by programs rather than by human eyes.

**References:**
* [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)
* [Tutorial: Web Scraping with Python Using Beautiful Soup](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/)

### **Part 0. Basic Setup**

In [None]:
# Run this code to import the NumPy and Pandas modules
import numpy as np
import pandas as pd

### **Part 1. `requests` and `Beautiful Soup`**

In [None]:
# Install and import `request` library
# !pip install requests # "Requirement already satisfied"
import requests

In [None]:
# Install and import `BeautifulSoup` function
# !pip install beautifulsoup4 # "Requirement already satisfied"
from bs4 import BeautifulSoup

**A. Scraping the Fake Python Job Site**

Objectives
* Step through a web scraping pipeline from start to finish.
* Inspect the HTML structure of your target site with your browser’s developer tools.
* Decipher the data encoded in URLs.
* Download the page’s HTML content using Python’s `requests` library.
* Parse the downloaded HTML with `Beautiful Soup` to extract relevant information.
* Convert the information into a DataFrame using `Pandas`.

Data source: https://realpython.github.io/fake-jobs/

**Step 1. Inspecting the Data Source**

1.1. Exploring the website

>Click through the site and interact with it just like any typical job searcher would. For example, you can scroll through the main page of the website.

1.2. Deciphering the information in URLs

>A programmer can encode a lot of information in a URL. Your web scraping journey will be much easier if you first become familiar with how URLs work and what they’re made of. For example, you might find yourself on a details page that has the following URL:
>
>>https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html
>
>You can deconstruct the above URL into two main parts:
>1. **The base URL** represents the path to the search functionality of the website. In the example above, the base URL is `https://realpython.github.io/fake-jobs/`.
>1. **The specific site location** that ends with `.html` is the path to the job description's unique resource.
>
>Any job posted on this website will use the same base URL. However, the unique resources’ location will be different depending on what specific job posting you’re viewing.

1.3. Inspecting the site using developer tools

>**Developer tools** can help you understand the structure of a website. All modern browsers come with developer tools installed. In this section, you’ll see how to work with the developer tools in Chrome. The process will be very similar to other modern browsers.
>* **Chrome on Mac:** (*menu*) View → Developer → Developer Tools, (*keyboard shortcut*) Cmd + Alt + I
>* **Chrome on Windows/Linux:** (*menu*) The top-right menu button (⋮) → More Tools → Developer Tools, (*keyboard shortcut*) Ctrl + Shift + I
>
>Developer tools allow you to interactively explore the site’s document object model (DOM) to better understand your source. To dig into your page’s DOM, select the Elements tab in developer tools. You’ll see a structure with clickable HTML elements. You can expand, collapse, and even edit elements right in your browser.

**Step 2. Scraping HTML Content from a Page with `requests`**

In [None]:
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL) # HTTP `get` request to get or retrieve data

print(page.text) # Show the inspected HTML from 'page.text' object

We are interested in the following classes:
* `class="title is-5"` contains the title of the job posting.
* `class="subtitle is-6 company"` contains the name of the company that offers the position.
* `class="location"` contains the location where you would be working.



**Step 3. Parsing HTML Code with `Beautiful Soup`**

In [None]:
# Create a Beautiful Soup object
soup = BeautifulSoup(page.content, "html.parser") # Taking 'page.content' instead of 'page.text'

3.1. Find elements by ID
> In an HTML web page, every element can have an `id` attribute assigned. As the name already suggests, that `id` attribute makes the element uniquely identifiable on the page.
>
> From the current HTML page, you can find a `<div>` with an `id` attribute that has the value `"ResultsContainer"` as shown below:
>```html
><div id="ResultsContainer" class="columns is-multiline">
>```

In [None]:
# Find the specific HTML element by its ID
results = soup.find(id="ResultsContainer")

In [None]:
# Show all the HTML contained within the <div>
print(results.prettify())

3.2. Find elements by HTML class name

In [None]:
# Find all job-posting elements
job_elements = results.find_all("div", class_="card-content") # Wrapped in a <div> element with the class 'card-content'

In [None]:
# Take a look at all the job postings
for job_element in job_elements:
  print(job_element)
  print("--------------------------------------------------------------------------------")

In [None]:
# Pick out child elements with descriptive class names from each job posting
for job_element in job_elements:
  title_element = job_element.find("h2", class_="title is-5")
  company_element = job_element.find("h3", class_="subtitle is-6 company")
  location_element = job_element.find("p", class_="location")
  print(title_element)
  print(company_element)
  print(location_element)
  print()

3.3. Extract text from HTML elements

In [None]:
# .text to return the text content only
for job_element in job_elements:
  title_element = job_element.find("h2", class_="title is-5")
  company_element = job_element.find("h3", class_="subtitle is-6 company")
  location_element = job_element.find("p", class_="location")
  print(title_element.text)
  print(company_element.text)
  print(location_element.text)
  print()

In [None]:
# .text.strip() to remove extra whitespace
for job_element in job_elements:
  title_element = job_element.find("h2", class_="title is-5")
  company_element = job_element.find("h3", class_="subtitle is-6 company")
  location_element = job_element.find("p", class_="location")
  print(title_element.text.strip())
  print(company_element.text.strip())
  print(location_element.text.strip())
  print()

3.4. Constructing a DataFrame from the HTML

In [None]:
# Define a DataFrame with empty columns
job_postings = pd.DataFrame(columns=["title", "company", "location"])

In [None]:
# Construct a DataFrame from 'job_elements'
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title is-5")
    company_element = job_element.find("h3", class_="subtitle is-6 company")
    location_element = job_element.find("p", class_="location")

    # Create a temporary DataFrame for the current job posting
    temp_df = pd.DataFrame([[title_element, company_element, location_element]], columns=["title", "company", "location"])

    # Use concat to add the new row to the DataFrame
    job_postings = pd.concat([job_postings, temp_df], ignore_index=True)

In [None]:
# Print the results
job_postings.head()

Revise the code to make each block of the DataFrame clean texts.

In [None]:
# Define a DataFrame with empty columns
job_postings = pd.DataFrame(columns=["title", "company", "location"])

In [None]:
# Revise the Construct a DataFrame from 'job_elements'
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title is-5")
    company_element = job_element.find("h3", class_="subtitle is-6 company")
    location_element = job_element.find("p", class_="location")
    temp_df = pd.DataFrame([[title_element.text.strip(), company_element.text.strip(), location_element.text.strip()]], columns=["title", "company", "location"]) # Updating this line
    job_postings = pd.concat([job_postings, temp_df], ignore_index=True)

In [None]:
# Print the results
job_postings.head()

**B. Scraping Actual Weather Data**

Objectives
* Download the web page containing the weather forecast using `requests`.
* Create a `BeautifulSoup` class to parse the page.
* Extract and print the extended forecast for San Francisco.
* Convert the information into a DataFrame using `Pandas`.

Data source: https://forecast.weather.gov/MapClick.php?lon=-73.53080749511717&lat=41.053272202459226

**Step 1. Inspecting the Data Source**

1.1. Exploring the website

1.2. Deciphering the information in URLs

1.3. Inspecting the site using developer tools

**Step 2. Scraping HTML Content from a Page with `requests`**

In [None]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lon=-73.53080749511717&lat=41.053272202459226") # HTTP `get` request to get or retrieve data
print(page.text) # Show the inspected HTML from 'page.text' object

**Step 3. Parsing HTML Code with `Beautiful Soup`**

In [None]:
# Create a Beautiful Soup object
soup = BeautifulSoup(page.content, "html.parser") # Taking 'page.content' instead of 'page.text'

3.1. Find elements by ID

In [None]:
# Find the specific HTML element by its ID
results = soup.find(id="seven-day-forecast-container")

In [None]:
# Show all the HTML contained within the <div>
print(results.prettify())

3.2. Find elements by HTML class name

In [None]:
# Find all forecast elements
forecast_items = results.find_all("div", class_="tombstone-container")

In [None]:
# Take a look at all forecast items
for forecast_item in forecast_items:
  print(forecast_item)
  print("--------------------------------------------------------------------------------")

In [None]:
# Pick out child elements with descriptive class names from each job posting
for forecast_item in forecast_items:
  period_element = forecast_item.find("p", class_="period-name")
  cloud_element = forecast_item.find("p", class_="short-desc")
  temperature_element = forecast_item.find("p", class_="temp")
  print(period_element)
  print(cloud_element)
  print(temperature_element)
  print()

Revise the code to 1) extract text from HTML elements and 2) make each block of the DataFrame clean texts.

In [None]:
# Define a Dataframe with empty columns
forecasts = pd.DataFrame(columns=["period", "cloud", "temperature"])

In [None]:
# Construct a DataFrame from 'forecast_items'
for forecast_item in forecast_items:
  period_element = forecast_item.find("p", class_="period-name")
  cloud_element = forecast_item.find("p", class_="short-desc")
  temperature_element = forecast_item.find("p", class_="temp")
  temp_df = pd.DataFrame([[period_element.text.strip(), cloud_element.text.strip(), temperature_element.text.strip()]], columns=["period", "cloud", "temperature"])
  forecasts = pd.concat([forecasts, temp_df], ignore_index=True)

In [None]:
# Print the results
forecasts.head(10)

**References:**
* [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)
* [Tutorial: Web Scraping with Python Using Beautiful Soup](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/)

### **Part 2. `Selenium`**

Please refer to the other iPython notebook (`"05_Data_Collection_Methods_Selenium.ipynb"`).