# Web scraping


## Overview

- Web scraping is automated data collection from websites.
- Websites may prohibit scraping due to data protection or server overload concerns.
- **Always check a website's acceptable use policy before web scraping.**
- `Warning`: Web scraping against a website's wishes can be illegal
  - Wikipedia content is hosted under a "common license" 
  - For that reason, we will ONLY use Wikipedia for scraping in the DSAN program
  - **Please do not scrap any website other than Wikipedia in the DSAN program**
  - Generally, It is better to use APIs, rather than scraping the web
- To begin we obtain HTML code from a Wikipedia web page.
* **Source**: All content from these slides is modified from this URL
  * [https://realpython.com/python-web-scraping-practical-introduction/](https://realpython.com/python-web-scraping-practical-introduction/)


## Viewing source code

* Fun fact: you can view a webpage's **HTML source code** by right-clicking on the page and selecting "View Source"
  * On older websites, this means we can just request `https://www.page.com` and parse the returned HTML
* Less fun fact: modern web frameworks like **React** or **Next.js** generate pages dynamically using JS, meaning that what you see on the page will not be visible in the HTML source
  * Data scraping still possible for these websites, however, using browser automation tools like [https://www.selenium.dev/](https://www.selenium.dev/)

## Scraping Difficulty

Here we show different level of Difficulty web-scrapping 

| | How is data loaded? | Solution |
|-|-|-|
| **Easy** | Data in HTML source | "View Source" |
| **Medium** | Data loaded dynamically via API | "View Source", find API call, request programmatically |
| **Hard** | Data loaded dynamically via web framework | Use <a href="https://www.selenium.dev/" target="_blank">Selenium</a> |


## Example:
* Consider the following website: [https://en.wikipedia.org/wiki/Georgetown_University](https://en.wikipedia.org/wiki/Georgetown_University)
![](images/2023-09-10-22-59-31.png)

## Getting starting {.scrollable}

- Python's standard library includes the `urllib` package for web scraping.
- `urllib.request` module in `urllib` has the `urlopen()` for opening URLs.
- `urlopen()` import it by typing the appropriate command

In [21]:
#| code-fold: false

from urllib.request import urlopen
page = urlopen("https://en.wikipedia.org/wiki/Georgetown_University")
print(page)

<http.client.HTTPResponse object at 0x105c4e560>


* To extract HTML from a webpage:
  - Utilize the HTTPResponse object's .read() method to obtain a byte sequence.
  - Apply .decode() with UTF-8 to convert bytes into a string.

In [22]:
#| code-fold: false

html_bytes = page.read()
html_str = html_bytes.decode("utf-8")

print(type(html_str))
print(html_str[0:1000])

<class 'str'>
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Georgetown University - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled

## Parsing HTML: String methods

* You have the option to obtain information from the HTML using various methods.
- Extracting text from HTML can be done with string methods.
- e.g. using `.find()` to locate and extract data within HTML tags like `<title>`.
- String slicing is used to extract content between specific indices.
- `.find()` returns the index of the first occurrence of a substring, i.e. the `<title>` tag.
- It's a practical way to access specific data within HTML content.

In [29]:
#| code-fold: false

title_index = html.find("<title>")
start_index = title_index + len("<title>")
end_index = html.find("</title>")
print(title_index,start_index,end_index)
print(html[start_index:end_index])

519 526 559
Georgetown University - Wikipedia


# Understanding URLs

## URL: Unified Resource Locator

![](images/2023-09-11-00-17-22.png)

- **URL**: Begins with the base URL (e.g., https://example.com).
- **Query Parameters**: Separated from the base URL by a question mark (?).
- **Parameter Key-Value Pairs**: Key and value pairs separated by an equal sign (=).
- **Multiple Parameters**: Multiple key-value pairs separated by ampersands (&).
- **Encoding**: Special characters may be URL-encoded (e.g., spaces become "%20").
- **Example**: https://example.com/search?q=keyword&lang=en&page=1

## Details {.smaller}

A URL query typically consists of additional information or parameters appended to a base URL, allowing you to send data to a web server when making an HTTP request. Here are the details of a URL query:

- **Base URL**: The URL that represents the main web page or resource you want to access (e.g., https://example.com).

- **Query Parameters**: These are key-value pairs that provide additional information to the web server. Query parameters are separated from the base URL by a question mark (?).

- **Key-Value Pairs**: Each query parameter consists of a key and a value separated by an equal sign (=). For example, in the URL https://example.com/search?q=keyword&lang=en&page=1:

  - "q" is the key, and "keyword" is the value.
  - "lang" is the key, and "en" is the value.
  - "page" is the key, and "1" is the value.

- **Multiple Parameters**: Include multiple query parameters in a single URL by separating them with ampersands (&). For example, in https://example.com/search?q=keyword&lang=en&page=1, three query parameters: "q," "lang," and "page."

- **URL Encoding**: Special characters, e.g spaces or non-alphanumeric characters, may need to be URL-encoded to ensure they are transmitted correctly. For instance, a space is encoded as "%20," and other characters have their own encodings.

- **Example**: In a search engine query, you might use a URL query to specify the search term ("q"), the language preference ("lang"), and the page number ("page"). The resulting URL could look like this:
  -  `https://example.com/search?q=python%20tutorial&lang=en&page=2.`

<!-- - **HTTP Method**: The HTTP method used to make the request (e.g., GET or POST) determines how the query parameters are sent to the server. In a GET request, query parameters are appended to the URL, while in a POST request, they are sent in the request body. -->

URL queries are used to pass data to the server for various purposes, e.g searching, filtering, or customizing content.

## Example

* `https://www.amazon.com/s?k=flip+flops`

![](images/2023-09-11-00-20-35.png)

## Example 

* `https://www.amazon.com/s?k=flip+flops&page=3`
![](images/2023-09-11-00-24-20.png)

# Regular Expressions

## Overview 

- **Regular Expressions (Regexes)**: Patterns for searching text in strings.
- **Python Support**: Utilized through the "re" module in Python.
- **Programming General**: Widely used in multiple programming languages.
- **Importing "re"**: Start by importing the "re" module in Python for regex functionality.
- **Metacharacters**: Special characters in regular expressions for pattern representation.
- **Example**: "*" denotes zero or more instances of the preceding character.


## Getting started

- **.findall()**: Function to find all text matching a specified regex pattern within a string.
- `re.findall()`: Function takes two arguments.
  - First argument: The regex pattern to match.
  - Second argument: The string to test against the pattern.
* the `re.findall()` function returns a list containing all non-overlapping matches of the specified regular expression pattern in the given string. Each element in the list represents a match found in the string. If no matches are found, it returns an empty list.
* Pattern matching is case sensitive






In [85]:
import re
string="welcome to me home"

print(re.findall("me*", string))
print(re.findall("me* ", string))

print(re.findall("ab*c", "abcd"))
print(re.findall("ab*c", "acc"))
print(re.findall("ab*c", "abcac"))
print(re.findall("ab*c", "abdc"))
print(re.findall("ab*c", "ABC", re.IGNORECASE))



['me', 'me', 'me']
['me ', 'me ']
['abc']
['ac']
['abc', 'ac']
[]
['ABC']


## Wild-cards

* You can use a period (.) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters "a" and "c" separated by a single character as follows:
* Similar to a single characeter wild-card 
* The regex pattern ".*" matches any character repeated any number of times. For example, "a.*c" finds substrings starting with "a" and ending with "c," with any content in between.
  * Very similar to the `*` wild card in Linux bash scripting 


<!-- UNFINSHED STOPPED HERE -->
<!-- Often, you use re.search() to search for a particular pattern inside a string. This function is somewhat more complicated than re.findall() because it returns an object called MatchObject that stores different groups of data. This is because there might be matches inside other matches, and re.search() returns every possible result. -->







In [96]:
# only works as a single character while card
print(re.findall("miss.ssippi", "mississippi"))
print(re.findall("miss.sippi", "mississippi"))
print(re.findall("mis.*pi", "mississippi"))

['mississippi']
[]
['mississippi']


# Beautiful soup

## Overview

* Regular expressions are effective for pattern matching.
* However, often it is better to use a dedicated HTML parser. 
* Python offers many tools for this, and `Beautiful Soup` is a popular choice.
- **Beautiful Soup**: Python library for parsing HTML and XML documents.
- **Purpose**: Extracts data from web pages efficiently.
- **Features**: Navigates and searches the document tree, handles malformed markup.
- **Benefits**: Simplifies HTML parsing, ideal for web scraping.
- **Integration**: Compatible with various parsers like lxml and html5lib.
- **Community**: Well-documented & widely used in web scraping & data extraction tasks.
- Installation: `python -m pip install beautifulsoup4`
- If Beautiful Soup lacks required functionality, consider exploring `lxml` library. It's more versatile but requires a steeper learning curve.

## Important BeautifulSoup methods

- **BeautifulSoup()**: Creates a BeautifulSoup object for parsing HTML or XML.
- **.find() and .find_all()**: Locate tags or elements based on criteria.
- **.select()**: Uses CSS selectors to find elements.
- **.get_text()**: Extracts text content from tags, removing HTML.
- **.prettify()**: Formats the parsed document for readability.
- **.parent and .children**: Access parent and child elements.
- **.next_sibling and .previous_sibling**: Navigate to neighboring elements.
- **.find_parent() and .find_next_sibling()**: Locate specific parent or sibling elements.
- **.attrs**: Access tag attributes as a dictionary.
- **.append() and .insert()**: Add new elements to the parsed document.
- **.replace_with()**: Replace a tag or element with new content.
- **.decompose()**: Remove a tag or element from the document.
- **.encode() and .decode()**: Convert between Unicode and byte strings.

## Example {.scrollable}

* Opens URL using `urlopen()`: Reads and stores HTML content as a string in 'html'.
* Creates a BeautifulSoup object named 'soup' for parsing the HTML content.
* The BeautifulSoup object 'soup' is created with two arguments: the HTML to be parsed and "html.parser" as the parser choice, which is Python's built-in HTML parser.
- Access specific HTML tags using properties of the Tag object.
- For instance, retrieve the `<title>` tag using the `.title` property.

In [62]:
#| code-fold: false

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://en.wikipedia.org/wiki/Georgetown_University"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(type(soup))
print(soup.title)
print(soup.title.string) # remove html tags


<class 'bs4.BeautifulSoup'>
<title>Georgetown University - Wikipedia</title>
Georgetown University - Wikipedia


## Extracting text {.scrollable}

- Use the 'soup' variable interactively for HTML content parsing.
- In IDLE, utilize the `.get_text()` method to extract text and remove HTML tags.

In [57]:
#| code-fold: false
import os
html_text=soup.get_text()
print(type(html_text))

# REMOVE BLANK SPACES
html_text = os.linesep.join([s for s in html_text.splitlines() if s])
print(html_text[0:300])

<title>Georgetown University - Wikipedia</title>
<class 'str'>
Georgetown University - Wikipedia
Jump to content
Main menu
Main menu
move to sidebar
hide
		Navigation
	
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate
		Contribute
	
HelpLearn to editCommunity portalRecent changesUpload file
Languages
Language links are at the top of 


## Extracting images 

- HTML tags can directly indicate the data to retrieve.
- Example: Extracting image URLs from the 'src' attribute of `<img>` tags.
- Retrieves a list of all `<img>` tags from the HTML document.
- The list contains Tag objects, not plain strings, which provide a user-friendly interface for working with tag information.

<!-- unfinished : render some of the images -->

In [63]:
#| code-fold: false

soup.find_all("img")

[<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>,
 <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>,
 <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>,
 <img class="mw-file-element" data-file-height="305" data-file-width="260" decoding="async" height="176" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Georgetown_University_seal.svg/150px-Georgetown_University_seal.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Georgetown_University_seal.svg/225px-Georgetown_University_seal.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Georgetown_University_seal.svg/300px-Georgetown_University_seal.svg.png 2x" width="150"/>,
 <img class="mw-file

<!-- 


## Working with HTML

* <a href="https://requests.readthedocs.io/en/latest/" target="_blank">`requests` Documentation</a>
* <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup (`bs4`) Documentation</a>
#| label: bs4-example
#| echo: true
#| code-fold: show
# Get HTML
import pandas as pd
import requests
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Data_science")
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
header_elts = soup.select('h2 > span.mw-headline')
header_df = pd.DataFrame({'section_title': [elt.text for elt in header_elts]})
header_df
::: {.aside}

Note: `httr2` is a re-written version of the original `httr` package, which is now deprecated. You'll still see lots of code using `httr`, however, so it's good to know how both versions work. <a href="https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html" target="_blank">Click here for a helpful vignette</a> on the original `httr` library.

:::

## Navigating HTML with CSS selectors

## Navigating HTML with XPath

<a href="https://devhints.io/xpath" target="_blank">XPath Cheatsheet</a>

* Notice the last line of the previous code example[^r-version-xpath]:

[^r-version-xpath]: See the R code examples in the appendix, for the R version of this code, which uses XPath strings in the same way as `BeautifulSoup` does.

* The string passed to `xml_find_all()` is an **XPath selector**

::: {.aside}

XPath selectors are used by many different libraries, including **Selenium** (which we'll look at very soon) and **jQuery** (a standard extension to plain JavaScript allowing easy searching/manipulation of the DOM), so it's good to learn it now!

:::

## XPath I: Selecting Elements

```html {filename="mypage.html"}
<div class="container">
  <h1>Header</h1>
  <p id="page-content">Content</p>
  <img class="footer-image m-5" src="footer.png">
</div>
```

* `'//div'` matches all elements `<div>` in the document:

    ```html
    <div class="container">
      <h1>Header</h1>
      <p id="page-content">Content</p>
      <img class="footer-image m-5" src="footer.png">
    </div>
    ```
* `'//div//img'` matches `<img>` elements which are **children of** `<div>` elements:

    ```html
    <img class="footer-image m-5" src="footer.png">
    ```


## XPath II: Filtering by Attributes {.smaller}

```html {filename="mypage.html"}
<div class="container">
  <h1>Header</h1>
  <p id="page-content">Content</p>
  <img class="footer-image m-5" src="footer.png">
</div>
```

* `'//p[id="page-content"]'` matches all `<p>` elements with id `page-content`[^unique-id]:

    ```html
    <p id="page-content">Content</p>
    ```
* Matching **classes** is a bit trickier:

    [`'//img[contains(concat(" ", normalize-space(@class), " "), " footer-image ")]'`]{.small-codeblock}

    matches all `<img>` elements with `page-content` as one of their classes[^multi-class]

    ```html
    <img class="footer-image m-5" src="footer.png">
    ```

[^unique-id]: In HTML, `id`s are required to be **unique** to particular elements (and elements cannot have more than one `id`), meaning that this should only return a **single** element, for valid HTML code (not followed by all webpages!). Also note the **double-quotes** after `id=`, which are required in XPath.

[^multi-class]: Your intuition may be to just use `'//img[@class="footer-image"]'`. Sadly, however, this will match only elements with `footer-image` as their **only** class. i.e., it will match `<img class="footer-image">` but not `<img class="footer-image another-class">`. This will usually fail, since most elements on modern webpages have several classes. For example, if the site is using <a href="https://getbootstrap.com/docs/5.3/getting-started/introduction/" target="_blank">Bootstrap</a>, `<p class="p-5 m-3"></p>` creates a paragraph element with a padding of 5 pixels and a margin of 3 pixels. -->