# Tutorial for dsci524_group29_webscraping

This package provides tools for web scraping, including:

- Fetching HTML content from a URL.
- Parsing specific elements from the HTML content.
- Saving the extracted data to a file.

In this tutorial, you will learn how to use each function in the package with real-life examples.

In [1]:
# import the required functions from the package
from dsci524_group29_webscraping.fetch_html import fetch_html
from dsci524_group29_webscraping.parse_content import parse_content
from dsci524_group29_webscraping.save_data import save_data

In [2]:
# Fetch HTML content from a website
url = "https://example.com"
html_content = fetch_html(url)

# Display the first 500 characters of the HTML content
print(html_content[:500])

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
    


In [3]:
# Parse paragraph tags from the HTML content using CSS selector
parsed_data = parse_content(html_content, selector="p", selector_type="css")

# Display the first 5 parsed elements
print(parsed_data[:5])

[{'value': 'This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.'}, {'value': None}]


In [4]:
# Parse headings from the HTML content using XPath selector
parsed_headings = parse_content(html_content, selector="//h1", selector_type="xpath")

# Display the headings
print(parsed_headings)

[{'value': 'Example Domain'}]


In [5]:
# Save the parsed paragraphs to a CSV file
save_data(parsed_data, format="csv", destination="output_paragraphs.csv")
print("Paragraphs saved to 'output_paragraphs.csv'.")

# Save the parsed headings to a CSV file
save_data(parsed_headings, format="csv", destination="output_headings.csv")
print("Headings saved to 'output_headings.csv'.")

Paragraphs saved to 'output_paragraphs.csv'.
Headings saved to 'output_headings.csv'.
