# Tutorial

This package provides tools for web scraping, including:

- Fetching HTML content from a URL.
- Parsing specific elements from the HTML content.
- Saving the extracted data to a file.

In this tutorial, you will learn how to use each function in the package with real-life examples.

## Imports

_First, you will need to import these three functions in order to use them in your own pipeline. The functions can easily be imported with the example code in this cell._

In [None]:
from dsci524_group29_webscraping.fetch_html import fetch_html
from dsci524_group29_webscraping.parse_content import parse_content
from dsci524_group29_webscraping.save_data import save_data

## Fetch HTML content from a website

_Next, call the `fetch_html` function, with the URL of the website you want to scrape. In the example below, we use the [IANA Example Domain](https://example.com). The output from this website is simple and can be printed as illustrated below._

_You can try it with another website of your choosing. However, you might want to first check the length of the response (`len(html_content)`) to see if you can print all of it in your notebook or to the console._

In [None]:
url = "https://example.com"
html_content = fetch_html(url)
print(html_content)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

## Parse HTML content using different selectors

_Now you can parse the HTML text to extract specific elements from it. For this, you will need to have some basic understanding of HTML, which you can review [here](https://www.w3schools.com/html/html_basic.asp)._

_For example, from the [example html](https://example.com) retrieved in the previous step, we might want to parse paragraph tags (`<p>`) from the HTML content using CSS selector. The code below shows show you can do that:_

In [None]:
parsed_data = parse_content(html_content, selector="p", selector_type="css")
print(parsed_data)

[{'value': 'This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.'}, {'value': None}]


_From the same sample HTML, we might want to parse HTML Heading 1 tags (`<h1>`) using XPath selector as shown in the code below:_

In [None]:
parsed_headings = parse_content(html_content, selector="//h1", selector_type="xpath")
print(parsed_headings)

[{'value': 'Example Domain'}]


## Save parsed data to CSV file

_And finally, you can save just the bits you extracted from the HTML in a file! In the example above, we retrieve a simple list of 1 element in each case. However, a web page will typically have several elements fitting the specification. For instance, a page might have several `<h1>` or `<p>` tags. The `save_data` function will allow you to save all the elements that we retrieved into a file._

_The example below saves the `<p>` tags retrieved from the example above (in the `parsed_data` variable) to a CSV file `output_paragraphs.csv`:_

In [None]:
save_data(parsed_data, format="csv", destination="output_paragraphs.csv")
print("Paragraphs saved to 'output_paragraphs.csv'.")

Paragraphs saved to 'output_paragraphs.csv'.


_And the one below saves the `<h1>` tags retrieved from the example above (in the `parsed_headings` variable) to a CSV file `output_headings.csv`:_

In [None]:
save_data(parsed_headings, format="csv", destination="output_headings.csv")
print("Headings saved to 'output_headings.csv'.")

Headings saved to 'output_headings.csv'.


_Now you can use these examples to try out many other websites! Here is an easy suggestion: how many `<h2>` tags are on the [UBC MDS homepage](https://masterdatascience.ubc.ca/)?_