# Basic Web Scraper project


## Summary
This is a **simple** web scraping program that aims to troubleshoot and solve another issue I am having with Obsidian Publish. 
See the [project charter](https://github.com/nvbergen/basic_web_scraper/blob/main/charter%20-%20basic_web_scraper.md) or just go direct to the main branch of the [repo](https://github.com/nvbergen/basic_web_scraper/tree/main) for details

The successful code is at the end of the file. 

### Prerequisites
Ensure that you have a subdirectory labeled `output` if you are copying this code. 

## GPT stream


### prompt 1

> _help me write a web scraping tool that will capture all of html from a webpage._

#### Generated code output

In [None]:
import requests

def fetch_html(url):
    try:
        # Send a GET request to the specified URL
        response = requests.get(url)
        
        # Raise an error if the request was unsuccessful
        response.raise_for_status()
        
        # Return the HTML content of the page
        return response.text
    except requests.RequestException as e:
        print(f"An error occurred: {e}")
        return None

# Example usage:
# url = "https://example.com"
# html_content = fetch_html(url)
# print(html_content)


#### Results
The program was not executed because I failed to specify a key requirement **output of file to local directory**

### Prompt 2:
> _I would also like my output to be saved locally_

#### Generated code output

In [None]:
import requests

def fetch_html(url):
    try:
        # Send a GET request to the specified URL
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        
        # Raise an error if the request was unsuccessful
        response.raise_for_status()
        
        # Return the HTML content of the page
        return response.text
    except requests.RequestException as e:
        print(f"An error occurred: {e}")
        return None

def save_html_to_file(html_content, file_path):
    try:
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(html_content)
        print(f"HTML content saved to {file_path}")
    except Exception as e:
        print(f"An error occurred while saving the file: {e}")

# Example usage:
# url = "https://example.com"
# html_content = fetch_html(url)
# if html_content:
#     save_html_to_file(html_content, "example_page.html")


#### Results
Program was **not successful**
<br>
The code and accompanying explanation was not useful. specifically, it was not useful because I failed to specify my requirements for the program. I also thought the script was very verbose. So I attempted another prompt in the same session. 

### Prompt 3.
> _Let's start again with a different version. please make a script which will meet the following requirements._
1. [sic] _takes a single url and gets the entire HTML tree and exports a text file to a predefined path._

#### Generated code output. 

In [None]:
import requests

def fetch_and_save_html(url, file_path):
    """
    Fetches the entire HTML content from the given URL and saves it to a text file.
    
    Parameters:
    - url (str): The URL of the webpage to fetch.
    - file_path (str): The predefined path where the HTML content should be saved.
    """
    try:
        # Set up a User-Agent to mimic a real browser request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
        }
        
        # Send a GET request to the URL
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise an exception for HTTP errors
        
        # Save the HTML content to the specified file path
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(response.text)
        
        print(f"HTML content successfully saved to: {file_path}")
    
    except requests.RequestException as e:
        print(f"Failed to fetch the HTML content. Error: {e}")
    except Exception as e:
        print(f"Failed to save the HTML content. Error: {e}")

# Example usage:
# url = "https://example.com"
# file_path = "output/example_page.txt"
# fetch_and_save_html(url, file_path)


#### Results
The code looked more inline with what I was expecting so the program was executed. 
<br>
The result was **not successful** because the requests library do not wait for the rest of the page to load. I was expecting  a long document of html in a tree structure which was not delivered. <br> Excepted results from `test1.txt`
<br>
<br>
```html
<!doctype html><html lang="en"><head><meta charset="utf-8"/><meta name="viewport" content="width=device-width,initial-scale=1"/><style class="preload">html,body{margin:0;height:100%}body.theme-light{background-color:#fff;color:#222}body.theme-dark{background-color:#1e1e1e;color:#dadada}.preload{padding:20px;white-space:pre-wrap;overflow-wrap:break-word}@keyframes rotate{from{transform:rotate(0)}to{transform:rotate(360deg)}}</style><base href="https://publish.obsidian.md"><script defer="defer" 

```

### Prompt 4
>_well the output was not what I had expected. can we rewrite using beautiful soup and prettify the output to match all the correct indentations etc._

#### Generated code output

In [6]:
import requests
from bs4 import BeautifulSoup

def fetch_and_save_prettified_html(url, file_path):
    """
    Fetches the entire HTML content from the given URL, formats it with BeautifulSoup,
    and saves it to a text file with proper indentation.
    
    Parameters:
    - url (str): The URL of the webpage to fetch.
    - file_path (str): The predefined path where the formatted HTML content should be saved.
    """
    try:
        # Set up a User-Agent to mimic a real browser request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
        }
        
        # Send a GET request to the URL
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise an exception for HTTP errors
        
        # Parse the HTML content with BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Use BeautifulSoup's prettify function to format the HTML
        prettified_html = soup.prettify()
        
        # Save the prettified HTML content to the specified file path
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(prettified_html)
        
        print(f"Prettified HTML content successfully saved to: {file_path}")
    
    except requests.RequestException as e:
        print(f"Failed to fetch the HTML content. Error: {e}")
    except Exception as e:
        print(f"Failed to save the HTML content. Error: {e}")

# Example usage:
# url = "https://example.com"
# file_path = "output/example_page.txt"
# fetch_and_save_prettified_html(url, file_path)


#### Results
The results of this test looked better and had all of the proper human readable indentations (that I know of) however it is still not showing any html for any of the content that actually gets loaded. 

### Final Prompt
The gpt suggested `selenium` and using `chrome` for automation purposes. I am familiar with this tool but due to reasons I will not have any other browser available other than Safari. <br><br>

To my delight, safari allows for remote automation, a setting in the advanced developer menus in the Safari settings tree. 
<br><br>
> _this file seems to be getting html that is incomplete. when I open inspect element on my browser I see a big HTML structure the file I get contains like 4 lines why the difference?_

#### Generated code output.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

def fetch_and_save_prettified_html_safari(url, file_path):
    """
    Fetches the complete HTML content from the given URL using Selenium with Safari (rendering JavaScript),
    formats it with BeautifulSoup, and saves it to a text file with proper indentation.
    
    Parameters:
    - url (str): The URL of the webpage to fetch.
    - file_path (str): The predefined path where the formatted HTML content should be saved.
    """
    try:
        # Initialize the Safari browser using the WebDriver
        driver = webdriver.Safari()
        
        # Navigate to the URL
        driver.get(url)
        
        # Wait for a few seconds to allow JavaScript to load (adjust this as needed)
        time.sleep(5)
        
        # Get the full HTML content after JavaScript has been executed
        html_content = driver.page_source
        
        # Parse the HTML content with BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Use BeautifulSoup's prettify function to format the HTML
        prettified_html = soup.prettify()
        
        # Save the prettified HTML content to the specified file path
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(prettified_html)
        
        print(f"Prettified HTML content successfully saved to: {file_path}")
    
    except Exception as e:
        print(f"An error occurred: {e}")
    
    finally:
        # Ensure the browser is closed
        driver.quit()

# Example usage:
# url = "https://example.com"
# file_path = "output/example_page.txt"
# fetch_and_save_prettified_html_safari(url, file_path)



In [5]:
url = "https://notes.nickvanbergen.com/Cornell+Notes+in+Obsidian"
file_path = "output/standard_function.txt"
fetch_and_save_prettified_html_safari(url, file_path)


Prettified HTML content successfully saved to: output/standard_function.txt


#### Results
This program was successful. The resulting HTML is over 700 lines. This assures that I have the detail i need. 

