# Webpage to Markdown Converter

- This Jupyter notebook allows you to download a webpage, convert it into Markdown format, and save the result locally in the "downloads" folder.
- The notebook is designed to be modular, with reusable functions for downloading, saving, and converting content.
- Additionally, it includes error handling and feedback at each stage to ensure a smooth experience.


## Import Libraries

- We will use the following libraries:
  - `requests`: To download webpage content.
  - `BeautifulSoup`: To parse the HTML.
  - `Markdownify`: To convert HTML to Markdown.
  - `os`: To manage file and directory operations.
  - `logging`: For providing feedback and managing errors.


In [1]:
# Import necessary libraries
import os
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import logging

# Configure logging for user feedback and error handling
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

# Create 'downloads' folder if it doesn't exist
if not os.path.exists('downloads'):
    os.makedirs('downloads')


## Function to Download Webpage

- This function takes a URL as input and attempts to download its HTML content using the `requests` library.
- It includes error handling for invalid URLs or network issues.
- The content is returned if the download is successful.


In [2]:
def download_webpage(url):
    """
    Downloads the webpage content from the given URL.

    Parameters:
    - url (str): The URL of the webpage to download.

    Returns:
    - str: HTML content of the downloaded webpage if successful.
    """
    try:
        logging.info("Attempting to download webpage...")
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raise HTTPError for bad responses
        logging.info("Webpage downloaded successfully!")
        return response.text
    except requests.exceptions.RequestException as e:
        logging.error(f"Error downloading webpage: {e}")
        return None


## Function to Save HTML Content

- This function saves the downloaded HTML content into a file within the "downloads" folder.
- It checks for file system issues, such as permissions, and provides user feedback.


In [3]:
def save_html(content, filename):
    """
    Saves the HTML content to a file in the 'downloads' folder.

    Parameters:
    - content (str): HTML content to be saved.
    - filename (str): The name of the file to save the content as.

    Returns:
    - str: Full path of the saved file.
    """
    try:
        filepath = os.path.join('downloads', filename)
        with open(filepath, 'w', encoding='utf-8') as file:
            file.write(content)
        logging.info(f"HTML content saved to {filepath}")
        return filepath
    except OSError as e:
        logging.error(f"Error saving file: {e}")
        return None


## Function to Convert HTML to Markdown

- This function uses the `Markdownify` library to convert the downloaded HTML content into Markdown format.


In [4]:
def convert_to_markdown(html_content):
    """
    Converts HTML content to Markdown format using Markdownify.

    Parameters:
    - html_content (str): HTML content to be converted.

    Returns:
    - str: Converted Markdown content.
    """
    try:
        logging.info("Converting HTML content to Markdown...")
        markdown_content = md(html_content)
        logging.info("Conversion to Markdown successful!")
        return markdown_content
    except Exception as e:
        logging.error(f"Error converting to Markdown: {e}")
        return None


## Function to Verify HTML Content

- This function verifies the presence of essential HTML tags, such as `<title>`.
- If such tags are missing, the function warns the user.


In [5]:
def verify_html_content(content):
    """
    Verifies the presence of essential HTML tags like <title> in the HTML content.

    Parameters:
    - content (str): HTML content to be verified.

    Returns:
    - bool: True if essential tags are present, False otherwise.
    """
    try:
        soup = BeautifulSoup(content, 'html.parser')
        if soup.title and soup.meta:
            logging.info("HTML content verification successful!")
            return True
        else:
            logging.warning("Essential HTML tags are missing (e.g., <title>, <meta>)")
            return False
    except Exception as e:
        logging.error(f"Error verifying HTML content: {e}")
        return False


## User Input

- Prompt the user for a URL and an optional filename.
- If no filename is provided, a default name is used based on the webpage's title or URL.


In [6]:
# Prompt the user for a URL
url = input("Enter the URL of the webpage: ")

# Optional custom filename
filename = input("Enter a custom filename (leave blank for default): ").strip()

# If no filename is provided, use a default name based on the webpage's title or URL
if not filename:
    filename = 'default_webpage.html'
else:
    # Ensure the provided filename has the correct extension for the HTML file
    if not filename.endswith('.html'):
        filename += '.html'


## Main Workflow

- Here we bring everything together.
- The workflow involves downloading the webpage, verifying the content, saving the HTML, and converting it to Markdown.
- Feedback is provided at each stage.


In [7]:
# Main workflow
html_content = download_webpage(url)

if html_content:
    if verify_html_content(html_content):
        # Save HTML file (with correct .html extension)
        saved_path = save_html(html_content, filename)

        if saved_path:
            # Convert HTML to Markdown and save as .md file
            markdown_content = convert_to_markdown(html_content)
            if markdown_content:
                markdown_filename = filename.replace('.html', '.md')
                save_html(markdown_content, markdown_filename)


INFO: Attempting to download webpage...
INFO: Webpage downloaded successfully!
INFO: HTML content verification successful!
INFO: HTML content saved to downloads/eu_ai_act-full_text.html
INFO: Converting HTML content to Markdown...
INFO: Conversion to Markdown successful!
INFO: HTML content saved to downloads/eu_ai_act-full_text.md
