## Link Extractor

This program extracts links from STATIC webpages and puts them in a Markdown file named links.md.

If using this in Colab, open the files browser on the left pane. You may have to refresh the files browser after running the program to see the links.md file.

Scroll down past the code for a full description of the program's operation.

In [5]:
import requests
from bs4 import BeautifulSoup
from pathlib import Path

# The URL you want to scrape
url = 'https://www.google.com/'
def running_in_notebook():
    try:
        from IPython import get_ipython
        if get_ipython() is None:
            return False  # Not in an IPython environment
        if 'IPKernelApp' not in get_ipython().config:
            return False  # IPython is running, but not under a kernel
    except (ImportError, AttributeError):
        return False  # IPython is not installed or other attribute errors
    return True

# Fetch the web page
response = requests.get(url)
response.raise_for_status()  # Check for request success

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Initialize a dictionary to store links with href as key to avoid duplicates
links_dict = {}

# Extract all links and store in the dictionary to ensure uniqueness
for link in soup.find_all('a'):
    href = link.get('href')
    text = link.text.strip()
    if href and text and href not in links_dict:  # Check if href is already included
        links_dict[href] = text

# Sort links by the link text (title)
sorted_links = sorted(links_dict.items(), key=lambda item: item[1].lower())

# Use this function to adjust behavior based on environment
if running_in_notebook():
    # Jupyter notebook specific code
    from pathlib import Path
    file_path = Path('links.md')  # Example: Specify path for Jupyter notebook environment
else:
    # Code for non-notebook environment
    from pathlib import Path
    script_dir = Path(__file__).resolve().parent
    file_path = script_dir / 'links.md'  # Example: Specify path for standalone script

# Write the links to a Markdown file
with open(file_path, 'w', encoding='utf-8') as file:
    for href, text in sorted_links:
        # fix the part(s) of the URL that Colab inserts
        href = href.replace('https://www.google.com/url?q=', '')
        href = href.split('&')[0]
        file.write(f"- {text} - {href}\n")

print("Links have been saved to links.md, sorted alphabetically by title without duplicates.")


Links have been saved to links.md, sorted alphabetically by title without duplicates.


Here's a fully commented breakdown of the code, including explanations of its components:

**Imports**

* **requests:** A powerful library for making HTTP requests to websites and fetching their content.
* **bs4 (BeautifulSoup):** A library for parsing HTML and XML documents. It makes it easy to navigate and extract data from web pages.
* **pathlib:** A module providing classes to work with file system paths in a platform-independent way.

**Function: running_in_notebook()**

* **Purpose:** This function detects if the code is running within a Jupyter Notebook environment.  This is useful for potentially modifying file save locations based on the execution context.
* **How it works:**
    * Tries to import `IPython` (the core of Jupyter Notebooks).
    * If successful, checks if the environment is an IPython kernel.
    * Returns `True` if both conditions are met, `False` otherwise.

**Web Scraping: the main logic**

* **url:** The target website to scrape (in this case, Google's homepage).

* **requests.get(url):**
    * Sends an HTTP GET request to the specified `url`.
    * Stores the server's response in the `response` object.

* **response.raise_for_status():**
    * Checks if the request was successful. Raises an exception for error status codes (e.g., 404 Not Found).

* **BeautifulSoup(response.text, 'html.parser'):**
    * Creates a `BeautifulSoup` object, parsing the HTML content of the response using the `html.parser`.
    * `soup` now represents a structured way to navigate the HTML.

* **links_dict = {}:**
    * Initializes an empty dictionary to store unique links (key = `href` attribute,  value = link text).

* **for link in soup.find_all('a'): ...**
    * Iterates over all anchor tags (`<a>`) within the HTML.
    * **href = link.get('href'):** Extracts the `href` attribute.
    * **text = link.text.strip():** Extracts the text content of the link and removes leading/trailing spaces.
    * **if href and text and href not in links_dict:**  Ensures a valid `href` and text exist, and the link hasn't been seen before (prevents duplicates).
        * **links_dict[href] = text:**  Adds the link to the dictionary.

* **sorted_links = ...:**
    * Sorts the items in the `links_dict` as a list of (href, text) tuples.
    * Sorting is done alphabetically (case-insensitive) by the link text.

**Environment-Based Behavior**

* **if running_in_notebook(): ... else: ...**
    * Determines file path based on whether the code is running in a notebook or as a standalone script.
    * **Jupyter Notebook:**  Sets a simple file name 'links.md' (likely in the same directory as the notebook).
    * **Standalone Script:**  Calculates the path relative to the current script's location.

**Saving Links to Markdown**

* **with open(file_path, 'w', encoding='utf-8') as file:** ...
    * Opens the specified file in write mode (`'w'`) with UTF-8 encoding.
    * **for href, text in sorted_links:** Iterates through the sorted links.
        * **href modification:** This part attempts to remove extra bits that might be added to links when clicked via Google search results. You may or may not need it depending on your goal.
        * **file.write(f"- {text} - {href}\n"):** Writes each link in Markdown list format to the file.

* **print(...):**  Provides confirmation of success.

