### **Overview of the code:**

This module provides functions to scrape web pages for links, images, and text content.

#### **Components and Functionalities**

1. **Imports**:
   - Imports necessary modules (`requests`, `BeautifulSoup`, `urllib.parse`, `re`).

2. **Function: `get_all_links(url, domain)`**:
   - Retrieves all links from a webpage that belong to a specified domain.
   - **Args**:
     - `url (str)`: The URL of the webpage.
     - `domain (str)`: The domain to filter links.
   - **Returns**:
     - `set`: A set of links that belong to the specified domain.

3. **Function: `scrape_page(url)`**:
   - Scrapes images and text content from a webpage.
   - **Args**:
     - `url (str)`: The URL of the webpage.
   - **Returns**:
     - `tuple`: A tuple containing a list of image URLs and the text content.

#### **Detailed Method Descriptions**

- **`get_all_links(url, domain)`**:
  - Makes an HTTP GET request to `url` using `requests.get`.
  - Parses the HTML content using `BeautifulSoup`.
  - Finds all `<a>` tags with `href` attributes and filters them by the specified `domain`.
  - Constructs absolute URLs using `urljoin` and adds them to a set of links.

- **`scrape_page(url)`**:
  - Makes an HTTP GET request to `url` using `requests.get`.
  - Parses the HTML content using `BeautifulSoup`.
  - **Image Scraping**:
    - Finds all `<img>` tags with `src` attributes and filters out data URLs.
    - Constructs absolute image URLs using `urljoin`.
  - **Text Scraping**:
    - Extracts all text content from the webpage, excluding content from `<header>` and `<footer>` tags.
    - Cleans and preprocesses the text using regular expressions (`re`).


In [None]:
"""
Module for scraping web pages.
"""
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import re

def get_all_links(url, domain):
    """
    Get all links from a webpage.

    Args:
        url (str): The URL of the webpage.
        domain (str): The domain to filter links.

    Returns:
        set: A set of links that belong to the specified domain.
    """
    try:
        response = requests.get(url)
        sourp = BeautifulSoup(response.content, 'html.parser')
        links = set()

        # Extract and filter links by domain
        for tag in sourp.find_all('a', href=True):
            href = tag['href']
            full_url = urljoin(url, href)
            if urlparse(full_url).netloc:
                links.add(full_url)

        return links
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return set()

def scrape_page(url):
    """
    Scrape images and text from a webpage.

    Args:
        url (str): The URL of the webpage.

    Returns:
        tuple: A tuple containing a list of image URLs and the text content.
    """


    
    try:
        response = requests.get(url)
        sourp = BeautifulSoup(response.content, 'html.parser')

        # Get all image URLs and filter out data URLs

        # Get all text content

        for tag in sourp(['header', 'footer']):
            tag.decompose()

        # Extract text from remaining content
        texts = ' '.join(sourp.stripped_strings)

        # Pre-process the text
        # Remove extra whitespace and newlines
        texts = re.sub(r'[@#\$%\^&\*\(\)_\+\=\{\}\[\]\|\\:;\"\'<>,\.\?/~\-\⇒\©\®\™\§\¶•…¡¿÷×°‰‡†¶©®€£]', '', texts)
        
        images = [urljoin(url, img['src']) for img in sourp.find_all('img', src=True) if not img['src'].startswith('data:')]


        return images, texts
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return [], ""
