### **Overview of the code:**

This project is web scraping process, interacting with URLs, scraping data, saving to a database, and optionally downloading images.

#### **Components and Functionalities**

1. **Imports and Setup**:
   - Imports necessary modules (`sys`, `import_ipynb`, `urllib.parse`, `sqlite3`, `scrape`, `database`, `download`).
   - Inserts a custom path to include additional Python notebooks.

2. **Main Functionality**:
   - **User Input**: Prompts the user to input a domain URL and extracts the domain using `urlparse`.
   - **URL Scraping**: Retrieves all links from the domain using `scrape.get_all_links` and adds the main URL to the list.
   - **Data Scraping**: Iterates through each URL, scraping image URLs and text content using `scrape.scrape_page`, storing results in `scraped_data`.
   - **Database Interaction**: Saves scraped data to a SQLite database using `database.save_to_database`.
   - **Optional Image Download**: Asks the user if they want to download images. If confirmed, prompts for download path and image URL from the database. Retrieves image URLs and downloads them using `download.download_images`.
   - **Completion Message**: Displays a completion message when the script finishes execution.

#### **Detailed Method Descriptions**

- **User Input and URL Handling**:
  - **main_url**: Takes user input for the main domain URL.
  - **domain**: Extracts the domain from the main URL.
  - **db_name**: Derives the database name using `database.get_db_name_from_url`.

- **Scraping and Database Operations**:
  - **urls_to_scrape**: Retrieves all relevant links from the domain.
  - **scraped_data**: Stores tuples of scraped data including URL, image URLs, and text content.

- **Database Interaction**:
  - **SQLite Connectivity**: Connects to the SQLite database specified by `db_name`.
  - **Data Storage**: Saves scraped data into the database table `scraped_data`.

- **Optional Image Download**:
  - **User Confirmation**: Asks if the user wants to download images from scraped URLs.
  - **Image Retrieval**: Retrieves image URLs from the database using SQL queries.
  - **Image Download**: Downloads images to the specified path if URLs are found.

- **Execution Flow**:
  - **Main Execution**: Executes the `main()` function if the script is run directly (`__name__ == "__main__"`).



In [6]:
## Install this so that code to import .ipynb files
!pip install import-ipynb
!pip install nbimporter

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [7]:
"""
Main script to orchestrate the web scraping process.
"""
import sys
import import_ipynb
sys.path.insert(0, '/home/d356/AI_Project/python_notebooks')
from urllib.parse import urlparse
import sqlite3
import scrape
import database
import download


def main():
    """
    Main function to orchestrate the web scraping process.
    Prompts the user for input, scrapes the website, saves data to the database,
    and optionally downloads images.
    """
    main_url = input("Enter the Domain URL: ")
    domain = urlparse(main_url).netloc
    db_name = database.get_db_name_from_url(main_url)

    urls_to_scrape = scrape.get_all_links(main_url, domain)
    urls_to_scrape.add(main_url)  # Add the main URL to the list of URLs to scrape

    scraped_data = []

    # Scrape each URL for images and text content
    for url in urls_to_scrape:
        print(f"Scraping {url}...")
        image_urls, text_content = scrape.scrape_page(url)
        scraped_data.append((url, ','.join(image_urls), text_content))

    # Save the scraped data to the database
    database.save_to_database(scraped_data, db_name)
    print(f"Data saved in {db_name}")

    # Prompt the user to download images if desired
    download_images_option = input("Do you want to download images from the scraped URLs? (yes/no): ")
    if download_images_option.lower() == 'yes':
        save_path = input("Provide the path where you want to save the images: ")
        download_image_url = input("Provide the URL from the database to scrape images: ")
        domain_folder = '-'.join(download_image_url.split('/')[-2:-1])  # Folder name based on domain

        # Retrieve image URLs from the database
        conn = sqlite3.connect(db_name)
        cursor = conn.cursor()
        cursor.execute("SELECT image_urls FROM scraped_data WHERE url=?", (download_image_url,))
        result = cursor.fetchone()
        conn.close()

        if result:
            image_urls = result[0].split(',')
            # Download images
            download.download_images(image_urls, save_path, domain_folder)
        else:
            print("No image URL available for this URL.")
    else:
        print("Script execution complete.")

if __name__ == "__main__":
    main()


Enter the Domain URL: https://oth-aw.de/
Scraping https://oth-aw.de/hochschule/ueber-uns/einrichtungen/zentrum-fuer-gender-und-diversity/familiengerechte-hochschule/...
Scraping https://oth-aw.de/hochschule/kooperationen/denkmax-stadtlabor-weiden/denkmax-stadtlabor-weiden/...
Scraping https://oth-aw.de/studium/im-studium/organisatorisches/stunden-und-pruefungsplaene/...
Scraping https://oth-aw.de/studium/campus-und-leben/fundbuero/...
Scraping https://oth-aw.de/international/wege-ins-ausland/auslandsblog-zugvoegel/...
Scraping https://oth-aw.de/weiterbildung/oth-professional/projekte/...
Scraping https://oth-aw.de/rechtsgrundlagen/...
Scraping https://oth-aw.de/weiterbildung/berufsbegleitendes-studium/bachelorstudium/...
Scraping https://oth-aw.de/hochschule/kooperationen/makerspace/ueber-den-makerspace/...
Scraping https://oth-aw.de/forschung/forschungsprofil/hrk-forschungslandkarte/...
Scraping https://oth-aw.de/hochschule/aktuelles/veranstaltungen/veranstaltungsliste/infoveranstaltu

Scraping https://oth-aw.de/forschung/forschungsprofil/news/...
Scraping https://oth-aw.de/forschung/forschungsprofil/publikationen/...
Scraping https://oth-aw.de/studium/vor-dem-studium/reinschnuppern/reinschnuppern/...
Scraping https://oth-aw.de/international/internationales-profil/faq/...
Scraping https://oth-aw.de/studium/engagement/talentfoerderung-preise-und-stipendien/aktuelles/...
Scraping https://oth-aw.de/hochschule/aktuelles/news/oth-amberg-weiden-sammelt-ueber-4000-brillen-fuer-guten-zweck/...
Scraping https://oth-aw.de/studium/engagement/studierendenvertretung/hochschulwahlen/...
Scraping https://oth-aw.de/international/wege-zu-uns/internationale-vollzeitstudierende/...
Scraping https://oth-aw.de/hochschule/kooperationen/digitale-gruenderinitiative-oberpfalz-dgo/...
Scraping https://oth-aw.de/international/wege-ins-ausland/studium-im-ausland/...
Scraping https://oth-aw.de/hochschule/aktuelles/veranstaltungen/veranstaltungsliste/verabschiedung-der-absolventinnen-und-absolven