# Web Scraping and Data Extraction Tool with PostgreSQL Integration

## Introduction

This Jupyter Notebook extends the web scraping tool by incorporating a database-driven approach. In addition to crawling web pages and saving HTML content, the tool now also extracts relevant data (e.g., titles, authors, categories, summaries) from each webpage and stores this data in a PostgreSQL database. The database is structured to handle information about articles, their authors, and categories, ensuring efficient data management and retrieval.

### Objectives

- **Crawl Web Pages**: Start from an initial URL and navigate through discovered links using a breadth-first search (BFS) approach.
- **Download HTML Content**: Save the HTML content of crawled pages into files for later use.
- **Extract Data**: Parse the HTML content to extract key information like article title, author, category, and summary.
- **Save Data to Database**: Store the extracted data in a PostgreSQL database using pre-defined tables for articles, authors, and categories.
- **Parallel Processing**: Use threading and parallel processing to efficiently crawl and extract data from multiple URLs simultaneously.

### PostgreSQL Integration

The tool creates a PostgreSQL database and sets up tables required to store the extracted data:
- **Author Table**: Stores information about the authors of articles.
- **Category Table**: Stores the categories to which each article belongs.
- **Article Table**: Stores article data, linking each article to its respective author and category via foreign key constraints.

### Dependencies

To run this notebook, ensure that you have the following libraries installed:
- `selenium`
- `BeautifulSoup`
- `psycopg2` (for interacting with PostgreSQL)
- `joblib`
- `threading`
- `os`
- `sqlalchemy` (for creating connection pools)

### Database Setup

Before starting the web crawling and data extraction process, the script:
1. **Connection Pool**: Utilizes a connection pool to efficiently handle multiple database connections during parallel processing.
2. **Creates the PostgreSQL Database**: Initializes a PostgreSQL database (if it doesn’t already exist).
3. **Creates the Necessary Tables**: Sets up tables for storing article data, author information, and categories.

### Data Extraction and Saving

For each webpage crawled, the tool extracts key information such as the title, author, category, and summary. This data is then saved in the PostgreSQL database.

### Usage

To start the web crawling and data extraction process:
1. Set up your PostgreSQL database credentials.
2. Modify the `start_url`, `MAX_PAGES`, and `NUM_WORKERS` variables as needed.
3. Execute the cells sequentially to define the necessary functions and initiate the crawling, data extraction, and saving process.


##### Imported libraries

In [None]:
# Standard library imports
import os  # For file system operations
import threading  # For thread-safe operations on shared variables

# Third-party imports
import psycopg2  # PostgreSQL database adapter
from psycopg2 import pool  # Establish global connection to Postgres database
from bs4 import BeautifulSoup  # For parsing HTML content
from joblib import Parallel, delayed  # For parallel processing
from selenium import webdriver  # For sending requests to URLs


##### Initalise variables

In [None]:
# Define and initialise variables

# Constants
MAX_PAGES = 100 # Maximum number of pages to crawl
NUM_WORKERS = 50 # Number of parallel workers - if default number of 5 workers is not used

# Configuration Variables
start_url = "https://www.news24.com/" # Initial URL to start crawling from 
folder = 'crawled_webpages' # Folder to save crawled pages
data_batch = []
sql_scripts = {'Author':'Author_Table.sql', 
               'Category':'Category_Table.sql',
               'Article':'Article_Table.sql'
              } # SQL scripts to be called for table creation

url_start = ['https://www.news24.com/news24',
             'https://www.news24.com/fin24',
             'https://www.news24.com/sport',
             'https://www.news24.com/news24/investigations',
             'https://www.news24.com/news24/politics',
             'https://www.news24.com/news24/opinions',
             'https://www.news24.com/life',
             'https://www.news24.com/fin24/consumer-lookout',
             'https://www.news24.com/fin24/climate_future'
             ] # All possible beginnings of news24 webpages - according to website home page tabs
               # Ensure that only news24 articles are crawled to extract data conveniently 
               # (uniform data tag convention)

# Runtime Variables
urls_to_crawl = [start_url]  # Initialise a list of URLs to crawl
crawled_count = 0  # Counter to track the number of pages crawled
crawled_lock = threading.Lock()  # Lock for safely incrementing the counter

# Postgres server connection credentials
minconn = 1
maxconn = 5
host = "localhost" # Hostname of the PostgreSQL server
port = "5432" # Port number on which the PostgreSQL server is listening
user = "postgres" # Username to authenticate with the PostgreSQL server
password = "postgres1" # Password corresponding to Username
db_name = "section2_db" # Database name

##### Defined functions

In [None]:
def create_connection_pool(minconn, maxconn, host, port, user, password, database):
    """
    Create a connection pool for connecting to a specific PostgreSQL database.

    This function initializes a connection pool that can manage multiple connections 
    to the specified PostgreSQL database, allowing for efficient resource usage 
    and improved performance in applications that require frequent database access.

    Parameters:
        minconn (int): The minimum number of connections to maintain in the pool.
        maxconn (int): The maximum number of connections allowed in the pool.
        host (str): The hostname or IP address of the PostgreSQL server.
        port (int): The port number on which the PostgreSQL server is listening.
        user (str): The username used to authenticate with the PostgreSQL server.
        password (str): The password corresponding to the provided username.
        database (str): The name of the database to which the connections will be made.

    Returns:
        connection_pool: A connection pool object for the specified database if successful, 
                         or None if an error occurs during pool creation.
    
    Raises:
        Exception: Any exception raised during the creation of the connection pool.
    """
    try:
        # Create a connection pool for the specified database
        connection_pool = psycopg2.pool.SimpleConnectionPool(
            minconn,
            maxconn,
            host=host,
            port=port,
            user=user,
            password=password,
            database=database 
        )
        
        print("Connection pool to PostgreSQL server created successfully.")
        return connection_pool  # Return the connection pool object

    except Exception as e:
        print(f"Error creating connection pool: {e}")
        return None  # Return None if the connection pool creation fails


In [None]:
def add_data_to_tables(connection_pool, data_batch):
    """Add data from a batch of crawled webpages to the PostgreSQL database.

    Args:
        data_batch (list): A list of data rows, where each row is a tuple containing
            title, list of author names, date, list of category names, summary, and URL.

    Raises:
        Exception: If any error occurs during data insertion.
    """
    global db_name
    
    conn = None
    cur = None
    
    try:
        print(f"Saving collected data to the database '{db_name}'...")
        # Get a connection from the connection pool
        conn = connection_pool.getconn()
        cur = conn.cursor()

        for row in data_batch:
            title, authors, publication_date, categories, summary, url = row
            
            author_ids = []
            # Insert into author table
            for author in authors:
                cur.execute(
                    "INSERT INTO Author (name) VALUES (%s) ON CONFLICT (name) DO NOTHING RETURNING id",
                    (author,)
                )
                author_id = cur.fetchone()[0] if cur.rowcount > 0 else None
                if author_id:
                    author_ids.append(author_id)

            category_ids = []
            # Insert into category table
            for category in categories:
                cur.execute(
                    "INSERT INTO Category (name) VALUES (%s) ON CONFLICT (name) DO NOTHING RETURNING id",
                    (category,)
                )
                category_id = cur.fetchone()[0] if cur.rowcount > 0 else None
                if category_id:
                    category_ids.append(category_id)

            # Insert into article table
            # Assuming you want to link to the first author and category for simplicity
            # Adjust this logic as needed based on your requirements
            author_id = author_ids[0] if author_ids else None
            category_id = category_ids[0] if category_ids else None

            cur.execute(
                """
                INSERT INTO Article (title, author_id, publication_date, category_id, summary, url)
                VALUES (%s, %s, %s, %s, %s, %s)
                """,
                (title, author_id, publication_date, category_id, summary, url)
            )

        conn.commit()

    except Exception as e:
        # If any error occurs, rollback the transaction to avoid partial insertion
        print(f"Error inserting data: {e}")
        if conn:
            conn.rollback()

    finally:
        # Close the cursor 
        if cur:
            cur.close()

        # Release the connection back to the pool
        if conn:
            connection_pool.putconn(conn)

        print(f"Data saving operation completed! All records have been successfully stored in the database '{db_name}'.")


In [None]:
def extract_webpage_data(soup):
    """Extract data from a webpage using BeautifulSoup.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the parsed HTML of the webpage.

    Returns:
        list: A list containing the extracted title, authors, date, categories, and summaries.
    """
    # Extract the title
    title_tag = soup.find('h1', class_='article__title')
    title = title_tag.get_text(strip=True) if title_tag else "N/A"

    # Extract the authors - as a list
    author_tag = soup.find('div', class_='article__author')
    authors_text = author_tag.get_text(strip=True) if author_tag else "N/A"

    # Initialize an empty list for authors
    authors = []

    if authors_text != "N/A":
        # Remove "written by" phrase if present
        authors_text = authors_text.replace('written by', '').strip()

        # Split the authors by "and" and commas, then strip extra whitespace
        authors = [author.strip() for author in authors_text.replace('and', ',').split(',')]
    else:
        authors = ["N/A"]

    # Extract the date
    date_tag = soup.find('p', class_='article__date') 
    date = date_tag.get_text(strip=True) if date_tag else "N/A"

    # Extract the categories - as a list
    category_tags = soup.find_all('a', attrs={'data-tag': True})  
    categories = [category_tag.get_text() for category_tag in category_tags] if category_tags else ["N/A"]

    # Extract summaries
    summary_tags = soup.find_all('strong')  # Finds all <strong> tags (assuming each point is in <strong>)
    summaries = " ".join([tag.get_text(strip=True) for tag in summary_tags]) if summary_tags else "N/A"

    # List of data collected from the webpage
    data = [title, authors, date, categories, summaries]

    return data


In [None]:
def create_tables(connection_pool, sql_scripts):
    """Create tables in the PostgreSQL database using provided SQL scripts.

    Args:
        sql_scripts (dict): A dictionary where keys are table names and values are SQL script file names to execute for table creation.

    Prints:
        A success message for each script executed, or an error message if any errors occur during execution.
    """
    conn = None
    cur = None
    
    try:
        print("Creating database tables...")
        
        # Get a connection from the connection pool
        conn = connection_pool.getconn()
        
        # Enable autocommit mode to allow CREATE TABLE commands to execute successfully
        conn.autocommit = True

        # Create a cursor object
        cur = conn.cursor()
        
        # Iterate through the SQL scripts to create tables
        for table, script in sql_scripts.items():
            try:
                # Open the specified SQL script in read mode
                with open(script, 'r') as file:
                    sql_command = file.read()

                # Execute the SQL command
                cur.execute(sql_command)
                print(f"Table '{table}' created successfully.")
                
            except psycopg2.errors.DuplicateTable:
                print(f"Table '{table}' already exists.")
            except Exception as e:
                print(f"Error occurred while processing the script for table '{table}': {e}")

    except Exception as e:
        print(f"Error occurred while creating tables: {e}")
    finally:
        # Close the cursor if it was created
        if cur:
            cur.close()
        # Release the connection back to the pool
        if conn:
            connection_pool.putconn(conn)
            

In [None]:
def create_database(host, port, user, password, database):
    """Create a PostgreSQL database.

    Args:
        host (str): The host where the PostgreSQL server is running.
        port (int): The port number for the PostgreSQL server.
        user (str): The PostgreSQL username.
        password (str): The PostgreSQL password.
        database (str): The name of the database to create.

    Prints:
        A success message if the database is created, or an error message if it already exists,
        if a connection to the server fails, or if another error occurs during database creation.
    """
  
    conn = None
    try:
        # Connect to PostgreSQL server without specifying a database
        conn = psycopg2.connect(
            host=host,
            port=port,
            user=user,
            password=password
        )
        conn.autocommit = True  # Enable autocommit mode to allow CREATE DATABASE

        # Create a cursor
        cur = conn.cursor()

        # Create the database
        cur.execute(f"CREATE DATABASE {database};")
        print(f"Database '{database}' created successfully.")
        
    except psycopg2.errors.DuplicateDatabase:
        print(f"Database '{database}' already exists.")
    except Exception as e:
        print(f"Error occurred while creating the database: {e}")
    finally:
        # Close cursor and connection
        if cur:
            cur.close()
        if conn:
            conn.close()  # Close connection to the serv

In [None]:
def crawl(url, folder, max_pages):
    """Crawl a webpage to download HTML content and extract data.

    This function checks whether the maximum number of pages to crawl
    has been reached. If not, it retrieves the HTML content from the
    specified URL, saves it to a file, extracts additional links for
    further crawling, and gathers webpage data.

    Args:
        url (str): The URL of the webpage to crawl.
        folder (str): The directory where HTML files will be saved.
        MAX_PAGES (int): The maximum number of pages to crawl.
    """
    
    global crawled_count, data_batch, url_start  # Access global variables

    # Check if we have crawled enough pages
    with crawled_lock:
        if crawled_count >= MAX_PAGES:
            return

    try:
        # Each thread creates its own WebDriver instance
        driver = webdriver.Chrome()
        driver.get(url)

        # Save the HTML content to a file
        with crawled_lock:
            filename = os.path.join(folder, f"webpage_{crawled_count + 1}.html")
            crawled_count += 1
            
        with open(filename, 'w', encoding='utf-8') as file:
            file.write(driver.page_source)

        # Parse HTML to find links for further crawling
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        links = soup.find_all('a', href=True)  # Find all anchor tags with href attributes
        
        # Extract article data from webpage
        data = extract_webpage_data(soup)  # Call extract_webpage_data
        data.append(url)  # Append webpage URL to data list
        data_batch.append(data)  # Append data list to data_batch list

        # Add new links to the list of URLs to crawl
        with crawled_lock:
            for link in links:
                new_url = link['href']
                if new_url.startswith(tuple(url_start)): # Ensure that new url starts with all possible categories
                    urls_to_crawl.append(new_url)  # Add the link to the list of URLs

    except Exception as e:
        print(f"Error crawling {url}: {e}")
    
    finally:
        driver.quit()  # Close the browser after each crawl


In [None]:
def worker(folder, MAX_PAGES):
    """Process URLs from the crawl queue in a worker thread.

    This function retrieves the next URL from the shared `urls_to_crawl` list
    and calls the `crawl` function to process it. It runs until either there
    are no more URLs or the maximum number of pages has been crawled.

    Args:
        folder (str): The folder where HTML files will be saved.
        MAX_PAGES (int): The maximum number of pages to crawl.

    Returns:
        None
    """
    while True:
        with crawled_lock:
            if len(urls_to_crawl) == 0 or crawled_count >= MAX_PAGES:
                break
            url = urls_to_crawl.pop(0)  # Get the next URL from the list

        crawl(url, folder, MAX_PAGES)  # Crawl the page and download content


In [None]:
def run_parallel_crawling(folder, MAX_PAGES, NUM_WORKERS=5):
    """Execute parallel crawling using multiple worker threads.

    This function creates a specified folder for saving crawled pages and
    initiates parallel crawling by launching multiple worker threads.

    Args:
        folder (str): The folder to save crawled pages.
        MAX_PAGES (int): The maximum number of pages to crawl.
        NUM_WORKERS (int, optional): The number of parallel workers. Default is 5.

    Returns:
        None
    """
    global start_url # Call global variable
    
    # Create folder for saving crawled pages
    os.makedirs(folder, exist_ok=True)
    
    # Use joblib's Parallel to run multiple threads
    print(f"Starting the web scraping process for url: '{start_url}'. Collecting data... Please wait.")
    Parallel(n_jobs = NUM_WORKERS, backend = "threading")(
        delayed(worker)(folder, MAX_PAGES) for _ in range(NUM_WORKERS)
    )

    

In [None]:
def main():
    """Main function to initialize the connection pool, create the database and tables,
    run the web crawler, and add collected data to the database.

    This function orchestrates the entire workflow by initializing the connection
    pool, creating the necessary database and tables, performing the web crawling
    operation, and finally inserting the gathered data into the appropriate tables
    in the PostgreSQL database.
    """
    
    global folder, MAX_PAGES, NUM_WORKERS

    # Call the create_database function
    create_database(host, port, user, password, db_name)
    
    # Initialise connection pool to Postgres server
    connection_pool = create_connection_pool(minconn, maxconn, host, port, user, password, db_name)
    
    # Call the create_tables function
    create_tables(connection_pool, sql_scripts)
 
    # Start web crawling
    run_parallel_crawling(folder, MAX_PAGES, NUM_WORKERS)
    print("Web scraping operation completed successfully. Data has been gathered.")
    
    # Add the batch of collected data from webpages to the tables in the database
    add_data_to_tables(connection_pool, data_batch)

    # Close the connection pool after all operations are complete
    connection_pool.closeall()
    print("All operations completed successfully. The program has finished running.")


##### Execute web crawling and data extraction program

In [None]:
if __name__ == "__main__":
    main()