# Crawler prerequisites

## Database initialization

Before running the web crawler, we need to start a Docker container with a Postgre SQL database, which we will use as storage for our web crawler.

To start the container, run the following command:
```
docker run --name postgresql-wier -e POSTGRES_PASSWORD=SecretPassword -e POSTGRES_USER=user -v $PWD/pgdata:/var/lib/postgresql/data -v $PWD/init-scripts:/docker-entrypoint-initdb.d -p 5432:5432 -d postgres:12.2
```

Then, run the `database.sql` script to intialize the crawldb database with all the necceseary tables and relations.

## Seminar imports

In here are the imports that are imported for our Seminar 1 needs.

In [80]:
# Parsing HTML
from bs4 import BeautifulSoup

# Regular expressions
import re

# Visualization library
# import later

# Database, concurent working
import concurrent.futures
import threading
import psycopg2

lock = threading.Lock()

# Time
from datetime import datetime

# For getting the response code
import requests

## Parameters

These parameters determine how our web crawler will run.
* `number_of_workers`: we determine how many concurent workers are running in parallel, speeding up the process of page retrieval

In [81]:
number_of_workers = 4

## URL seeds
These pages will be used as a starting point for our crawler.

In [82]:
# Used as starting pages for the crawler
web_page_seeds = [
    "http://gov.si",
    "http://evem.gov.si",
    "http://e-uprava.gov.si",
    "http://e-prostor.gov.si"
]

# Temporary starting page
WEB_PAGE_ADDRESS = "http://evem.gov.si"

# Also put there pages into the frontier
frontier = web_page_seeds.copy()

# Web crawler

Below is the structure of the web crawler, as well as the code to run it.

### Setup

Change `WEB_DRIVER_LOCATION` according to your setup.

In [83]:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# !!! CHANGE THIS DEPENDING ON YOUR MACHINE !!!
WEB_DRIVER_LOCATION = "C:/ManualInstalls/ChromeDriver/chromedriver.exe"
TIMEOUT = 5

# If you comment the following line, a browser will show ...
#chrome_options.add_argument("--headless")
chrome_options = Options()
#Adding a specific user agent
chrome_options.add_argument("user-agent=fri-wier-kj_lk_tm")

## URL frontier

A list of URLs waiting to be parsed. Using the frontier we run the web crawler.

In [85]:
def run_web_crawler():
    while frontier:
        # Get an url from the front of a frontier
        url = frontier[0]
        frontier.pop(0)
        
        
        
        # ----------------
        # CHECK DUPLICATES
        # ----------------
        
        # Check if it's a duplicate
        # If it's a duplicate, skip to the next element, since we already did work on this url
        if check_page_duplicates(url):
            continue
            
        # Set site accecss time and convert it to SQL appropriate format
        accessed_time = datetime.now().isoformat()
        
        
        
        
        
        # --------------------
        # PAGE INFO EXTRACTION
        # --------------------
        
        try:
            http_status_code = requests.head(url).status_code
        except requests.ConnectionError:
            # Determine what to do here; currently continue with the next item
            continue
        
        # Get HTML from the url
        html = download_and_render_page(url)
        
        # Parse the content of the webpage, to extract links, images etc.
        urls, images_urls = extract_data(html)
        
        # Add urls to frontier
        for url in urls:
            frontier.append(url)
        
        
        
        
        
        # ----
        # SITE
        # ----
        
        # Check if the site is already present in the database, if it is, find it and get the id, otherwise create one
        domain = extract_domain(url)
        site_id = find_site(domain)
        if site_id == -1:
            robots_content = get_robots_content(domain)
            sitemap = get_sitemap_content(url)
            
            insert_site(domain, robots_content, sitemap)
        
        
        
        
        
        # ----
        # PAGE
        # ----

        page_id = insert_page(site_id, page_type_code, url, html, http_status_code, accessed_time)
        
        
        
        
        
        # ----
        # LINK
        # ----
        
        # Insert links into database
        # TODO - figure out how to save when there's link to a page that doesn't exist in the database; -1 so far
        for url in urls:
            insert_link(page_id, find_page(url))
        
        
        
        
        
        # ---------
        # PAGE DATA
        # ---------
        
        # TODO finish
        page_data = "HTML"
        insert_page_data(code, page_id, data_type_code, page_data)
        
        
        
        
        
        # -----
        # IMAGE
        # -----
        
        # Insert images into database; TODO finish the image data extraction
        image_data = "TEMP"
        for image in images_urls:
            insert_image(page_id, filename, content_type, image_data, accessed_time)

## HTTP downloader and renderer

Retrieves and renders a web page.

### Downloading and rendering a page

In [86]:
def download_and_render_page(url):
    chrome_options.add_argument("user-agent=fri-wier-kj_lk_tm")
    
    driver = webdriver.Chrome(WEB_DRIVER_LOCATION, options=chrome_options)
    driver.get(url)
    
    TIMEOUT = 5
    time.sleep(TIMEOUT)
    
    html = driver.page_source
    
    driver.close()
    
    return html

## Data extractor

In [88]:
def extract_data(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    urls_array = []
    images_array = []
    
    for link in soup.find_all('link'):
        urls_array.append(link['href'])

    for link in soup.find_all('a'):
        urls_array.append(link['href'])
    
    # TODO do the extraction for the elements with an onclick attribute as well
    
    for link in soup.find_all('img'):
        images_array.append(link['href'])
    
    return urls_array, images_array

### Site data extraction

In [87]:
# Extract domain name from the link
def extract_domain(link):
    split_link = link.split("/")
    return split_link[0] + "//" + split_link[2]



# Get robots.txt content from current domain
def get_robots_content(domain):
    chrome_options.add_argument("user-agent=fri-wier-kj_lk_tm")
    
    domain += "/robots.txt"
    driver = webdriver.Chrome(WEB_DRIVER_LOCATION, options=chrome_options)
    driver.get(domain)
    
    TIMEOUT = 5
    time.sleep(TIMEOUT)
    
    html = driver.page_source
    
    driver.close()

    return html


def get_sitemap_content(domain):
    chrome_options.add_argument("user-agent=fri-wier-kj_lk_tm")
    
    # TODO - fix this, as the sitemap doens't neccesearily reside at this address
    domain += "/sitemap.xml"
    driver = webdriver.Chrome(WEB_DRIVER_LOCATION, options=chrome_options)
    driver.get(domain)
    
    TIMEOUT = 5
    time.sleep(TIMEOUT)
    
    html = driver.page_source
    
    # Finding all the links in the page
    soup = BeautifulSoup(html, 'html.parser')
    content = []
    
    for link in soup.find_all('link'):
        content.append(link['href'])

    for link in soup.find_all('a'):
        content.append(link['href'])
    
    driver.close()

    return ", ".join(content)

## Duplicate detector

Detects already parsed pages.

In [89]:
def check_page_duplicates(url):
    if find_page(url) == -1:
        return false
    return true

## Datastore

Store the data and additional metadata used by the crawler. Here resides the logic used to access database and preform actions such as printing certain table's content, inserting into the said table, et cetera.

### Sites

In [90]:
def insert_site(domain, robots_content, sitemap_content):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("INSERT INTO crawldb.site (domain, robots_content, sitemap_content) VALUES ('"
                + domain + "','"
                + robots_content + "','"
                + sitemap_content + "' RETURNING id);")
    
    id = cur.fetchone()[0]
    
    cur.close()
    conn.close()
    return id





# Idk if this is needed, but why would we store more of the same domain (it's not mentioned in the instructions)
def find_site(domain):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("SELECT * FROM crawldb.site WHERE domain='" + domain + "';")
    
    site_id = -1
    
    # Check if array is empty, meaning we didn't find the site already present in the table
    if cur.fetchall():
        site_id = cur.fetchone()[0]
    
    cur.close()
    conn.close()
    
    return site_id

### Image

In [91]:
def insert_image(page_id, filename, content_type, data, accessed_time):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("INSERT INTO crawldb.image (page_id, filename, content_type, data, accessed_time) VALUES (FOREIGN KEY REFERENCES crawldb.page"
                + page_id + "),'"
                + filename + "','"
                + content_type + "','"
                + data + "','"
                + accessed_time + "');")
    
    id = cur.fetchone()[0]
    
    cur.close()
    conn.close()
    return id






def find_image(page_id, filename):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("SELECT * FROM crawldb.image WHERE page_id = '" + page_id + "' AND filename='" + domain + "';")
    
    image_id = -1
    
    # Check if array is empty, meaning the site isn't already present in the table
    if cur.fetchall():
        image_id = cur.fetchone()[0]
    
    cur.close()
    conn.close()
    
    return image_id

### Page

In [92]:
def insert_page(site_id, page_type_code, url, html_content, http_status_code, accessed_time):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("INSERT INTO crawldb.page (site_id, page_type_code, url, html_content, http_status_code, accessed_time) VALUES (FOREIGN KEY REFERENCES crawldb.site"
                + site_id + "),'"
                + page_type_code + "','"
                + html_content + "',"
                + http_status_code + ",'"
                + accessed_time + "');")
    
    cur.close()
    conn.close()
    return True





def find_page(url):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("SELECT * FROM crawldb.page WHERE url='" + url + "';")
    
    page_id = -1
    
    # Check if array is empty, meaning the site isn't already present in the table
    if cur.fetchall():
        page_id = cur.fetchone()[0]
    
    cur.close()
    conn.close()
    
    return page_id

### Page data

In [93]:
def insert_page_data(page_id, data_type_code, data):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("INSERT INTO crawldb.page_data (page_id, data_type_code, data) VALUES ('"
                + page_id + "','"
                + data_type_code + "','"
                + data + "');")
    
    cur.close()
    conn.close()
    return True

### Data type

In [94]:
def insert_data_type(code):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("INSERT INTO crawldb.page (code) VALUES ('" + code + "');")
    
    cur.close()
    conn.close()
    return

### Link

In [95]:
def insert_link(from_page, to_page):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("INSERT INTO crawldb.link (page_id, data_type_code, data) VALUES (FOREIGN KEY REFERENCES crawldb.page("
                + from_page + "),FOREIGN KEY REFERENCES crawldb.page("
                + to_page + "));")
    
    cur.close()
    conn.close()
    return

### Page type

In [96]:
def insert_page_type(code):
    conn = psycopg2.connect(host="localhost", user="user", password="SecretPassword")
    conn.autocommit = True
    
    cur = conn.cursor()
    cur.execute("INSERT INTO crawldb.page_type (code) VALUES ('" + code + "');")
    
    cur.close()
    conn.close()
    return

## Execution

Concurrent execution of web crawler using multiple workers as given by the `number_of_workers` variable. Stop by stopping the working cell.

In [97]:
# Workers
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
    print(f"\n ... executing workers ...\n")
    for _ in range(number_of_workers):
        executor.submit(run_web_crawler)


 ... executing workers ...



NameError: name 'initialize_web_crawler' is not defined

# Other

In [None]:
# Potentially insert possibly otherwise useful code (like debugging code) later.

This is here just because I hate how `.ipynb` creates an empty cell after the last run one.