# INFO 4271 - Exercise 1 - Web Crawling

Issued: April 16, 2024

Due: April 22, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Duplicate Detection
When crawling large numbers of Web pages we are likely to encounter a considerable number of duplicate documents. To not flood our index with replicas of the same documents, we need a duplicate detection scheme.

a) Using python's built-in hash() function, process the following documents in order of appearance and flag up any exact duplicates.

- **D1** "This is just some document"
- **D2** "This is another piece of text"
- **D3** "This is another piece of text"
- **D4** "This is just some documents"
- **D5** "Totally different stuff"

In [37]:
#Check a single document against an existing collection of previsouly seen documents for exact duplicates.
def check_exct(doc, docs):
    docHash = hash(doc[1])
    for x in docs:
        if hash(x[1]) == docHash:
            return True
    return False

b) Going beyond exact duplicates, we want to also identify any near-duplicates that are very similar but not identical to previously seen content. Implement the SimHash method discussed in class and again process the five documents, this time flagging up exact and near duplicates.

In [34]:
import hashlib
NUM_HASH_BITS = 128

#Check a single document against an existing collection of previsouly seen documents for near duplicates
def check_simhash(doc, docs):
    checkFingerprint = computeFingerprint(doc)
    for x in docs:
        difference = calculateDifference(checkFingerprint, computeFingerprint(x))
        if difference < 7:
            return True
    return False

def computeFingerprint(doc):
    weightedWords = weight(doc[1])
    addedColumns = [0] * NUM_HASH_BITS
    for word in weightedWords:
        hashFnk = hashlib.md5()
        hashFnk.update(word.encode('utf-8'))
        hash = hashFnk.hexdigest()
        binaryWord = str(int(hash, 16))
        for y in range(len(binaryWord)):
            if binaryWord[y] == "0":
                addedColumns[y] = -1 * weightedWords[word]
            else:
                addedColumns[y] = int(binaryWord[y]) * weightedWords[word]
    fingerprint = ""
    for x in addedColumns:
        if x > 0:
            fingerprint += "1"
        else:
            fingerprint += "0"
    return fingerprint

def calculateDifference(hash1, hash2):
    return sum(bit1 != bit2 for bit1, bit2 in zip(hash1, hash2))

def weight(doc):
    weightedWords = {}
    arr = doc.split()
    for word in arr:
        if word in weightedWords:
            weightedWords[word] = weightedWords[word] + 1
        else:
            weightedWords[word] = 1
    return weightedWords


In [39]:
crawl = [['D1', 'This is just some document'], ['D2', 'This is another piece of text'], ['D3', 'This is another piece of text'], ['D4', 'This is just some documents'], ['D5', 'Totally different stuff']]

#Process raw crawled website content
def process(crawl):
    docs = []
    for doc in crawl:
        if check_simhash(doc, docs): #Can be exchanged for check_simhash()
            print('DUPLICATE: '+doc[0])
        else:
            docs.append(doc)

process(crawl)

DUPLICATE: D3
DUPLICATE: D4


# 2. Focused Search Engines
Suppose you were to build a COVID-19 Web search engine for which you want to collect and eventually serve only COVID-19 information. The general web crawling process follows this scheme:

1. Create a seed set of known URLs (a.k.a the frontier)
2. Pull a URL from the frontier and visit it
3. Save the page content for our search engine (indexing)
4. Once on the page, note down all URLs linked there
5. Put all encountered URLs in the queue
6. Repeat from Step 2 until the queue is empty

In this particular setting, how should the generic step-by-step crawling process be modified/extended? Discuss all relevant considerations:

The crawling process need to be extended by the following steps:
- In step 1 we should put URLs into the frontier, that are related to COVID-19, starting with the most relevant ones.
- In step 4 the linked URLs should only be put down if the content of the page or the URL itself is indicating that the URLs are relevant to COVID-19 in any way. This could be done by checking for existing COVID-19 related keywords.
- In step 5 the encountered URLs should be put in the queue by prioritizing URLs based on 
    - Relevancy to COVID-19
    - Freshness of the website (Prioritize if there was recent change)
    - Quality of the source (to get more reliable information)
    while we also want to filter duplicates/near duplicates and don't put them in the queue.
- In step 3 we could also add another layer of checking if the content we are saving for indexing is really COVID-19 related and discard the page URL otherwise.
- We should also recrawl sites that are frequently changing like news sites or governmental information/warning sites more often.