# Information Retrieval
## Lesson 1: Comprehensive Guide to Web Crawling and Document Processing



Information Retrieval (IR) is the process of obtaining relevant information from a large repository, such as a database or the internet, in response to a specific query. The primary goal of IR is to provide users with accurate and useful results based on their search criteria. This field encompasses a range of activities, from simple keyword searches to complex algorithms that understand and interpret user intent, context, and semantics.

### Key Components of Information Retrieval:

1. **Indexing**: The process of organizing data to enable efficient retrieval. This often involves creating an index that maps keywords to their locations in documents.

2. **Query Processing**: Interpreting and refining user queries to match relevant documents. This may include techniques such as stemming, stop-word removal, and query expansion.

3. **Ranking**: Ordering the search results based on their relevance to the query. Common ranking algorithms include TF-IDF, PageRank, and various machine learning models.

4. **Evaluation**: Measuring the effectiveness of the retrieval system using metrics such as precision, recall, and F1-score.

Information Retrieval is a critical component of many applications, including search engines, digital libraries, and recommendation systems, helping users find the information they need quickly and accurately.

In this lesson, we will mainly deal with Web Crawling and Document Processing

# 1. Crawler

A crawler is a computer program that automatically searches documents on the Web. Crawlers are primarily programmed for repetitive actions so that browsing is automated. Search engines use crawlers most frequently to browse the internet and build an index.


### 1.0.1. How to parse a page?
Parsing a webpage involves retrieving and analyzing its HTML content to extract specific data or information. This process typically includes:

1. **Fetching the Page**: Using a tool or library to download the HTML content of a webpage.
2. **Processing the HTML**: Utilizing libraries like BeautifulSoup (Python), or similar to navigate and manipulate the HTML structure.
3. **Extracting Data**: Identifying and isolating the required elements, such as text, images, links, or other data, based on tags, classes, or attributes.


If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [None]:
import requests
import os
from urllib.parse import quote

class Document:

    def __init__(self, url):
        self.url = url

    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()

    def download(self):
        #TODO download self.url content, store it in self.content and return True in case of success
        return False

    def persist(self):
        #TODO write document content to hard drive
        pass

    def load(self):
        #TODO load content from hard drive, store it in self.content and return True in case of success
        return False

### 1.1.1. Tests ###

In [None]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

## 1.2. Parse HTML
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
1. `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
2. `self.images` list of images met in a document. Again, links can be relative to current page.
3. `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [None]:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse


class HtmlDocument(Document):

    def parse(self):
        #TODO extract plain text, images and links from the document
        self.anchors = [("fake link text", "http://fake.url/")]
        self.images = ["http://image.com/fake.jpg"]
        self.text = "fake text and some other text"

### 1.2.1. Tests

In [None]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

## 1.3. Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose).

**Criteria to succeed in the task**:
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained from inside `<body>` tag only.

In [None]:
from collections import Counter

class HtmlDocumentTextData:

    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()

    def get_sentences(self):
        #TODO implement sentence parser
        result = []
        return result

    def get_word_stats(self):
        #TODO return Counter object of the document, containing mapping {`word` -> count_in_doc}
        return Counter()

### 1.3.1. Tests ###

In [None]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

## 1.4. Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [None]:
from queue import Queue

class Crawler:

    def crawl_generator(self, source, depth=1):
        #TODO return real crawling results. Don't forget to process failures,
        # exceptions, 3**, 4** codes
        for i in range(3):
            yield HtmlDocumentTextData(source)

### 1.4.1. Tests ###

In [None]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")

print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'


## 1.5. Account the caching policy

Sometimes remote documents (especially when we speak about static content like `js` or `gif`) can swear that they will not change for some time. This is done by setting [Cache-Control response header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control).

In [None]:
import os
import time
from datetime import datetime
from urllib.parse import quote

class Document:

    def __init__(self, url):
        self.url = url
        self.content = None
        self.cache_file = quote(url, safe='')

    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()

    def download(self):
        try:
            response = requests.get(self.url, allow_redirects=True)
            if response.status_code == 200:
                self.content = response.content
                self.headers = response.headers
                return True
            else:
                print(f"Failed to download {self.url}: {response.status_code}")
                return False
        except requests.RequestException as e:
            print(f"Request failed: {e}")
            return False

    def persist(self):
        try:
            with open(self.cache_file, 'wb') as f:
                f.write(self.content)
            with open(self.cache_file + '.meta', 'w') as f:
                f.write(self.headers.get('Cache-Control', ''))
            print(f"File saved as {self.cache_file}")
        except Exception as e:
            print(f"Failed to persist {self.url}: {e}")

    def load(self):
        if not os.path.exists(self.cache_file):
            return False
        try:
            with open(self.cache_file, 'rb') as f:
                self.content = f.read()
            return True
        except Exception as e:
            print(f"Failed to load {self.url}: {e}")
            return False



We have implemented a descendant to a `Document` class, which will refresh the document in case of expired cache even if the file is already on the hard drive.

In [None]:
class CachedDocument(Document):
    def is_cache_expired(self):
        if not os.path.exists(self.cache_file + '.meta'):
            return True

        with open(self.cache_file + '.meta', 'r') as f:
            cache_control = f.read()

        if 'max-age' in cache_control:
            max_age = int(cache_control.split('max-age=')[1].split(',')[0])
            file_age = time.time() - os.path.getmtime(self.cache_file)
            return file_age > max_age
        return True

    def get(self):
        if not os.path.exists(self.cache_file) or self.is_cache_expired():
            print(f"Cache expired or not found for {self.url}")
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
        else:
            print(f"Loading from cache for {self.url}")
            self.load()


### Run


In [None]:
import time

doc = CachedDocument('https://yandex.ru/')
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

## 1.6. Languages
Maybe you heard, that there are multiple languages in the world. European languages, like Russian and English, use similar puctuation, but even in this family there is ¡Spanish!

Other languages can use different punctiation rules, like **Arabic or [Thai](http://www.thai-language.com/ref/breaking-words)**.

Your task is to support (at least) three languages (English, Arabic, and Thai) tokenization in your `HtmlDocumentTextData` class descendant.

What should you do (acceptance criteria):
1. Use any language dection techniques, e.g. [langdetect](https://pypi.org/project/langdetect/).
2. Use language-specific tokenization tools, e.g. for [Thai](https://pythainlp.github.io/tutorials/notebooks/pythainlp_get_started.html#Tokenization-and-Segmentation) and [Arabic](https://github.com/CAMeL-Lab/camel_tools).
3. Use these pages to test your code: [1](https://www.bangkokair.com/tha/baggage-allowance) and [2](https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82%D8%A8%D8%A9-%D8%A8%D9%88%D8%AA%D9%8A%D9%86).
4. Pass the tests.

In [None]:
!pip install langdetect

In [None]:
!pip install pythainlp

In [None]:
from bs4.element import Comment
import urllib.parse
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from langdetect import detect
import pythainlp
from collections import Counter
import re

In [None]:
class HtmlDocument(Document):

    def parse(self):
        soup = BeautifulSoup(self.content, 'html.parser')

        self.anchors = [(a.get_text(), urllib.parse.urljoin(self.url, a.get('href', '')))
                        for a in soup.find_all('a', href=True)]
        self.images = [urllib.parse.urljoin(self.url, img.get('src', ''))
                       for img in soup.find_all('img', src=True)]
        self.text = ' '.join(soup.stripped_strings)

class HtmlDocumentTextData:
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()

    def get_sentences(self):
        return sent_tokenize(self.doc.text)

    def get_word_stats(self):
        words = word_tokenize(self.doc.text)
        return Counter(word.lower() for word in words if word.isalpha())


In [None]:
class MultilingualHtmlDocumentTextData(HtmlDocumentTextData):
    def __init__(self, url):
        super().__init__(url)
        self.language = self.detect_language()

    def detect_language(self):
        try:
            return detect(self.doc.text)
        except:
            return 'en'  # Default to English if detection fails

    def tokenize_text(self):
        if self.language == 'ar':
            tokens = ...  # Use arabic-specific tokenization here
        elif self.language == 'th':
            tokens = ...  # Use Thai-specific tokenization here
        else:  # Assuming English or other languages
            tokens = word_tokenize(self.doc.text)
        return tokens

    def get_sentences(self):
        if self.language == 'ar':
            # Use Arabic-specific sentence tokenizer here if available
            pass
        elif self.language == 'th':
            # Use Thai-specific sentence tokenizer here if available
            pass
        return super().get_sentences()

    def get_word_stats(self):
        tokens = self.tokenize_text()
        return Counter(token.lower() for token in tokens if token.isalpha())

### Tests

In [None]:
doc = MultilingualHtmlDocumentTextData("https://www.bangkokair.com/tha/baggage-allowance")
print(doc.get_word_stats().most_common(10))

doc = MultilingualHtmlDocumentTextData("https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82")
print(doc.get_word_stats().most_common(10))