# 1. Crawler

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [1]:
import argparse
import os
import re

import pythainlp
import requests


def wget(url, filename):
    # allow redirects - in case file is relocated
    resp = requests.get(url, allow_redirects=True)
    # this can also be 2xx, but for simplicity now we stick to 200
    # you can also check for `resp.ok`
    if resp.status_code != 200:
        print(resp.status_code, resp.reason, 'for', url)
        return
    
    # just to be cool and print something
    print(*[f"{key}: {value}" for key, value in resp.headers.items()], sep='\n')
    print()
    
    # try to extract filename from url
    if filename is None:
        # start with http*, ends if ? or # appears (or none of)
        m = re.search("^http.*/([^/\?#]*)[\?#]?", url)
        filename = m.group(1)
        if not filename:
            raise NameError(f"Filename neither given, nor found for {url}")

    # what will you do in case 2 websites store file with the same name?
    if os.path.exists(filename):
        raise OSError(f"File {filename} already exists")
    
    with open(filename, 'wb') as f:
        f.write(resp.content)
        print(f"File saved as {filename}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='download file.')
    parser.add_argument("-O", type=str, default=None, dest='filename', help="output file name. Default -- taken from resource")
    parser.add_argument("url", type=str, default=None, help="Provide URL here")
    args = parser.parse_args()
    wget(args.url, args.filename)

usage: ipykernel_launcher.py [-h] [-O FILENAME] url
ipykernel_launcher.py: error: unrecognized arguments: -f


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [67]:
import requests
import hashlib
from urllib.parse import quote

class Document:
    
    def __init__(self, url):
        self.url = url
        self.content=None
        self.name=None
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        #TODO download self.url content, store it in self.content and return True in case of success
        url = self.url
        r = requests.get(url)
        self.content = r.content

        if r.status_code !=200:
            return False
        return True
    
    def persist(self):
        #TODO write document content to hard drive
        url = self.url
        self.name = quote(url).replace("/","_")
        r = requests.get(url)
        if len(self.name)>127:
            self.name = self.name[:127]
        open(self.name,'w').write(r.content.decode('utf-8'))
            
    def load(self):
        #TODO load content from hard drive, store it in self.content and return True in case of success
        try:
            self.content = open(self.name,'r').read()
            if self.content!=None:
                return True
        except:
            return False

### 1.1.1. Tests ###

In [68]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

## 1.2. [M][15] Account the caching policy

Sometimes remote documents (especially when we speak about static content like `js` or `gif`) can swear that they will not change for some time. This is done by setting [Cache-Control response header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control).

In [69]:
import requests
requests.get('https://polyfill.io/v3/polyfill.min.js').headers['Cache-Control']

'public, s-maxage=31536000, max-age=604800, stale-while-revalidate=604800, stale-if-error=604800'

Please study the documentation and implement a descendant to a `Document` class, which will refresh the document in case of expired cache even if the file is already on the hard drive.

In [70]:
class CachedDocument(Document):
    def __init__(self, url):
        Document.__init__(self,url)
        self.cash_max_age= 0
        self.time=0 #time of last version
    def load(self):
        #TODO download self.url content, store it in self.content and return True in case of success
        url = self.url
        r = requests.get(url)
        if self.cash_max_age<=time.time()-self.time:
            self.content = r.content
            #find max-age
            headers=r.headers['Cache-Control'].split(',')
            for header in headers:
                if 'max-age' in header:
                    self.cash_max_age=int(header.split('=')[1])
            self.time=time.time()
            print('Document was updated ',self.name)
            try:
                self.content = open(self.name,'r').read()
                if self.content!=None:
                    return True
            except:
                return False
    # TODO your code here


### 1.2.1. Tests

Add logging in your code and show that your code behaves differently for documents with different caching policy.

In [71]:
import time

doc = CachedDocument('https://polyfill.io/v3/polyfill.min.js')
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

doc = CachedDocument('https://yandex.ru/')
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

Document was updated  None
Document was updated  None
Document was updated  https%3A__yandex.ru_
Document was updated  https%3A__yandex.ru_


The first document showed message about updating 1 time
yandex.ru was updated 3 times



## 1.3. [10] Parse HTML ##
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
- `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
- `self.images` list of images met in a document. Again, links can be relative to current page.
- `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [72]:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse


class HtmlDocument(Document):
    
    def parse(self):
        #TODO extract plain text, images and links from the document
        soup = BeautifulSoup(self.content)
        self.anchors = [tuple([link.getText(), urllib.parse.urljoin(self.url,link.get('href'))])for link in soup.find_all('a')]

        self.images = [urllib.parse.urljoin(self.url,img.get('src')) for img in soup.find_all('img')]
        #find_all('a')= [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
        #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
        # img.get('src') returns link to the image
        self.text = self.text_from_html(self.content)

    def tag_visible(self,element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True


    def text_from_html(self,body):
        soup = BeautifulSoup(body, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(self.tag_visible, texts)
        return u" ".join(t.strip() for t in visible_texts)



### 1.3.1. Tests ###

In [73]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

In [74]:
doc.images

['https://mc.yandex.ru/watch/53482672',
 'http://sprotasov.ru/images/gb.svg',
 'http://sprotasov.ru/images/gb.svg',
 'http://sprotasov.ru/images/gb.svg',
 'http://sprotasov.ru/images/ru.svg',
 'http://sprotasov.ru/images/ru.svg',
 'http://sprotasov.ru/images/gb.svg',
 'http://sprotasov.ru/images/gb.svg',
 'http://sprotasov.ru/images/ru.svg',
 'http://sprotasov.ru/images/gb.svg',
 'http://sprotasov.ru/images/ru.svg',
 'http://sprotasov.ru/images/ru.svg',
 'http://sprotasov.ru/images/gb.svg']

## 1.4. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). 

**Criteria of success**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained inside `<body>` tag only.

In [75]:
#pip install pythainlp
#pip install camel-tools

In [76]:
from collections import Counter
import re
import nltk
nltk.download('punkt')
class HtmlDocumentTextData:
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        #TODO implement sentence parser
        text=self.doc.text
        sent_text = nltk.sent_tokenize(text)
        return sent_text
    
    def get_word_stats(self):
        text=self.doc.text.lower()


        #TODO return Counter object of the document, containing mapping {`word` -> count_in_doc}

        return Counter(re.split(r'\W+',text))
        #return Counter(nltk.word_tokenize(text))

[nltk_data] Downloading package punkt to /home/guzel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 1.4.1. Tests ###

In [77]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
print(doc.get_sentences())
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

[('и', 59), ('в', 30), ('иннополис', 20), ('по', 17), ('2022', 16), ('на', 14), ('университет', 12), ('области', 10), ('ит', 10), ('с', 10)]
['               Все медиа  Facebook Вконтакте Youtube Twitter Instagram habr      Абитуриентам  Бизнесу  Медиа   Личный кабинет      Университет      Об университете     Органы управления     Учредители    Наблюдательный совет      Команда университета     Организационная структура    Образовательные и научные подразделения    Технологические центры      Преподавательский состав     Профессорско-преподавательский состав    Вакантные должности ППС      Работа в университете     Карьера в университете    Корпоративная жизнь    Релокация в Иннополис    Вакансии      Кампус     Кампус  Информация о жилом, учебном и спортивном комплексах, медцентре, питании и досуге на территории города и Университета Иннополис.', 'Ответы на часто задаваемые вопросы      Сведения об образовательной организации     Сведения об образовательной организации  Информация об

## 1.5. [M][35] Languages
Maybe you heard, that there are multiple languages in the world. European languages, like Russian and English, use similar puctuation, but even in this family there is ¡Spanish!

Other languages can use different punctiation rules, like **Arabic or [Thai](http://www.thai-language.com/ref/breaking-words)**.

Your task is to support (at least) three languages (English, Arabic, and Thai) tokenization in your `HtmlDocumentTextData` class descendant.

What should you do:
1. Use any language dection techniques, e.g. [langdetect](https://pypi.org/project/langdetect/).
2. Use language-specific tokenization tools, e.g. for [Thai](https://pythainlp.github.io/tutorials/notebooks/pythainlp_get_started.html#Tokenization-and-Segmentation) and [Arabic](https://github.com/CAMeL-Lab/camel_tools).
3. Use these pages to test your code: [1](https://www.bangkokair.com/tha/baggage-allowance) and [2](https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82%D8%A8%D8%A9-%D8%A8%D9%88%D8%AA%D9%8A%D9%86).

In [80]:
from langdetect import detect
from pythainlp import word_tokenize as thaitokenize
import camel_tools.tokenizers.word
class MultilingualHtmlDocumentTextData(HtmlDocumentTextData):
    
    #TODO your code here
    def __init__(self,url):
        HtmlDocumentTextData.__init__(self,url)
        self.language=detect(self.doc.text) #detecting language of the text


    def get_word_stats(self):
        text=self.doc.text
        if self.language=='en': #eng
            return Counter(nltk.word_tokenize(text))
        if self.language=='th': #thai
            return Counter(thaitokenize(text))
        if self.language=='ar':
            return Counter(camel_tools.tokenizers.word.simple_word_tokenize(text))


        #TODO return Counter object of the document, containing mapping {`word` -> count_in_doc}







### 1.5.1. Tests

In [81]:
doc = MultilingualHtmlDocumentTextData("https://www.bangkokair.com/tha/baggage-allowance")
print(doc.get_word_stats().most_common(10))

doc = MultilingualHtmlDocumentTextData("https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82")
print(doc.get_word_stats().most_common(10))

[(' ', 217), ('  ', 122), ('    ', 69), ('สัมภาระ', 34), ('   ', 27), ('การ', 25), ('เรา', 24), ('     ', 23), ('และ', 22), ('ของ', 21)]
[('تعليق', 12), ('مشاهده', 10), ('.', 5), ('الإمارات', 5), ('-', 4), ('بن', 4), ('زايد', 4), ('في', 4), ('الفجر', 4), ('و', 4)]


## 1.5. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [82]:
from queue import Queue

class Crawler:
    def crawl_generator(self, source, depth=1):
        queue = [(source, 1)]
        visited = [source]
        while len(queue) > 0:
            cur_url, cur_depth = queue.pop(0)
            if cur_depth > depth:
                return
            try:
                HtmlDocument = HtmlDocumentTextData(cur_url)
                for l in HtmlDocument.doc.anchors:
                    link = l[1]
                    if link not in visited:
                        visited.append(link)
                        queue.append((link, cur_depth+1))
                yield HtmlDocument
            except:
                continue




### 1.5. Tests ###

In [83]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
345 distinct word(s) so far
https://apply.innopolis.university/en
1064 distinct word(s) so far
https://corporate.innopolis.university/en
1221 distinct word(s) so far
https://media.innopolis.university/en
1271 distinct word(s) so far
https://innopolis.university/lk/
1628 distinct word(s) so far
https://innopolis.university/en/about/
1755 distinct word(s) so far
https://innopolis.university/en/board/
1842 distinct word(s) so far
https://innopolis.university/en/team/
1843 distinct word(s) so far
https://innopolis.university/en/team-structure/
1846 distinct word(s) so far
https://innopolis.university/en/team-structure/education-academics/
1850 distinct word(s) so far
https://innopolis.university/en/team-structure/techcenters/
1852 distinct word(s) so far
https://innopolis.university/en/faculty/
2709 distinct word(s) so far
https://career.innopolis.university/en/job/
3120 distinct word(s) so far
https://career.innopolis.university/en/
3360 distinct word(s) s