# 1. Crawler

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [1]:
# installing required libraries 
import sys

!{sys.executable} -m pip install bs4
!{sys.executable} -m pip install regex
!{sys.executable} -m pip install argparse
!{sys.executable} -m pip install requests
!{sys.executable} -m pip install urllib
!{sys.executable} -m pip install nltk

In [2]:
# downloading stopwords
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

In [3]:
# create temp directory for storing downloaded html files
import os

os.mkdir('temp')

In [1]:
import argparse
import os
import re
import requests

def wget(url, filename):
    # allow redirects - in case file is relocated
    resp = requests.get(url, allow_redirects=True)
    # this can also be 2xx, but for simplicity now we stick to 200
    # you can also check for `resp.ok`
    if resp.status_code != 200:
        print(resp.status_code, resp.reason, 'for', url)
        return
    
    # just to be cool and print something
    print(*[f"{key}: {value}" for key, value in resp.headers.items()], sep='\n')
    print()
    
    # try to extract filename from url
    if filename is None:
        # start with http*, ends if ? or # appears (or none of)
        m = re.search("^http.*/([^/\?#]*)[\?#]?", url)
        filename = m.group(1)
        if not filename:
            raise NameError(f"Filename neither given, nor found for {url}")

    # what will you do in case 2 websites store file with the same name?
    if os.path.exists(filename):
        raise OSError(f"File {filename} already exists")
    
    with open(filename, 'wb') as f:
        f.write(resp.content)
        print(f"File saved as {filename}")


# if __name__ == "__main__":
#     parser = argparse.ArgumentParser(description='download file.')
#     parser.add_argument("-O", type=str, default=None, dest='filename', help="output file name. Default -- taken from resource")
#     parser.add_argument("url", type=str, default=None, help="Provide URL here")
#     args = parser.parse_args()
#     wget(args.url, args.filename)

### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [2]:
import hashlib
import requests
from urllib.parse import quote

class Document:
    
    def __init__(self, url):
        self.url = url
        self.content = None
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        resp = requests.get(self.url, allow_redirects=True)
        if resp.status_code != 200:
            return False
        else:
            self.content = resp.content
            return True
    
    def persist(self):
        f = open(f'temp/{quote(self.url).replace("/", "_")}',"wb")
        f.write(self.content)
        f.close()
        pass
            
    def load(self):
        try:
            f = open(f'temp/{quote(self.url).replace("/", "_")}',"rb")
            self.content = f.read()
            f.close()
        except:
            return False
        return True

In [3]:
quote("http://sprotasov.ru/data/iu.txt").replace("/", "_") 

'http%3A__sprotasov.ru_data_iu.txt'

### 1.1.1. Tests ###

In [4]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

## 1.3. [10] Parse HTML ##
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
- `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
- `self.images` list of images met in a document. Again, links can be relative to current page.
- `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [5]:
from bs4 import BeautifulSoup, Doctype
from bs4.element import Comment
import urllib.parse

class HtmlDocument(Document):

    def _visible_text(self, element):
        if isinstance(element, Comment):
            return False
        if element.parent.name in ['style', 'title', 'head', 'script', '[document]']:
            return False
        return True
    
    def _normalize_link(self, link):
        if link is not None and link[:4] == 'http':
            return link
        return urllib.parse.urljoin(self.url, link)
    
    def _get_anchors(self):
        soup = BeautifulSoup(self.content)
        self.anchors = []
        a_links = soup.find_all("a")
        for a in a_links:
            if a.has_attr('href'):
                text = a.getText()
                link = a.get('href')
                self.anchors.append((text, self._normalize_link(link)))
    
    def _get_images(self):
        soup = BeautifulSoup(self.content)
        self.images = []
        imgs = soup.find_all("img")
        for img in imgs:
            link = img.get('src')
            self.images.append(self._normalize_link(link))
            
    def _get_text(self):
        soup = BeautifulSoup(self.content)
        for item in soup.contents:
            if isinstance(item, Doctype):
                item.extract()
                
        text_list = filter(self._visible_text, soup.findAll(text=True))
        self.text = " ".join([word.strip() for word in text_list]).replace("\n", "").strip()

        # self.text = soup.find("body").getText().replace("\n", "").strip()
    
    def parse(self):
        self.get()
        self._get_images()
        self._get_anchors()
        self._get_text()
        

### 1.3.1. Tests ###

In [6]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()
print(doc.text)

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

Stanislav Protasov in just few links:  Curriculum vitae  Google Scholar  GitHub  ResearchGate  Публикации в eLibrary  Facebook  LinkedIn  Research with Stas telegram channel  Подкаст "Происхождение видов": telegram , iTunes , RSS  Automatic testing system ( source code )  Книга "Давайте объясню: или зачем программисту математика"  Материалы на ПостНауке  Twitter


## 1.4. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose e.g. [nlptk](https://www.nltk.org/api/nltk.tokenize.html)). 

**Criteria of success**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained inside `<body>` tag only.

In [7]:
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, RegexpTokenizer
import re

stop_words = set.union(set(stopwords.words('russian')), set(stopwords.words('english')))

class HtmlDocumentTextData:
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        # print(result)
        tokenizer = RegexpTokenizer(r'\w+')
        result = tokenizer.tokenize(self.doc.text.lower())
        return result
    
    def get_word_stats(self):
        word_count = {}
        words = [w.lower() for w in self.get_sentences() if w.lower() not in stop_words and w.isalpha()]
        for word in words:
            word_count[word] = word_count.get(word, 0) + 1
        return Counter(word_count)

### 1.4.1. Tests ###

In [8]:
doc = HtmlDocumentTextData("https://innopolis.university")
# doc.get_sentences()
print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

[('иннополис', 20), ('университет', 12), ('области', 10), ('ит', 10), ('лаборатория', 10), ('университета', 9), ('центр', 9), ('технологий', 7), ('робототехники', 7), ('образование', 7)]


## 1.5. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [9]:
from queue import Queue

class Crawler:
    
    def crawl_generator(self, source, depth=1):
        visited = {}
        q = Queue()
        q.put((source, depth))
        visited[source] = True
        while(not q.empty()):
            url, dep = q.get()
            try:
                site = HtmlDocumentTextData(url)
                yield site
            except:
                continue
                
            if(dep == 0):
                continue
            for _, new_url in site.doc.anchors:
                if not visited.get(new_url):
                    q.put((new_url, dep - 1))
                    visited[new_url] = True
        

### 1.5. Tests ###

In [10]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
279 distinct word(s) so far
https://apply.innopolis.university/en
906 distinct word(s) so far
https://corporate.innopolis.university/en
1053 distinct word(s) so far
https://media.innopolis.university/en
1097 distinct word(s) so far
https://innopolis.university/lk/
1435 distinct word(s) so far
https://innopolis.university/en/about/
1534 distinct word(s) so far
https://innopolis.university/en/board/
1608 distinct word(s) so far
https://innopolis.university/en/team/
1609 distinct word(s) so far
https://innopolis.university/en/team-structure/
1612 distinct word(s) so far
https://innopolis.university/en/team-structure/education-academics/
1616 distinct word(s) so far
https://innopolis.university/en/team-structure/techcenters/
1618 distinct word(s) so far
https://innopolis.university/en/faculty/
2420 distinct word(s) so far
https://career.innopolis.university/en/job/
2819 distinct word(s) so far
https://career.innopolis.university/en/
3046 distinct word(s) so

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
https://apply.innopolis.university/en/?special=Y
6516 distinct word(s) so far
https://apply.innopolis.university/
6516 distinct word(s) so far
https://apply.innopolis.ru/en/get-in/
6730 distinct word(s) so far
https://apply.innopolis.university/en/faq/
6733 distinct word(s) so far
https://apply.innopolis.university/en/#block5327
6733 distinct word(s) so far
https://apply.innopolis.university/en/olympiad-bonus/
6754 distinct word(s) so far
https://www.ets.org/s/cv/toefl/at-home/
6754 distinct word(s) so far
https://englishtest.duolingo.com/
6754 distinct word(s) so far
http://nic.gov.ru/en/proc/nic/legalize
6828 distinct word(s) so far
https://www.hcch.net/en/instruments/conventions/authorities1/?cid=41
6896 distinct word(s) so far
https://drive.google.com/file/d/10spw7SYKomHSyWuaNMWMl3-t_OiiR-qg/view
6896 dist

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://apply.innopolis.ru/get-in/tests/
7397 distinct word(s) so far
https://innopolis.university/en/campus/
7397 distinct word(s) so far
https://innopolis.com/en/
7499 distinct word(s) so far
https://apply.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://apply.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
http://www.campuslife.innopolis.ru/main
7499 distinct word(s) so far
https://corporate.innopolis.university/en/outsourcingrd/
7510 distinct word(s) so far
https://corporate.innopolis.university/en/technologicalaudit/
7558 distinct word(s) so far
https://corporate.innopolis.university/en/stratsession/
7602 distinct word(s) so far
https://corporate.innopolis.university/en/Bootcamp/
7645 distinct word(s) so far
https://corporate.innopolis.university/en/education/
7651 distinct word(s) so far
https://corporate.innopolis.university/
7814 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://corporate.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://corporate.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
https://media.innopolis.university/en/events
7814 distinct word(s) so far
https://media.innopolis.university/
7848 distinct word(s) so far
https://media.innopolis.university/en?TAGS=*
7848 distinct word(s) so far
https://media.innopolis.university/en?TAGS=Industry
7884 distinct word(s) so far
https://media.innopolis.university/en?TAGS=Research
7912 distinct word(s) so far
https://media.innopolis.university/en?TAGS=Education
7930 distinct word(s) so far
https://media.innopolis.university/en?TAGS=Students life
7957 distinct word(s) so far
https://media.innopolis.university/en?TAGS=Global
7970 distinct word(s) so far
https://media.innopolis.university/en?TAGS=Robotics
7970 distinct word(s) so far
https://media.innopolis.university/en?TAGS=Blockchain
7970 distinct word(s) so far
https://media.i

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://media.innopolis.university/en/?PAGEN_2=2
8064 distinct word(s) so far
https://media.innopolis.university/en/?PAGEN_2=3
8069 distinct word(s) so far
https://media.innopolis.university/en/?PAGEN_2=38
8097 distinct word(s) so far
https://media.innopolis.university/en/?PAGEN_2=39
8112 distinct word(s) so far
https://spec.innopolis.university/
8162 distinct word(s) so far
https://www.youtube.com/channel/UCZNo9zTHZNZOW4fSFcCKh_A
8162 distinct word(s) so far
https://media.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://media.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
https://innopolis.university/about/
8335 distinct word(s) so far
https://innopolis.university/board/
8435 distinct word(s) so far
https://innopolis.university/team/
8436 distinct word(s) so far
https://innopolis.university/team-structure/
8442 distinct word(s) so far
https://innopolis.university/team-structure/education-academics/
8447 distinc

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/lk/?special=Y
18115 distinct word(s) so far
https://innopolis.university/en/lk/
18116 distinct word(s) so far
https://mail.innopolis.ru/
18122 distinct word(s) so far
https://portal.university.innopolis.ru
18122 distinct word(s) so far
https://it.university.innopolis.ru/portal
18137 distinct word(s) so far
https://apply.innopolis.ru/get-in/
18137 distinct word(s) so far
https://my.university.innopolis.ru/
18137 distinct word(s) so far
https://dovuz.innopolis.university/login/
18154 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/public/files/Согласие на обработку ПДн для УИ.pdf
Skipping https://innopolis.university/public/files/Согласие на обработку ПДн для УИ.pdf
https://innopolis.university/en/about/?special=Y
18154 distinct word(s) so far
https://corporate.innopolis.university/en/
18154 distinct word(s) so far
https://robotics.innopolis.university/en/
18158 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/upload/iblock/c68/AR_2016.pdf
Skipping https://innopolis.university/upload/iblock/c68/AR_2016.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/upload/iblock/26c/AR_2017.pdf
Skipping https://innopolis.university/upload/iblock/26c/AR_2017.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/upload/iblock/59f/AR_2018.pdf
Skipping https://innopolis.university/upload/iblock/59f/AR_2018.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/upload/iblock/0b2/AR_2019.pdf
Skipping https://innopolis.university/upload/iblock/0b2/AR_2019.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/upload/iblock/189/AR2020(3).pdf
Skipping https://innopolis.university/upload/iblock/189/AR2020(3).pdf
https://innopolis.university/en/board/?special=Y
18158 distinct word(s) so far
https://digital.gov.ru/ru/
18316 distinct word(s) so far
http://mzio.tatarstan.ru
18465 distinct word(s) so far
https://innopolis.university/en/team/?special=Y
18465 distinct word(s) so far
https://innopolis.university/en/team-structure/?special=Y
18465 distinct word(s) so far
https://innopolis.university/en/team-director/
18480 distinct word(s) so far
https://innopolis.university/en/team-rector/
18485 distinct word(s) so far
https://innopolis.university/en/team-structure/education-academics/?special=Y
18485 distinct word(s) so far
https://innopolis.university/en/team-structure/team-faculty/
18487 distinct word(s) so far
https://innopolis.university/en/team-structure/team-faculty2/
18494 distinct word(s) so far
https://innopolis.university/en/team-academicpolicy/
18501 distinct w

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://career.innopolis.university/en/job/#career-feedback__form
18748 distinct word(s) so far
https://career.innopolis.university/public/files/career_personal_data.docx
19143 distinct word(s) so far
https://career.innopolis.university/en/corporate-life/
19153 distinct word(s) so far
https://career.innopolis.university/en/relocation/
19219 distinct word(s) so far
https://career.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://career.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
https://career.innopolis.university/
19219 distinct word(s) so far
https://career.innopolis.university/success-stories/julia-kazaeva/
19220 distinct word(s) so far
https://career.innopolis.university/success-stories/farid-gainullin/
19220 distinct word(s) so far
https://career.innopolis.university/success-stories/salimzhan-gafurov-en/
19220 distinct word(s) so far
https://innopolis.university/en/campus?special=Y
19220 distinct word(s) s

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHA

https://apply.innopolis.university/en/bachelor/?special=Y
19302 distinct word(s) so far
https://innopolis.university/filespublic/2020MathTestEng.pdf
Skipping https://innopolis.university/filespublic/2020MathTestEng.pdf
https://innopolis.university/filespublic/Computer%20science.pdf
Skipping https://innopolis.university/filespublic/Computer%20science.pdf
https://innopolis.university/filespublic/English.pdf
Skipping https://innopolis.university/filespublic/English.pdf
https://innopolis.university/filespublic/contest-26104-en.pdf
Skipping https://innopolis.university/filespublic/contest-26104-en.pdf
https://innopolis.university/filespublic/Overview%20of%20the%20BS%20Program1.pdf
Skipping https://innopolis.university/filespublic/Overview%20of%20the%20BS%20Program1.pdf
https://apply.innopolis.ru/en/get-in/tests/
19302 distinct word(s) so far
https://apply.innopolis.university/en/bachelor/CE/?special=Y
19302 distinct word(s) so far
https://apply.innopolis.university/en/bachelor/DS-AI/?specia

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://apply.innopolis.university/en/grant/
19310 distinct word(s) so far
https://apply.innopolis.university/en/master/datascience/?special=Y
19310 distinct word(s) so far
https://innopolis.university/filespublic/DSAI_2021_ENG_.pdf
Skipping https://innopolis.university/filespublic/DSAI_2021_ENG_.pdf
https://innopolis.university/files/Master%60s%20program%20Data%20Analysis%20and%20Artificial%20Intelligence.pdf
Skipping https://innopolis.university/files/Master%60s%20program%20Data%20Analysis%20and%20Artificial%20Intelligence.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://apply.innopolis.university/en/master/securityandnetworkengineering/?special=Y
19310 distinct word(s) so far
https://innopolis.university/filespublic/MS%20SNE_2021.pdf
Skipping https://innopolis.university/filespublic/MS%20SNE_2021.pdf
https://innopolis.university/files/Master%60s%20program%20Security%20and%20Network%20Engineering.pdf
Skipping https://innopolis.university/files/Master%60s%20program%20Security%20and%20Network%20Engineering.pdf
https://apply.innopolis.university/en/master/development/?special=Y
19310 distinct word(s) so far
https://apply.innopolis.university/en/master/development/#block5931
19310 distinct word(s) so far
https://innopolis.university/files/Master%60s%20program%20Software%20Engineering.pdf
Skipping https://innopolis.university/files/Master%60s%20program%20Software%20Engineering.pdf
https://apps.apple.com/ru/app/id1447056625
19528 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://apps.apple.com/ru/app/id1138728895
19579 distinct word(s) so far
https://kazanfirst.ru/news/534089
19656 distinct word(s) so far
https://profit.kz/news/37024/Opredelilis-pobediteli-nFactorial-Challenge/
19741 distinct word(s) so far
https://apply.innopolis.university/en/master/robotics/?special=Y
19741 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/filespublic/IU%20MS%20Robotics%20Presentation_2021.pdf
Skipping https://innopolis.university/filespublic/IU%20MS%20Robotics%20Presentation_2021.pdf
https://innopolis.university/files/Master%60s%20program%20Robotics%20and%20Computer%20Vision.pdf
Skipping https://innopolis.university/files/Master%60s%20program%20Robotics%20and%20Computer%20Vision.pdf
https://apply.innopolis.university/en/master/technological-entrepreneurship/?special=Y
19741 distinct word(s) so far
https://innopolis.university/filespublic/photo_2021-12-07_17-00-50.jpg
20383 distinct word(s) so far
https://minobrnauki.gov.ru/press-center/news/?ELEMENT_ID=25900
20791 distinct word(s) so far
https://innopolis.university/files/Master%60s%20program%20Technological%20Entrepreneurship.pdf
Skipping https://innopolis.university/files/Master%60s%20program%20Technological%20Entrepreneurship.pdf
http://startupfairy.ru
21031 distinct word(s) so far
https://startupfairy.ru/idea
21245 distinct word(s) so fa

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://alumni.innopolis.university/registration/
21971 distinct word(s) so far
https://alumni.innopolis.university/#block4478
21971 distinct word(s) so far
https://alumni.innopolis.university/#block4481
21971 distinct word(s) so far
https://kazanexpress.ru/
21973 distinct word(s) so far
https://www.remyrobotics.com/
21973 distinct word(s) so far
https://yorso.com/
22196 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://alumni.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://alumni.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
https://www.facebook.com/innopolisU
22196 distinct word(s) so far
https://alumni.innopolis.university/public/files/Согласие на обработку ПДн для УИ.pdf
Skipping https://alumni.innopolis.university/public/files/Согласие на обработку ПДн для УИ.pdf
https://innopolis.university/en/research/?special=Y
22196 distinct word(s) so far
https://innopolis.university/en/lab-operating-systems/?special=Y
22196 distinct word(s) so far
https://innopolis.university/en/lab-software-service-engineering/?special=Y
22196 distinct word(s) so far
https://innopolis.university/en/lab-industrializing-software/?special=Y
22196 distinct word(s) so far
https://innopolis.university/en/lab-bioinformatics/?special=Y
22196 distinct word(s) so far
https://innopolis.university/en/lab-game-development/?special=Y
22196 distinct word

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/files/politicacookies.pdf
Skipping https://innopolis.university/files/politicacookies.pdf
https://www.minobrnauki.gov.ru/action/situational_center/
92606 distinct word(s) so far
https://innopolis.university/antiterror/
92623 distinct word(s) so far
https://innopolis.university/en/ido/?special=Y
92623 distinct word(s) so far
https://innopolis.university/en/ido/#block5358
92623 distinct word(s) so far
https://university.innopolis.ru/
92623 distinct word(s) so far
https://dovuz.innopolis.university/informatika-i-programmirovanie/
92643 distinct word(s) so far
https://dovuz.innopolis.university/informacionnaya-bezopasnost/
92665 distinct word(s) so far
https://dovuz.innopolis.university/matematika/
92676 distinct word(s) so far
https://dovuz.innopolis.university/proektnaya-deyatelnost/
92700 distinct word(s) so far
https://dovuz.innopolis.university/robotics/
92723 distinct word(s) so far
https://dovuz.innopolis.university/finansovye-tekhnologii/
92733 distinct

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://dovuz.innopolis.university/intensiv-ege-informatika/?utm_source=site&utm_medium=news&utm_campaign=dovuz&utm_content=3feb
93045 distinct word(s) so far
https://dovuz.innopolis.university/news/startoval-nabor-na-vesennie-intensivy-po-programmirovaniyu-i-podgotovke-k-ege/
93045 distinct word(s) so far
https://dovuz.innopolis.university/news/vebinar-dlya-abiturientov-2022/
93056 distinct word(s) so far
https://dovuz.innopolis.university/news/otkryta-registratsiya-na-kursy-po-programmirovaniyu-i-geometrii-/
93065 distinct word(s) so far
http://dovuz.innopolis.university/about/
93065 distinct word(s) so far
http://dovuz.innopolis.university/events/
93065 distinct word(s) so far
http://dovuz.innopolis.university/news/
93065 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://dovuz.innopolis.university/public/files/Согласие на обработку ПДн для УИ.pdf
Skipping https://dovuz.innopolis.university/public/files/Согласие на обработку ПДн для УИ.pdf
https://university.innopolis.ru/en/about/?special=Y
93065 distinct word(s) so far
https://university.innopolis.ru/search/
93065 distinct word(s) so far
https://university.innopolis.ru/about/
93065 distinct word(s) so far


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://university.innopolis.ru/upload/iblock/c68/AR_2016.pdf
Skipping https://university.innopolis.ru/upload/iblock/c68/AR_2016.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://university.innopolis.ru/upload/iblock/26c/AR_2017.pdf
Skipping https://university.innopolis.ru/upload/iblock/26c/AR_2017.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://university.innopolis.ru/upload/iblock/59f/AR_2018.pdf
Skipping https://university.innopolis.ru/upload/iblock/59f/AR_2018.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://university.innopolis.ru/upload/iblock/0b2/AR_2019.pdf
Skipping https://university.innopolis.ru/upload/iblock/0b2/AR_2019.pdf


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://university.innopolis.ru/upload/iblock/189/AR2020(3).pdf
Skipping https://university.innopolis.ru/upload/iblock/189/AR2020(3).pdf
https://university.innopolis.ru/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://university.innopolis.ru/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
http://www.campuslife.innopolis.ru/handbook2021
93216 distinct word(s) so far
http://www.campuslife.innopolis.ru/services
93218 distinct word(s) so far
https://bit.ly/2IX3Nx9
93218 distinct word(s) so far
http://campuslife.innopolis.ru/innostudents
93230 distinct word(s) so far
http://campuslife.innopolis.ru/battle_rage
93254 distinct word(s) so far
https://t.me/joinchat/DjhyZkBN-FmZStxTB40qwQ
93256 distinct word(s) so far
https://docs.google.com/document/d/1D23NYEsWNRexYgeYT8Om3oaNVyGealNYfqOaYlqV4xs/edit
93263 distinct word(s) so far
http://www.campuslife.innopolis.ru/volunteering
93265 distinct word(s) so far
http://www.campuslife.innopolis.ru/aboutsu
93273 distinct 

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


http://globalai.innopolis.university/
93681 distinct word(s) so far
https://media.innopolis.university/news/grant-na-obuchenie-v-universitete-innopolis-detali-konkursnogo-otbora/
93709 distinct word(s) so far
https://media.innopolis.university/news/blockchain-platform-iu/
93731 distinct word(s) so far
https://media.innopolis.university/public/files/Согласие на обработку ПДн для УИ.pdf
Skipping https://media.innopolis.university/public/files/Согласие на обработку ПДн для УИ.pdf
https://us02web.zoom.us/j/9660990045?pwd=haAZQtiWaT-BVj-FamfQPy-KdnNBvQ
93731 distinct word(s) so far
https://us02web.zoom.us/j/86924417470?pwd=cWpHUlk4OG5BcVNjek94Qk9mMEFjQT09
93731 distinct word(s) so far
https://us02web.zoom.us/j/82115193290?pwd=aStDSVpXUnV5Q0JwZXB3Lys4YmJQUT09
93731 distinct word(s) so far
https://us02web.zoom.us/j/81023939550?pwd=ZHZka0hGTVh4RWFoOUZ3V3lESjVYUT09
93731 distinct word(s) so far
https://us02web.zoom.us/j/84204806112?pwd=Q0ROZkw3bmVGUTB2Q3pYRjFPRWk0UT09
93731 distinct word(s) so 