# 1. Crawler

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [None]:
import argparse
import os
import re
import requests


def wget(url, filename):
    # allow redirects - in case file is relocated
    resp = requests.get(url, allow_redirects=True)
    # this can also be 2xx, but for simplicity now we stick to 200
    # you can also check for `resp.ok`
    if resp.status_code != 200:
        print(resp.status_code, resp.reason, 'for', url)
        return
    
    # just to be cool and print something
    print(*[f"{key}: {value}" for key, value in resp.headers.items()], sep='\n')
    print()
    
    # try to extract filename from url
    if filename is None:
        # start with http*, ends if ? or # appears (or none of)
        m = re.search("^http.*/([^/\?#]*)[\?#]?", url)
        filename = m.group(1)
        if not filename:
            raise NameError(f"Filename neither given, nor found for {url}")

    # what will you do in case 2 websites store file with the same name?
    if os.path.exists(filename):
        raise OSError(f"File {filename} already exists")
    
    with open(filename, 'wb') as f:
        f.write(resp.content)
        print(f"File saved as {filename}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='download file.')
    parser.add_argument("-O", type=str, default=None, dest='filename', help="output file name. Default -- taken from resource")
    parser.add_argument("url", type=str, default=None, help="Provide URL here")
    args = parser.parse_args()
    wget(args.url, args.filename)

### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [1]:
import requests
from urllib.parse import quote
class Document:
    
    def __init__(self, url):
        self.url = url
        self.filename=None
        self.content=[]
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
                
                
    
    def download(self):
        #TODO download self.url content, store it in self.content and return True in case of success
        self.content= requests.get(self.url).content
        if self.content:
            return True
        else:
            return False
    
    def persist(self):
        self.filename=quote(self.url).replace("/","_")+(self.url[-4:] if self.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt') else '.txt')
        with open(self.filename, 'wb') as f:
            f.write(self.content)
            #print("file is written")
            
    def load(self):
        #TODO load content from hard drive, store it in self.content and return True in case of success
        if self.filename!=None:
            with open(self.filename, 'rb') as f:
               for line in f:
                   self.content+=line
        
               return True
        else:
            #print("filename is not exists")
            return False

### 1.1.1. Tests ###

In [2]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

## 1.2. [M][15] Account the caching policy

Sometimes remote documents (especially when we speak about static content like `js` or `gif`) can swear that they will not change for some time. This is done by setting [Cache-Control response header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control).

In [3]:
import requests
requests.get('https://polyfill.io/v3/polyfill.min.js').headers['Cache-Control']


'public, s-maxage=31536000, max-age=604800, stale-while-revalidate=604800, stale-if-error=604800'

Please study the documentation and implement a descendant to a `Document` class, which will refresh the document in case of expired cache even if the file is already on the hard drive.

In [4]:
import requests
import os
import time


class CachedDocument(Document):
    #currenttime-downloadtime<maxage--->load
    def __init__(self,url):
        Document.__init__(self,url)
        self.filename=quote(self.url).replace("/","_")+(self.url[-4:] if self.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt') else '.txt')
        
    def get(self):
        r=requests.get(self.url).headers['Cache-Control']
        max_age=int(r.split(',')[2].split('=')[1])
        if not os.path.exists('./'+self.filename):
            print("page is downloaded")
            self.download()
            self.persist()
            
        else:
            download_time=round(os.stat('./'+self.filename).st_mtime)
            current_time=round(time.time())
            dur=current_time-download_time
            print('dur: ',dur)
            print('max age: ',max_age)
            if dur>max_age:
                print("page is refresed")
                self.download()
                self.persist()
            else:
                if self.load():
                    print('page is loaded ')
                
            
        
    
    
        

### 1.2.1. Tests

Add logging in your code and show that your code behaves differently for documents with different caching policy.

In [5]:

doc = CachedDocument('https://polyfill.io/v3/polyfill.min.js')
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

doc = CachedDocument('https://yandex.ru/')
doc.get()
time.sleep(2)
doc.get()
time.sleep(2)
doc.get()

dur:  10296
max age:  604800
page is loaded 
dur:  10298
max age:  604800
page is loaded 
dur:  10301
max age:  604800
page is loaded 
dur:  8673
max age:  0
page is refresed
dur:  3
max age:  0
page is refresed
dur:  2
max age:  0
page is refresed


## 1.3. [10] Parse HTML ##
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
- `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
- `self.images` list of images met in a document. Again, links can be relative to current page.
- `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [6]:
from bs4 import BeautifulSoup
from bs4.element import Comment
from urllib.parse import urljoin


class HtmlDocument(Document):
    def tag_visible(self,element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True
      
    def text_from_html(self,body):
        soup = BeautifulSoup(body, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(self.tag_visible, texts)
        return u" ".join(t.strip() for t in visible_texts)
    
    def parse(self):
        #TODO extract plain text, images and links from the document
        #self.anchors = [("fake link text", "http://fake.url/")]
        #self.images = ["http://image.com/fake.jpg"]
        #self.text = "fake text and some other text"
        self.anchors = []
        self.images = []
        self.text=None
        body = requests.get(self.url).text
        soup = BeautifulSoup(body, "html.parser") 
        links=soup.find_all('a')
        images=soup.find_all('img')
            
          
        for a in links:
            text=a.get_text().replace("\n", "").replace("\r", "")
            link=a.get('href')
            link=urljoin(self.url,link)
            self.anchors.append((text,link))

        for img in images:
            source=img.get('src')
            source=urljoin(self.url,source)
            self.images.append(source)
           
        
        
        self.text=self.text_from_html(body)
        #print("==========================anchors============================")   
        #print(self.anchors)
        #print("==========================images==============================")
        #print(self.images)
        #print("===========================text================================")
        #print(self.text)   
    

In [7]:
### 1.3.1. Tests

In [8]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

## 1.4. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). 

**Criteria of success**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained inside `<body>` tag only.

In [9]:
from collections import Counter

class HtmlDocumentTextData(HtmlDocument):
    def __init__(self, url):
        self.url=url
        self.doc = HtmlDocument(url)
    
    def get_sentences(self):
        #TODO implement sentence parser
        result = []
        
        body = requests.get(self.url).text
        result=self.text_from_html(body)      
        return result
    
    def get_word_stats(self):
        #TODO return Counter object of the document, containing mapping {`word` -> count_in_doc}
        try:
            words=self.get_sentences()
            tokens=words.split()
            tokens = [ele for ele in tokens if ele.strip()]
            res={}
            for word in tokens:
                word=word.lower()
                if len(word)>3:
                    if word in res.keys():
                        res[word]+=1
                    else:
                        res[word]=1

            return Counter(res)
        except:
            pass

### 1.4.1. Tests ###

In [10]:
doc = HtmlDocumentTextData("https://innopolis.university/")
print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

[('иннополис', 19), ('университет', 12), ('области', 10), ('лаборатория', 10), ('университета', 9), ('центр', 9), ('разработки', 7), ('2022', 7), ('образовательной', 6), ('технологий', 6)]


## 1.5. [M][35] Languages
Maybe you heard, that there are multiple languages in the world. European languages, like Russian and English, use similar puctuation, but even in this family there is ¡Spanish!

Other languages can use different punctiation rules, like **Arabic or [Thai](http://www.thai-language.com/ref/breaking-words)**.

Your task is to support (at least) three languages (English, Arabic, and Thai) tokenization in your `HtmlDocumentTextData` class descendant.

What should you do:
1. Use any language dection techniques, e.g. [langdetect](https://pypi.org/project/langdetect/).
2. Use language-specific tokenization tools, e.g. for [Thai](https://pythainlp.github.io/tutorials/notebooks/pythainlp_get_started.html#Tokenization-and-Segmentation) and [Arabic](https://github.com/CAMeL-Lab/camel_tools).
3. Use these pages to test your code: [1](https://www.bangkokair.com/tha/baggage-allowance) and [2](https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82%D8%A8%D8%A9-%D8%A8%D9%88%D8%AA%D9%8A%D9%86).

In [11]:
!pip install langdetect
!pip install spacy
!pip install pythainlp



In [12]:
#requirements
#install langdetect,spacy, pythainlp
from langdetect import detect
import spacy
from spacy.lang.en import English
from spacy.lang.ar import Arabic
from spacy.lang.th import Thai
class MultilingualHtmlDocumentTextData(HtmlDocumentTextData):
    
    #TODO your code here
    def get_word_stats(self):
        #TODO return Counter object of the document, containing mapping {`word` -> count_in_doc}
        words=self.get_sentences()
        lang=detect(words)
        #nlp = spacy.load("en_core_web_sm")
        print(lang)
        if lang=='en':
            nlp=English()
            words = nlp(words)
            tokens = [token.text for token in words]
            tokens = [ele for ele in tokens if ele.strip()]
        elif lang=='ar':
            nlp=Arabic()
            words = nlp(words)
            tokens = [token.text for token in words]
            tokens = [ele for ele in tokens if ele.strip()]
        elif lang=='th':
            nlp=Thai()
            words = nlp(words)
            tokens = [token.text for token in words]
            tokens = [ele for ele in tokens if ele.strip()]

        else:
            tokens=words.split()
            tokens = [ele for ele in tokens if ele.strip()]

        res={}
        for word in tokens:
            if len(word)>3:
                word=word.lower()
                if word in res.keys():
                    res[word]+=1
                else:
                    res[word]=1
        return Counter(res)

### 1.5.1. Tests

In [16]:
doc = MultilingualHtmlDocumentTextData("https://www.bangkokair.com/tha/baggage-allowance")
print(doc.get_word_stats().most_common(10))

doc = MultilingualHtmlDocumentTextData("https://alfajr-news.net/details/%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%AF%D9%8A%D9%85%D9%88%D9%82%D8%B1%D8%A7%D8%B7%D9%8A-%D9%81%D9%8A-%D8%A7%D9%84%D9%83%D9%88%D9%86%D8%BA%D8%B1%D8%B3-%D8%A7%D9%84%D8%A3%D9%85%D8%B1%D9%8A%D9%83%D9%8A-%D9%84%D9%85%D8%B9%D8%A7%D9%82")
print(doc.get_word_stats().most_common(10))

th
[('สัมภาระ', 34), ('กิโลกรัม', 21), ('เดินทาง', 17), ('เที่ยวบิน', 16), ('บริการ', 16), ('ข้อมูล', 13), ('น้ำหนัก', 13), ('เกี่ยวกับ', 11), ('พิเศษ', 11), ('ผู้โดยสาร', 10)]
ar
[('تعليق', 12), ('مشاهده', 10), ('الإمارات', 6), ('الفجر', 4), ('حملة', 3), ('أخبار', 3), ('أغسطس', 3), ('2020', 3), ('javascript', 2), ('your', 2)]


In [17]:
#Just for testing the BFS 
graph = {
  'A' : ['B','C'],
  'B' : ['D', 'E'],
  'C' : ['F'],
  'D' : [],
  'E' : ['F'],
  'F' : []
}

visited = []  
queue = []     

def visited_nodes(visited,queue,level):
    try:
        while queue:
            s=queue.pop(0)
            level+=1
            print(s)  
            for neighbour in graph[s[0]]:
                if neighbour not in visited:
                    visited.append(neighbour)
                    queue.append([neighbour,level])
    except:
        visited_nodes(visited,queue,level)
    return visited

def bfs(visited, graph, node):
    level=0
    visited.append(node)
    queue.append([node,level])
    return visited_nodes(visited,queue,level)



print(bfs(visited, graph, 'A'))

['A', 0]
['B', 1]
['C', 1]
['D', 2]
['E', 2]
['F', 3]
['A', 'B', 'C', 'D', 'E', 'F']


## 1.5. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [18]:
from queue import Queue

class Crawler:
    
    def crawl_generator(self, source, depth=1):
        #TODO return real crawling results. Don't forget to process failures
        graph={}
        visited = [] 
        queue = []     
        def links(source):
            try:
                lst=[]
                d=HtmlDocumentTextData(source)
                d.parse()
                links=d.anchors
                for lnk in links:
                    lst.append(lnk[1])
                return list(dict.fromkeys(lst))
            except:
                pass
            
            return []
        
        def sub_links(source,depth):
            lst=[]
            lnks=[]
            i=0
            if depth>=1:
                lst=links(source)
                graph[source]=lst
                lnks.append(lst)
                depth-=1
                while depth!=0:
                    sub_lnks=[]
                    for item in lnks[i]:
                        lst=links(item)
                        graph[item]=lst
                        if len(lst)!=0:
                            for l in lst:
                                sub_lnks.append(l)
                            
                    if len(sub_lnks)!=0:
                        lnks.append(sub_lnks)
                    i+=1  
                    depth-=1            
            else:
                graph[source]=[]
                
            return graph
        
        graph=sub_links(source,depth)
        def visited_nodes(visited,queue,level):
            try:
                while queue:
                    s=queue.pop(0)
                    level+=1
                    #print(s)  
                    for neighbour in graph[s[0]]:
                        if neighbour not in visited:
                            visited.append(neighbour)
                            queue.append([neighbour,level])
            except:
                visited_nodes(visited,queue,level)
            return visited
        
        def bfs(visited, graph, node):
            level=0
            visited.append(node)
            queue.append([node,level])
            return visited_nodes(visited,queue,level)


        sources=bfs(visited, graph, source)
        for source in sources:
            yield HtmlDocumentTextData(source)

        
          

### 1.5. Tests ###

In [19]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis should be among most common'

https://innopolis.university/en/
303 distinct word(s) so far
https://apply.innopolis.university/en
1152 distinct word(s) so far
https://corporate.innopolis.university/en
1327 distinct word(s) so far
https://media.innopolis.university/en
1386 distinct word(s) so far
https://innopolis.university/lk/
1732 distinct word(s) so far
https://innopolis.university/en/about/
1867 distinct word(s) so far
https://innopolis.university/en/board/
1958 distinct word(s) so far
https://innopolis.university/en/team/
1960 distinct word(s) so far
https://innopolis.university/en/team-structure/
1963 distinct word(s) so far
https://innopolis.university/en/team-structure/education-academics/
1966 distinct word(s) so far
https://innopolis.university/en/team-structure/techcenters/
1968 distinct word(s) so far
https://innopolis.university/en/faculty/
3111 distinct word(s) so far
https://career.innopolis.university/en/job/
3723 distinct word(s) so far
https://career.innopolis.university/en/
4049 distinct word(s) s

10320 distinct word(s) so far
mailto:admissions@innopolis.ru
10320 distinct word(s) so far
https://apply.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://apply.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
http://www.campuslife.innopolis.ru/main
10320 distinct word(s) so far
https://corporate.innopolis.university/en/outsourcingrd/
10339 distinct word(s) so far
https://corporate.innopolis.university/en/technologicalaudit/
10427 distinct word(s) so far
https://corporate.innopolis.university/en/stratsession/
10531 distinct word(s) so far
https://corporate.innopolis.university/en/Bootcamp/
10609 distinct word(s) so far
https://corporate.innopolis.university/en/education/
10620 distinct word(s) so far
https://corporate.innopolis.university/
10811 distinct word(s) so far
https://corporate.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://corporate.innopolis.university/publi

23022 distinct word(s) so far
https://ai.innopolis.university/
23071 distinct word(s) so far
https://innopolis.university/center-blockchain/
23221 distinct word(s) so far
https://innopolis.university/centergis/
23534 distinct word(s) so far
https://innopolis.university/center-cybersecurity/
23768 distinct word(s) so far
https://innopolis.university/center-oil/
23887 distinct word(s) so far
https://innopolis.university/center-robotics/
24375 distinct word(s) so far
https://innopolis.university/proekty/podderzhka-innovacionnoj-deyatelnosti/
24410 distinct word(s) so far
https://innopolis.university/digital-economy
24526 distinct word(s) so far
https://innopolis.university/startupstudio/
24830 distinct word(s) so far
https://innopolis.university/organizatsiya-i-provedenie-meropriyatiy/
25010 distinct word(s) so far
https://innopolis.university/sponsorship/
25043 distinct word(s) so far
https://innopolis.university/contacts/
25046 distinct word(s) so far
https://innopolis.university/lk/?sp

26577 distinct word(s) so far
https://career.innopolis.university/en/job/?section_id=22891
26577 distinct word(s) so far
https://hh.ru/vacancy/50499917
26578 distinct word(s) so far
https://career.innopolis.university/en/job/#career-feedback__form
26578 distinct word(s) so far
mailto:faculty@innopolis.ru
26578 distinct word(s) so far
https://career.innopolis.university/public/files/career_personal_data.docx
26950 distinct word(s) so far
https://career.innopolis.university/en/corporate-life/
26968 distinct word(s) so far
https://career.innopolis.university/en/relocation/
27078 distinct word(s) so far
https://career.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://career.innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
https://career.innopolis.university/
27078 distinct word(s) so far
https://career.innopolis.university/success-stories/julia-kazaeva/
27094 distinct word(s) so far
https://career.innopolis.university/

30560 distinct word(s) so far
https://innopolis.university/en/teachingexcellencecenter/trainings/
30614 distinct word(s) so far
https://innopolis.university/en/teachingexcellencecenter/#block4504
30614 distinct word(s) so far
https://www.iswnetwork.ca
30651 distinct word(s) so far
mailto:o.zhirosh@innopolis.ru
30651 distinct word(s) so far
https://innopolis.university/en/writinghubhome/?special=Y
30651 distinct word(s) so far
https://innopolis.university/writinghubhome/
30651 distinct word(s) so far
https://innopolis.university/en/writinghubhome/#block4697
30651 distinct word(s) so far
https://innopolis.university/en/writinghubhome/whoweworkfor/
30654 distinct word(s) so far
https://innopolis.university/en/writinghubhome/writinghubtutors/
30655 distinct word(s) so far
https://innopolis.university/en/writinghubhome/signupforconsultation/
30666 distinct word(s) so far
https://innopolis.university/en/writinghubhome/academicstylereminder/
30865 distinct word(s) so far
https://innopolis.uni

101275 distinct word(s) so far
https://media.innopolis.university/events/olimpiada-innopolis-open-po-informatike/
101277 distinct word(s) so far
https://media.innopolis.university/events/
101279 distinct word(s) so far
https://corporate.innopolis.university
101279 distinct word(s) so far
http://www.innopolis.com/city/how-to-get/
101353 distinct word(s) so far
tel:+7 (937) 586-43-17
101353 distinct word(s) so far
https://minobrnauki.gov.ru
101416 distinct word(s) so far
https://innopolis.university/files/politicacookies.pdf
Skipping https://innopolis.university/files/politicacookies.pdf
https://www.minobrnauki.gov.ru/action/situational_center/
101436 distinct word(s) so far
https://innopolis.university/antiterror/
101467 distinct word(s) so far
https://innopolis.university/en/ido/?special=Y
101467 distinct word(s) so far
https://innopolis.university/en/ido/#block5358
101467 distinct word(s) so far
tel:+7 (843) 239-24-52
101467 distinct word(s) so far
https://university.innopolis.ru/
101

103241 distinct word(s) so far
https://us02web.zoom.us/j/86924417470?pwd=cWpHUlk4OG5BcVNjek94Qk9mMEFjQT09
103241 distinct word(s) so far
https://us02web.zoom.us/j/82115193290?pwd=aStDSVpXUnV5Q0JwZXB3Lys4YmJQUT09
103241 distinct word(s) so far
https://us02web.zoom.us/j/81023939550?pwd=ZHZka0hGTVh4RWFoOUZ3V3lESjVYUT09
103241 distinct word(s) so far
https://us02web.zoom.us/j/84204806112?pwd=Q0ROZkw3bmVGUTB2Q3pYRjFPRWk0UT09
103241 distinct word(s) so far
https://media.innopolis.university/?TAGS=Наука
103352 distinct word(s) so far
https://media.innopolis.university/?TAGS= Международное сотрудничество
103415 distinct word(s) so far
https://cs.gssi.it/devops2020/
103475 distinct word(s) so far
https://media.innopolis.university/news/Indie-GameDev-hack/
103516 distinct word(s) so far
https://media.innopolis.university/news/davos-iu/
103549 distinct word(s) so far
https://media.innopolis.university/news/digital-operating-room/
103643 distinct word(s) so far
https://media.innopolis.university/n

105228 distinct word(s) so far
https://www.youtube.com/
105228 distinct word(s) so far
https://www.youtube.com/about/
105239 distinct word(s) so far
https://www.youtube.com/about/press/
105290 distinct word(s) so far
https://www.youtube.com/about/copyright/
105885 distinct word(s) so far
https://www.youtube.com/t/contact_us/
105939 distinct word(s) so far
https://www.youtube.com/creators/
106071 distinct word(s) so far
https://www.youtube.com/ads/
106391 distinct word(s) so far
https://developers.google.com/youtube
106437 distinct word(s) so far
https://www.youtube.com/t/terms
107089 distinct word(s) so far
https://www.youtube.com/t/privacy
108092 distinct word(s) so far
https://www.youtube.com/about/policies/
108360 distinct word(s) so far
https://www.youtube.com/howyoutubeworks?utm_campaign=ytgen&utm_source=ythp&utm_medium=LeftNav&utm_content=txt&u=https%3A%2F%2Fwww.youtube.com%2Fhowyoutubeworks%3Futm_source%3Dythp%26utm_medium%3DLeftNav%26utm_campaign%3Dytgen
108454 distinct word(s)