# 1. HTML structure

HTML stands for HyperText Markup Language.

Unlike a scripting or programming language that uses scripts to perform functions, a markup language uses tags to identify content.

**The Web Structure**

The ability to code using HTML is essential for any web professional. Acquiring this skill should be the starting point for anyone who is learning how to create content for the web.

Modern Web Design
HTML: Structure
CSS: Presentation
JavaScript: Behavior

PHP or similar: Backend
CMS: Content Management


# 2. Regular expressions

https://regex101.com/


## 2.1 Main methods

https://tproger.ru/translations/regular-expression-python/

https://habr.com/ru/post/349860/?utm_source=habrahabr&utm_medium=rss&utm_campaign=349860


- re.match(pattern, string) - find according to pattern at the begining of the line 

- re.search(pattern, string) - almost the same but not only for start position in line; finds across all line but returns ONLY first matching

- re.findall(pattern, string) - return ALL matching across line with list

- re.split(pattern, string, maxsplit=0) - split line according to pattern

- re.sub(pattern, repl, string) - find pattern1 and replace according to repl

- re.compile(pattern, repl, string) - comprise regular expression in one object and use for another methods 

In [18]:
import re
## re.match
result_true = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result_true)
print(result_true.group(0))

result_false = re.match(r'Analytics', 'AV Analytics Vidhya AV')
print(result_false)

# Start position
print(result_true.start())
# End position
print(result_true.end())

## re.search
result_search = re.search(r'Analytics', 'AV Analytics Vidhya AV')
print(result_search.group(0))

## re.findall 
result_all = re.findall(r'AV', 'AV Analytics Vidhya AV')
print(result_all)

## re.split [maxsplit=0]
result_split = re.split(r'y', 'Analytics')
print(result_split)

result_split_1 = re.split(r'i', 'Analytics Vidhya', maxsplit=1)
print(result_split_1)


result_split_2 = re.split(r'i', 'Analytics Vidhya', maxsplit=2)
print(result_split_2)
## re.sub
result_sub = re.sub(r'India', 
                    'the World', 
                    'AV is largest Analytics community of India')
print(result_sub)
## re.compile
pattern = re.compile('AV')
result_compile1 = pattern.findall('AV Analytics Vidhya AV')
print(result_compile1)
result_compile2 = pattern.findall('AV is largest analytics community of India')
print(result_compile2)

<re.Match object; span=(0, 2), match='AV'>
AV
None
0
2
Analytics
['AV', 'AV']
['Anal', 'tics']
['Analyt', 'cs Vidhya']
['Analyt', 'cs V', 'dhya']
AV is largest Analytics community of the World
['AV', 'AV']
['AV']


## 2.2 Ways to create pattern

### 2.2.0 Flags modifiers

- g - Global

Tells the engine not to stop after the first match has been found, but rather to continue until no more matches can be found.

- m - Multiline

The ^ and $ anchors now match at the beginning/end of each line respectively, instead of beginning/end of the entire string.

- i - Case insentive

A case insensitive match is performed, meaning capital letters will be matched by non-capital letters and vice versa.

- x - ignore whitespace 

This flag tells the engine to ignore all whitespace and allow for comments in the regex. Comments are indicated by a starting "#"-character. If you need to include a space character in your regex, it must now be escaped '\ '.

- s - single line

This enables the dot (.) metacharacter to also match new lines. The string could be visualised as a single line input.

- u - enable unicode support
- a - restrict matches to ASCII only 

### 2.2.1 General tokens

- \n - Matches a newline character

- \r - Matches a carriage return character, unicode character 2185.

- \t - Matches a tab character. Historically, tab stops happen every 8 characters.

- \0 - Matches a null character, most often visually represented in unicode using U+2400.

### 2.2.2 Character classes

In [28]:
# [abc] - Matches either an a, b or c character
print(re.findall(r"[abc]", "alpha"))

# [^abc] - Matches any character except for an a, b or c
print(re.findall(r"[^abc]", "alpha"))

# [a-z] Matches any characters between a and z, including a and z.
print(re.findall(r"[a-z]", "alpha12314"))

# [^a-z] Matches any characters except those in the range a-z.
print(re.findall(r"[^a-z]", "alpha12314"))

# [a-zA-Z] Matches any characters between a-z or A-Z. 
#You can combine as much as you please.
print(re.findall(r"[a-zA-Z]", "alpha12314ALPHA"))

['a', 'a']
['l', 'p', 'h']
['a', 'l', 'p', 'h', 'a']
['1', '2', '3', '1', '4']
['a', 'l', 'p', 'h', 'a', 'A', 'L', 'P', 'H', 'A']


### 2.2.3 Anchors

In [108]:
# ^ Matches the start of a string without consuming any characters. 
#If multiline mode is used, this will also match immediately 
#after a newline character.
print(re.findall(r"^alpha 123", "alpha 12314 ALPHA\-/ aaaaaaaa"))

# $ - Matches the end of a string without consuming any characters. 
#If multiline mode is used, this will also match immediately 
#before a newline character.
print(re.findall(r" end$", "alpha 12314 ALPHA\-/ end"))

# \b - word boundary
print(re.findall(r"\b[Xyu]", "Xyuxuyyj"))

# \B - non-word boundary
print(re.findall(r"\B[Xyu]", "Xyuxuyyj"))

# \A - Matches the start of a string only. 
#Unlike ^, this is not affected by multiline mode.
print(re.findall(r"\A\w+", "start of string"))

# \Z - Matches the end of a string only. 
#Unlike $, this is not affected by multiline mode.
print(re.findall(r"\w+\Z", "start of string"))


['alpha 123']
[' end']
['X']
['y', 'u', 'u', 'y', 'y']
['start']
['string']


### 2.2.4 Quntifiers

In [144]:
# a? - Matches an `a` character or nothing.
print(re.findall(r"a?", "alpha 12314 ALPHA\-/ aaaa"))

# a* - Matches zero or more consecutive `a` characters.
print(re.findall(r"a*", "alpha 12314 ALPHA\-/ aaaaaaaa"))

# a*? - Matches as few characters as possible.

print(re.findall(r"\w*?", "r re regex"))

# a+ - Matches one or more consecutive `a` characters.
print(re.findall(r"a+", "alpha 12314 ALPHA\-/ aaaaaaaa"))

# a{3} - Matches exactly 3 consecutive `a` characters.
print(re.findall(r"a{3}", "alpha 12314 ALPHA\-/ aaaaaaaa"))

# a{3,} - Matches at least 3 consecutive `a` characters.
print(re.findall(r"a{3,}", "alpha 12314 ALPHA\-/ aaaaaaaa"))

# a{3,6} - Matches between 3 and 6 (inclusive) consecutive `a` characters.
print(re.findall(r"a{3,6}", "alpha 12314 ALPHA\-/ aaaaaaaa"))

['a', '', '', '', 'a', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'a', 'a', 'a', 'a', '']
['a', '', '', '', 'a', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'aaaaaaaa', '']
['', 'r', '', '', 'r', '', 'e', '', '', 'r', '', 'e', '', 'g', '', 'e', '', 'x', '']
['a', 'a', 'aaaaaaaa']
['aaa', 'aaa']
['aaaaaaaa']
['aaaaaa']


### 2.2.5 Meta sequences

In [143]:
# . - Matches any character other than newline (or including newline with the /s flag)
print(re.findall(r".", "alpha12314ALPHA"))

# \s - Matches any space, tab or newline character.
print(re.findall(r"\s", "alpha 12314 ALPHA"))

# \S - Matches anything other than a space, tab or newline.
print(re.findall(r"\S", "alpha 12314 ALPHA"))

# \d - Matches any decimal digit. Equivalent to [0-9].
print(re.findall(r"\d", "alpha 12314 ALPHA"))

# \D - Matches anything other than a decimal digit.
print(re.findall(r"\D", "alpha 12314 ALPHA"))

# \w - Matches any letter, digit or underscore. Equivalent to [a-zA-Z0-9_].
print(re.findall(r"\w", "alpha 12314 ALPHA-/-"))

# \W - Matches anything other than a letter, digit or underscore.
print(re.findall(r"\W", "alpha 12314 ALPHA\-/"))

# \v - Matches newlines and vertical tabs. Works with Unicode. 
#Vertical tabs can be inserted in some word processors using CMD/CTRL+ENTER.
print(re.findall(r"line one \n \n", "(\v)")) # I dont know correct example

# [\b] - Matches the backspace control character.
print(re.findall(r"[\b]", "fsgfsdf\\sdgsgds\\")) # I dont know correct example


['a', 'l', 'p', 'h', 'a', '1', '2', '3', '1', '4', 'A', 'L', 'P', 'H', 'A']
[' ', ' ']
['a', 'l', 'p', 'h', 'a', '1', '2', '3', '1', '4', 'A', 'L', 'P', 'H', 'A']
['1', '2', '3', '1', '4']
['a', 'l', 'p', 'h', 'a', ' ', ' ', 'A', 'L', 'P', 'H', 'A']
['a', 'l', 'p', 'h', 'a', '1', '2', '3', '1', '4', 'A', 'L', 'P', 'H', 'A']
[' ', ' ', '\\', '-', '/']
[]
[]


### 2.2.6 Group constructs

In [30]:
# (...) - Parts of the regex enclosed in parentheses may be referred 
#to later in the expression or extracted from the results of a successful match.
print(re.findall(r"alpha\s(...)", "alpha 12+14 ALPHA\-/"))

# (a|b) - Matches the a or the b part of the subexpression.
print(re.findall(r"(alpha...| ALPHA...)", "alpha 12314 ALPHA\-/"))

# (?:...) - This construct is similar to (...), but won't create a capture group.
print(re.findall(r"(?:he)+", "heheh he heh"))

# (?#...) - Any text appearing in this group is ignored in the regex. 
#Another option is enabling the x flag to allow #comments. 
#This flag will also cause regex to ignore spaces.
print(re.findall(r"Not(?# .* <-- that should match all)",
                "Nothing else matches."))

# (?P<name>...) - This capturing group can be referred to using the given name 
#instead of a number. Alternative notation for (?<name>...) and (?'name'...) 
#when using a PCRE flavor.
print(re.findall(r"(?P<name>Sally)", "Call me Sally."))

# (?P=name) - Matches the text matched by a previously named capture group. 
#This is the python specific notation.
print(re.findall(r"(?P<named_group>cool)[a-z ]+(?P=named_group)", 
                 "cool is cool"))

# (?!...) - Starting at the current position in the expression, 
#ensures that the given pattern will not match. Does not consume characters.
print(re.findall(r"foo(?!bar)", 
                 "foobar foobaz"))

#(?=...) - Asserts that the given subpattern can be matched here, 
#without consuming characters

print(re.findall(r"foo(?=bar)", "foobar foobaz"))


['12+']
['alpha 12', ' ALPHA\\-/']
['hehe', 'he', 'he']
['Not']
['Sally']
['cool']
['foo']
['foo']


# 3. Requests and BeautifulSoup

In [3]:
from bs4 import BeautifulSoup # for extracting data from web pages
import re # for regular expressions
import requests # get access to web page info
def get_html(url):
    ''' This function takes URL and return content '''
    r = requests.get(url)
    return(r.content)
url = "https://www.avito.ru/moskva"
bs = BeautifulSoup(get_html(url))
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## 3.1 Navigate across web pages

In [8]:
print(bs.title)
print(bs.title.name)
print(bs.title.string)
print(bs.title.parent.name)

<title>Авито — объявления в Москве — Объявления на сайте Авито</title>
title
Авито — объявления в Москве — Объявления на сайте Авито
head


In [35]:
bs.title.text

'Авито — объявления в\xa0Москве — Объявления на\xa0сайте Авито'

In [14]:
print(bs.p)
print(bs.p['class'])
print(bs.a)

<p class="sendout-banner-note-1nFyo">Вы сразу узнаете, если придёт сообщение, появится новое предложение в избранном или кто-то купит ваш товар с доставкой.</p>
['sendout-banner-note-1nFyo']
<a class="link-link-39EVK link-design-default-2sPEv top-banner-action-KNoaa" data-marker="title" href="/dostavka/bezopasnost" rel="noopener" target="_blank" title="советы о безопасных сделках">советы о безопасных сделках</a>


In [17]:
bs.find_all('a')[1].parent()

[<a class="header-link-nCPL_ header-nav-link-3yDcQ" href="/business" rel="" target="" title="">Для бизнеса</a>]

In [17]:
bs.find(class_="header-link-nCPL_ header-nav-link-3yDcQ")

<a class="header-link-nCPL_ header-nav-link-3yDcQ" href="/business" rel="" target="" title="">Для бизнеса</a>

In [None]:
# Find with tag name
bs.find(lambda tag: tag.name == "tr" and "Статус" in tag.text)

In [1]:
# .find_all()

# .parent
# .find_parent()

# .parents
# .find_parents()

# .find_next_sibling()
# .find_previous_sibling()

In [None]:
salary = soup.find_all('div', text=re.compile('\d{1,9}'))

## 3.2 Extracting data

In [18]:
# URLs found within a page’s according to tag
for link in bs.find_all('a'):
    print(link.get('href'))

/dostavka/bezopasnost
/business
/moskva
/shops/moskva
//support.avito.ru
/favorites
#login?authsrc=h
#login?next=%2Fadditem&authsrc=ca
None
/moskva/transport?cd=1
/moskva/nedvizhimost?cd=1
/moskva/rabota?cd=1
/moskva/predlozheniya_uslug?cd=1
None
/moskva/lichnye_veschi
/moskva/transport
/moskva/hobbi_i_otdyh
/moskva/dlya_doma_i_dachi
/moskva/bytovaya_elektronika
/moskva/rabota
/moskva/predlozheniya_uslug
/moskva/nedvizhimost
/moskva/dlya_biznesa
/moskva/zhivotnye
/moskva/mebel_i_interer/krovat_iz_metalla_pod_matras_2000h1800mm_1047406539
/moskva/mebel_i_interer/krovat_iz_metalla_pod_matras_2000h1800mm_1047406539
/moskva/odezhda_obuv_aksessuary/mayka_top_goluboy_1764910737
/moskva/odezhda_obuv_aksessuary/mayka_top_goluboy_1764910737
/moskva/avtomobili/chevrolet_niva_2004_1964103712
/moskva/avtomobili/chevrolet_niva_2004_1964103712
/moskva/avtomobili/mercedes-benz_cls-klass_2013_1824402064
/moskva/avtomobili/mercedes-benz_cls-klass_2013_1824402064
/moskva/odezhda_obuv_aksessuary/plaschi_

In [19]:
# Text from a page
print(bs.get_text())

  


  Авито — объявления в Москве — Объявления на сайте Авито








 



 
 

















 
  

Не дайте себя обмануть — почитайтесоветы о безопасных сделкахДля бизнесаОбъявленияМагазиныПомощьВход и регистрацияПодать объявлениеtimingАвтоНедвижимостьРаботаУслугиещёЛюбая категорияТранспортАвтомобилиМотоциклы и мототехникаГрузовики и спецтехникаВодный транспортЗапчасти и аксессуарыНедвижимостьКвартирыКомнатыДома, дачи, коттеджиЗемельные участкиГаражи и машиноместаКоммерческая недвижимостьНедвижимость за рубежомРаботаВакансииРезюмеУслугиЛичные вещиОдежда, обувь, аксессуарыДетская одежда и обувьТовары для детей и игрушкиЧасы и украшенияКрасота и здоровьеДля дома и дачиБытовая техникаМебель и интерьерПосуда и товары для кухниПродукты питанияРемонт и строительствоРастенияБытовая электроникаАудио и видеоИгры, приставки и программыНастольные компьютерыНоутбукиОргтехника и расходникиПланшеты и электронные книгиТелефоныТовары для компьютераФототехникаХобби и отдыхБилеты и путешествияВелоси

In [None]:
# !pip3 install newspaper3k
from newspaper import Article
article = Article("https://www.rbc.ru/spb_sz/21/10/2020/5f9044cb9a79474426d4d5c9")
article.download()
article.parse()
article.text.replace("\n\n"," ")

Inspired by requests for its simplicity and powered by lxml for its speed:

https://pypi.org/project/newspaper3k/

# 4. Selenium 

## 4.0 Getting started

Here we will use Gooogle Chrome as brower to automate parsing becase we are shure it's the fastest and simplest way to do it.

First of all, you should download and unpack chromedriver which is compatible with your chrome version. https://chromedriver.chromium.org/

Full documentation of selenium:

https://selenium-python.readthedocs.io/

In [43]:
#!pip install selenium
from selenium import webdriver
# Chrome driver path
chrome_driver_path = "/Users/iakubovskii/PythonR/anaconda3/chromedriver"
driver = webdriver.Chrome(executable_path = chrome_driver_path)
href = "https://www.google.ru/"
driver.get(href)

In [4]:
{'a': [1,2,3], 
     'b': [3,2]
     }

{'a': [1, 2, 3], 'b': [3, 2]}

In [46]:
driver.find_element_by_xpath('//*[@id="tsf"]/div[2]/div[1]/div[3]/center/input[2]').click()

In [48]:
driver.find_element_by_xpath('//*[@id="tsf"]/div[2]/div[1]/div[1]/div/div[2]/input').clear()

In [50]:
driver.forward()

## 4.1 Browser options

In [44]:
# Main options of Google Chrome

from selenium.webdriver.chrome.options import Options

## 4.2 Key methods in selenium

In [51]:
# Find elements

# element = driver.find_element_by_id("passwd-id")
# element = driver.find_element_by_name("passwd")
# element = driver.find_element_by_xpath("//input[@id='passwd-id']")
# element = driver.find_element_by_class_name("//input[@id='passwd-id']")

# All methods

# find_element_by_id
# find_element_by_name
# find_element_by_xpath
# find_element_by_link_text
# find_element_by_partial_link_text
# find_element_by_tag_name
# find_element_by_class_name
# find_element_by_css_selector

# Methods

# element.click() # click element
# element.clear() # clear text to element
# element.send_keys("some text") # send text to element

# Imitating Back and forward command in browser

# driver.forward()
# driver.back()

# Imitating Keys

from selenium.webdriver.common.keys import Keys
# Keys.ADD
# Keys.ARROW_DOWN, Keys.ARROW_LEFT, Keys.ARROW_UP, Keys.ARROW_RIGHT
# Keys.BACKSPACE
# Keys.ENTER
#...........You can get all keys with tab after Keys.




## 4.3 Select elements

One of the main goal of using selenium - to select any elements in websites

In [53]:
from selenium.webdriver.support.ui import Select

# Select element
# select = Select(driver.find_element_by_name('name'))

# Methods
# select.select_by_index(index)
# select.select_by_visible_text("text")
# select.select_by_value(value)



## 4.4 Moving between windows, frames and popups

In [None]:
# driver.window_handles - show all opening pages

# Other pages
# driver.switch_to_window("windowName")
# Other frame
# driver.switch_to_frame("frameName")
# Popup
# driver.switch_to.alert

In [None]:
# Back to pages
# driver.execute_script("window.history.go(-1)")
# Maximize window 
# driver.maximize_window()
# driver.execute_script("window.resizeTo(1920,1080)")



## 4.5 Exceptions, waits

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

## 4.6 XPATH structure

In [None]:
# driver.find_element_by_xpath("//input[@id='passwd-id']")
# ..//ignore previous structure
# input - needed tag at the begining of the row
# [everything after input]
# @type of element
# 'name_element'

# Select element by text via xpath
# browser.find_element_by_xpath("//div[text()='Smth_text']")

## 4.9 Solutions of general problem in selenium

# 5. User Agent & Proxy  

In [None]:
info = requests.get("https://www.alta.ru/tam/" + post + "/",
                    headers={"User-Agent": generate_user_agent()},
                    proxies={"http": proxy})

In [19]:
'''
На момент разработки текущие версии:
Python - version 3.8.3
Selenium - version 3.141.0
Tor - version Tor-Browser 4.0
Chrome - version Версия 86.0.4240.75 (Официальная сборка), (64 бит)

Запуск браузера в функции login использует TOR для открытия его через chrome
это позволяет избежать вообще какой либо волокиты с ip-маскарадом
https://tor.en.uptodown.com/windows/versions
'''

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.chrome.options import Options 
import requests
from bs4 import BeautifulSoup
from user_agent import generate_user_agent
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
import time
import pandas as pd
from bs4 import BeautifulSoup
#from hyperdash import monitor_cell
import os
import shutil
from tqdm import tqdm
import random
#from fp.fp import FreeProxy
from tqdm.notebook import tqdm
import glob
from selenium import webdriver
from tqdm import notebook as tqdm_notebook
import re



In [23]:
torexe = os.popen("/Users/iakubovskii/Applications/Tor Browser.app")

In [25]:
torexe = os.popen("/Users/iakubovskSRHSEHSEHSHii/Applications/Tor Browser.app")

In [26]:
torexe

<os._wrap_close at 0x7fc561e96e80>

In [22]:
def option_chrome(download_directory, proxy):#настройка браузера
    chromeOptions = webdriver.ChromeOptions()
    prefs = {"download.default_directory" : download_directory}
    chromeOptions.add_argument('--proxy-server=%s' % proxy)
    chromeOptions.add_argument("user-agent=" + generate_user_agent())
    chromeOptions.add_experimental_option("prefs",prefs)
    chrome_options.add_argument("--start-fullscreen");
    return(chromeOptions)
def login(download_directory):# залогинивается и возвращается к стране на которой указано
    
    global browser

    
   
    torexe = os.popen("/Users/iakubovskii/Applications/Tor Browser.app")
    PROXY = "socks5://localhost:9050" # IP:PORT or HOST:PORT
    options = webdriver.ChromeOptions()
    options.add_argument('--proxy-server=%s' % PROXY)
    prefs = {"download.default_directory" : download_directory}
    
    options.add_argument("user-agent=" + generate_user_agent())
    options.add_experimental_option("prefs",prefs)
    
    browser = webdriver.Chrome(chrome_options=options, 
                               executable_path= "/Users/iakubovskii/PythonR/anaconda3/chromedriver")
    

    browser.implicitly_wait(10)
    browser.get("https://www.trademap.org/")
br = login("/Users/iakubovskii/Machine_Learning/Datasets")


use options instead of chrome_options



WebDriverException: Message: unknown error: net::ERR_PROXY_CONNECTION_FAILED
  (Session info: chrome=86.0.4240.111)


# 6. API

## 6.1 Yandex 

## 6.2 Google