# Lesson 7‚ÄîWeb Automation with Selenium





Version 1.1. Prepared by [Makzan](https://makzan.net). Updated at 2021 Janurary.

In this series, we will use 3 lectures to learn fetching data online. This includes:

- Finding patterns in URL
- Open web URL
- Downloading files in Python
- Fetch data with API
- Web scraping with Requests and BeautifulSoup
- **Web automation with Selenium**
- **Converting Wikipedia tabular data into CSV**

We use Selenium when:
- When Requests and BeautifulSoup does not work.
- When page requires JavaScript to render the data.

Pros:
- It launches real browser and automate browser.
- Better compatibility .

Cons:
- Slow because it launches real browser.


## Downloading browser driver

We need web browser driver to use Selenium. 

- [Gecko Driver for Firefox](https://github.com/mozilla/geckodriver/releases)
- [Chrome Driver](https://chromedriver.chromium.org/)

In [1]:
pip install selenium

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 904 kB 2.5 MB/s eta 0:00:01
Installing collected packages: selenium
Successfully installed selenium-3.141.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

In [3]:
options = Options()
# options.add_argument('-headless')

browser = webdriver.Chrome(options=options)
browser.quit()

WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home


If Selenium raises an error about missing PATH for chrome driver, we may need to specific the PATH when creating the browser instance:

In [7]:
browser = webdriver.Chrome('./chromedriver', options=options)
browser.quit()

In [22]:
import time

options = Options()
options.add_argument('-headless')
# options.add_argument('--lang=zh-Hant')

browser = webdriver.Chrome('./chromedriver', options=options)

url = "https://www.kakafun.com/mo/app/webapp/2hand.html?isdomaintest=&main=1&devicetype=fixpc&dataid="

browser.get(url)

# time.sleep(4)

while True:
    elements = browser.find_elements_by_css_selector(".topic")
    if len(elements) > 0:
        break

# print(elements)
print("Loaded", "Found topics count:", len(elements))
        
# elements[1].find_element_by_css_selector("img").click()

# Alternative, we execute the defined script of the 'onclick'.
browser.execute_script("enterDataList('digital');")
time.sleep(3)

for page in range(1,5):
    print("----")
    print(f"Loading Page {page}")
    
    items = browser.find_elements_by_css_selector(".kkpage-content .item-content")

    print("Found item content: ", len(items))

    for item in items:
        print(item.text)
        print("----")
        
    browser.execute_script("apps_datalist_navcontrol=true;apps_datalist_pg=1;getDataList('digital');")
    time.sleep(1)


Loaded Found topics count: 33
----
Loading Page 1
Found item content:  14
ÂêÑÁ±ªÂÆ∂ÁîµÊ∏ÖÊ¥óÊúçÂä°ÔºöÂÜ∑Ê∞îÔºåÊ¥óË°£Êú∫ÔºåÂÜ∞ÁÆ±ÔºåÈ•ÆÊ∞¥Á≠âÔºÅÔºÅ Èõ∂ÂîÆÔºöËí∏È¶èÊ∞¥/‰∏≠ÂçóÂ±±ÁüøÊ≥âÊ∞¥33 ÂÖÉÊ°∂ ÊåâÊ°∂50ÂÖÉ‰∏™ Ë¥≠Ê∞¥Á•®10ÈÄÅ1 20ÈÄÅ3 ...
Price: 2**
----
Ê∏Ö‰ΩçÁΩÆÔºåÂá∫ÂîÆÈñíÁΩÆÂÖÖÈõªÈÖç‰ª∂„ÄÇ
Price: Ëá™Âá∫ÂÉπ
Face to face: Macau,Taipa
----
Âπ≥ÂîÆ intel i5 ÂõõÊ†∏ GTX1060 ÊâìÊ©üÈ£üÈõû‰∏ªÊ©ü ÔºåÂèØÁé©LOLÔºåPUBGÔºåGTA5ÔºåÊâìÊ©üÂçÅÂàÜÊµÅÊö¢ÔºåÂêàÊâìÊ©üÔºåÊñáÊõ∏‰∏äÁ∂≤ÔºåÂΩ±Èü≥Â®õ...
Price: 3980
----
(Ë∂ÖÊñ∞!) NespressoÂÆ∂Áî®/officeÂ∞èÂûãÂíñÂï°Ê©üÔºåÁ∞°ÂñÆÊòìÁî®ÔºåÊìç‰ΩúÊ≠£Â∏∏ÔºåÂÆòÁ∂≤...
Price: 699
----
Êñ∞Âπ¥ÂÅáÊúüÂú®ÂÆ∂Ê≤ñËøîÊùØÂíñÂï°Â∞±ÊúÄÊ≠£‰∫Ü. È£õÂà©Êµ¶ÂíñÂï°Ê©üÂÆ∂Áî®‰∏ÄÈ´îÊ©ü. Âè™Áî®ÂπæÊ¨°. 9Êàê9Êñ∞,...
Price: 800
Face to face: Macau
----
ÊîæÂÖ®Êñ∞ÔºåÁÑ°ÈñãÂ∞ÅÈÅéÔºåJBLÂñáÂè≠
Price: 730
Face to face: Macau
----
Êï¥Â∑¶ÂπæÊ¨°ÈªëËíúÂ∞±ÁÑ°Êï¥‰∫Ü ÂéüÂÉπÁ¥Ñ$7xx, Âá∫ËÆìÊæ≥ÈñÄÂπ£$300 ÂüπÊ≠£‰∫§Êî∂
Price: 250
Face to face: ,
----
Touch Ëø∑‰Ω†Á†¥Â£ÅË±ÜÊºøÊ©üÔºå‰∏çË≠∞ÂÉπ„ÄÇ ‚úÖ ÂÖçÈÅéÊøæ„ÄÅÂÖçÊ≥°Ë±Ü

## Selenium Cheat Sheet

https://codoid.com/selenium-webdriver-python-cheat-sheet/

Here are some essential commands to control web browser through Selenium:

In [31]:
browser = webdriver.Chrome()
browser.maximize_window()
browser.get('https://example.com')
browser.find_element_by_css_selector('a')
browser.find_elements_by_css_selector('a')
browser.quit()

## Taking screenshot

In [7]:
'''Capture the screenshot of a website via Headless Browser.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(options=options)
browser.maximize_window()
browser.get('http://macaodaily.com')
browser.save_screenshot('MacaoDaily.png')
browser.quit()

## Example: Fetching stock data from aastock

Let's try to fetch stock quote from aastock.com. If we try to directly access the stock page, the data may not load. We can load any one page from aastock and then simulate inputting the stock number and press enter. By using this automation, we can simulate a normal web browser browsing behavior.

In [16]:
'''Fetch current stock from aastock.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time

stock_number = '0011'

options = Options()
# options.add_argument('-headless')

browser = webdriver.Chrome('./chromedriver', options=options)
browser.maximize_window()

browser.get('http://www.aastocks.com/tc/stocks/aboutus/companyinfo.aspx')
element = browser.find_element_by_css_selector('#sb-txtSymbol-aa')
element.send_keys(stock_number)
element.send_keys(Keys.RETURN)

time.sleep(3)

element = browser.find_element_by_css_selector('.lastBox')
print(element.text)


browser.quit()

ÁèæÂÉπ(Ê∏ØÂÖÉ)
(ÊåáÊï∏|Ë°åÊ•≠)
Ê≥¢ÂπÖ
144.700 - 146.800
‚ñ≤
146.200


## Example: Fetch dicj data with Selenium

We had used API to fetch DICJ data. This example shows an alternative to fetch the same data by using Selenium.

In [12]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(options=options)

browser.get('http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html')

time.sleep(5)

element = browser.find_element_by_css_selector("#report #table1")

rows = element.find_elements_by_css_selector("tr")
print(rows[0].text)
for row in rows[3:]:
    print(row.text)


2020Âπ¥Âèä2019Âπ¥ÊØèÊúàÂπ∏ÈÅãÂçöÂΩ©ÊØõÊî∂ÂÖ•
‰∏ÄÊúà‰ªΩ 22,126 24,942 -11.3% 22,126 24,942 -11.3%
‰∫åÊúà‰ªΩ 3,104 25,370 -87.8% 25,229 50,312 -49.9%
‰∏âÊúà‰ªΩ 5,257 25,840 -79.7% 30,486 76,152 -60.0%
ÂõõÊúà‰ªΩ 754 23,588 -96.8% 31,240 99,739 -68.7%
‰∫îÊúà‰ªΩ 1,764 25,952 -93.2% 33,004 125,691 -73.7%
ÂÖ≠Êúà‰ªΩ - - - - - -
‰∏ÉÊúà‰ªΩ - - - - - -
ÂÖ´Êúà‰ªΩ - - - - - -
‰πùÊúà‰ªΩ - - - - - -
ÂçÅÊúà‰ªΩ - - - - - -
ÂçÅ‰∏ÄÊúà‰ªΩ - - - - - -
ÂçÅ‰∫åÊúà‰ªΩ - - - - - -


## Example: Fetch flight price from ctrip

In this example, we will fetch airline query by querying flights.ctrip.com with 4 parameters: departure date, arrival date, departure airport, arrival airport.

In [10]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import datetime

In [11]:
today = datetime.date.today()
five_days_later = today + datetime.timedelta(days=5)

print(today.isoformat())
print(five_days_later.isoformat())


2021-01-14
2021-01-19


In [14]:
options = Options()
#options.add_argument('-headless')

from_city = "hkg"
to_city = "hel"

url = f"https://flights.ctrip.com/international/search/round-{from_city}-{to_city}?depdate={today}_{five_days_later}&cabin=y_s&adult=1&child=0&infant=0"

print(url)

browser = webdriver.Chrome('./chromedriver', options=options)
browser.maximize_window()
browser.get(url)

time.sleep(3)

elements = browser.find_elements_by_css_selector(".flight-item")

print(f"Found {len(elements)} results.")

print(from_city.upper())
print(to_city.upper())
for row in elements:
    airline = row.find_element_by_css_selector(".airline-name")
    print(airline.text)
    price = row.find_element_by_css_selector(".price")
    print(price.text)
    
    
browser.quit()

https://flights.ctrip.com/international/search/round-hkg-hel?depdate=2021-01-14_2021-01-19&cabin=y_s&adult=1&child=0&infant=0
Found 3 results.
HKG
HEL
Ê±âËééËà™Á©∫
¬•9246Ëµ∑
Ê±âËééËà™Á©∫
¬•26673Ëµ∑
Ëç∑ÂÖ∞ÁöáÂÆ∂Ëà™Á©∫
¬•6249Ëµ∑


## Example: Use MailGun to send result to yourself

In [2]:
DOMAIN = None
API_KEY= None
FROM = "mak@makzan.net"
TO = ["mak@makzan.net"]

In [27]:
from bs4 import BeautifulSoup
import requests
import datetime

def send_simple_message(content, subject="Yeah"):
    return requests.post(
        f"https://api.mailgun.net/v3/{DOMAIN}/messages",
        auth=("api", API_KEY),
        data={"from": FROM,
        "to": TO,
        "subject": subject,
        "text": content})

# keywords
keywords = ["ÂâµÊ•≠", "ÁßëÊäÄ"]

# today
today = datetime.datetime.today()
year = str(today.year).zfill(2)
month = str(today.month).zfill(2)
day = str(today.day).zfill(2)

res = requests.get(f"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm")

res.encoding = "utf-8"

soup = BeautifulSoup(res.text, "html5lib")

results = []

links = soup.select("#all_article_list a")
for link in links:
    news_title = link.getText()

    for keyword in keywords:
        if keyword in news_title:
            results.append(f"{year}-{month}-{day}: {news_title}")

content = "\n".join(results)
subject = f"‰ªäÊó•Êúâ{len(results)}ÁØáÊñ∞ËÅûÊÇ®ÂèØËÉΩÊÑüËààË∂£"
# send_simple_message(content, subject=subject)
print(subject)
print(content)

‰ªäÊó•Êúâ1ÁØáÊñ∞ËÅûÊÇ®ÂèØËÉΩÊÑüËààË∂£
2020-06-29: Á≤µÊâìÈÄ†Â©¶Â•≥Âä©Ëæ≤ÂâµÊ•≠Â∞±Ê•≠Âü∫Âú∞
