# Lesson 7—Web Automation with Selenium





In this series, we will use 3 lectures to learn fetching data online. This includes:

- Finding patterns in URL
- Open web URL
- Downloading files in Python
- Fetch data with API
- Web scraping with Requests and BeautifulSoup
- **Web automation with Selenium**
- **Converting Wikipedia tabular data into CSV**

We use Selenium when:
- When Requests and BeautifulSoup does not work.
- When page requires JavaScript to render the data.

Pros:
- It launches real browser and automate browser.
- Better compatibility .

Cons:
- Slow because it launches real browser.


## Downloading browser driver

We need web browser driver to use Selenium. 

- [Gecko Driver for Firefox](https://github.com/mozilla/geckodriver/releases)
- [Chrome Driver](https://chromedriver.chromium.org/)

In [3]:
pip install selenium

Collecting selenium
  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
Installing collected packages: selenium
Successfully installed selenium-3.141.0
Note: you may need to restart the kernel to use updated packages.


In [4]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

In [6]:
options = Options()
# options.add_argument('-headless')

browser = webdriver.Chrome(options=options)
browser.quit()

If Selenium raises an error about missing PATH for chrome driver, we may need to specific the PATH when creating the browser instance:

In [None]:
browser = webdriver.Chrome('./chromedriver', options=options)
browser.quit()

## Selenium Cheat Sheet

https://codoid.com/selenium-webdriver-python-cheat-sheet/

Here are some essential commands to control web browser through Selenium:

In [31]:
browser = webdriver.Chrome()
browser.maximize_window()
browser.get('https://example.com')
browser.find_element_by_css_selector('a')
browser.find_elements_by_css_selector('a')
browser.quit()

## Taking screenshot

In [7]:
'''Capture the screenshot of a website via Headless Browser.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(options=options)
browser.maximize_window()
browser.get('http://macaodaily.com')
browser.save_screenshot('MacaoDaily.png')
browser.quit()

## Example: Fetching stock data from aastock

Let's try to fetch stock quote from aastock.com. If we try to directly access the stock page, the data may not load. We can load any one page from aastock and then simulate inputting the stock number and press enter. By using this automation, we can simulate a normal web browser browsing behavior.

In [26]:
'''Fetch current stock from aastock.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time

stock_number = '0001'

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(options=options)
browser.maximize_window()

browser.get('http://www.aastocks.com/tc/stocks/aboutus/companyinfo.aspx')
element = browser.find_element_by_css_selector('#txtHKQuote')
element.send_keys(stock_number)
element.send_keys(Keys.RETURN)

time.sleep(3)

element = browser.find_element_by_css_selector('.lastBox')
print(element.text)


browser.quit()

收市價(港元)
(指數|行業)
波幅
48.800 - 50.000
▼
49.250


## Example: Fetch dicj data with Selenium

We had used API to fetch DICJ data. This example shows an alternative to fetch the same data by using Selenium.

In [12]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(options=options)

browser.get('http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html')

time.sleep(5)

element = browser.find_element_by_css_selector("#report #table1")

rows = element.find_elements_by_css_selector("tr")
print(rows[0].text)
for row in rows[3:]:
    print(row.text)


2020年及2019年每月幸運博彩毛收入
一月份 22,126 24,942 -11.3% 22,126 24,942 -11.3%
二月份 3,104 25,370 -87.8% 25,229 50,312 -49.9%
三月份 5,257 25,840 -79.7% 30,486 76,152 -60.0%
四月份 754 23,588 -96.8% 31,240 99,739 -68.7%
五月份 1,764 25,952 -93.2% 33,004 125,691 -73.7%
六月份 - - - - - -
七月份 - - - - - -
八月份 - - - - - -
九月份 - - - - - -
十月份 - - - - - -
十一月份 - - - - - -
十二月份 - - - - - -


## Example: Fetch flight price from ctrip

In this example, we will fetch airline query by querying flights.ctrip.com with 4 parameters: departure date, arrival date, departure airport, arrival airport.

In [13]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import datetime

In [16]:
today = datetime.date.today()
five_days_later = today + datetime.timedelta(days=5)

print(today.isoformat())
print(five_days_later.isoformat())


2020-06-29
2020-07-04


In [23]:
options = Options()
#options.add_argument('-headless')

from_city = "hkg"
to_city = "hel"

url = f"https://flights.ctrip.com/international/search/round-{from_city}-{to_city}?depdate={today}_{five_days_later}&cabin=y_s&adult=1&child=0&infant=0"

print(url)

browser = webdriver.Chrome(options=options)
browser.maximize_window()
browser.get(url)

time.sleep(3)

elements = browser.find_elements_by_css_selector(".flight-item")

print(f"Found {len(elements)} results.")

print(from_city.upper())
print(to_city.upper())
for row in elements:
    airline = row.find_element_by_css_selector(".airline-name")
    print(airline.text)
    price = row.find_element_by_css_selector(".price")
    print(price.text)
    
    
browser.quit()

https://flights.ctrip.com/international/search/round-hkg-hel?depdate=2020-06-29_2020-07-04&cabin=y_s&adult=1&child=0&infant=0
Found 3 results.
HKG
HEL
英国航空
¥10072起
英国航空
¥22167起
法国航空
¥24908起


## Example: Use MailGun to send result to yourself

In [2]:
DOMAIN = None
API_KEY= None
FROM = "mak@makzan.net"
TO = ["mak@makzan.net"]

In [27]:
from bs4 import BeautifulSoup
import requests
import datetime

def send_simple_message(content, subject="Yeah"):
    return requests.post(
        f"https://api.mailgun.net/v3/{DOMAIN}/messages",
        auth=("api", API_KEY),
        data={"from": FROM,
        "to": TO,
        "subject": subject,
        "text": content})

# keywords
keywords = ["創業", "科技"]

# today
today = datetime.datetime.today()
year = str(today.year).zfill(2)
month = str(today.month).zfill(2)
day = str(today.day).zfill(2)

res = requests.get(f"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm")

res.encoding = "utf-8"

soup = BeautifulSoup(res.text, "html5lib")

results = []

links = soup.select("#all_article_list a")
for link in links:
    news_title = link.getText()

    for keyword in keywords:
        if keyword in news_title:
            results.append(f"{year}-{month}-{day}: {news_title}")

content = "\n".join(results)
subject = f"今日有{len(results)}篇新聞您可能感興趣"
# send_simple_message(content, subject=subject)
print(subject)
print(content)

今日有1篇新聞您可能感興趣
2020-06-29: 粵打造婦女助農創業就業基地
