# Lesson 7—Web Automation with Selenium





In this series, we will use 3 lectures to learn fetching data online. This includes:

- Finding patterns in URL
- Open web URL
- Downloading files in Python
- Fetch data with API
- Web scraping with Requests and BeautifulSoup
- **Web automation with Selenium**
- **Converting Wikipedia tabular data into CSV**

We use Selenium when:
- When Requests and BeautifulSoup does not work.
- When page requires JavaScript to render the data.

Pros:
- It launches real browser and automate browser.
- Better compatibility .

Cons:
- Slow because it launches real browser.


## Downloading browser driver

We need web browser driver to use Selenium. 

- [Gecko Driver for Firefox](https://github.com/mozilla/geckodriver/releases)
- [Chrome Driver](https://chromedriver.chromium.org/)

In [None]:
browser = webdriver.Chrome(options=options)
browser.quit()

If Selenium raises an error about missing PATH for chrome driver, we may need to specific the PATH when creating the browser instance:

In [None]:
browser = webdriver.Chrome('./chromedriver', options=options)
browser.quit()

## Selenium Cheat Sheet

https://codoid.com/selenium-webdriver-python-cheat-sheet/

## Taking screenshot

In [3]:
'''Capture the screenshot of a website via Headless Firefox.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(options=options)
browser.maximize_window()
browser.get('https://makclass.com/')
browser.save_screenshot('makclass.png')
browser.quit()

## Example: Fetching stock data from aastock

In [None]:
'''Fetch current stock from aastock.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

stock_number = '0001'

options = Options()
options.add_argument('-headless')

browser = webdriver.Firefox(options=options)
browser.maximize_window()

browser.get('http://www.aastocks.com/tc/stocks/aboutus/companyinfo.aspx')
element = browser.find_element_by_css_selector('#txtHKQuote')
element.send_keys(stock_number)
browser.execute_script("shhkquote($('#txtHKQuote').val(), 'quote', mainPageMarket)")

time.sleep(3)

element = browser.find_element_by_css_selector('.font28')
print(element.text)


browser.quit()

## Example: Fetch dicj data with Selenium

In [None]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

options = Options()
options.add_argument('-headless')

browser = webdriver.Firefox(options=options)

browser.get('http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2019/index.html')

time.sleep(5)

element = browser.find_element_by_css_selector("#report #table1")

rows = element.find_elements_by_css_selector("tr")
print(rows[0].text)
for row in rows[3:]:
    print(row.text)

input("Press enter to dismiss.")

## Example: Fetch flight price from ctrip

In [None]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

options = Options()
#options.add_argument('-headless')

from_city = "hkg"
to_city = "hel"

url = f"https://flights.ctrip.com/international/search/round-{from_city}-{to_city}?depdate=2019-10-01_2019-10-05&cabin=y_s&adult=1&child=0&infant=0"

browser = webdriver.Firefox(options=options)
browser.maximize_window()
browser.get(url)

time.sleep(3)

elements = browser.find_elements_by_css_selector(".flight-operate")

print(from_city.upper())
print(to_city.upper())
for row in elements:
    price = row.find_element_by_css_selector(".price")
    print(price.text)

## Example: Use MailGun to send result to yourself

In [2]:
DOMAIN = None
API_KEY= None
FROM = "mak@makzan.net"
TO = ["mak@makzan.net"]

In [1]:
from bs4 import BeautifulSoup
import requests
import datetime

def send_simple_message(content, subject="Yeah"):
    return requests.post(
        f"https://api.mailgun.net/v3/{DOMAIN}/messages",
        auth=("api", API_KEY),
        data={"from": FROM,
        "to": TO,
        "subject": subject,
        "text": content})

# keywords
keywords = ["脫歐", "賀一誠", "創業", "科技"]

# today
today = datetime.datetime.today()
year = str(today.year).zfill(2)
month = str(today.month).zfill(2)
day = str(today.day).zfill(2)

res = requests.get(f"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm")

res.encoding = "utf-8"

soup = BeautifulSoup(res.text, "html5lib")

results = []

links = soup.select("#all_article_list a")
for link in links:
    news_title = link.getText()

    # Task 2: Change keyword into input
    for keyword in keywords:
        if keyword in news_title:
            # Task 3: Save the result in TXT intead of printing out.
            results.append(f"{year}-{month}-{day}: {news_title}")

content = "\n".join(results)
subject = f"今日有{len(results)}篇新聞您可能感興趣"
# send_simple_message(content, subject=subject)
print(content)

ModuleNotFoundError: No module named 'bs4'