# Lecture 5—Fetching data online

- Downloading files in Python
- Open web URL
- Fetch data with API
- Web scraping with Requests and BeautifulSoup
- Web automation with Selenium
- Converting Wikipedia tabular data into CSV


We need to know the URL In order to download files, or web scrap a web page. Usually it is finding the variable patterns in URL.
Edit
For example, from the following URL, we can find the pattern of the search query.


- https://docs.python.org/3/search.html?q=namedtuple&check_keywords=yes&area=default
- https://duckduckgo.com/?q=python+doc
- https://www.google.com/maps/search/Libraries/@22.1612464,113.5303786,13z
- http://macaodaily.com/html/2020-05/04/node_2.htm
- http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html
- https://bis.dsat.gov.mo:37812/macauweb/routeLine.html?routeName=3&direction=0&language=zh-tw&ver=3.5.12

Let’s take a closer look at DSAT.gov.mo bus route page. If we can the bus routes, we can observe that the page URL doesn’t change. There may be 2 reasons:

1. The page changes are generated via JavaScript rendering.
2. The page is inside an iframe so that page changes do not change the top-level URL.


If it is the first reason, we will need a more advanced browser driver technique. If it is the second reason, we can get the URL by opening the link in a new tab, or simply copying the link location via right-click.

Now we can observe the URL for each route has the following pattern.

https://bis.dsat.gov.mo:37812/macauweb/routeLine.html?routeName=3&direction=0&language=zh-tw&ver=3.5.12

![](dsat-bus.png)

Take DICJ.gov.mo example, the URL is:

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html

If we inspect the network requests, we can find the behind-the-scene XML URL:

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/report_cn.xml?id=10

![](dicj.png)

## Example: Web Search Python Documentation Site

In [2]:
import webbrowser

query = "webbrowser"

url = f"https://docs.python.org/3/search.html?q={query}&check_keywords=yes&area=default"

webbrowser.open(url)


True

In [None]:
The search 

In [None]:
import webbrowser

query = input("Please input search query to search Python doc. ")

url = f"https://docs.python.org/3/search.html?q={query}&check_keywords=yes&area=default"

webbrowser.open(url)


## Example: Searching DuckDuckGo search engine

DuckDuckGo search engine allows going to the first search result by adding an exclamation mark (!) in the query string. We will use this feature to create a Python script.


In [4]:
import webbrowser

query = "Python history"

url = f"https://duckduckgo.com?q=!+{query}"

webbrowser.open(url)

Please input search query to search DuckDuckGo and go to first result. python history


True

In [None]:
import webbrowser

query = input("Please input search query to search DuckDuckGo and go to first result. ")

url = f"https://duckduckgo.com?q=!+{query}"

webbrowser.open(url)

## Example: Google map search near Macao

In [5]:
import webbrowser

query = "Book store"

# A map search in Macao.
url = f"https://www.google.com/maps/search/{query}/@22.1612464,113.5303786,13z"

webbrowser.open(url)

True

![](google-map-result.png)

**Exercise Time**

Try to turn the map location into Shanghai.

In [None]:
import webbrowser

query = input("Please input search query to search near-by Macao. ")

# A map search in Macao.
url = f"https://www.google.com/maps/search/{query}/@22.1612464,113.5303786,13z"

webbrowser.open(url)

## URL for iOS apps


In iOS, we can use x-callback-url to interact with apps in iOS by using Python and Pythonista.

There are web site that collects x-callback-url for iOS apps:

http://x-callback-url.com/apps/

For example, Things—A tasks manager—provides x-callback-url API:

https://culturedcode.com/things/support/articles/2803573/

Another example that Bear—notes taking iOS app—provides x-callback-url API too.

https://bear.app/faq/X-callback-url%20Scheme%20documentation/


## Downloading files

We can use `urlretrieve` from `urllib.request` module to download file.

For example, we can download geckdriver.zip file from their Github repository with the following code.

In [7]:
'''Download chart from AAStock server with given stock numbers.'''

from urllib.request import urlretrieve

stock_numbers = ['0011','0005','0001','0700','3333','0002','0012']

for stock_number in stock_numbers:
    url = "http://charts.aastocks.com/servlet/Charts?fontsize=12&15MinDelay=T&lang=1&titlestyle=1&vol=1&Indicator=1&indpara1=10&indpara2=20&indpara3=50&indpara4=100&indpara5=150&subChart1=2&ref1para1=14&ref1para2=0&ref1para3=0&subChart2=3&ref2para1=12&ref2para2=26&ref2para3=9&subChart3=12&ref3para1=0&ref3para2=0&ref3para3=0&scheme=3&com=100&chartwidth=660&chartheight=855&stockid=00{}.HK&period=6&type=1&logoStyle=1".format(stock_number)
    urlretrieve(url, '{}-chart.gif'.format(stock_number))

('chromedriver.zip', <http.client.HTTPMessage object at 0x1091cd350>)


# Fetching XML

In [10]:
pip install untangle

Collecting untangle
  Downloading untangle-1.1.1.tar.gz (3.1 kB)
Building wheels for collected packages: untangle
  Building wheel for untangle (setup.py) ... [?25ldone
[?25h  Created wheel for untangle: filename=untangle-1.1.1-py3-none-any.whl size=3410 sha256=678ed047367a6d024ab37d3d424ef606a5d3de48f1d2aa254c5acdb9da946713
  Stored in directory: /Users/makzan/Library/Caches/pip/wheels/b9/a9/9c/45580c8b7a00e3e79b889e8e78a4f3427fff5a4d48f1cfea0a
Successfully built untangle
Installing collected packages: untangle
Successfully installed untangle-1.1.1
Note: you may need to restart the kernel to use updated packages.


## Example: SMG.gov.mo

In [11]:
import untangle
import datetime


def fetch():
    obj = untangle.parse('http://xml.smg.gov.mo/c_actual_brief.xml')

    temperature = obj.ActualWeatherBrief.Custom.Temperature.Value.cdata
    humidity = obj.ActualWeatherBrief.Custom.Humidity.Value.cdata

    print("現時澳門氣溫 " + temperature + " 度，濕度 " + humidity + "%。")

fetch()

現時澳門氣溫 30 度，濕度 81%。


## Example: 博彩月計毛收入

In [81]:
import untangle
import datetime

year = datetime.date.today().year

# list begins at 0, and we look for previous month.
month = datetime.date.today().month -1 -1

if last_month < 0:
    year = year - 1
    last_month = 11 # list beings at 0.

url = f"http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/{year}/report_cn.xml?id=8"

data = untangle.parse(url)

month_data = data.STATISTICS.REPORT.DATA.RECORD[month]

net_income = month_data.DATA[1].cdata
last_net_income = month_data.DATA[2].cdata
change_rate = month_data.DATA[3].cdata
acc_net_income = month_data.DATA[4].cdata
acc_last_net_income = month_data.DATA[5].cdata
acc_change_rate = month_data.DATA[6].cdata

print(f"{year} 年 {month+1} 月份 毛收入 {net_income} ({year-1}:{last_net_income}), {change_rate}")
print(f"{year} 年 {month+1} 月份 累計毛收入 {acc_net_income} ({year-1}:{acc_last_net_income}), {acc_change_rate}")

2020 年 5 月份 毛收入 1,764 (2019:25,952), -93.2%
2020 年 5 月份 累計毛收入 33,004 (2019:125,691), -73.7%


## 過去 12 個月博彩月計毛收入

In [82]:
def fetch_and_print_dicj_year_month(year, month):
    url = f"http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/{year}/report_cn.xml?id=8"

    data = untangle.parse(url)

    month_data = data.STATISTICS.REPORT.DATA.RECORD[month]

    net_income = month_data.DATA[1].cdata
    last_net_income = month_data.DATA[2].cdata
    change_rate = month_data.DATA[3].cdata
    acc_net_income = month_data.DATA[4].cdata
    acc_last_net_income = month_data.DATA[5].cdata
    acc_change_rate = month_data.DATA[6].cdata

    print(f"{year} 年 {month+1}  月份 毛收入\t {net_income} \t ({year-1}:{last_net_income}), {change_rate}")
#     print(f"{year} 年 {month+1} 累計毛收入\t {acc_net_income}\t ({year-1}:{acc_last_net_income}), {acc_change_rate}")

In [83]:
import untangle
import datetime

for i in range(-12,0):    
    date = datetime.date.today() + datetime.timedelta(days=i*30)    
    fetch_and_print_dicj_year_month(date.year, date.month-1)

2019 年 6  月份 毛收入	 23,812 	 (2018:22,490), 5.9%
2019 年 7  月份 毛收入	 24,453 	 (2018:25,327), -3.5%
2019 年 8  月份 毛收入	 24,262 	 (2018:26,559), -8.6%
2019 年 9  月份 毛收入	 22,079 	 (2018:21,952), 0.6%
2019 年 10  月份 毛收入	 26,443 	 (2018:27,328), -3.2%
2019 年 11  月份 毛收入	 22,877 	 (2018:24,995), -8.5%
2019 年 12  月份 毛收入	 22,838 	 (2018:26,468), -13.7%
2020 年 1  月份 毛收入	 22,126 	 (2019:24,942), -11.3%
2020 年 2  月份 毛收入	 3,104 	 (2019:25,370), -87.8%
2020 年 3  月份 毛收入	 5,257 	 (2019:25,840), -79.7%
2020 年 4  月份 毛收入	 754 	 (2019:23,588), -96.8%
2020 年 5  月份 毛收入	 1,764 	 (2019:25,952), -93.2%


In [84]:
import json
import requests

url = "https://api.exchangeratesapi.io/latest?symbols=HKD&base=CNY"

response = requests.get(url)
data = json.loads(response.text)
print(data)

print(data['rates']['HKD'])


{'rates': {'HKD': 1.0935529258}, 'base': 'CNY', 'date': '2020-06-17'}
1.0935529258


## Web Scraping

1. Querying web page
1. Parse the DOM tree
1. Get the data we want from the HTML code

In [87]:
from bs4 import BeautifulSoup
import requests

try:
    res = requests.get("https://news.gov.mo/home/zh-hant")
except requests.exceptions.ConnectionError:
    print("Error: Invalid URL")
    exit()


soup = BeautifulSoup(res.text, "html.parser")

for h5 in soup.select("h5"):
    print(h5.getText().strip())


新型冠狀病毒感染應變協調中心查詢熱線統計數字 (6月17日08H00 至06月18日08H00)
明起重新開放澳門居民入境珠海豁免隔離預約系統        獲批人士可 7天內入境珠海獲豁免隔離
101X路線繁忙時間增密班次
中國經典《四書》之《大學》、《中庸》葡譯本在中國文化周面世
自助辦理身份證續期服務延伸至全日24小時
HUSH!!首場網上音樂會周日舉行　工作坊及音樂短片比賽接受報名
市政署繼續檢疫進口冷藏水產　再檢測22樣本結果未見異常
文化傳播大使呈獻領航計劃 穿越崗頂劇院之旅6月下旬登場
澳大線上華人芯片設計技術研討會萬人參與
澳大學生奪商業精英國際賽季軍
經香港國際機場返澳的居民今日下午乘坐特別渡輪服務抵達氹仔北安碼頭
新城A區B4地段公共房屋建造工程 - 基礎及地庫公開開標
“環松山步行系統設計連建造工程”之公開開標
澳門節能週2020系列活動－齊熄燈，一小時
行政長官賀一誠出席植樹活動
荷香飄飄（攝影：盧錦烈）
【新聞局】經香港國際機場返澳的居民乘搭首班特別渡輪抵達氹仔北安碼頭
【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(17-06)
【新聞局】心出發-遊澳門新聞發佈會16-06
【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(15-06)
【新聞局】行政長官 賀一誠 栽種幼樹 宣揚珍惜大自然
【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(12-06)
【新聞局】黃少澤：爭取推動盡快恢復正常通關
【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(10-06)
【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(08-06)
【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(05-06)




焯公亭 記華商 抗疫貢獻
夜香行業七十年代式微 垃圾處理三十年前變天
病疫影響城市規劃
澳門藝術界 為抗疫打氣
2020暑期活動開始網上報名
非強制央積金2020年度預算盈餘特別分配款項名單公佈
“澳門居民入境珠海暫不實施集中隔離醫學觀察”申請獲審批後須預約核酸檢測　查詢請電應變中心熱線
“澳門居民入境珠海暫不實施集中隔離醫學觀察”申請系統首日運作暢順
本澳連續65日無新增確診病例 “豁免澳門居民前往珠海十四日隔離醫學觀察”措施明再小額開放申請
－記者會快訊（“心出發‧遊澳門”本地遊活動的

In [90]:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.gov.mo/zh-hant/public-holidays/year-2020/")
soup = BeautifulSoup(response.text, "html.parser")

tables = soup.select(".table")

for row in tables[0].select("tr"):
    if len(row.select("td")) > 0:
        date = row.select("td")[1].text
        name = row.select("td")[3].text
        print(f"{date}: {name}")
  

1月1日: 元旦
1月25日: 農曆正月初一
1月26日: 農曆正月初二
1月27日: 農曆正月初三
4月4日: 清明節
4月10日: 耶穌受難日
4月11日: 復活節前日
4月30日: 佛誕節
5月1日: 勞動節
6月25日: 端午節
10月1日: 中華人民共和國國慶日
10月2日: 中華人民共和國國慶日翌日
10月2日: 中秋節翌日
10月25日: 重陽節
11月2日: 追思節
12月8日: 聖母無原罪瞻禮
12月20日: 澳門特別行政區成立紀念日
12月21日: 冬至
12月24日: 聖誕節前日
12月25日: 聖誕節


In [89]:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.gov.mo/zh-hant/public-holidays/year-2020/")
soup = BeautifulSoup(response.text, "html.parser")

tables = soup.select(".table")

for row in tables[0].select("tr"):
    if len(row.select("td")) > 0:
        is_obligatory = (row.select("td")[0].text == "*")
        if is_obligatory:
            date = row.select("td")[1].text
            name = row.select("td")[3].text
            print(f"{date}: {name}")
  

1月1日: 元旦
1月25日: 農曆正月初一
1月26日: 農曆正月初二
1月27日: 農曆正月初三
4月4日: 清明節
5月1日: 勞動節
10月1日: 中華人民共和國國慶日
10月2日: 中秋節翌日
10月25日: 重陽節
12月20日: 澳門特別行政區成立紀念日


## Is today government holiday?

In [92]:
import requests
from bs4 import BeautifulSoup
import datetime

# Get today's year, month and day
today = datetime.date.today()
year = today.year
month = today.month
day = today.day
today_weekday = today.weekday()
today_date = f"{month}月{day}日"


# Fetch gov.mo
url = f"https://www.gov.mo/zh-hant/public-holidays/year-{year}/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

tables = soup.select(".table")

holidays = {}

for table in tables:
    for row in table.select("tr"):
        if len(row.select("td")) > 0:    
            date = row.select("td")[1].text
            weekday = row.select("td")[2].text
            name = row.select("td")[3].text
            holidays[date] = name


# Query holidays
print(today_date)
if today_date in holidays:
    holiday = holidays[today_date]
    print(f"今天是公眾假期：{holiday}")
elif today_weekday == 0:
    print("今天是星期日，但不是公眾假期。")
elif today_weekday == 6:
    print("今天是星期六，但不是公眾假期。")  
else:
    print("今天不是公眾假期。")

6月18日
今天不是公眾假期。


In [104]:
def is_macao_holiday(query_date):    
    # Fetch gov.mo
    url = f"https://www.gov.mo/zh-hant/public-holidays/year-{query_date.year}/"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    tables = soup.select(".table")

    holidays = {}

    for table in tables:
        for row in table.select("tr"):
            if len(row.select("td")) > 0:    
                date = row.select("td")[1].text
                weekday = row.select("td")[2].text
                name = row.select("td")[3].text
                holidays[date] = name


    # Query holidays
    date_key = f"{query_date.month}月{query_date.day}日"

    if date_key in holidays:        
        holiday = holidays[date_key]
        print(f"{date_key}是公眾假期：{holiday}")
    elif query_date.weekday() == 0:
        print(f"{date_key}是星期日，但不是公眾假期。")
    elif query_date.weekday() == 6:
        print(f"{date_key}是星期六，但不是公眾假期。")  
    else:
        print(f"{date_key}不是公眾假期。")

In [105]:
is_macao_holiday(datetime.date.today())

6月18日不是公眾假期。


In [106]:
import dateutil
date = dateutil.parser.parse("2020-01-01")
is_macao_holiday(date)

1月1日是公眾假期：元旦


In [None]:
import dateutil
date = dateutil.parser.parse("2020-10-01")
is_macao_holiday(date)