# Lecture 6—Fetching data online

Version 1.0, by [Makzan](https://makzan.net). Last updated 2021 March.

In this series, we will use 3 lectures to learn fetching data online. This includes:

- **Finding patterns in URL**
- **Open web URL**
- **Downloading files in Python**
- **Fetch data with API**
- Web scraping with Requests and BeautifulSoup
- Web automation with Selenium
- Converting Wikipedia tabular data into CSV

## Finding patterns in URL


We need to know the URL In order to download files, or web scrap a web page. Usually it is finding the variable patterns in URL.
Edit
For example, from the following URL, we can find the pattern of the search query.


- https://docs.python.org/3/search.html?q=namedtuple&check_keywords=yes&area=default
- https://duckduckgo.com/?q=python+doc
- https://www.google.com/maps/search/Libraries/@22.1612464,113.5303786,13z
- http://macaodaily.com/html/2020-05/04/node_2.htm
- http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html
- https://bis.dsat.gov.mo:37812/macauweb/routeLine.html?routeName=3&direction=0&language=zh-tw&ver=3.5.12

Let’s take a closer look at DSAT.gov.mo bus route page. If we can the bus routes, we can observe that the page URL doesn’t change. There may be 2 reasons:

1. The page changes are generated via JavaScript rendering.
2. The page is inside an iframe so that page changes do not change the top-level URL.


If it is the first reason, we will need a more advanced browser driver technique. If it is the second reason, we can get the URL by opening the link in a new tab, or simply copying the link location via right-click.

Now we can observe the URL for each route has the following pattern.

https://bis.dsat.gov.mo:37812/macauweb/routeLine.html?routeName=3&direction=0&language=zh-tw&ver=3.5.12

![](dsat-bus.png)

Take DICJ.gov.mo example, the URL is:

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html

If we inspect the network requests, we can find the behind-the-scene XML URL:

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/report_cn.xml?id=10

![](dicj.png)

## Example: Web Search Python Documentation Site

Sometimes, we can speed up our daily operation just by automatically opening the URL that we need. We can use `webbrowser` to do so.



In [2]:
import webbrowser

query = "webbrowser"

url = f"https://docs.python.org/3/search.html?q={query}&check_keywords=yes&area=default"

webbrowser.open(url)


True

**✏️ Exercise Time**
    
Please try to turn the query into an input asking for the search query:

In [133]:
import webbrowser

### Start writing your code here
None
### End writing your code

webbrowser.open(url)


True

|Expected question to ask|
|---|
|Please input search query to search Python doc: |

## Example: Searching DuckDuckGo search engine

DuckDuckGo search engine allows going to the first search result by adding an exclamation mark (!) in the query string. We will use this feature to create a Python script.


In [130]:
import webbrowser

query = "Python history"

url = f"https://duckduckgo.com?q=!+{query}"

webbrowser.open(url)

True

**✏️ Exercise Time**

Please try to turn the query into an input asking for the search query:

In [129]:
import webbrowser

### Start writing your code here
None
### End writing your code

webbrowser.open(url)

True

|Expected question to ask|
|---|
|Please input search query : |

## Example: Google map search near Macao

In [5]:
import webbrowser

query = "Book store"

# A map search in Macao.
url = f"https://www.google.com/maps/search/{query}/@22.1612464,113.5303786,13z"

webbrowser.open(url)

True

![](google-map-result.png)

**✏️ Exercise Time**

Try to turn the map location into Shanghai.

In [None]:
import webbrowser

query = "Book store"

# Start writing your code here
latitude = None
longitude = None
zoom_level = 13
url = f"https://www.google.com/maps/search/{query}/@{latitude},{longitude},{zoom_level}z"

webbrowser.open(url)

## URL for iOS apps


In iOS, we can use x-callback-url to interact with apps in iOS by using Python and Pythonista.

There are web site that collects x-callback-url for iOS apps:

http://x-callback-url.com/apps/

For example, Things—A tasks manager—provides x-callback-url API:

https://culturedcode.com/things/support/articles/2803573/

Another example that Bear—notes taking iOS app—provides x-callback-url API too.

https://bear.app/faq/X-callback-url%20Scheme%20documentation/


## Downloading files

We can use `urlretrieve` from `urllib.request` module to download file.

For example, we can download geckdriver.zip file from their Github repository with the following code.

In [7]:
'''Download chart from AAStock server with given stock numbers.'''

from urllib.request import urlretrieve

stock_numbers = ['0001','0005','0011','0700','3333','0002','0012']

for stock_number in stock_numbers:
    url = "http://charts.aastocks.com/servlet/Charts?fontsize=12&15MinDelay=T&lang=1&titlestyle=1&vol=1&Indicator=1&indpara1=10&indpara2=20&indpara3=50&indpara4=100&indpara5=150&subChart1=2&ref1para1=14&ref1para2=0&ref1para3=0&subChart2=3&ref2para1=12&ref2para2=26&ref2para3=9&subChart3=12&ref3para1=0&ref3para2=0&ref3para3=0&scheme=3&com=100&chartwidth=660&chartheight=855&stockid=00{}.HK&period=6&type=1&logoStyle=1".format(stock_number)
    urlretrieve(url, '{}-chart.gif'.format(stock_number))

('chromedriver.zip', <http.client.HTTPMessage object at 0x1091cd350>)


# Fetching XML

In [10]:
pip install untangle

Collecting untangle
  Downloading untangle-1.1.1.tar.gz (3.1 kB)
Building wheels for collected packages: untangle
  Building wheel for untangle (setup.py) ... [?25ldone
[?25h  Created wheel for untangle: filename=untangle-1.1.1-py3-none-any.whl size=3410 sha256=678ed047367a6d024ab37d3d424ef606a5d3de48f1d2aa254c5acdb9da946713
  Stored in directory: /Users/makzan/Library/Caches/pip/wheels/b9/a9/9c/45580c8b7a00e3e79b889e8e78a4f3427fff5a4d48f1cfea0a
Successfully built untangle
Installing collected packages: untangle
Successfully installed untangle-1.1.1
Note: you may need to restart the kernel to use updated packages.


## Example: SMG.gov.mo

xml.smg.gov.mo

![](xml.smg.gov.mo.png)

In [134]:
import untangle
import datetime

obj = untangle.parse('https://xml.smg.gov.mo/c_actual_brief.xml')

temperature = obj.ActualWeatherBrief.Custom.Temperature.Value.cdata
humidity = obj.ActualWeatherBrief.Custom.Humidity.Value.cdata

print("現時澳門氣溫 " + temperature + " 度，濕度 " + humidity + "%。")



現時澳門氣溫 30 度，濕度 81%。


There may be error when running the code above, depending on how many "Temperature" data are there from SMG.gov.mo.

If there are only one `Temperature` data, it is a direct access. If there are more than one `Temperature` data, it becomes a list. We can determine if it is a list by checking `type(target) == list`.

In [138]:
type([]) == list

True

In [139]:
import untangle
import datetime

obj = untangle.parse('https://xml.smg.gov.mo/c_actual_brief.xml')

humidity = obj.ActualWeatherBrief.Custom.Humidity.Value.cdata

if type(obj.ActualWeatherBrief.Custom.Temperature) == list:
    temperature = obj.ActualWeatherBrief.Custom.Temperature[0].Value.cdata
else:
    temperature = obj.ActualWeatherBrief.Custom.Temperature.Value.cdata


print("現時澳門氣溫 " + temperature + " 度，濕度 " + humidity + "%。")



現時澳門氣溫 30 度，濕度 81%。


## Example: 博彩月計毛收入

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/index.html

In [81]:
import untangle
import datetime

year = datetime.date.today().year

# list begins at 0, and we look for previous month.
month = datetime.date.today().month -1 -1

if last_month < 0:
    year = year - 1
    last_month = 11 # list beings at 0.

url = f"http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/{year}/report_cn.xml?id=8"

data = untangle.parse(url)

month_data = data.STATISTICS.REPORT.DATA.RECORD[month]

net_income = month_data.DATA[1].cdata
last_net_income = month_data.DATA[2].cdata
change_rate = month_data.DATA[3].cdata
acc_net_income = month_data.DATA[4].cdata
acc_last_net_income = month_data.DATA[5].cdata
acc_change_rate = month_data.DATA[6].cdata

print(f"{year} 年 {month+1} 月份 毛收入 {net_income} ({year-1}:{last_net_income}), {change_rate}")
print(f"{year} 年 {month+1} 月份 累計毛收入 {acc_net_income} ({year-1}:{acc_last_net_income}), {acc_change_rate}")

2020 年 5 月份 毛收入 1,764 (2019:25,952), -93.2%
2020 年 5 月份 累計毛收入 33,004 (2019:125,691), -73.7%


## 過去 12 個月博彩月計毛收入

In [82]:
def fetch_and_print_dicj_year_month(year, month):
    url = f"http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/{year}/report_cn.xml?id=8"

    data = untangle.parse(url)

    month_data = data.STATISTICS.REPORT.DATA.RECORD[month]

    net_income = month_data.DATA[1].cdata
    last_net_income = month_data.DATA[2].cdata
    change_rate = month_data.DATA[3].cdata
    acc_net_income = month_data.DATA[4].cdata
    acc_last_net_income = month_data.DATA[5].cdata
    acc_change_rate = month_data.DATA[6].cdata

    print(f"{year} 年 {month+1}  月份 毛收入\t {net_income} \t ({year-1}:{last_net_income}), {change_rate}")
#     print(f"{year} 年 {month+1} 累計毛收入\t {acc_net_income}\t ({year-1}:{acc_last_net_income}), {acc_change_rate}")

In [83]:
import untangle
import datetime

for i in range(-12,0):    
    date = datetime.date.today() + datetime.timedelta(days=i*30)    
    fetch_and_print_dicj_year_month(date.year, date.month-1)

2019 年 6  月份 毛收入	 23,812 	 (2018:22,490), 5.9%
2019 年 7  月份 毛收入	 24,453 	 (2018:25,327), -3.5%
2019 年 8  月份 毛收入	 24,262 	 (2018:26,559), -8.6%
2019 年 9  月份 毛收入	 22,079 	 (2018:21,952), 0.6%
2019 年 10  月份 毛收入	 26,443 	 (2018:27,328), -3.2%
2019 年 11  月份 毛收入	 22,877 	 (2018:24,995), -8.5%
2019 年 12  月份 毛收入	 22,838 	 (2018:26,468), -13.7%
2020 年 1  月份 毛收入	 22,126 	 (2019:24,942), -11.3%
2020 年 2  月份 毛收入	 3,104 	 (2019:25,370), -87.8%
2020 年 3  月份 毛收入	 5,257 	 (2019:25,840), -79.7%
2020 年 4  月份 毛收入	 754 	 (2019:23,588), -96.8%
2020 年 5  月份 毛收入	 1,764 	 (2019:25,952), -93.2%


## Example: Exchange Rate API

https://exchangeratesapi.io

In [84]:
import json
import requests

url = "https://api.exchangeratesapi.io/latest?symbols=HKD&base=CNY"

response = requests.get(url)
data = json.loads(response.text)
print(data)

print(data['rates']['HKD'])


{'rates': {'HKD': 1.0935529258}, 'base': 'CNY', 'date': '2020-06-17'}
1.0935529258
