<h1><b>Exercise 2: Web Scraping with Beautiful Soup (Part 1)</b></h1>

The document you are reading is not a static web page, but an interactive environment called a **Colab notebook** that lets you write and execute code.

In this exercise, you do the following:
+ Retrieve HTML Source Code from Web by urllib module
+ Retrieve elements from HTML Source Code by BeautifulSoup module
+ Change date format by Datetime module
+ Build CSV file to save the information retrieved

## Retrieve HTML Source Code from Web by urllib module
Retrieve the HTML source code from specify URL.

In [None]:
from urllib import request
url = "https://www.eitp.gov.hk/en/programme-document.php"
with request.urlopen(url) as response:
  data = response.read().decode("utf-8")

Display the result.

In [None]:
print(data)

## Retrieve elements from HTML Source Code by BeautifulSoup module

Import BeautifulSoup module.

In [None]:
import bs4
root = bs4.BeautifulSoup(data, "html.parser")

Display the result with HTML parser of BeautifulSoup.

In [None]:
print(root.prettify())

Find all elements with class "programme-item".

In [None]:
items = root.find_all(class_='programme-item')

Display the elements of class "programme-item".

In [None]:
print(items)

Find element with class "title".

In [None]:
titles = root.find("div", class_="title")

Display the first element of class "title".

In [None]:
print(titles)

Display the text of the first element of class "title".

In [None]:
print(titles.string)

Find all elements with class "date".

In [None]:
dates = root.find_all("div", class_="date")

Display the elements of class "date".

In [None]:
print(dates)

Display the list of text of class "date".

In [None]:
for date in dates:
  if date.string != None:
    print(date.string)

Display the list of document date.

In [None]:
for date in dates:
  if date.string != None:
    DocDate = date.string.replace("Date : ", "")
    print(DocDate)

## Change date format by Datetime module

Import datetime module for date formatting.

In [None]:
from datetime import datetime

Convert date format from dd/mm/yyyy to yyyy-mm-dd.

In [None]:
for date in dates:
  if date.string != None:
    DocDate = date.string.replace("Date : ", "")
    DocDateFormat = datetime.strptime(DocDate, "%d/%m/%Y").strftime('%Y-%m-%d')
    print(DocDateFormat)

Display the details of elements of class "programme-item".

In [None]:
for item in items:
  DocName = item.select('.title')[0].text
  DocDate = item.select('.date')[0].text.replace("Date : ", "")
  DocDateFormat = datetime.strptime(DocDate, "%d/%m/%Y").strftime('%Y-%m-%d')
  DocURL = 'https://www.eitp.gov.hk' + item.a.get('href').replace("..", "")
  print(DocName, DocDate, DocDateFormat, DocURL)

## Build CSV file to save the information retrieved

Import CSV module.

In [None]:
import csv

Build CSV file with column names.

In [None]:
f = csv.writer(open('doc-list.csv', 'w'))
f.writerow(['DocName', 'DocDate(dd/mm/yyyy)', 'DocDate(yyyy-mm-dd)', 'DocURL'])

Write the records into the CSV file.

In [None]:
for item in items:
  DocName = item.select('.title')[0].text
  DocDate = item.select('.date')[0].text.replace("Date : ", "")
  DocDateFormat = datetime.strptime(DocDate, "%d/%m/%Y").strftime('%Y-%m-%d')
  DocURL = 'https://www.eitp.gov.hk' + item.a.get('href').replace("..", "")
  f.writerow([DocName, DocDate, DocDateFormat, DocURL])

## Full Version

In [None]:
from urllib import request
import bs4
import csv
from datetime import datetime

url = "https://www.eitp.gov.hk/en/programme-document.php"
with request.urlopen(url) as response:
  data = response.read().decode("utf-8")

root = bs4.BeautifulSoup(data, "html.parser")

items = root.find_all(class_='programme-item')

f = csv.writer(open('doc-list.csv', 'w'))
f.writerow(['DocName', 'DocDate(dd/mm/yyyy)', 'DocDate(yyyy-mm-dd)', 'DocURL'])

for item in items:
  DocName = item.select('.title')[0].text
  DocDate = item.select('.date')[0].text.replace("Date : ", "")
  DocDateFormat = datetime.strptime(DocDate, "%d/%m/%Y").strftime('%Y-%m-%d')
  DocURL = 'https://www.eitp.gov.hk' + item.a.get('href').replace("..", "")
  f.writerow([DocName, DocDate, DocDateFormat, DocURL])