## requests 模組: 讀取網頁原始碼

- 開啟本機網頁服務：cd 到網頁的目錄後，執行 python3 -m http.server
- 瀏覽器打開 http://localhost:8000/demo1.html

In [None]:
import requests
url = "http://localhost:8000/demo1.html"
response = requests.get(url)
# 檢查HTTP回應碼是否為200(requests.code.ok)
print(response.status_code)

## BeautifulSoup 模組：網頁解析

In [None]:
# 認識網頁的結構
# /Users/jacky/Documents/Lecture Python/files/html/demo1.html
'''
<!DOCTYPE html>
<html>
<head>
  <encoding="UTF-8">  
  <title>我是網頁標題</title>
</head>
<body>
    <h1 class="large">我是標題</h1>
    <div>
      <p>我是段落</p>
      <img src="https://www.nycu.edu.tw/userfiles/nycuch/images/20230915173911063.png" alt="我是圖片"><br>
      <a href="https://steam.nycu.edu.tw">學士後電子與光子學士學位學程</a>
    </div>
</body>
</html>
'''

## 列印指定標籤的內容

In [None]:
import requests
from bs4 import BeautifulSoup
url = "http://localhost:8000/demo1.html"
response = requests.get(url)
response.encoding = 'UTF-8'
sp = BeautifulSoup(response.text, 'html.parser')
print(sp.title);  print(sp.title.text)

In [None]:
print(sp.h1)
print(sp.h1.text)
print(sp.p); print(sp.p.text)

In [None]:
# 下載圖片
from pathlib import Path
print(sp.img); print(sp.img["src"])
url = sp.img["src"]
response = requests.get(url)
img_path = Path.cwd() / ".." / "files" / "image" / "nycu_logo.png"

with open(img_path, "wb") as img_file:
    img_file.write(response.content)

In [None]:
# 取得URL並開啟網頁
print(sp.a)
print(sp.a.text)
print(sp.a["href"])

import webbrowser
url = sp.a["href"]
webbrowser.open(url)

## 找尋指定標籤的內容：find()、find_all()

In [None]:
# 
# /Users/jacky/Documents/Lecture Python/files/html/demo2.html
'''
<!DOCTYPE html>
<html>
<head>
    <!--TAG meta to set encoding, description, keywords, author-->
    <!--TAG title to set title of browser tab--> 
    <meta charset="UTF-8">
    <title>爬蟲練習二</title>   
</head>

<body>
    <!--h1~h6 標籤是用於定義標題-->
    <h1>我是 h1 大小的標籤</h1>

    <!--div 標籤是用於定義文檔中的分隔區塊, 可視為一個容器-->
    <div></div>
        <p id="x01" style="color:brown">我是文字一在段落內</p>
        <p id="x02" style="font-size:16pt">我是文字二在段落內</p>
        <p id="x03" class="normal">我是文字三在段落內</p>
        <hr></hr>
        <ul>
            <li class="nycu"><a href="http://www.nycu.edu.tw" target="_blank">連結陽明交大</a></li>
            <li class="nthu"><a href="http://www.nthu.edu.tw" target="_blank">連結清華大學</a></li>
        </ul>
        <img src="https://joycat.org/images/web_design/HTML5.png" alt="HTML5示例圖像" width="90" height="60">
        <p>點擊連結至 <a href="https://zh.wikipedia.org/zh-tw/HTML" target="_blank">wiki HTML的說明</a></p>
</body>
</html>        
'''

In [None]:
import requests
from bs4 import BeautifulSoup
url = "http://localhost:8000/demo2.html"
response = requests.get(url)
response.encoding = 'UTF-8'
sp = BeautifulSoup(response.text, 'html.parser')

In [None]:
# sp.tag_name only return the first tag
print(sp.title.text)
print(sp.p.text)
print(sp.a.text)


In [None]:
# sp.find(tag_name) only return the first tag too. To get all tags, use find_all(tag_name)
print(sp.find('p').text)
print(sp.find_all('p'))
print('there are', len(sp.find_all('p')), 'p tags')
for i in range(3):
    print(sp.find_all('p')[i].text)
for i in range(3):
    print(sp.find_all('a')[i].text)

In [None]:
print(sp.find('p', style='color:brown').text)
print(sp.find('p', id='x02').text)
print(sp.find('p', class_='normal').text)

## 利用CSS選擇器找尋內容：select()

In [None]:
# print(sp.select('title')[0].text)
# for i in range(4):
#     print(sp.select('p')[i].text)
# for p in sp.select('p'):
#     print(p.text)

print(sp.select("#x01")[0].text)
print(sp.select('.normal')[0].text)

## *取得標籤的屬性內容*

In [None]:
# sp.tag_name[attribute_name] vs sp.tag_name.get(attribute_name)
print(sp.img['src'],'\t', sp.img.get('src'), '\n')
# sp.tag_name vs sp.find(tag_name) vs sp.find_all(tag_name)
print(sp.img['src'], '\t', sp.find('img')['src'], '\t', sp.find_all('img')[0]['src'], '\n')
# sp.tag_name vs sp.select(tag_name)
print(sp.img['src'], '\t', sp.select('img')[0]['src'], '\n')

print(sp.select('a')[0].get('href'), '\t', sp.select('a')[0]['href'], '\n')

## Lab

1. 分析網頁 http://ehappy.tw/bsdemo1.htm，取出它網頁中HTML 5 image的來源網址 
2. 分析網頁 http://ehappy.tw/bsdemo2.htm，爬出它網頁中'我是段落一'的文字 
3. 分析網頁 http://ehappy.tw/bsdemo2.htm，取出它網頁中'我是超連結2'的網址