## Stock catch

點選 Header，看 General 的 Request URL 及 Method 與 Response Headers 的 Content-Type 可以知道：
https://ithelp.ithome.com.tw/upload/images/20211006/20139358GVxM0ylTC2.png

- URL：https://www.twse.com.tw/exchangeReport/STOCK_DAY。
- 必要 Query：response=json&date=20211006&stockNo=2330。所以可以根據需求在 Query 的 response、date、stockNo 代入想要的值。
- HTTP Method 是 GET。
- Headers 的 Content-Type：application/json;charset=UTF-8 所以格式是 JSON，編碼為 UTF-8。
- 另外我們也知道了，不管我們 Query 的 date 是幾號，他的 server 是整個月份的日成交資訊都回給你，這點要注意一下喔！

![image.png](attachment:image.png)

In [1]:
# https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY.html

In [2]:
import requests

url = "https://www.twse.com.tw/exchangeReport/STOCK_DAY"

# 
res = requests.get(url, params={
    "response": "json",
    "date": "20211006",
    "stockNo": "2330"
})

# 把 JSON 轉成 Python 可存取之型態
res_json = res.json()

# 我們要的每日成交資訊在 data 這個欄位
daily_price_list = res_json.get("data", [])

# 該欄位是 List 所以用 for 迴圈印出
for daily_price in daily_price_list:
    print(daily_price)

['110/10/01', '39,006,124', '22,368,447,571', '579.00', '579.00', '571.00', '574.00', '-6.00', '78,088']
['110/10/04', '22,288,138', '12,741,914,919', '574.00', '575.00', '569.00', '572.00', '-2.00', '42,147']
['110/10/05', '36,063,650', '20,406,526,124', '562.00', '572.00', '560.00', '572.00', ' 0.00', '56,678']
['110/10/06', '33,989,379', '19,345,468,774', '573.00', '574.00', '565.00', '571.00', '-1.00', '53,606']
['110/10/07', '28,480,400', '16,458,019,782', '575.00', '582.00', '572.00', '580.00', '+9.00', '23,304']
['110/10/08', '22,601,904', '13,051,944,932', '582.00', '583.00', '573.00', '575.00', '-5.00', '24,157']
['110/10/12', '27,126,088', '15,455,194,493', '570.00', '575.00', '564.00', '575.00', ' 0.00', '32,210']
['110/10/13', '19,267,437', '11,006,807,040', '572.00', '575.00', '570.00', '571.00', '-4.00', '18,050']
['110/10/14', '14,270,641', '8,213,170,376', '579.00', '579.00', '573.00', '573.00', '+2.00', '14,814']
['110/10/15', '55,305,252', '32,878,638,999', '592.00', 

## Stock catch 2

根據前篇得到的資訊：

- URL：https://isin.twse.com.tw/isin/class_main.jsp。
- 必要的 Query：market=1&issuetype=1&Page=1&chklike=Y。
- 所以可以根據需求在 Query 的 market、issuetype、Page、chklike 代入自己要的值。
- HTTP Method 是 GET。
- Content-Type：text/html;charset=MS950，所以格式是 HTML，編碼為 MS950。

In [3]:
# project2
# pip install beautifulsoup4
# pip install lxml

In [4]:
from bs4 import BeautifulSoup
html = "<html>test test test</html>"
soup = BeautifulSoup(html, "lxml")
soup

<html><body><p>test test test</p></body></html>

In [5]:
from bs4 import BeautifulSoup
html = "<title>example1</title><title>example2</title>"
soup = BeautifulSoup(html, "lxml")

# find_all() 會搜尋整個 html 回傳符合的值
print(soup.find_all("title"))
# [<title>example1</title>, <title>example2</title>]

# find() 只會回傳"第一個"符合的值
print(soup.find("title"))
# <title>example1</title>

# 拿取 tag 之間的 text
print(soup.find("title").text)
# example1


[<title>example1</title>, <title>example2</title>]
<title>example1</title>
example1


In [6]:
import json
import requests
from bs4 import BeautifulSoup
# index constant，數字代表我們要的資料在 list 的位置
TARGET_TABLE_INDEX = 1
STOCK_NO_INDEX = 2
STOCK_NAME_INDEX = 3
STOCK_INDUSTRY_INDEX = 6

# JSON settings
TITLE = "stock"
JSON_INDENT = 4

# HTTP Request
url = "https://isin.twse.com.tw/isin/class_main.jsp"
res = requests.get(url, params={
    "market": "1",
    "issuetype": "1",
    "Page": "1",
    "chklike": "Y"
})

# 處理編碼，使用預設 utf-8 的話 res.text 的內容會有亂碼
res.encoding = "MS950"
res_html = res.text

# Parse
soup = BeautifulSoup(res_html, "lxml")

# HTML 裡面有兩張 table
tr_list = soup.find_all("table")[TARGET_TABLE_INDEX].find_all("tr") #find_all("table") 回傳的 list length 是 2
tr_list.pop(0)  # item 是欄位名稱用不到所以 pop 掉

# 開始處理資料
result = []
for tr in tr_list:
    td_list = tr.find_all("td")
    stock_no_val = td_list[STOCK_NO_INDEX].text  #Num
    stock_name_val = td_list[STOCK_NAME_INDEX].text #Name
    stock_industry_val = td_list[STOCK_INDUSTRY_INDEX].text #Industry

    # Save to dict
    result.append({
        "stockNo": stock_no_val,
        "stockName": stock_name_val,
        "stockIndustry": stock_industry_val
    })

#Download
stock_list_dict = {TITLE: result}
with open("stock_info_list.json", "w", encoding="utf-8") as f:
    f.write(json.dumps(stock_list_dict, indent=JSON_INDENT, ensure_ascii=False))

## User-Agent  Change 偽裝

F12 -> Network -> Request -> Headers -> User Agent

In [7]:
import requests
res = requests.get("https://www.google.com/")  #your browser
print(res.request.headers)

{'User-Agent': 'python-requests/2.26.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}


In [8]:
pip install fake-useragent

Note: you may need to restart the kernel to use updated packages.


In [9]:
# Sned Request User-agent。
from fake_useragent import UserAgent
ua = UserAgent()

# browser User-Agent
ua.ie
ua.google
ua.firefox
ua.safari
ua.random

'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'

In [10]:
import requests
from fake_useragent import UserAgent

ua = UserAgent()
res = requests.get("https://www.google.com/", headers={
    "User-Agent": ua.random
})

print(res.request.headers)

{'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}


## multithreading

In [11]:
import time
def req1():
    print("req1: 開始送 request") 
    time.sleep(3)
    print("req1: 接收到 response") 
def req2():
    print("req2: 開始送 request") 
    time.sleep(3)
    print("req2: 接收到 response")

if __name__ == '__main__':
    req1()
    req2()

req1: 開始送 request
req1: 接收到 response
req2: 開始送 request
req2: 接收到 response


In [12]:
# import python 的多執行緒套件
import time
from threading import Thread

def req1():
    print("req1: 開始送 request") 
    time.sleep(3)
    print("req1: 接收到 response") 

def req2():
    print("req2: 開始送 request") 
    time.sleep(3)
    print("req2: 接收到 response")

if __name__ == '__main__':
    # 先將 funtion 包裝成 thread 的物件，再執行
    Thread(target=req1).start()
    Thread(target=req2).start()

req1: 開始送 request
req2: 開始送 request


In [13]:
import time
import requests
from threading import Thread

# 將流程先用 function 封裝起來，接下來才能帶入 thread 中
def daily_price_req(date, stock_no):
    res = requests.get("https://www.twse.com.tw/exchangeReport/STOCK_DAY",
                       params={
                           "response": "json",
                           "date": date,
                           "stockNo": stock_no
                       })

    # 把 JSON 轉成 Python 可存取之型態
    res_json = res.json()

    # 我們要的每日成交資訊在 data 這個欄位
    daily_price_list = res_json['data']

    # 印出資料
    print("{} 每日成交資訊: {}".format(stock_no, daily_price_list))


if __name__ == '__main__':
    # 日期什麼的，可以依照需求設定
    req_info_list = [
        {
            "date": "20211011",
            "stockNo": "2330"
        },
        {
            "date": "20211011",
            "stockNo": "2603"
        },
        {
            "date": "20211011",
            "stockNo": "2609"
        }
    ]

    for req_info in req_info_list:
        req_date = req_info.get("date")
        req_stock_no = req_info.get("stockNo")
        if req_date and req_stock_no:
            # args 為 function 會用到的參數
            req_thread = Thread(target=daily_price_req,
                                args=(req_date, req_stock_no))
            req_thread.start()

            # 每個 req 間隔最好 3 秒以上，不然會被證交所鎖 IP 一段時間
            time.sleep(3)

2330 每日成交資訊: [['110/10/01', '39,006,124', '22,368,447,571', '579.00', '579.00', '571.00', '574.00', '-6.00', '78,088'], ['110/10/04', '22,288,138', '12,741,914,919', '574.00', '575.00', '569.00', '572.00', '-2.00', '42,147'], ['110/10/05', '36,063,650', '20,406,526,124', '562.00', '572.00', '560.00', '572.00', ' 0.00', '56,678'], ['110/10/06', '33,989,379', '19,345,468,774', '573.00', '574.00', '565.00', '571.00', '-1.00', '53,606'], ['110/10/07', '28,480,400', '16,458,019,782', '575.00', '582.00', '572.00', '580.00', '+9.00', '23,304'], ['110/10/08', '22,601,904', '13,051,944,932', '582.00', '583.00', '573.00', '575.00', '-5.00', '24,157'], ['110/10/12', '27,126,088', '15,455,194,493', '570.00', '575.00', '564.00', '575.00', ' 0.00', '32,210'], ['110/10/13', '19,267,437', '11,006,807,040', '572.00', '575.00', '570.00', '571.00', '-4.00', '18,050'], ['110/10/14', '14,270,641', '8,213,170,376', '579.00', '579.00', '573.00', '573.00', '+2.00', '14,814'], ['110/10/15', '55,305,252', '32,8

In [14]:
req_thread_list = []
for req_info in req_info_list:
        req_date = req_info.get("date")
        req_stock_no = req_info.get("stockNo")
        if req_date and req_stock_no:
            req_thread = Thread(target=daily_price_req,
                                args=(req_date, req_stock_no))
            req_thread.start()
            req_thread_list.append(req_thread)

            # 每個 req 間隔最好 3 秒以上，不然會被證交所鎖 IP 一段時間
            time.sleep(3)

for req_thread in req_thread_list:
    # join() 就是讓程式等著，確定該 thread 已經執行完畢之後再繼續往下
    # 所以這邊的用法是說：
    # 我的程式會在這邊等所有 req thread（daily_price_req 裡的那些動作）都執行完畢後再往下做
    req_thread.join()

# do something...
print("Ending...")

2330 每日成交資訊: [['110/10/01', '39,006,124', '22,368,447,571', '579.00', '579.00', '571.00', '574.00', '-6.00', '78,088'], ['110/10/04', '22,288,138', '12,741,914,919', '574.00', '575.00', '569.00', '572.00', '-2.00', '42,147'], ['110/10/05', '36,063,650', '20,406,526,124', '562.00', '572.00', '560.00', '572.00', ' 0.00', '56,678'], ['110/10/06', '33,989,379', '19,345,468,774', '573.00', '574.00', '565.00', '571.00', '-1.00', '53,606'], ['110/10/07', '28,480,400', '16,458,019,782', '575.00', '582.00', '572.00', '580.00', '+9.00', '23,304'], ['110/10/08', '22,601,904', '13,051,944,932', '582.00', '583.00', '573.00', '575.00', '-5.00', '24,157'], ['110/10/12', '27,126,088', '15,455,194,493', '570.00', '575.00', '564.00', '575.00', ' 0.00', '32,210'], ['110/10/13', '19,267,437', '11,006,807,040', '572.00', '575.00', '570.00', '571.00', '-4.00', '18,050'], ['110/10/14', '14,270,641', '8,213,170,376', '579.00', '579.00', '573.00', '573.00', '+2.00', '14,814'], ['110/10/15', '55,305,252', '32,8

## Project 自動更新每日個股日成交資訊

In [15]:
#進度條 progress bar

class ProgressBar:
    bar_string_fmt = '\rProgress: [{}{}] {:.2%} {}/{}'
    cnt = 0

    def __init__(self, total, bar_total=20):
        self.total = total # task 的總數
        self.bar_total = bar_total  # progress bar 的長度

    def update(self, step=1):
        total = self.total
        self.cnt += step           # 更新 progress bar 的進度
        bar_cnt = (int((self.cnt/total)*self.bar_total))  # bar 的數量
        space_cnt = self.bar_total - bar_cnt  # 空白的數量

        # 顯示 progress bar '\r' 的意思代表 replace，print 出來的字串不會印在新的一行而是 replace 原本那行同位置的字符
        # {:.2%}，表示 format 進來的值會以百分比顯示，並只取到小數點後兩位
        progress = self.bar_string_fmt.format(
            '█' * bar_cnt,
            ' ' * space_cnt,
            self.cnt/total,
            self.cnt,
            total
        )
        print(progress, end='')

        # 100%
        if self.cnt/total == 1:
            print('\n')
            
if __name__ == '__main__':
    total = 10000
    progress_bar = ProgressBar(total)
    for _ in range(total):
        progress_bar.update()

Progress: [                    ] 0.01% 1/10000Progress: [                    ] 0.02% 2/10000Progress: [                    ] 0.03% 3/10000Progress: [                    ] 0.04% 4/10000Progress: [                    ] 0.05% 5/10000Progress: [                    ] 0.06% 6/10000Progress: [                    ] 0.07% 7/10000Progress: [                    ] 0.08% 8/10000Progress: [                    ] 0.09% 9/10000Progress: [                    ] 0.10% 10/10000Progress: [                    ] 0.11% 11/10000Progress: [                    ] 0.12% 12/10000Progress: [                    ] 0.13% 13/10000Progress: [                    ] 0.14% 14/10000Progress: [                    ] 0.15% 15/10000Progress: [                    ] 0.16% 16/10000Progress: [                    ] 0.17% 17/10000Progress: [                    ] 0.18% 18/10000Progress: [                    ] 0.19% 19/10000Progress: [                    ] 0.20% 20/10000Progress: [                    ] 0.21% 21/10000

Progress: [██████              ] 33.59% 3359/10000Progress: [██████              ] 33.60% 3360/10000Progress: [██████              ] 33.61% 3361/10000Progress: [██████              ] 33.62% 3362/10000Progress: [██████              ] 33.63% 3363/10000Progress: [██████              ] 33.64% 3364/10000Progress: [██████              ] 33.65% 3365/10000Progress: [██████              ] 33.66% 3366/10000Progress: [██████              ] 33.67% 3367/10000Progress: [██████              ] 33.68% 3368/10000Progress: [██████              ] 33.69% 3369/10000Progress: [██████              ] 33.70% 3370/10000Progress: [██████              ] 33.71% 3371/10000Progress: [██████              ] 33.72% 3372/10000Progress: [██████              ] 33.73% 3373/10000Progress: [██████              ] 33.74% 3374/10000Progress: [██████              ] 33.75% 3375/10000Progress: [██████              ] 33.76% 3376/10000Progress: [██████              ] 33.77% 3377/10000Progress: [██████             

Progress: [████████████        ] 64.44% 6444/10000Progress: [████████████        ] 64.45% 6445/10000Progress: [████████████        ] 64.46% 6446/10000Progress: [████████████        ] 64.47% 6447/10000Progress: [████████████        ] 64.48% 6448/10000Progress: [████████████        ] 64.49% 6449/10000Progress: [████████████        ] 64.50% 6450/10000Progress: [████████████        ] 64.51% 6451/10000Progress: [████████████        ] 64.52% 6452/10000Progress: [████████████        ] 64.53% 6453/10000Progress: [████████████        ] 64.54% 6454/10000Progress: [████████████        ] 64.55% 6455/10000Progress: [████████████        ] 64.56% 6456/10000Progress: [████████████        ] 64.57% 6457/10000Progress: [████████████        ] 64.58% 6458/10000Progress: [████████████        ] 64.59% 6459/10000Progress: [████████████        ] 64.60% 6460/10000Progress: [████████████        ] 64.61% 6461/10000Progress: [████████████        ] 64.62% 6462/10000Progress: [████████████       

Progress: [███████████████████ ] 98.58% 9858/10000Progress: [███████████████████ ] 98.59% 9859/10000Progress: [███████████████████ ] 98.60% 9860/10000Progress: [███████████████████ ] 98.61% 9861/10000Progress: [███████████████████ ] 98.62% 9862/10000Progress: [███████████████████ ] 98.63% 9863/10000Progress: [███████████████████ ] 98.64% 9864/10000Progress: [███████████████████ ] 98.65% 9865/10000Progress: [███████████████████ ] 98.66% 9866/10000Progress: [███████████████████ ] 98.67% 9867/10000Progress: [███████████████████ ] 98.68% 9868/10000Progress: [███████████████████ ] 98.69% 9869/10000Progress: [███████████████████ ] 98.70% 9870/10000Progress: [███████████████████ ] 98.71% 9871/10000Progress: [███████████████████ ] 98.72% 9872/10000Progress: [███████████████████ ] 98.73% 9873/10000Progress: [███████████████████ ] 98.74% 9874/10000Progress: [███████████████████ ] 98.75% 9875/10000Progress: [███████████████████ ] 98.76% 9876/10000Progress: [███████████████████

In [16]:
import json
import time
import requests
from pathlib import Path
from threading import Thread
from datetime import datetime
from fake_useragent import UserAgent

class DailyPriceSpider:
    def __init__(self, progress_bar=None):
        
        # 重複使用 TCP 連線
        self.req = requests.Session()
        self.url = "https://www.twse.com.tw/exchangeReport/STOCK_DAY"
        self.headers = self.req.headers
        
        # 偽裝 User-Agent
        ua = UserAgent()
        self.headers["User-Agent"] = ua.random
        
        # 加入 progress bar
        if progress_bar:
            self.progress_bar = progress_bar
 
    # 供內部使用
    def __get(self, date, stock_no):
        res = self.req.get(self.url,
                           headers = self.headers,
                           params={
                               "response": "csv",  # 這次抓的是 csv 格式
                               "date": date,
                               "stockNo": stock_no
                           })
        return res.text
    
    # 供內部使用
    def __save_file(self, res_text, path):  # 去掉 res_text 裡多餘的空白行
        res_text = '\n'.join(
            filter(None,res_text.splitlines())
        )
        path = Path(path)
        
        # parents=True，如果父資料夾不存在則會一併創建
        # exist_ok=True，創建資料夾時，該資料夾已存在則不會 throw exception
        path.parent.mkdir(parents=True, exist_ok=True)
        with open(path, 'w', encoding="utf-8") as file:
            file.write(res_text)
        return
 
    
    def scrape(self, date, stock_no, save_path=""):
        # 把上面的兩個 function 整合
        res_text = self.__get(date, stock_no)
        if save_path:
            self.__save_file(res_text, save_path)
            
        # 更新 progress bar
        if self.progress_bar:
            self.progress_bar.update()


if __name__ == '__main__':
    SAVE_PATH_ROOT = "./daily_stock_price/"
    SLEEP_TIME = 3
    stock_info_list_file = {}
    with open("./stock_info_list.json", "r", encoding="utf-8") as f:
        stock_info_list_file = json.load(f)

    # 不管日期是幾號，他回傳都是給我們一整個月的，所以就固定設成1號就好
    # strftime()，把 datetime 輸出成我們要的格式
    today_date = "{}01".format(datetime.now().strftime("%Y%m"))

    stock_info_list = stock_info_list_file.get("stock", [])
    
    # 加入 progress bar
    progress_bar = ProgressBar(len(stock_info_list))
    dps = DailyPriceSpider(progress_bar=progress_bar)
    req_thread_list = []

    for stock_info in stock_info_list:

        stock_no = stock_info.get("stockNo")
        stock_name = stock_info.get("stockName")
        stock_industry = stock_info.get("stockIndustry")
        file_name = "{}_{}_daily_price.csv".format(
            today_date[:-2],  # 字串只需要用到年跟月
            stock_no+stock_name)

        save_path = "{}/{}/{}/{}".format(SAVE_PATH_ROOT,
                                         stock_industry,
                                         stock_no+stock_name,
                                         file_name)

        if stock_no and stock_name and stock_industry:
            req_thread = Thread(target=dps.scrape,
                                args=(today_date, stock_no, save_path),
                                daemon=True)
            req_thread.start()
            req_thread_list.append(req_thread)
            time.sleep(SLEEP_TIME)

    for req_thread in req_thread_list:
        req_thread.join()

    print("Finished")


Progress: [█████████████████   ] 85.92% 830/966

Exception in thread Thread-849:
Traceback (most recent call last):
  File "C:\Users\cti110016\Anaconda3\lib\threading.py", line 973, in _bootstrap_inner
    self.run()
  File "C:\Users\cti110016\Anaconda3\lib\threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\CTI110~1\AppData\Local\Temp/ipykernel_24864/16292542.py", line 55, in scrape
  File "C:\Users\CTI110~1\AppData\Local\Temp/ipykernel_24864/16292542.py", line 45, in __save_file
  File "C:\Users\cti110016\Anaconda3\lib\pathlib.py", line 1323, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'daily_stock_price\\半導體業\\6531愛普*'


Progress: [██████████████████  ] 91.10% 880/966

Exception in thread Thread-900:
Traceback (most recent call last):
  File "C:\Users\cti110016\Anaconda3\lib\threading.py", line 973, in _bootstrap_inner
    self.run()
  File "C:\Users\cti110016\Anaconda3\lib\threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\CTI110~1\AppData\Local\Temp/ipykernel_24864/16292542.py", line 55, in scrape
  File "C:\Users\CTI110~1\AppData\Local\Temp/ipykernel_24864/16292542.py", line 45, in __save_file
  File "C:\Users\cti110016\Anaconda3\lib\pathlib.py", line 1323, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'daily_stock_price\\電子通路業\\8070長華*'


Progress: [███████████████████ ] 99.79% 964/966Finished
