# 利用 Request 爬取股市資料
> Requests is an elegant and simple HTTP library for Python, built for human beings.

1. Request 提供 `HTTP` 相關的函式，相對於比較早期的 `Urllib` 模組，更加的簡潔易用。
2. 僅介紹簡單的網頁資料爬取，想要了解更進階的爬蟲知識，同學可以多加利用網路上的教學資源。

### 參考資料：
* https://requests.readthedocs.io/en/master/#
* https://www.crummy.com/software/BeautifulSoup/bs4/doc/

# 常用檔案格式簡介

儲存方式|檔案類型|簡介
:---|:---|:---
text|CSV|Comma-Separated Values，利用逗號與換行來儲存表格，缺點是靈活性很差
text|JSON|以物件的形式來儲存資料，較CSV靈活，但是占用空間較多
binary|Python Pickle Format|Python序列化(serializing)後的二進制檔案，在讀取與寫入上都非常快速
binary|HDF5 Format|Hierarchical Data Format，專門針對大量資料而設計，用來支援巨量數據存取的檔案格式

## CSV
利用逗號與換行來儲存表格
```csv
Name,Department,Number,Grade
"Robert","CSIE",1,99.9
"Bob","CSIE",2,60
"Bobby","CSIE",3,59.9
```

## JSON
儲存方式是以一個個的物件(每一個大括號包起來的內容都是物件)來儲存，每個物件都會有自己的屬性，使用Python讀入後其使用方式會像字典一樣。
```json
{
    "Robert": {
        "Department": "CSIE",
        "Number": 1,
        "Grade": 99.9
    },
    "Bob": {
        "Department": "CSIE",
        "Number": 2,
        "Grade": 60
    },
    "Bobby": {
        "Department": "CSIE",
        "Number": 3,
        "Grade": 59.9
    }
}
```

In [14]:
import requests
date = "2023031"

# f開頭的字串(f-string)是python中字串格式化(String format)語法，亦稱爲格式化字串常數（formatted string literals）
# 相較於以前的str.format或是%-formatting這些方式更加簡潔快速。
# f-string 用大括號 {} 表示被替換的欄位，直接填入替換內容， 例如指定的日期 date
url = f'https://www.twse.com.tw/exchangeReport/MI_INDEX?response=json&date={date}&type=ALLBUT0999'
# 對該位址傳送一個Get的請求，回傳結果是一個json格式的字典(dict)
#requests 針對 JSON 資料的解析函式內建為 Response 類別的 json() 方法，
response = requests.get(url)
response_json = response.json()

In [15]:
# 透過 Response 類別的 text 屬性可以檢視回應的 Data 文字內容
print(response.text)

{"data4":[["寶島股價報酬指數","24,939.36","<p style ='color:green'>-<\u002fp>","413.95","-1.63",""],["發行量加權股價報酬指數","32,275.79","<p style ='color:green'>-<\u002fp>","508.19","-1.55",""],["臺灣公司治理100報酬指數","12,155.96","<p style ='color:green'>-<\u002fp>","176.32","-1.43",""],["臺灣50報酬指數","24,811.55","<p style ='color:green'>-<\u002fp>","381.07","-1.51",""],["臺灣50權重上限30%報酬指數","23,698.78","<p style ='color:green'>-<\u002fp>","347.72","-1.45",""],["臺灣中型100報酬指數","26,985.66","<p style ='color:green'>-<\u002fp>","439.82","-1.60",""],["臺灣資訊科技股報酬指數","40,587.58","<p style ='color:green'>-<\u002fp>","660.75","-1.60",""],["臺灣發達報酬指數","19,285.22","<p style ='color:green'>-<\u002fp>","254.65","-1.30",""],["臺灣高股息報酬指數","18,004.17","<p style ='color:green'>-<\u002fp>","199.94","-1.10",""],["臺灣就業99報酬指數","14,915.33","<p style ='color:green'>-<\u002fp>","245.08","-1.62",""],["臺灣高薪100報酬指數","12,236.27","<p style ='color:green'>-<\u002fp>","161.85","-1.31",""],["未含金融電子報酬指數","31,904.78","<p style ='color:green'>-<\u002fp>

In [16]:
# # data9儲存股票的成交資訊
response_json['data9'][0:3]

[['0050',
  '元大台灣50',
  '15,018,299',
  '36,674',
  '1,778,354,150',
  '118.85',
  '118.95',
  '118.00',
  '118.30',
  '<p style= color:green>-</p>',
  '1.90',
  '118.25',
  '22',
  '118.30',
  '44',
  '0.00'],
 ['0051',
  '元大中型100',
  '58,321',
  '266',
  '3,285,800',
  '56.55',
  '56.55',
  '56.00',
  '56.10',
  '<p style= color:green>-</p>',
  '1.00',
  '56.00',
  '2',
  '56.45',
  '2',
  '0.00'],
 ['0052',
  '富邦科技',
  '547,515',
  '841',
  '57,448,418',
  '105.60',
  '105.60',
  '104.65',
  '104.90',
  '<p style= color:green>-</p>',
  '1.70',
  '104.75',
  '29',
  '104.90',
  '2',
  '0.00']]

In [17]:
# fields9欄位則儲存對應的欄位名稱
response_json['fields9']

['證券代號',
 '證券名稱',
 '成交股數',
 '成交筆數',
 '成交金額',
 '開盤價',
 '最高價',
 '最低價',
 '收盤價',
 '漲跌(+/-)',
 '漲跌價差',
 '最後揭示買價',
 '最後揭示買量',
 '最後揭示賣價',
 '最後揭示賣量',
 '本益比']

## Numpy 簡介
> NumPy is the fundamental package for scientific computing with Python. It contains among other things:
> 1. a powerful N-dimensional array object
> 2. sophisticated (broadcasting) functions
> 3. tools for integrating C/C++ and Fortran codes
> 4. useful linear algebra, Fourier transform, and random number capabilities

在談到 `Pandas` 之前，需要對 `Numpy` 有個簡單的了解，`Numpy` 專門用來處理多維陣列操作，對於效能有非常好的優化，`Numpy` 的操作速度可以幾乎達到與 C 語言一樣快速，同時許多 Python 強大的資料科學相關套件都是建立在 `Numpy` 之上，例如後面會談到的`Pandas`、`Scikit-learn` 等等... 下面將簡單講解一些 Numpy Array 的用法

In [18]:
import numpy as np

In [19]:
np.__version__

'1.19.2'

# Data Types in Python

C or Java requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:

```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

While in Python the equivalent operation could be written this way:

```python
# Python code
result = 0
for i in range(100):
    result += i
```

## Python 可以...

```python
# Python code
x = 4
x = "four"
```
## 但是不可以
```python
# Python code
i = 1
i++
  ^
SyntaxError: invalid syntax
```


## 手動建立一個 Array
numpy可以接受任何tuple, list, range等等輸入

In [24]:
L1 = [True, "2", 3.0, 4]
print(type(L1))
[type(item) for item in L1]

<class 'list'>


[bool, str, float, int]

In [26]:
# 使用list產生Array
array_input = [0, 2, 2, 7, 7, 1, 2, 1, 7, 1]  # 元素資料型別可不同，並且用逗號隔開
print(array_input)
print(type(array_input))
np.array(array_input)     # 元素資料型別相同，無逗號隔開
print(np.array(array_input))
print(type(np.array(array_input)))

[0, 2, 2, 7, 7, 1, 2, 1, 7, 1]
<class 'list'>
[0 2 2 7 7 1 2 1 7 1]
<class 'numpy.ndarray'>


In [27]:
# 使用range來產生
array_input = range(100)
print(type(array_input))
print(array_input)
a=np.array(array_input)
print(a)

<class 'range'>
range(0, 100)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


In [27]:
# 產生二維的Array (nested lists)
array_input = [range(3) for i in range(3)]
print(np.array(array_input))

[[0 1 2]
 [0 1 2]
 [0 1 2]]


## 其它產生 Array 的方法

### np.arange(a,b,c)表示產生從a到b但不包括b，間隔為c的一個array，資料型別預設是int32。但是np.linspace(a,b,c)表示把a到b平均分成c分，包括b。

In [92]:
# 類似range()
np.arange(0, 20)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [34]:
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [37]:
np.linspace(0, 20, 21)

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20.])

In [38]:
np.linspace(0, 20, 21, dtype=int) # 產生的同時可以自己指定型態

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20])

In [117]:
np.linspace(0, 1,5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [39]:
# 產生一個指定形狀，用 0填充的數組 
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [40]:
np.zeros((5,2))

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [41]:
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [28]:
np.full((2,3,4), fill_value=2) #返回給定形狀和類型的新數組，並用fill_value填充

array([[[2, 2, 2, 2],
        [2, 2, 2, 2],
        [2, 2, 2, 2]],

       [[2, 2, 2, 2],
        [2, 2, 2, 2],
        [2, 2, 2, 2]]])

In [29]:
# 產生一組隨機的數據(0~1)
# uniformly distributed random values between 0 and 1
np.random.random(10)

array([0.92883646, 0.09276367, 0.04040192, 0.76605625, 0.50862489,
       0.67455132, 0.88919732, 0.69528826, 0.0114339 , 0.49891244])

In [31]:
#生成一個 [a,b] 之間的均勻分佈的浮點數，相當於 a + (b-a) * random.random()生成一個 [a,b] 之間的均勻分佈的浮點數，相當於 a + (b-a) * random.random()
np.random.uniform(2,5,20)

array([1.51534075, 1.73794494, 1.71125201, 1.23234806, 1.94204938,
       1.52843344, 1.66341698, 1.43786834, 1.19340466, 1.99151375,
       1.55938534, 1.71723972, 1.71180004, 1.18220878, 1.03152026,
       1.20110759, 1.41553435, 1.17737938, 1.45335444, 1.82400866])

In [50]:
# 產生一組 3x3 陣列 (normally distributed random values)
#  mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[ 1.07692923, -0.01499588, -0.03898253],
       [ 0.92812342,  1.31294581,  0.26788041],
       [-0.21516662, -0.44305266, -1.87544569]])

In [32]:
# 產生單位矩陣(Identity matrix)
np.eye(100)

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

## 查看 Array 的資訊

In [34]:
array = np.random.randint(10, size=(3, 4, 5))
array

array([[[0, 6, 1, 1, 4],
        [4, 2, 1, 2, 9],
        [9, 5, 2, 2, 7],
        [2, 0, 2, 9, 5]],

       [[6, 6, 4, 0, 0],
        [6, 4, 2, 4, 1],
        [3, 8, 1, 9, 7],
        [8, 3, 6, 0, 4]],

       [[7, 0, 4, 1, 5],
        [1, 7, 0, 9, 5],
        [0, 2, 8, 0, 0],
        [9, 4, 7, 8, 9]]])

In [57]:
# 維度
array.ndim

3

In [142]:
array.shape

(3, 4, 5)

In [58]:
array.size

60

In [144]:
array.dtype

dtype('int32')

## 切片取值(Array Slicing)
``` python
res = array[start:end:step]
res = array[y_start:y_end:y_step, x_start:x_end:x_step]
```

In [52]:
a = np.array(range(10))
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [53]:
a[3:9]

array([3, 4, 5, 6, 7, 8])

In [37]:
array = np.random.randint(10, size=(3, 4, 5))
array

array([[[2, 4, 4, 1, 9],
        [7, 4, 0, 8, 4],
        [4, 4, 3, 2, 3],
        [9, 7, 6, 0, 5]],

       [[6, 6, 6, 8, 2],
        [9, 1, 1, 2, 5],
        [0, 4, 6, 6, 6],
        [6, 4, 5, 2, 3]],

       [[0, 0, 2, 4, 1],
        [0, 7, 7, 4, 2],
        [0, 4, 1, 5, 7],
        [7, 8, 9, 5, 4]]])

In [38]:
array[::2]

array([[[2, 4, 4, 1, 9],
        [7, 4, 0, 8, 4],
        [4, 4, 3, 2, 3],
        [9, 7, 6, 0, 5]],

       [[0, 0, 2, 4, 1],
        [0, 7, 7, 4, 2],
        [0, 4, 1, 5, 7],
        [7, 8, 9, 5, 4]]])

In [39]:
array[1::2]

array([[[6, 6, 6, 8, 2],
        [9, 1, 1, 2, 5],
        [0, 4, 6, 6, 6],
        [6, 4, 5, 2, 3]]])

In [40]:
array[::-1]

array([[[0, 0, 2, 4, 1],
        [0, 7, 7, 4, 2],
        [0, 4, 1, 5, 7],
        [7, 8, 9, 5, 4]],

       [[6, 6, 6, 8, 2],
        [9, 1, 1, 2, 5],
        [0, 4, 6, 6, 6],
        [6, 4, 5, 2, 3]],

       [[2, 4, 4, 1, 9],
        [7, 4, 0, 8, 4],
        [4, 4, 3, 2, 3],
        [9, 7, 6, 0, 5]]])

In [41]:
array = np.array(range(9)).reshape(3, 3)
array

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [42]:
array[1:3, 1:3]

array([[4, 5],
       [7, 8]])

## Array arithmetic

In [43]:
x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division
print("-x     = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2  = ", x % 2)

x     = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]
-x     =  [ 0 -1 -2 -3]
x ** 2 =  [0 1 4 9]
x % 2  =  [0 1 0 1]


In [44]:
theta = np.linspace(0, np.pi, 3)
print("theta      = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))

theta      =  [0.         1.57079633 3.14159265]
sin(theta) =  [0.0000000e+00 1.0000000e+00 1.2246468e-16]
cos(theta) =  [ 1.000000e+00  6.123234e-17 -1.000000e+00]
tan(theta) =  [ 0.00000000e+00  1.63312394e+16 -1.22464680e-16]


In [45]:
x = [-1, 0, 1]
print("x         = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))

x         =  [-1, 0, 1]
arcsin(x) =  [-1.57079633  0.          1.57079633]
arccos(x) =  [3.14159265 1.57079633 0.        ]
arctan(x) =  [-0.78539816  0.          0.78539816]


In [46]:
x = [1, 2, 4, 10]
print("x        =", x)
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))

x        = [1, 2, 4, 10]
ln(x)    = [0.         0.69314718 1.38629436 2.30258509]
log2(x)  = [0.         1.         2.         3.32192809]
log10(x) = [0.         0.30103    0.60205999 1.        ]


## Aggregations: Minimum and Maximum

Python has built-in ``min`` and ``max`` functions, used to find the minimum value and maximum value of any given array:

In [48]:
big_array = np.random.rand(1000)

In [49]:
min(big_array), max(big_array)

(0.0008984801022676736, 0.9996069141002208)

In [50]:
big_array.max()

0.9996069141002208

In [51]:
big_array.min()

0.0008984801022676736

In [52]:
big_array.mean()

0.4997258464990301

In [53]:
big_array.std()

0.29198729782690824

## 以 Pandas 中的容器來儲存爬取的資料

### Pandas 簡介
> An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

1. 主要提供兩種資料容器：Sereies 與 DataFrame，分別用來處理一維與二維的資料
2. 提供大量的方法，包括向量操作、空值處理、表格合併等等
3. Pandas 基於 Numpy 撰寫，處理資料速度非常快，適合用來處理大量的數據
4. 支援多種檔案格式的 IO，對於大部分的檔案格式都有支援

Pandas官方文件提供非常完整的使用說明，參見：
https://pandas.pydata.org/pandas-docs/stable/index.html  

In [56]:
import pandas as pd

In [57]:
# 用DataFrame的建構子直接把資料轉成DataFrame的格式，
# pd.DataFrame()可以解析ndarray, Iterable, dict, DataFrame這些型態(Type)的資料，
# 而參數columns則是用來指定表格的欄位名稱。
stock = pd.DataFrame(response_json['data9'], columns=response_json['fields9'])
# 顯示資料前幾筆的數值，用來觀察資料的長相
stock.head(10)

Unnamed: 0,證券代號,證券名稱,成交股數,成交筆數,成交金額,開盤價,最高價,最低價,收盤價,漲跌(+/-),漲跌價差,最後揭示買價,最後揭示買量,最後揭示賣價,最後揭示賣量,本益比
0,50,元大台灣50,15018299,36674,1778354150,118.85,118.95,118.0,118.3,<p style= color:green>-</p>,1.9,118.25,22,118.3,44,0.0
1,51,元大中型100,58321,266,3285800,56.55,56.55,56.0,56.1,<p style= color:green>-</p>,1.0,56.0,2,56.45,2,0.0
2,52,富邦科技,547515,841,57448418,105.6,105.6,104.65,104.9,<p style= color:green>-</p>,1.7,104.75,29,104.9,2,0.0
3,53,元大電子,4097,1002,239541,58.6,58.65,58.6,58.65,<p style= color:green>-</p>,1.35,58.65,1,58.95,25,0.0
4,55,元大MSCI金融,241454,522,5368178,22.45,22.45,22.15,22.19,<p style= color:green>-</p>,0.29,22.19,1,22.2,5,0.0
5,56,元大高股息,41886499,34293,1169191727,28.0,28.01,27.83,27.86,<p style= color:green>-</p>,0.38,27.86,238,27.87,83,0.0
6,57,富邦摩台,3089,1003,265168,86.25,86.25,85.95,85.95,<p style= color:green>-</p>,1.8,85.85,2,86.1,20,0.0
7,61,元大寶滬深,184620,879,3507589,19.16,19.16,18.91,18.95,<p style= color:green>-</p>,0.26,18.95,12,19.0,3,0.0
8,6203,元大MSCI台灣,6921,920,395268,57.0,57.25,57.0,57.25,<p style= color:green>-</p>,1.25,57.0,2,57.6,21,0.0
9,6204,永豐臺灣加權,23570,1010,1830686,77.7,77.7,77.7,77.7,<p style= color:green>-</p>,1.5,77.65,10,77.9,12,0.0


In [58]:
# 查看表格大小
stock.shape

(1181, 16)

In [59]:
# DataFrame.info()可以用來查看更詳細的資料，包含欄位名稱、資料型態、記憶體占用等。
stock.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1181 entries, 0 to 1180
Data columns (total 16 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   證券代號     1181 non-null   object
 1   證券名稱     1181 non-null   object
 2   成交股數     1181 non-null   object
 3   成交筆數     1181 non-null   object
 4   成交金額     1181 non-null   object
 5   開盤價      1181 non-null   object
 6   最高價      1181 non-null   object
 7   最低價      1181 non-null   object
 8   收盤價      1181 non-null   object
 9   漲跌(+/-)  1181 non-null   object
 10  漲跌價差     1181 non-null   object
 11  最後揭示買價   1181 non-null   object
 12  最後揭示買量   1181 non-null   object
 13  最後揭示賣價   1181 non-null   object
 14  最後揭示賣量   1181 non-null   object
 15  本益比      1181 non-null   object
dtypes: object(16)
memory usage: 147.8+ KB


## Pandas 的輸入輸出
支援常見的 CSV、JSON 或是一些二進制的儲存格式像是 Pickle、HDF5，也可以直接讀寫資料庫，完整的支援格式可以參考：https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

因為使用方式都大同小異，這裡使用 CSV 格式來示範檔案的儲存與讀取

In [60]:
# 將前面製作好的DataFrame(關於Pandas的兩種容器後面會做詳細說明)儲存到電腦中，
# 參數index=False 表示不要輸出DataFrame自動指定的index(表格最左邊那一排數字)
stock.to_csv('stock.csv', index=False)

In [61]:
# 讀取的方式也非常簡單，使用pd.read_csv(filepath)即可
pd.read_csv('stock.csv').head()

Unnamed: 0,證券代號,證券名稱,成交股數,成交筆數,成交金額,開盤價,最高價,最低價,收盤價,漲跌(+/-),漲跌價差,最後揭示買價,最後揭示買量,最後揭示賣價,最後揭示賣量,本益比
0,50,元大台灣50,15018299,36674,1778354150,118.85,118.95,118.0,118.3,<p style= color:green>-</p>,1.9,118.25,22,118.3,44,0.0
1,51,元大中型100,58321,266,3285800,56.55,56.55,56.0,56.1,<p style= color:green>-</p>,1.0,56.0,2,56.45,2,0.0
2,52,富邦科技,547515,841,57448418,105.6,105.6,104.65,104.9,<p style= color:green>-</p>,1.7,104.75,29,104.9,2,0.0
3,53,元大電子,4097,1002,239541,58.6,58.65,58.6,58.65,<p style= color:green>-</p>,1.35,58.65,1,58.95,25,0.0
4,55,元大MSCI金融,241454,522,5368178,22.45,22.45,22.15,22.19,<p style= color:green>-</p>,0.29,22.19,1,22.2,5,0.0
