# Web Scraping
- [slide](https://docs.google.com/presentation/d/1gn4d5gzXzgyEAIz_AOdM3pmJ0-6iiN7EbTjKqCQlgPo/edit?usp=sharing) for getting json data online
- Finding data url
- Scraping 104.com job info
- Understanding how to send request and get back response
- Send request with Cookie, payload, Referer...

# Library requests to send web requests
* http://docs.python-requests.org/en/master/ 
* Quickstart http://docs.python-requests.org/en/master/user/quickstart/

In [None]:
import requests
import json
response = requests.get('https://tcgbusfs.blob.core.windows.net/blobyoubike/YouBikeTP.gz')
print(response.json())

## (Practice): Find out and traverse data behind urls



In [77]:
url_AQX = "https://taqm.epa.gov.tw/taqm/aqs.ashx?lang=tw&act=aqi-epa&ts=1538961147679"
url_dcard = "https://www.dcard.tw/_api/forums/relationship/posts?popular=true"
url_pchome = "http://ecshweb.pchome.com.tw/search/v3.3/all/results?q=X100F&page=1&sort=rnk/dc"
url_cnyes = "https://news.cnyes.com/api/v3/news/category/headline?startAt=1588262400&endAt=1589212799&limit=30"

# Need Referer = https://www.104.com.tw/
url_104 = "https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=data%20scientist&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=2&mode=s&jobsource=2018indexpoc"

# POST
# url_sinyi = 'https://pixel-api.scupio.com/v0/event?cb=0.3497148401321104'

- (hint) Using `requests.get(url, timeout=(x, y))` to set the limitation of waiting time
- `timeout=(x, y)`: Max x seconds to connect to server and max y seconds to wait on response
- https://data.moi.gov.tw/moiod/System/Principle.aspx?Sample=2

In [79]:
res = requests.get(url_pchome, timeout=(3, 5)).json()
print(type(res))
# print(res.keys())

<class 'dict'>


## (Option) Write a function to load json data

In [68]:
def get_web_json(url):
    response = requests.get(url, timeout=(3, 5))
    print("Response Code:", response.status_code)
    if not response.ok:
        return None
    data = response.json()
    return data


## (Option) Check data type (list or dict)

In [74]:
data = get_web_json(url_dcard)
if isinstance(data, dict):
    print("A dict with", data.keys())
elif isinstance(data, list):
#     print("A list with the first data enetry\n", data[0])
    print(pd.DataFrame(data).head())

Response Code: 200
          id           title  \
0  235615372  大學時期的情侶會走到最後嗎？   
1  235619462     比基尼踩到男友媽媽的雷   
2  235614768            幫男友弄   
3  235615817   #更 長得漂亮講話卻很靠北   
4  235614624          女友都不讀書   

                                             excerpt  anonymousSchool  \
0  大學時期的情侶到底會不會走到紅毯呢？最近看到很多朋友快畢業就一堆情侶分手，難道畢業就一定分手...             True   
1  和男友在一起一年多，中間也去過不少玩水的地方，因為我本身蠻愛拍照的所以泳衣都穿比基尼居多，上...             True   
2  男友有時候要我幫他弄，但是我第一次交男朋友，也是第一次幫別人用，所以我也不會...，他說他慢...             True   
3  呃 我先聲明一下 這男的真的不是我，是這個女生跟一個男同事講，然後那個同事再跟我說，但我們兩...             True   
4  女友是科大的 我是國立大學，一樣的科系，說好一起考證照國考，我兩個都上了，他都沒上，我罵他不...             True   

   anonymousDepartment  pinned                               forumId replyId  \
0                 True   False  42851318-b9e2-4a75-8a05-9fe180becefe    None   
1                 True   False  42851318-b9e2-4a75-8a05-9fe180becefe    None   
2                 True   False  42851318-b9e2-4a75-8a05-9fe180becefe    None   
3                 True   Fals

### (Practice) Adding type-checking code to the function

In [27]:
# Practice: Adding the type-checking code to the function




### (Deprecated) Problmatic url? See requests doc
* Try to get back the url https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1
* You will get an 404 status_code
```
url_591 = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1&section=12&firstRow=30&totalRows=495"
response = requests.get(url_591)
print(response.ok)
print(response.status_code)
```
* Solution: https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python

# Scraping 104.com
* Slide: https://docs.google.com/presentation/d/e/2PACX-1vRW84XoB5sFRT1Eg-GrK4smX23qoNkFffz_h8oRU4AIvJAgrrxBn8059_0UeHv_pFBks_Z37vNbLGai/pub?start=false&loop=false&delayms=3000
```
https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=data%20scientist&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=2&mode=s&jobsource=2018indexpoc
```

## 1. Get the first page, but fails

In [5]:
import requests
import json
url_104 = 'https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=data%20scientist&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=2&mode=s&jobsource=2018indexpoc'
response = requests.get(url_104)

print(response.status_code)
print(response.text)
print(response.headers)

200
<html xmlns="http://www.w3.org/1999/xhtml"><head><META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=utf-8"/><title></title></head>
<body>
<SCRIPT LANGUAGE="JavaScript">
window.location="https://www.104.com.tw/jobs/main/syserr?eid=13584908623485849121";
</script>
</body>
</html>
{'Cache-Control': 'no-cache', 'Connection': 'close', 'Content-Type': 'text/html; charset=utf-8', 'Pragma': 'no-cache', 'Content-Length': '293'}


### (Option) Write html to file

In [93]:
with open('temp_output.html', 'w') as fout:
    fout.write(response.text)
fout.close()

# webbrowser cannot work, but why?
import webbrowser
webbrowser.open_new_tab('temp_output.html')

True

### (Practice) Observing youbike data headers
Is is differnt from 104.com's?

In [158]:
import requests
import json
response = requests.get('https://tcgbusfs.blob.core.windows.net/blobyoubike/YouBikeTP.gz')
print(response)
print(response.status_code)
print(type(response)) # <class 'requests.models.Response'>
print(type(response.text)) # <class 'str'>

<Response [200]>
200
<class 'requests.models.Response'>
<class 'str'>


In [159]:
print(response.headers)
import pandas as pd
print(pd.DataFrame.from_dict(response.headers, orient='index'))

{'Content-Length': '32883', 'Content-Type': 'application/octet-stream', 'Content-Encoding': 'gzip', 'Content-MD5': '9hEmcORs/yynWSuu8p2Shg==', 'Last-Modified': 'Thu, 25 Mar 2021 03:57:01 GMT', 'ETag': '0x8D8EF420B809958', 'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0', 'x-ms-request-id': 'e41b4929-401e-011a-482a-2193e9000000', 'x-ms-version': '2009-09-19', 'x-ms-lease-status': 'unlocked', 'x-ms-blob-type': 'BlockBlob', 'Access-Control-Allow-Origin': '*', 'Date': 'Thu, 25 Mar 2021 03:57:27 GMT'}
                                                                        0
Content-Length                                                      32883
Content-Type                                     application/octet-stream
Content-Encoding                                                     gzip
Content-MD5                                      9hEmcORs/yynWSuu8p2Shg==
Last-Modified                               Thu, 25 Mar 2021 03:57:01 GMT
ETag                                          

## 2. Add referer to get back 104 data
* User-Agent: 你用什麼瀏覽器或系統
* Referer: 你從哪個頁面點選、跳轉過來
* Cookies: 經過與伺服器建立連結後，他給了你什麼資訊好讓你持續可以待在這個頁面。

In [112]:
url_104 = 'https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=data%20scientist&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=2&mode=s&jobsource=2018indexpoc'




<class 'dict'>


## 3. Traverse data to get the data block

dict_keys(['status', 'action', 'data', 'statusMsg', 'errorMsg'])
<class 'dict'>
dict_keys(['query', 'filterDesc', 'queryDesc', 'list', 'count', 'pageNo', 'totalPage', 'totalCount'])
<class 'list'>
<class 'dict'>


Unnamed: 0,jobType,jobNo,jobName,jobNameSnippet,jobRole,jobRo,jobAddrNo,jobAddrNoDesc,jobAddress,description,...,isSave,descSnippet,tags,landmark,link,jobsource,jobNameRaw,custNameRaw,lon,lat
0,0,12025874,Data Scientist,<em class='b-txt--highlight'>Data</em> <em cla...,1,1,6001001007,台北市信義區,信義路四段415號5F之4,尋找有應用程式大數據分析開發經驗的夥伴加入，開發腦科學研究資料分析軟體。有軟體架構經驗，能分...,...,0,尋找有應用程式大數據分析開發經驗的夥伴加入，開發腦科學研究資料分析軟體。有軟體架構經驗，能分...,[],距捷運台北101/世貿站340公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,Data Scientist,康泰智能科技股份有限公司,121.559395,25.033408
1,0,11496984,派遣至LINE_Insight planner (data scientist、資料科學家、...,派遣至LINE_Insight planner (<em class='b-txt--hig...,1,1,6001001010,台北市內湖區,,[Team/Position Description]\nInsight planner u...,...,0,[Team/Position Description]\nInsight planner u...,[],,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,派遣至LINE_Insight planner (data scientist、資料科學家、...,萬寶華企業管理顧問股份有限公司,121.5909027,25.0689422
2,0,12136731,【A】Data Scientist,【A】<em class='b-txt--highlight'>Data</em> <em ...,1,1,6001002016,新北市土城區,民生街4號,•Build production ready deep learning models\n...,...,0,•Build production ready deep learning models\n...,[],,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,【A】Data Scientist,富智康國際股份有限公司(鴻海集團),121.4143447,24.9613226
3,0,10589685,Data Scientist 資料科學家_數據分析師,<em class='b-txt--highlight'>Data</em> <em cla...,1,1,6001002020,新北市三重區,三和路四段111之32號7樓,分群、推薦、迴歸預測等演算法理論基礎，並有實作經驗\r\n2. 熟悉程式與分析應用，如 Py...,...,0,分群、推薦、迴歸預測等演算法理論基礎，並有實作經驗\r\n2. 熟悉程式與分析應用，如 Py...,[員工300人],距捷運三和國中站110公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,Data Scientist 資料科學家_數據分析師,伊雲谷數位科技股份有限公司,121.4856118,25.0774577
4,2,8229285,[KKLab] Data Scientist,[KKLab] <em class='b-txt--highlight'>Data</em>...,1,1,6001001011,台北市南港區,三重路19-3號一樓,[[[data]]] management solutions.\n● We innova...,...,0,<em class='b-txt--highlight'>data</em> manage...,[員工600人],距捷運南港展覽館站290公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,[KKLab] Data Scientist,願境網訊股份有限公司_KKBOX,121.6132482,25.0565476
5,0,12049575,資料科學家 Data Scientist (1-10),<em class='b-txt--highlight'>資料科學家</em> <em cl...,1,1,6001001004,台北市松山區,南京東路五段89號9樓,《工作內容》\n於專案中進行分析建模開發規劃及設計工作，與系統分析師溝通確認如何實作需求，並...,...,0,《工作內容》\n於專案中進行分析建模開發規劃及設計工作，與系統分析師溝通確認如何實作需求，並...,[員工150人],距捷運南京三民站380公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,資料科學家 Data Scientist (1-10),偉康科技股份有限公司,121.5601021,25.0518161
6,2,11864977,Data Scientist,<em class='b-txt--highlight'>Data</em> <em cla...,1,1,6001001005,台北市大安區,,"Engineering, or [[[Data]]] Science\n· 5+...",...,0,"Engineering, or <em class='b-txt--highlight'>...",[員工80人],距捷運科技大樓站130公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,Data Scientist,雲端互動股份有限公司,121.5433783,25.0249441
7,2,12135639,Data Scientist (Network Science),<em class='b-txt--highlight'>Data</em> <em cla...,1,1,6001001010,台北市內湖區,內湖路一段300號6樓之3,NetBase Quid is searching for a talented and p...,...,0,NetBase Quid is searching for a talented and p...,[外商公司],距捷運西湖站150公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,Data Scientist (Network Science),美商網基股份有限公司台灣分公司,121.5686754,25.0818595
8,2,12135742,Senior Data Scientist (Machine Learning),Senior <em class='b-txt--highlight'>Data</em> ...,1,1,6001001010,台北市內湖區,內湖路一段300號6樓之3,NetBase Quid is searching for a talented and p...,...,0,NetBase Quid is searching for a talented and p...,[外商公司],距捷運西湖站150公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,Senior Data Scientist (Machine Learning),美商網基股份有限公司台灣分公司,121.5686754,25.0818595
9,0,9597346,Data Scientist,<em class='b-txt--highlight'>Data</em> <em cla...,1,1,6001001007,台北市信義區,基隆路一段200號9樓之1,"with efficient and accurate tutor matching, c...",...,0,"with efficient and accurate tutor matching, c...",[員工70人],距捷運市政府站300公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,jolist_a_relevance,Data Scientist,知之有限公司,121.563553,25.040211


## 4. Get next page: get the 2nd, 1st, 3rd, ..., page urls
```
https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=data%20scientist&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=2&mode=s&jobsource=2018indexpoc

https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=data%20scientist&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=1&mode=s&jobsource=2018indexpoc

https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=data%20scientist&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=3&mode=s&jobsource=2018indexpoc
```

In [123]:
for page in range(1, 6):
    # Your code should be here

    
    
    

30
20
20
20
20


In [None]:
all_data = []
for page in range(1, 6):
    # Your code should be here
    
    
    
    print(len(all_data))

### Convert to pandas DataFrame

In [131]:
df = pd.DataFrame(all_data)
print(df.shape)
print(len(set(df.jobNo)))

(110, 38)
96


## 5. detect ending condition

dict_keys(['status', 'action', 'data', 'statusMsg', 'errorMsg'])
<class 'dict'>
dict_keys(['query', 'filterDesc', 'queryDesc', 'list', 'count', 'pageNo', 'totalPage', 'totalCount'])
2
150
['4544', '4320', '108', '14', '102', '0', '0']
4544


In [None]:
all_data = []
totalPage = 
for p in range(1, totalPage):

    
    
    
    print(len(all_data))

## 6. convert to dataframe

In [136]:
df = pd.DataFrame(all_data)
print(df.shape)
df.head()

(1150, 38)


Unnamed: 0,jobType,jobNo,jobName,jobNameSnippet,jobRole,jobRo,jobAddrNo,jobAddrNoDesc,jobAddress,description,...,isSave,descSnippet,tags,landmark,link,jobsource,jobNameRaw,custNameRaw,lon,lat
0,1,10956171,高階主管秘書,高階主管秘書,1,1,6001001010,台北市內湖區,新湖二路8號5樓,此職務需要有高度敏銳、解決問題之能力，橫向溝通佳且善於表達。\n1.處理與主管相關的行政事務...,...,0,此職務需要有高度敏銳、解決問題之能力，橫向溝通佳且善於表達。\n1.處理與主管相關的行政事務...,[],,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,高階主管秘書,偉太健康科技有限公司,121.5743893,25.0607141
1,1,8925036,設備工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),設備工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),1,1,6001006001,新竹市,,Technical\r\n1. Responsible for providing qual...,...,0,Technical\r\n1. Responsible for providing qual...,"[外商公司, 員工850人]",,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,設備工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),美商_科林研發股份有限公司_Lam Research,120.9674798,24.8138287
2,1,11977925,Product Marketing Specialist,Product Marketing Specialist,1,1,6001001002,台北市大同區,近捷運中山站,"Roles &amp; Responsibilities\n• Maintain, impl...",...,0,"Roles &amp; Responsibilities\n• Maintain, impl...",[],距捷運中山站80公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,Product Marketing Specialist,ThunderCore_閃電核心科技有限公司,121.5195662,25.0527276
3,1,9937464,設備工程師 - 南科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),設備工程師 - 南科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),1,1,6001014036,台南市新市區,南科三路25號2樓之2,Technical\r\n 1. Responsible for providing qua...,...,0,Technical\r\n 1. Responsible for providing qua...,"[外商公司, 員工850人]",,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,設備工程師 - 南科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),美商_科林研發股份有限公司_Lam Research,120.2759224,23.0977104
4,1,9226901,製程工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),製程工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),1,1,6001006001,新竹市,,*請檢附英文履歷**請載明多益成績*\n\nTechnical\n1.Responsible...,...,0,*請檢附英文履歷**請載明多益成績*\n\nTechnical\n1.Responsible...,"[外商公司, 員工850人]",,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,製程工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),美商_科林研發股份有限公司_Lam Research,120.9674798,24.8138287


# Dump files for backup



## M1. Dump one variable to json by json library
* https://docs.python.org/3/library/json.html

In [47]:
import json
with open('104_list.json', 'w') as outfile:
    json.dump(all_data, outfile)

## M2. Dump and load json by pandas library

In [137]:
with open('104_df.json', 'w') as f:
    f.write(df.to_json())

In [138]:
with open("104_df.json") as fin:
    data2 = pd.read_json(fin)
data2.head()

Unnamed: 0,jobType,jobNo,jobName,jobNameSnippet,jobRole,jobRo,jobAddrNo,jobAddrNoDesc,jobAddress,description,...,isSave,descSnippet,tags,landmark,link,jobsource,jobNameRaw,custNameRaw,lon,lat
0,1,10956171,高階主管秘書,高階主管秘書,1,1,6001001010,台北市內湖區,新湖二路8號5樓,此職務需要有高度敏銳、解決問題之能力，橫向溝通佳且善於表達。\n1.處理與主管相關的行政事務...,...,0,此職務需要有高度敏銳、解決問題之能力，橫向溝通佳且善於表達。\n1.處理與主管相關的行政事務...,[],,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,高階主管秘書,偉太健康科技有限公司,121.574389,25.060714
1,1,8925036,設備工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),設備工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),1,1,6001006001,新竹市,,Technical\r\n1. Responsible for providing qual...,...,0,Technical\r\n1. Responsible for providing qual...,"[外商公司, 員工850人]",,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,設備工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),美商_科林研發股份有限公司_Lam Research,120.96748,24.813829
2,1,11977925,Product Marketing Specialist,Product Marketing Specialist,1,1,6001001002,台北市大同區,近捷運中山站,"Roles &amp; Responsibilities\n• Maintain, impl...",...,0,"Roles &amp; Responsibilities\n• Maintain, impl...",[],距捷運中山站80公尺,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,Product Marketing Specialist,ThunderCore_閃電核心科技有限公司,121.519566,25.052728
3,1,9937464,設備工程師 - 南科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),設備工程師 - 南科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),1,1,6001014036,台南市新市區,南科三路25號2樓之2,Technical\r\n 1. Responsible for providing qua...,...,0,Technical\r\n 1. Responsible for providing qua...,"[外商公司, 員工850人]",,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,設備工程師 - 南科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),美商_科林研發股份有限公司_Lam Research,120.275922,23.09771
4,1,9226901,製程工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),製程工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),1,1,6001006001,新竹市,,*請檢附英文履歷**請載明多益成績*\n\nTechnical\n1.Responsible...,...,0,*請檢附英文履歷**請載明多益成績*\n\nTechnical\n1.Responsible...,"[外商公司, 員工850人]",,{'applyAnalyze': '//www.104.com.tw/jobs/apply/...,hotjob_chr,製程工程師 - 竹科 (濕蝕刻 Clean、蝕刻 Etch、薄膜沉積 Thin Film),美商_科林研發股份有限公司_Lam Research,120.96748,24.813829


## M3. Dump multiple variables to pickle

In [139]:
import pickle
with open('data/104.pkl', 'wb') as fout:  # Python 3: open(..., 'wb')
    pickle.dump([all_data, df], fout)

### Load multiple variables back to objects

In [142]:
with open('data/104.pkl', "rb") as fin:  # Python 3: open(..., 'rb')
    test = pickle.load(fin)
    print(type(test[0]))
    print(type(test[1]))

<class 'list'>
<class 'pandas.core.frame.DataFrame'>


In [144]:
test[0][0].keys()

dict_keys(['jobType', 'jobNo', 'jobName', 'jobNameSnippet', 'jobRole', 'jobRo', 'jobAddrNo', 'jobAddrNoDesc', 'jobAddress', 'description', 'optionEdu', 'period', 'periodDesc', 'applyCnt', 'applyDesc', 'custNo', 'custName', 'coIndustry', 'coIndustryDesc', 'salaryLow', 'salaryHigh', 'salaryDesc', 's10', 'appearDate', 'appearDateDesc', 'optionZone', 'isApply', 'applyDate', 'isSave', 'descSnippet', 'tags', 'landmark', 'link', 'jobsource', 'jobNameRaw', 'custNameRaw', 'lon', 'lat'])

## Using timestamp as file name

In [155]:
from datetime import datetime

now = datetime.now().strftime("%Y%m%d%H%M%S")
print("Current Time =", now)
with open('data/104_%s.pkl'%(now), 'wb') as fout:  # Python 3: open(..., 'wb')
    pickle.dump([all_data, df], fout)

Current Time = 20210325112321


# Clean version. 104


In [None]:
headers = {'referer': 'https://www.104.com.tw/'}
search_str = '數據分析'

# Detecting totalPage
url_104 = 'https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=' + search_str + '&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=2&mode=s&jobsource=2018indexpoc'
raw = requests.get(url_104, headers=headers).json()
totalPage = raw['data']['totalPage']
print(totalPage)

# Getting data by loop
all_data = []
for p in range(1, totalPage):
    url = 'https://www.104.com.tw/jobs/search/list?ro=0&kwop=7&keyword=' + search_str + '&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=' + str(page) + '&mode=s&jobsource=2018indexpoc'
    raw = requests.get(url, headers=headers).json()
    all_data.extend(raw['data']['list'])
    print(len(all_data))
df = pd.DataFrame(all_data)


# Saving and Backing up data
import pickle
from datetime import datetime

now = datetime.now().strftime("%Y%m%d%H%M%S")
print("Current Time =", now)
with open('data/104_%s_%s.pkl'%(search_str, now), 'wb') as fout:  # Python 3: open(..., 'wb')
    pickle.dump([all_data, df], fout)