### 爬蟲思路和概念 :
1. 先去要爬取的網站，打開F12 DevTool查看先關資訊
2. 觀察自己想要爬取的資料，可以複製一段文字，到Elements中查看，放在HTML檔案種中的哪個地方
3. 觀察Html Tag 和 css屬性，搭配 BS4進行爬取資訊

#### 範例
**爬取CakeResume的職缺資訊**
1. 可以觀察到所有的職缺資訊都是使用div 把資料包起來的
2. 會發現資料被包在一個叫做`JobSearchPage_searchResults__*`的裡面

    ![](https://i.imgur.com/iYC1npE.png)
3. 接著可以發現，所有的div container都叫做`JobSearchItem_Wrapper__*`，我們可以利用這個資訊進行資料的爬取

    ![](https://i.imgur.com/tdhjgBG.png)

爬取過程參照下列作法

In [4]:
import requests
from bs4 import BeautifulSoup
import re
import json
import time


In [5]:
url = 'https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest'

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

In [6]:
searchResult = soup.find_all('div', class_ = re.compile(r'JobSearchPage_searchResults__*'))  
searchList = soup.find_all('div', class_ = re.compile(r'JobSearchHits_list__*'))
searchItem = soup.find_all('div', class_ = re.compile(r'JobSearchItem_wrapper__*'))
len(searchItem)  # 表示這頁中有13個職缺資料，後續須對這些資料進行清洗

10

### 查看div 中的資料樣態
1. 職稱: 放在`JobSearchItem_headerTitle__*`
2. 公司名稱: 放在`JobSearchItem_headerSubtitle__*`
3. 公司簡介: 放在`JobSearchItem_description__*`
4. Job tag: 放在`JobSearchItem_tags__*`
5. 職缺性質: 放在`JobSearchItem_features__*`


In [7]:
processing = searchItem[0]
url_domain = 'https://www.cakeresume.com'
job_link = url_domain + (processing.find('div', class_ = re.compile(r'JobSearchItem_headerTitle__*')).find('a', href = True))['href']
job_title = processing.find('div', class_ = re.compile(r'JobSearchItem_headerTitle__*')).getText()
company = processing.find('div', class_ = re.compile(r'JobSearchItem_headerSubtitle__*')).getText()
important_skill = []
tags = processing.find('div', class_ = re.compile(r'JobSearchItem_tags__*'))
if tags:
    tags = tags.find_all('div', class_ = re.compile(r'Tags_item__*'))
    for tag in tags:
        if tag.text != '':
            important_skill.append(tag.text)

    jon_feature_soup = processing.find('div', class_ = re.compile(r'JobSearchItem_features__*'))


In [8]:
print(job_link)
print(job_title)
print(company)
print(important_skill)

https://www.cakeresume.com/companies/realtek-semiconductor/jobs/b8b624
PC Camera演算法開發工程師/專案技術主管(臺北大直)
Realtek 瑞昱半導體
['C/C++', 'Matlab', 'Camera']


In [9]:
job_feature_soup = processing.find('div', class_ = re.compile(r'JobSearchItem_features__*'))
inline_messages_type = job_feature_soup.find_all('div', class_ = re.compile(r'InlineMessage_icon__*'))
inline_messages = job_feature_soup.find_all('div', class_ = re.compile(r'InlineMessage_label__*'))
print(f'tpye: {len(inline_messages_type)} label: {len(inline_messages)}')
print(inline_messages)

tpye: 5 label: 5
[<div class="InlineMessage_label__hP3Fk">1<div class="SeparatorDot_dot__rrEYX">・</div><div class="JobSearchItem_featureSegments__I1Csc"><div><a class="Button_button__N4TAn Button_buttonLinkGreen__i7ru_ Button_buttonSmall__PsSMK Button_buttonNoPadding___8Dm0" href="/jobs?profession%5B0%5D=it_data-engineer&amp;job_type%5B0%5D=full_time&amp;order=latest"><div class="Button_buttonContent__arBxx"><div class="Button_buttonContentMain__afmnj">Full-time</div></div></a></div><div class="SeparatorDot_dot__rrEYX">・</div><div><a class="Button_button__N4TAn Button_buttonLinkGreen__i7ru_ Button_buttonSmall__PsSMK Button_buttonNoPadding___8Dm0" href="/jobs?profession%5B0%5D=it_data-engineer&amp;seniority_level%5B0%5D=mid_senior_level&amp;order=latest"><div class="Button_buttonContent__arBxx"><div class="Button_buttonContentMain__afmnj">Mid-Senior level</div></div></a></div></div></div>, <div class="InlineMessage_label__hP3Fk"><div class="JobSearchItem_featureSegments__I1Csc"><a class

In [10]:
for inline_message in inline_messages_type:
    class_type = inline_message.find('div', class_ = re.compile(r'Tooltip_handle__*')).find('i', class_ = True)['class']
    print(class_type[1][3:])

user
map-marker-alt
dollar-sign
business-time
sitemap


In [11]:
for inline_message in inline_messages:
    print(inline_message.text)

1・Full-time・Mid-Senior level
台灣台北
90K ~ 150K TWD / month
3 years of experience required
Managing 1-5 staff


In [12]:
feature = {}
if len(inline_messages) == len(inline_messages_type):
    for i in range(len(inline_messages)):
        type = (inline_messages_type[i].find('div', class_ = re.compile(r'Tooltip_handle__*')).find('i', class_ = True)['class'])[1][3:]
        label = inline_messages[i].text
        feature[type] = label

feature

{'user': '1・Full-time・Mid-Senior level',
 'map-marker-alt': '台灣台北',
 'dollar-sign': '90K ~ 150K TWD / month',
 'business-time': '3 years of experience required',
 'sitemap': 'Managing 1-5 staff'}

爬蟲程式設計，建立爬蟲Pipline以及資料清洗及整理的方式

In [13]:
def get_soup(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    if soup:
        print("Success")
        return soup
    else:
        print("Fail")
        return None

def get_searchItem(soup):
    searchItem = soup.find_all('div', class_ = re.compile(r'JobSearchItem_wrapper__*'))
    print(f"There are {len(searchItem)} jobs in this page.")
    return searchItem

def get_job_link_and_title(item):
    url_domain = 'https://www.cakeresume.com'
    job_title = item.find('div', class_ = re.compile(r'JobSearchItem_headerTitle__*')).getText()
    job_link = url_domain + (item.find('div', class_ = re.compile(r'JobSearchItem_headerTitle__*')).find('a', href = True))['href']
    return job_link, job_title

def get_company_name(item):
    company_name = item.find('div', class_ = re.compile(r'JobSearchItem_headerSubtitle__*')).getText()
    return company_name

def get_skill_set(item):
    if item.find('div', class_ = re.compile(r'JobSearchItem_tags__*')):
        skill_set = []
        tags = item.find('div', class_ = re.compile(r'JobSearchItem_tags__*')).find_all('div', class_ = re.compile(r'Tags_item__*'))
        for tag in tags:
            if tag.text != '':
                skill_set.append(tag.text)
        return skill_set
    else:
        return ""

def get_feature(item):
    job_features = item.find('div', class_ = re.compile(r'JobSearchItem_features__*'))
    inline_messages_type = job_feature_soup.find_all('div', class_ = re.compile(r'InlineMessage_icon__*'))
    inline_messages = job_feature_soup.find_all('div', class_ = re.compile(r'InlineMessage_label__*'))  
    print(f'type: {len(inline_messages_type)}  label: {len(inline_messages)}')
    feature = {}
    if len(inline_messages) == len(inline_messages_type):
        for i in range(len(inline_messages)):
            type = (inline_messages_type[i].find('div', class_ = re.compile(r'Tooltip_handle__*')).find('i', class_ = True)['class'])[1][3:]
            label = inline_messages[i].text
            feature[type] = label
    print(feature)
    return feature

def arrange_the_data(job_link, job_title, company_name, skill_set, feature):
    job_info = []
    if job_link != "":
        job_info.append(job_link)
    else:
        job_info.append("")

    if job_title != "":
        job_info.append(job_title)
    else:
        job_info.append("")   

    if company_name != "":
        job_info.append(company_name)
    else:
        job_info.append("")
        
    if skill_set != "":
        job_info.append(skill_set)
    else:
        job_info.append("")
        
    if "user" in feature.keys():
        job_info.append(feature['user'])
    else:
        job_info.append("")
        
    if "map-marker-alt" in feature.keys():
        job_info.append(feature['map-marker-alt'])
    else:
        job_info.append("")     
           
    if "dollar-sign" in feature.keys():
        job_info.append(feature['dollar-sign'])
    else:
        job_info.append("")        

    if "business-time" in feature.keys():
        job_info.append(feature['business-time'])
    else:
        job_info.append("")        

    if "sitemap" in feature.keys():
        job_info.append(feature['sitemap'])
    else:
        job_info.append("")        

    return job_info

In [36]:
# url = "https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest"


def get_job_info(url):
    soup = get_soup(url)
    if soup != "":
        searchItems = get_searchItem(soup)
        job_infos = []
        for search_item in searchItems:
            job_link, job_title = get_job_link_and_title(search_item)
            company_name = get_company_name(search_item)
            skill_set = get_skill_set(search_item)
            feature = get_feature(search_item)
            job_info = arrange_the_data(job_link, job_title, company_name, skill_set, feature)
            job_infos.append(job_info)
            print(job_info)


設計非日常爬蟲，初始化時該如何找到下一頁的function
我們的思路是每次進到一個爬蟲頁面後，找到下一頁的網址，直到找不到為止。
一樣借由開發人員工具(F12)來檢視網頁的規律，會發現頁碼的資訊放在`JobSearchPage_searchPagination__*`，

![](https://i.imgur.com/ldb3F5v.png)

接著我們可以發現，`>`下一頁符號的資訊放在`Pagination_itemNavigation__*`中href中。

![](https://i.imgur.com/VF6PebC.png)


In [15]:
url = 'https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest'

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

next_page = soup.find('div', class_ = re.compile(r'Pagination_wrapper__*'))
next_page

<div class="Pagination_wrapper__AEWI_"><div class="Pagination_itemNavigation__wHk0M Pagination_itemDisabled__jrhUA"><i class="fa fa-chevron-left"></i></div><a class="Pagination_itemNumber__5L1fV Pagination_itemActive___SZIW" href="https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest">1</a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest&amp;page=2">2</a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest&amp;page=3">3</a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest&amp;page=4">4</a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest&amp;page=5">5</a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest&amp;page=6">6</a><a class="

In [16]:
next_page_link = next_page.find_all('a', class_ = re.compile(r'Pagination_itemNavigation__*'), href = True)
next_page_link[0]['href']

'https://www.cakeresume.com/jobs/categories/it/data-engineer?order=latest&page=2'

In [17]:
# Check if there is no next page
url = 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=2'

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

next_page = soup.find('div', class_ = re.compile(r'Pagination_wrapper__*'))
next_page

<div class="Pagination_wrapper__AEWI_"><a class="Pagination_itemNavigation__wHk0M" href="https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&amp;order=latest"><i class="fa fa-chevron-left"></i></a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&amp;order=latest">1</a><a class="Pagination_itemNumber__5L1fV Pagination_itemActive___SZIW" href="https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&amp;order=latest&amp;page=2">2</a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&amp;order=latest&amp;page=3">3</a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&amp;order=latest&amp;page=4">4</a><a class="Pagination_itemNumber__5L1fV" href="https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&amp

In [18]:
next_page_link = next_page.find_all('a', class_ = re.compile(r'Pagination_itemNavigation__*'))
next_page_link[1]['href']

'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=3'

設計初始化DB的爬蟲


In [34]:
def find_next_page(soup):
    next_page = soup.find('div', class_ = re.compile(r'Pagination_wrapper__*'))
    next_page_link = next_page.find_all('a', class_ = re.compile(r'Pagination_itemNavigation__*'), href = True)
    if next_page_link:
        if len(next_page_link) == 2:
            return next_page_link[1]['href']
        else :
            return next_page_link[0]['href']
    else:
        next_page_link = next_page.find_all('a', class_ = re.compile(r'Pagination_itemNumber__*'), href = True)
        return next_page_link[0]['href']

In [38]:
job_types = ['it_back-end-engineer', 'it_data-engineer', 'it_data-scientist', 'it_machine-learning-engineer', 'it_qa-test-engineer', 'management-business_data-analyst']
location = 'latest'
url_data = []

for job_type in job_types:
    # time.sleep(10)
    url = f'https://www.cakeresume.com/jobs/python?profession%5B0%5D={job_type}&order={location}'
    url_data.append(url)
    soup = get_soup(url)
    while True:
        next_page_link = find_next_page(soup)
        if next_page_link in url_data:
            break
        else:
            url_data.append(next_page_link)
            soup = get_soup(next_page_link)
        

Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=2
Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=3
Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=4
Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=5
Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=6
Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=7
Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=8
Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=9
Success
https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=10
Success
https://www.cakeresume.com/jobs/pytho

In [41]:
for url in url_data:
    print(url)
    get_job_info(url)

https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest
Success
There are 10 jobs in this page.
type: 5  label: 5
{'user': '1・Full-time・Mid-Senior level', 'map-marker-alt': '台灣台北', 'dollar-sign': '90K ~ 150K TWD / month', 'business-time': '3 years of experience required', 'sitemap': 'Managing 1-5 staff'}
['https://www.cakeresume.com/companies/aha/jobs/backend-software-engineer-43ce76', '🔥  [資深] 後端軟體工程師 (全遠端)  🛠 Senior Back-End Engineer (Remote) 🚀', 'Aha', ['JavaScript', 'Node.js', 'Express.js'], '1・Full-time・Mid-Senior level', '台灣台北', '90K ~ 150K TWD / month', '3 years of experience required', 'Managing 1-5 staff']
type: 5  label: 5
{'user': '1・Full-time・Mid-Senior level', 'map-marker-alt': '台灣台北', 'dollar-sign': '90K ~ 150K TWD / month', 'business-time': '3 years of experience required', 'sitemap': 'Managing 1-5 staff'}
['https://www.cakeresume.com/companies/devcore/jobs/ffbb1e', 'Senior Developer 紅隊平台開發工程師', 'DEVCORE 戴夫寇爾', ['Ruby', 'Python', 'Web D

In [40]:
url_data

['https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=2',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=3',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=4',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=5',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=6',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=7',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=8',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=latest&page=9',
 'https://www.cakeresume.com/jobs/python?profession%5B0%5D=it_back-end-engineer&order=lat