## 作業目標

* 比較一下非同步爬蟲跟多線程爬蟲的差異是什麼？各自的優缺點為何？

In [1]:
# normal

import requests 
import time

URL = "https://www.bookwalker.com.tw/"
NUM = 10

def normal(URL, NUM):
    for i in range(NUM):
        response = requests.get(url=URL)
        result = response.url
    
    print("crawl target: {}".format(result))

t1 = time.time()
normal(URL, NUM)
print("Normal total time:", time.time()-t1)

crawl target: https://www.bookwalker.com.tw/
Normal total time: 6.5338990688323975


In [2]:
# 非同步

import asyncio
import aiohttp

# 有關 "RuntimeError: This event loop is already running in python" 錯誤解決辦法:
# https://stackoverflow.com/questions/46827007/runtimeerror-this-event-loop-is-already-running-in-python
import nest_asyncio
nest_asyncio.apply()

async def crawl_by_asyncio(session):
    response = await session.get(URL)
    return str(response.url)

async def main(loop, NUM):
    async with aiohttp.ClientSession() as session:
        tasks = [loop.create_task(crawl_by_asyncio(session)) for i in range(NUM)]
        finished, unfinished = await asyncio.wait(tasks)  # 註解。
        all_results = [r.result() for r in finished]
        print(all_results)
# 註解：asyncio.wait()會需要兩個tuple來儲存 "完成" & "待處理" 的資料。
# https://stackoverflow.com/questions/42231161/asyncio-gather-vs-asyncio-wait

t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop, NUM))
#loop.close()
print("total time: ", time.time()-t1)

['https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/', 'https://www.bookwalker.com.tw/']
total time:  0.9972765445709229


In [5]:
# 多線程

import _thread

start = time.time()

for i in range(NUM):
    _thread.start_new_thread( requests.get, (URL, ))
    print(URL)

finish = time.time()
print(finish - start)

https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
https://www.bookwalker.com.tw/
0.0029294490814208984


* 非同步處理是單線程，相較於多線程需要等待全部完成，比較不會有等待的時間。
* 以在[Stackoverflow](https://stackoverflow.com/questions/34680985/what-is-the-difference-between-asynchronous-programming-and-multithreading/34681101#34681101)上看到的一句為例：
* > Threading is about workers; asynchrony is about tasks.
* 多線程著重在於利用多個threads；非同步著重在於讓單一thread能有效使用。