# Asynchronously crawling all the Model pages. Depends on the Manufacturer crawling from notebook 05

In [1]:
import pickle
with open('mfct_urls.pickle', 'rb') as f:
    mfct_urls = pickle.load(f)
with open('expanded_urls.pickle', 'rb') as f:
    expanded_urls = pickle.load(f)
with open('all_models.pickle', 'rb') as f:
    all_models = pickle.load(f)

# We need to fetch about 11,000 webpages

It takes about 0.8 seconds per page when we do this using keep-alive sessions and going one-by-one synchronously.
This will take many hours.

So, we need to go asynchronously. We will still be single-threaded (aka using only one CPU core), just asynchronously.

If we need to go even faster, we can use multicore processing.

Here's some links:

https://skillshats.com/blogs/optimize-python-requests-for-faster-performance/
https://stackoverflow.com/questions/57098886/how-to-send-10-000-http-requests-concurrently-using-python
https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
https://plainenglish.io/blog/send-http-requests-as-fast-as-possible-in-python-304134d46604


## How I'd like to do it.

I'd like to do this from a thread/worker perspective.
Initialize x number of workers. They all grab URLs off a queue and fetch them. 

I'd prefer to NOT do this from a URLs perspective. I don't want to take a batch of URLs and turn them into threads.

A few ways:
1. Make x threads using threading library. 
2. Use a ThreadPoolExecutor
3. Use asyncio with aiohttp and a custom TCPConnector with a limit of the number of TCP connections.

We're going to use (3) a ThreadPoolExecutor following
* https://www.digitalocean.com/community/tutorials/how-to-use-threadpoolexecutor-in-python-3 
* https://superfastpython.com/threadpoolexecutor-in-python/
* https://superfastpython.com/threadpoolexecutor-in-python/

In [64]:
import requests
from bs4 import BeautifulSoup
import re
import os
import os.path
import urllib.parse
import time

cache_dir = os.path.expanduser("cache/")
def ensure_cache_dir():
    if not os.path.exists(cache_dir):
        print("Initializing cache directory", cache_dir)
        os.makedirs(cache_dir)
    else:
        print("Cache dir exists at", cache_dir)
        
def retrieve_to_cache(local,url,verbose=False,session=None):
    filename = urllib.parse.quote(url, '')
    filepath = os.path.join(cache_dir, filename)

    if os.path.exists(filepath):
        return str("Cache Hit ") + filepath
    else:
        response = local.session.get(url)        
        raw_html = response.text
        with open(filepath, 'x') as file:
            file.write(raw_html)
        return str("File Cached ") + filepath
    return True

ensure_cache_dir()

Cache dir exists at cache/


### WARNING: THE CODE BELOW FETCHES 11,000 HTML PAGES, ABOUT 35 MINUTES!

In [67]:
import concurrent.futures
import threading

# Lets start with 100 URLs and 2 async requests
#NUM_URLS = 1000
#urls = all_models[:NUM_URLS]
urls = all_models
CONCURRENT_REQUESTS = 200

## We assert that there are no duplicates in the URLs. 
## This way we can download and save to files in parallel
import collections
assert(len([item for item, count in collections.Counter(urls).items() if count > 1]) == 0)

def initialize_worker(local):
    local.session = requests.Session()
    

local = threading.local()
a = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=CONCURRENT_REQUESTS, initializer=initialize_worker, initargs=(local,)) as executor:
    futures = []
    for u in urls:
        # Hmm how do I give each thread a Session object?
        futures.append(executor.submit(retrieve_to_cache, local, u))
    for future in concurrent.futures.as_completed(futures):
        print(future.result())

elapsed = time.time() - a

print("We did",len(urls),"requests in",elapsed,"seconds, or an average per-request duration of",elapsed/len(urls))

Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2Fktm%2Fktm_1190_rc8r%252012.htm
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2Fsuzu%2Fsuzuki_gz250_marauder%252006.htm
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2Fktm%2Fktm-890-smt-2023.html
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2Fktm%2Fktm_890_adventure_r_rally_21.html
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2Fkawasaki%2Fkawasaki_75T_76.html
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2Fsuzu%2Fsuzuki_gz250_marauder%252003.htm
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2FAdler%2Fadler-favorit.html
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2FAC%2520Schnitzer%2Fac_schnitzer_s_1000rr.htm
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2Fktm%2Fktm_1290_super%2520adventure%252015.htm
Cache Hit cache/https%3A%2F%2Fwww.motorcyclespecs.co.za%2Fmodel%2FAdler%2Fadl

In [68]:
urls

['https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_f_650cs.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_F800R.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_800s_twinstar.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_hp2.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_k_1200rGT.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_k1200r.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_k_1200rs.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_k_1200s.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_k1300s.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_bmw_k1300r.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac_schnitzer_k_1200r%20sport%2007.htm',
 'https://www.motorcyclespecs.co.za/model/AC%20Schnitzer/ac