#Web Crawler

Author: Luana Bezerra Batista

Date: 10-07-2020

This application uses the following technologies/libraries (among others):
* Python, as programming language
* Beautiful Soup, as HTML parser
* concurrent.futures and multiprocessing, for parallel execution
* Flask, as a micro web framework

It implements almost all the [requirements](https://drive.google.com/file/d/1qOyuU69wUDDcjnUcdB6qjca10i9ZgDe7/view?usp=sharing), however, I had to make some choices that would lead to a "quick" implementation. 

Given that I was working at the same time that as I was implementing this solution (using my current employer resources -- oups! :), and because I wanted to deliver this solution in 1 week, I decided to skip the Docker part and use Colab instead.

Please see my **comments** in each code cell.


---



In orther to run this application, go to the **Runtime** menu and select **Restart and run all**.

When you reach the main function, please open a shell window and run **curl** commands using the ngrok http tunnel.


---


Available **curl** commands:

```
* Posting URLs using 1 single task:
 * curl -X POST http://b4098711e066.ngrok.io/1 -H "Content-Type: application/json" -d "[\"http://4chan.org/\", \"https://golang.org/\"]"

* Posting URLs using 2 paralell tasks (you can use one task per URL):
 * curl -X POST http://b4098711e066.ngrok.io/2 -H "Content-Type: application/json" -d "[\"http://4chan.org/\", \"https://golang.org/\"]"

* Getting the status of a task:
 * curl -X GET http://96ef6a5a14e0.ngrok.io/status/c426926b-64df-4417-8cb8-59f719c41ef1

* Getting the result of a task:
 * curl -X GET http://96ef6a5a14e0.ngrok.io/result/c426926b-64df-4417-8cb8-59f719c41ef1
```

Note that the ngrok tunnel changes at every execution. You'll find the right address in the last cell of this file, just after http://127.0.0.1

---



References: 
* https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* https://pymotw.com/3/concurrent.futures/
* https://docs.python.org/3/library/concurrent.futures.html
* https://stackoverflow.com/questions/62130227/how-do-i-get-child-process-pids-when-using-processpoolexecuter
* https://www.thepythoncode.com/article/download-web-page-images-python 
* https://beckernick.github.io/faster-web-scraping-python/
* https://blog.miguelgrinberg.com/post/designing-a-restful-api-with-python-and-flask
* https://pypi.org/project/Flask-UUID/

In [1]:
pip install requests bs4 urllib3 futures jsonlib-python3 uuid queuelib flask Flask-UUID



In [2]:
import concurrent.futures
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import os
import json
import uuid
import multiprocessing
import queue
import sys
from flask import Flask, jsonify, make_response, abort, request
from flask_uuid import FlaskUUID

In [3]:
#In order to make a Flask app running in Colab:
#https://github.com/gstaff/flask-ngrok
!pip install flask-ngrok
from flask_ngrok import run_with_ngrok



In [4]:
#Flask constructor 
app = Flask(__name__)

#Flask extension that registers a UUID converter for urls on a Flask application
FlaskUUID(app)

<flask_uuid.FlaskUUID at 0x7f3446bf29b0>

In [5]:
#Use ngrok http tunnel
run_with_ngrok(app) 

In [6]:
#Queue accessible to different workers
q = multiprocessing.Manager().Queue()
pid_dict = {}

In [7]:
#This function extracts image URLs from a list of base URLs and from their
#immediate children (i.e., until the second level of websites)
#It could be improved by allowing the number_of_levels to crawl as parameter. 
def crawl_image_urls(base_url_list, uuid, queue): 

    #setting the pid    
    pid = os.getpid()
    #print("Executing task on process {}".format(pid))       
    queue.put((uuid, pid))    

    #this will store the image urls
    img_url_list = []    

    #extracting all the images found in base_url
    for base_url in base_url_list:       

        resp = requests.get(base_url)

        #using Beautiful Soup as HTML parser
        soup = BeautifulSoup(resp.content, "html.parser")
        
        #extracting all img links from base_url
        BASE_IMAGES = soup.find_all("img")  

        for img in BASE_IMAGES:    
            #making the URL absolute by joining base_url with img_url
            img_url = urljoin(base_url, img.attrs.get("src")) 
            img_url_list.append(img_url)                   

        #getting children_img_urls 
        children_img_urls = crawl_children_image_urls(base_url)
        
        #updating the list with children_img_urls
        img_url_list.append(children_img_urls)

    return (img_url_list)

In [8]:
#Given a base_url, this function finds imediate children websites
#and extracts image URLs from them.
#It returns a list of image URLs.
def crawl_children_image_urls(base_url): 

    #extracting all the URLs found in base_url, that is, it's immediate children
    resp = requests.get(base_url)
    soup = BeautifulSoup(resp.content, "html.parser")
    CHILDREN_URLS = soup.find_all("a")

    #this will store the image urls
    img_url_list = []

    #not employed in this application
    #next_level_url_list = []

    #extracting all the images found in child_url
    for child_url in CHILDREN_URLS:
       
        try:
            url = child_url.get('href') #this request can fail due to MissingSchema
            
            child_resp = requests.get(url)
            
            #making a child soup! :D
            child_soup = BeautifulSoup(child_resp.content, "html.parser")      
            CHILD_IMAGES = child_soup.find_all("img")      

            for img in CHILD_IMAGES:    
                #making the URL absolute by joining base_url with img_url
                img_url = urljoin(url, img.attrs.get("src"))
                img_url_list.append(img_url)   
     
            #getting the next level of websites
            #NEXT_LEVEL_URLS = child_soup.find_all("a")
            #next_level_url_list.append(NEXT_LEVEL_URLS)
           
        except ValueError:
            pass          
       
        #return (img_url_list, next_level_url_list)
        return (img_url_list)

In [9]:
#This function supports max_threads = 1 or max_threads = len(url_list)
#I'm employing the word *thread*, but I'm actually using a ProcessPoolExecutor
#from the library concurrent.futures. 
#A ThreadPoolExecutor is also available in concurrent.futures, however I found 
#ProcessPoolExecutor easier, and I had to make a quick choice.
#When max_threads = len(url_list), it submitts each url to a different task.
#When max_threads = 1, it submitts all urls to the same task.     
#curl -X POST http://93373074171c.ngrok.io/ -H "Content-Type: application/json"
#--data "[\"http://4chan.org/\", \"https://golang.org/\"]"
@app.route('/<int:max_threads>', methods=['POST'])
def crawl_image_urls_concurrent(max_threads):    
 
    url_list = request.json

    if max_threads > 1:
        n_tasks = len(url_list)
    else:
        n_tasks = 1 

    futures = [] 
    json_dumps = []
    
    with concurrent.futures.ProcessPoolExecutor(max_workers=n_tasks) as executor:
        
        #submitting each url to a different task        
        if n_tasks == len(url_list):
            t = 1;
            for url in url_list:
                u = uuid.uuid4()            
                f = executor.submit(crawl_image_urls, [url], u, q)            
                futures.append(f)   
                pid_dict[u] = [f, None] # PID not known here      
                jd = json.dumps({"job_id": str(u), "task": str(t), "url": url})
                print(jd)
                json_dumps.append(jd)
                t = t + 1  

                try:
                    rcv_uuid, rcv_pid = q.get(block=True, timeout=1)
                    pid_dict[rcv_uuid] = [f, rcv_pid] # store PID
                except queue.Empty as e:
                    print('Queue is empty', e)         
        
        #submitting all urls at once (all to the same task)
        elif n_tasks == 1:
            u = uuid.uuid4()                                
            f = executor.submit(crawl_image_urls, url_list, u, q)            
            futures.append(f) 
            pid_dict[u] = [f, None] # PID not known here       
            jd = json.dumps({"job_id": str(u), "task": "1", "urls": url_list})
            print(jd)
            json_dumps.append(jd)
            try:
                rcv_uuid, rcv_pid = q.get(block=True, timeout=1)
                pid_dict[rcv_uuid] = [f, rcv_pid] # store PID
            except queue.Empty as e:
                print('Queue is empty', e)  

    return jsonify(json_dumps), 200

In [10]:
#This function outputs the status of a task, given it's job_id (uuid). 
#Because of my previous choices,  
#if we've chosen to use thread=1 for crawling multiple URLs, 
#we are unable to see the crawling progress of each URL separately.
#The status is given for the whole process.
#curl -X GET http://93373074171c.ngrok.io/status/0d7fbd8d-2d19-401b-920d-859735c4499a
@app.route('/status/<uuid(strict=False):u>', methods=['GET'])
def get_job_status(u):    
    try: 
        #_uuid_ = uuid.UUID(u) 
        _uuid_ = u      
        [futures, pid] = pid_dict[_uuid_]
        if futures.running():
            return json.dumps({"job_id": str(_uuid_), "status": "inprogress"}) 
        elif futures.done():  
            return json.dumps({"job_id": str(_uuid_), "status": "completed"})
        else:
            return json.dumps({"job_id": str(_uuid_), "status": str(futures)})  
    except KeyError:
        print('Key not found')      
    except ValueError:
        print('UUID not found')      

In [11]:
#Given a job_id (uuid), this function returns it's corresponding crawled image URLs
#curl -X GET http://93373074171c.ngrok.io/result/0d7fbd8d-2d19-401b-920d-859735c4499a
@app.route('/result/<uuid(strict=False):u>', methods=['GET'])
def get_results(u): 
    try: 
        #_uuid_ = uuid.UUID(u) 
        _uuid_ = u      
        [futures, pid] = pid_dict[_uuid_]
        return json.dumps({"job_id": str(_uuid_), "result": futures.result()})        
    except KeyError:
        print('Key not found')      
    except ValueError:
        print('UUID not found')  

In [12]:
#Flask error handlers
@app.errorhandler(404)
def not_found(error):
    return make_response(jsonify({'error': 'Not found'}), 404)
@app.errorhandler(400)
def not_found(error):
    return make_response(jsonify({'error': 'Bad Request'}), 400)
@app.errorhandler(500)
def not_found(error):
    return make_response(jsonify({'error': 'Internal Server Error'}), 500)

In [13]:
if __name__ == '__main__':    
    app.run() 
#Then, open a shell window and run curl commands using the ngrok http tunnel that will appear below http://127.0.0.1
#Examples:
#Posting URLs using 1 single task:
#curl -X POST http://b4098711e066.ngrok.io/1 -H "Content-Type: application/json" -d "[\"http://4chan.org/\", \"https://golang.org/\"]"
#
#Posting URLs using 2 paralell tasks:
#curl -X POST http://b4098711e066.ngrok.io/2 -H "Content-Type: application/json" -d "[\"http://4chan.org/\", \"https://golang.org/\"]"
#
#Getting the status of a task:
#curl -X GET http://96ef6a5a14e0.ngrok.io/status/c426926b-64df-4417-8cb8-59f719c41ef1
#
#Getting the result of a task:
#curl -X GET http://96ef6a5a14e0.ngrok.io/result/c426926b-64df-4417-8cb8-59f719c41ef1

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)


 * Running on http://514c610f2191.ngrok.io
 * Traffic stats available on http://127.0.0.1:4040


127.0.0.1 - - [07/Oct/2020 16:11:09] "[31m[1mPOST /2 HTTP/1.1[0m" 400 -


{"job_id": "687b4488-63c8-4eca-ab14-6bcef301fbb8", "task": "1", "url": "http://4chan.org/"}
{"job_id": "c6660667-89d9-40c7-a807-e12a1c958757", "task": "2", "url": "https://golang.org/"}


127.0.0.1 - - [07/Oct/2020 16:13:17] "[37mPOST /2 HTTP/1.1[0m" 200 -


{"job_id": "229995d9-d42f-495d-8ee4-f2e62442a53a", "task": "1", "urls": ["http://4chan.org/", "https://golang.org/"]}


127.0.0.1 - - [07/Oct/2020 16:13:27] "[37mPOST /1 HTTP/1.1[0m" 200 -
127.0.0.1 - - [07/Oct/2020 16:14:28] "[37mGET /status/229995d9-d42f-495d-8ee4-f2e62442a53a HTTP/1.1[0m" 200 -
127.0.0.1 - - [07/Oct/2020 16:14:35] "[37mGET /result/229995d9-d42f-495d-8ee4-f2e62442a53a HTTP/1.1[0m" 200 -
