## Callysto Course Link Checker 

**Description**: This is a notebook that iterates through directory files and runs a validation check on the links contained in Jupyter Notebooks. This notebook was written for Callysto internal use.

**Usage**: Run this notebook in the parent directory containing the notebooks or in the directory itself.

**Notes**: This notebook takes time and you will know when it is done by it's termination statement. It will return only error messages. It is only capable of handling conventional urls starting with https:// or www.

Last Edited: June 16, 2020

Author: LNC

Contact: lisa.cao@cybera.ca

In [None]:
# run only once if needed
# !pip3 install urllib3

In [None]:
import os
import json
import re
import urllib3

In [None]:
## function to parse urls (from geeksforgeeks)
def url_parse(string): 
    regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    url = re.findall(regex,string)       
    return [x[0] for x in url] 

In [None]:
## search through all directories and parse cells
def url_check():
    for root, dirs, files in os.walk("."):
        for filename in files:
            if filename.endswith('.ipynb'):
                notebook_name = filename[:-6]
                file = os.path.join(root, filename)
                notebook = json.load(open(file))
                cell_number = 0
                for cell in notebook['cells']:
                    cell_number += 1
                    try:
                        cell_contents = cell['source'][0]
                    except IndexError: 
                        pass
                    cell_urls = url_parse(cell_contents)
                    for url in cell_urls: 
                        http = urllib3.PoolManager()
                        valid_status = [200, 301, 302]
                    try:
                        req = http.request('GET', url, timeout = 5.0, retries = False)
                        if req.status in valid_status:
                            pass
                        else: 
                            print("BROKEN URL in",notebook_name, ": Cell", cell_number, url, "\n    reason:", req.status, "\n")
                    except Exception as e:
                        print("BROKEN URL in",notebook_name, ": Cell", cell_number, url, "\n    reason:", e, "\n")
    print(".. CHECK COMPLETE")

In [166]:
url_check() 

BROKEN URL in link butcher-checkpoint : Cell 1 https://docs.python.org/3orial/errors.html 
    reason: 404 

BROKEN URL in link butcher-checkpoint : Cell 2 https://ww.oogle.com/search?client=firefox-b-d&ei=a0jpXqDSBs-S0PEPr4i4yA4&q=does+try+block+require+except+python&oq=does+try+block+require+ex&gs_lcp=CgZwc3ktYWIQAxgAMgUIIRCgATIFCCEQoAEyBQghEKABOgQIABBHOgUIABCRAjoFCAAQgwE6BQgAELEDOgIIADoECAAQQzoECAAQCjoGCAAQFhAeOggIABAWEAoQHjoICAAQCBANEB46CAghEBYQHRAeOgQIIRAKUJQ8WLxhYOttaAJwA3gAgAF9iAHfEZIBBDI1LjKYAQCgAQGqAQdnd3Mtd2l6&sclient=psy-ab 
    reason: (<urllib3.connection.VerifiedHTTPSConnection object at 0x7f945f3a3c50>, 'Connection to ww.oogle.com timed out. (connect timeout=5.0)') 

BROKEN URL in link butcher-checkpoint : Cell 3 https://www.google&cd=&cad=rja&uact=8&ved=2ahUKEwjD15vxsYfqAhX9GDQIHeY2AXQQFjABegQIChAD&url=https%3A%2F%2Fwww.w3schools.com%2Fpython%2Fpython_try_except.asp&usg=AOvVaw0qFy-Jqf3Q5fKVHY-6plJr 
    reason: encoding with 'idna' codec failed (UnicodeError: label empt