New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health check for urls #148

Merged
merged 1 commit into from Jan 13, 2018

Conversation

Projects
None yet
2 participants
@huangsam
Contributor

huangsam commented Jan 11, 2018

Here is the output that comes from the check_urls.py script: urlout.txt

This solution uses ThreadPoolExecutor to resolve the inherent I/O bottleneck of URL requests. Also uses a fairly comprehensive regular expression for matching URLs. The pattern can be tweaked in the future if needed.

@mattmakai

This comment has been minimized.

Owner

mattmakai commented Jan 11, 2018

thanks @huangsam this is super useful! looks like there may be a bug though because the URLs that end in .html do not resolve correctly in this program. Any ideas there?

For example, on the Flask page it says "http://blog.startifact.com/posts/older/what-is-pythonic" is a 404 but the actual URL is "http://blog.startifact.com/posts/older/what-is-pythonic.html" which resolves fine.

@huangsam

This comment has been minimized.

Contributor

huangsam commented Jan 12, 2018

Thanks for pointing an example case out. The regular expression is good at detecting URLs but it's not perfect at capturing all of it. Separate parsing for Markdown and HTML might be necessary to better capture the URLs in their entirety. As for the core logic of verifying a URL, that works just fine.

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Configurable variables
URL_MATCH = 'https?:\/\/[a-zA-Z0-9\.\-]+(html|\/)[=a-zA-Z0-9\_\/\?\&\-]+'

This comment has been minimized.

@huangsam

huangsam Jan 12, 2018

Contributor

The last portion of the regex [=a-zA-Z0-9\_\/\?\&\-]+ should be [=a-zA-Z0-9\_\/\?\&\.\-]+ since it missed the . thereby ignoring websites that ended with .html.

@huangsam

This comment has been minimized.

Contributor

huangsam commented Jan 12, 2018

The output has been reduced significantly down to the following:

http://intermediatepythonista.com/python-comprehensions: -1
http://intermediatepythonista.com/python-generators: -1
http://learntocodewith.me/: -1
http://learntocodewith.me/getting-started/: -1
http://testdriven.io/part-five-intro/: 404
http://packetbeat.com/: -1
http://w3techs.com/technologies/details/ws-cherrypy/all/all: -1
https://c6c6d4e8.ngrok.io: 404
https://wiki.jenkins-ci.org/display/JENKINS/Securing: 404
Add health check for urls
- Add url collection algorithm
- Optimize regex + config for clarity
- Handle exceptions in get_url_status

@huangsam huangsam force-pushed the huangsam:feature/check-url-health branch from 0cdd819 to b56bb5e Jan 12, 2018

@huangsam

This comment has been minimized.

Contributor

huangsam commented Jan 13, 2018

Timeout errors are now showing up with 504 instead of -1 in urlout.txt. Let me know if there's anything else that needs to be done to get this merged in.

@mattmakai

This comment has been minimized.

Owner

mattmakai commented Jan 13, 2018

I'm happy to merge this now because it's super helpful. I guess the only other bit is that it's picking up non-URLs like "https://c6c6d4e8.ngrok.io", which are embedded in the code but don't actually link to sites. It's not a huge deal but if you want to improve the script that'd be a big improvement.

@mattmakai mattmakai merged commit 24737bc into mattmakai:master Jan 13, 2018

mattmakai added a commit that referenced this pull request Jan 13, 2018

@mattmakai

This comment has been minimized.

Owner

mattmakai commented Jan 13, 2018

Updated change log with a shout out for the new health check script. Thanks again @huangsam!

@huangsam huangsam deleted the huangsam:feature/check-url-health branch Jan 13, 2018

@huangsam

This comment has been minimized.

Contributor

huangsam commented Jan 13, 2018

Thank for the reference @mattmakai!

I do understand that non-URLs are being picked up. Not a fault of the actual regex, but more because of the context of the content surrounding the URLs. As a workaround, I created this line to ignore some obvious URLs.

To provide an "authentic" solution, I imagine that the one-line Bash command I invoke at the start of main won't be sufficient to cover this use case. Would be open to suggestions on how to proceed forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment