New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve url detection #166

Merged
merged 1 commit into from May 26, 2018

Conversation

Projects
None yet
2 participants
@huangsam
Contributor

huangsam commented May 26, 2018

The previous implementation used Linux regular expressions. This was sufficient as a MVP but it was not accurate enough for corner cases. After doing some more research, it seemed as if using a HTML parsing library would be more efficient for this purpose. As such, the shell command has been scrapped away in favor of a more elaborate approach for detecting urls.

  • Implement skeleton for extract_urls
  • Detect html and markdown files
  • Use bs4 for parsing html
  • Convert markdown for bs4 parsing
  • Remove use of urlin.txt and urlout.txt
  • Remove unnecessary global vars
Improve url detection
The previous implementation used Linux regular expressions. This was
sufficient as a MVP but it was not accurate enough for corner cases.
After doing some more research, it seemed as if using a HTML parsing
library would be more efficient for this purpose. As such, the shell
command has been scrapped away in favor of a more elaborate approach
for detecting urls.

- Implement skeleton for extract_urls
- Detect html and markdown files
- Use bs4 for parsing html
- Convert markdown for bs4 parsing
- Remove use of urlin.txt and urlout.txt
- Remove unnecessary global vars
@huangsam

This comment has been minimized.

Contributor

huangsam commented May 26, 2018

Preview of script output:

Extract urls...
Currently checking: file=full-stack-python-map.pdf                                  
Check urls...
Currently checking: id=2409 host=joaoventura.net                              
Bad urls: {
    "http://www.machinalis.com/blog/jwt-django-channels/": 404,
    "https://www.continuum.io/blog/developer-blog/using-bokeh-nist": 404,
    "http://erik.io/blog/2013/06/08/a-basic-guide-to-when-and-how-to-deploy-https/": 404,
    "http://flask.pocoo.org/docs/0.10/patterns/sqlite3/": 404,
    "http://articles.slicehost.com/nginx": 504,
    "https://github.com/fullstackpython/blog-code-examples/monitor-aws-lambda-python-3-6": 404,
    "http://flask.pocoo.org/docs/0.10/blueprints/": 404,
    "http://flask.pocoo.org/docs/0.10/tutorial/introduction/": 404,
    "http://www.machinalis.com/blog/offloading-work-using-django-channels/": 404,
    "https://storify.com/samnewman/in-which-i-discuss-monorepos": 410,
    "http://blog.yjl.im/2016/01/pymux-tmux-clone-in-python.html": -1,
    "http://blog.yjl.im/2016/01/pymux-and-tmux-performance-comparison.html": -1,
    "https://russ.garrett.co.uk/talks/postgres-gds/": 404,
    "http://flask.pocoo.org/docs/0.10/deploying/wsgi-standalone/": 404,
    "http://www.machinalis.com/blog/full-text-search-on-django-with-database-back-ends/": 404,
    "http://manpages.ubuntu.com/manpages/zesty/man1/ssh-agent.1.html": 404,
    "https://github.com/mapbox/mapboxgl-jupyter/blob/master/docs-markdown/viz.md": 404,
    "http://blog.ashnab.com/task-queues-and-python-rq/": 504,
    "http://www.machinalis.com/blog/pandas-django-rest-framework-bokeh/": 404
}

@huangsam huangsam referenced this pull request May 26, 2018

Merged

Fix bad urls #167

@mattmakai

This comment has been minimized.

Owner

mattmakai commented May 26, 2018

Wow, this looks like a huge improvement. Thanks again for doing all this work @huangsam.

@mattmakai mattmakai merged commit a5274df into mattmakai:master May 26, 2018

@huangsam huangsam deleted the huangsam:bugfix/url-discovery branch May 26, 2018

@huangsam

This comment has been minimized.

Contributor

huangsam commented May 26, 2018

My pleasure @mattmakai. It was great meeting you in person at PyCon 2018. I recall that you sent me an invitation to share my creation via http://twiliovoices.com - is it still possible?

@mattmakai

This comment has been minimized.

Owner

mattmakai commented May 27, 2018

Yes! Go ahead and submit the form that's linked to from that website. I'll respond back via my Twilio email next week. Have a great weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment