Improve url detection #166
Merged
Conversation
The previous implementation used Linux regular expressions. This was sufficient as a MVP but it was not accurate enough for corner cases. After doing some more research, it seemed as if using a HTML parsing library would be more efficient for this purpose. As such, the shell command has been scrapped away in favor of a more elaborate approach for detecting urls. - Implement skeleton for extract_urls - Detect html and markdown files - Use bs4 for parsing html - Convert markdown for bs4 parsing - Remove use of urlin.txt and urlout.txt - Remove unnecessary global vars
Preview of script output:
|
Merged
Wow, this looks like a huge improvement. Thanks again for doing all this work @huangsam. |
My pleasure @mattmakai. It was great meeting you in person at PyCon 2018. I recall that you sent me an invitation to share my creation via http://twiliovoices.com - is it still possible? |
Yes! Go ahead and submit the form that's linked to from that website. I'll respond back via my Twilio email next week. Have a great weekend. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
The previous implementation used Linux regular expressions. This was sufficient as a MVP but it was not accurate enough for corner cases. After doing some more research, it seemed as if using a HTML parsing library would be more efficient for this purpose. As such, the shell command has been scrapped away in favor of a more elaborate approach for detecting urls.