-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a tool to extract resources from freeze-dry'd html files #109
Conversation
For reference, so it can be linked back to the issue, this addresses #102 |
# Conflicts: # cli/setup.py
Okay, as far as I can tell, this does everything it needs to do now. Note that you may want to use #117 to serve the pages so you can see that they load correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this runs so fast and actually didn't crash on my login-forms corpus. Crazypants! I jotted down a few things, but the most important one is the observation on the location of the resources dir.
Good work!
cli/fathom_web/extract.py
Outdated
from click import argument, command, option, Path | ||
|
||
|
||
BASE64_DATA_PATTERN = re.compile(r'(data:(?P<mime>[a-zA-Z0-9]+?/[a-zA-Z0-9\-.+]+?);(\s?charset=utf-8;)?base64,(?P<string>[a-zA-Z0-9+/=]+))') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing some of the non-greedy quantifiers would make this more efficient, which might even be noticeable at the scales we're talking about.
For instance,
?P<mime>[a-zA-Z0-9]+?/
could become ?P<mime>[a-zA-Z0-9]+/
+?
makes the regex engine return as few repeats as possible while still letting the overall pattern match. That means, if it encounters "foo", it first tries "f/", then "fo/", then "foo/" before proceeding. Getting rid of the question mark saves compares.
Granted, I started this comment while I was still expecting to find a non-greedy quantifier on the actual base64 string. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested it. Boy, it's really fast even as you have it!
cli/fathom_web/extract.py
Outdated
|
||
|
||
BASE64_DATA_PATTERN = re.compile(r'(data:(?P<mime>[a-zA-Z0-9]+?/[a-zA-Z0-9\-.+]+?);(\s?charset=utf-8;)?base64,(?P<string>[a-zA-Z0-9+/=]+))') | ||
BASE_TAG_PATTERN = re.compile(r'<base.*?>') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<base [^>]*
, for similar reasons, should be faster without changing the meaning too much. Too bad it'll still mess up on <base foo=">" goo="bah">
. This really needs a parser. Does it matter for us in practice? Are we dealing only with freeze-dry-added tags?
Yes, I had to change my default, too. :-)
|
Couldn't figure out for 5 minutes why it wasn't working! :-)
|
# Conflicts: # cli/setup.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good except for what I think was a pathname miscommunication.
|
||
html = extract_base64_data_from_html_page(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better factoring. Yay for fewer, shallower side effects! :-)
cli/fathom_web/extract.py
Outdated
@@ -83,7 +83,7 @@ def extract_base64_data_from_html_page(file: pathlib.Path): | |||
html = fp.read() | |||
|
|||
# Make the subresources directory | |||
subresources_directory = file.parent / 'resources' / f'{file.stem}_resources' | |||
subresources_directory = file.parent / f'{file.stem}_resources' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think we had a miscommunication. I thought the outcome of yesterday's conversation was that we'd keep the samples/negatives/resources/45/2.png
-like layout and just remove the "_resources" suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we did. I think this stems out of how we think fathom-pick
will work now. Will fathom-pick
create a resources
directory in your destination directory and move the page-specific resource directories for the randomly picked items into that new resources
directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. And if the dir is already there, it'll just add stuff to it. But if the sample-specific subdir is already there, it should error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I'm on it!
Thanks, Daniel! :-D
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the backticks and land the sucker! Woooo! :-D
cli/fathom_web/extract.py
Outdated
For example, the resources for `example.html` would be stored in | ||
`example_resources/`. This tool is used to prepare your samples for a | ||
git-LFS enabled repository. | ||
`resources/example_resources/`. This tool is used to prepare your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out you need 2 backticks on either side in ReStructured Text!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes. Thank you for your great attention to detail!
This PR adds a tool called
fathom-extract
which will extract all base64 strings out of a page saved by freezeDry, save them in a new directory, and point the html to those new files.This tool will allow us to save our corpora in git (and git LFS).
There is an option, on by default, to preserve the original files which I added because I'm still a little scared =P