Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a tool to extract resources from freeze-dry'd html files #109

Merged
merged 12 commits into from
Jul 30, 2019

Conversation

danielhertenstein
Copy link
Collaborator

@danielhertenstein danielhertenstein commented Jul 12, 2019

This PR adds a tool called fathom-extract which will extract all base64 strings out of a page saved by freezeDry, save them in a new directory, and point the html to those new files.

This tool will allow us to save our corpora in git (and git LFS).

There is an option, on by default, to preserve the original files which I added because I'm still a little scared =P

cli/fathom_web/extract.py Outdated Show resolved Hide resolved
@danielhertenstein
Copy link
Collaborator Author

For reference, so it can be linked back to the issue, this addresses #102

@danielhertenstein danielhertenstein changed the title [DO NOT MERGE] Adds a tool to extract resources from freeze-dry'd html files Adds a tool to extract resources from freeze-dry'd html files Jul 19, 2019
# Conflicts:
#	cli/setup.py
cli/setup.py Outdated Show resolved Hide resolved
@danielhertenstein
Copy link
Collaborator Author

Okay, as far as I can tell, this does everything it needs to do now. Note that you may want to use #117 to serve the pages so you can see that they load correctly.

Copy link
Contributor

@erikrose erikrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this runs so fast and actually didn't crash on my login-forms corpus. Crazypants! I jotted down a few things, but the most important one is the observation on the location of the resources dir.

Good work!

cli/fathom_web/extract.py Outdated Show resolved Hide resolved
from click import argument, command, option, Path


BASE64_DATA_PATTERN = re.compile(r'(data:(?P<mime>[a-zA-Z0-9]+?/[a-zA-Z0-9\-.+]+?);(\s?charset=utf-8;)?base64,(?P<string>[a-zA-Z0-9+/=]+))')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing some of the non-greedy quantifiers would make this more efficient, which might even be noticeable at the scales we're talking about.

For instance,
?P<mime>[a-zA-Z0-9]+?/ could become ?P<mime>[a-zA-Z0-9]+/

+? makes the regex engine return as few repeats as possible while still letting the overall pattern match. That means, if it encounters "foo", it first tries "f/", then "fo/", then "foo/" before proceeding. Getting rid of the question mark saves compares.

Granted, I started this comment while I was still expecting to find a non-greedy quantifier on the actual base64 string. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested it. Boy, it's really fast even as you have it!



BASE64_DATA_PATTERN = re.compile(r'(data:(?P<mime>[a-zA-Z0-9]+?/[a-zA-Z0-9\-.+]+?);(\s?charset=utf-8;)?base64,(?P<string>[a-zA-Z0-9+/=]+))')
BASE_TAG_PATTERN = re.compile(r'<base.*?>')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<base [^>]*, for similar reasons, should be faster without changing the meaning too much. Too bad it'll still mess up on <base foo=">" goo="bah">. This really needs a parser. Does it matter for us in practice? Are we dealing only with freeze-dry-added tags?

cli/fathom_web/extract.py Outdated Show resolved Hide resolved
cli/setup.py Outdated Show resolved Hide resolved
cli/fathom_web/extract.py Outdated Show resolved Hide resolved
cli/fathom_web/extract.py Outdated Show resolved Hide resolved
cli/fathom_web/extract.py Outdated Show resolved Hide resolved
cli/fathom_web/extract.py Outdated Show resolved Hide resolved
cli/fathom_web/extract.py Outdated Show resolved Hide resolved
@erikrose
Copy link
Contributor

erikrose commented Jul 26, 2019 via email

@erikrose
Copy link
Contributor

erikrose commented Jul 26, 2019 via email

@danielhertenstein danielhertenstein requested review from erikrose and removed request for biancadanforth July 29, 2019 20:49
# Conflicts:
#	cli/setup.py
Copy link
Contributor

@erikrose erikrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good except for what I think was a pathname miscommunication.


html = extract_base64_data_from_html_page(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better factoring. Yay for fewer, shallower side effects! :-)

@@ -83,7 +83,7 @@ def extract_base64_data_from_html_page(file: pathlib.Path):
html = fp.read()

# Make the subresources directory
subresources_directory = file.parent / 'resources' / f'{file.stem}_resources'
subresources_directory = file.parent / f'{file.stem}_resources'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think we had a miscommunication. I thought the outcome of yesterday's conversation was that we'd keep the samples/negatives/resources/45/2.png-like layout and just remove the "_resources" suffix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we did. I think this stems out of how we think fathom-pick will work now. Will fathom-pick create a resources directory in your destination directory and move the page-specific resource directories for the randomly picked items into that new resources directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. And if the dir is already there, it'll just add stuff to it. But if the sample-specific subdir is already there, it should error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I'm on it!

@erikrose
Copy link
Contributor

erikrose commented Jul 30, 2019 via email

Copy link
Contributor

@erikrose erikrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the backticks and land the sucker! Woooo! :-D

For example, the resources for `example.html` would be stored in
`example_resources/`. This tool is used to prepare your samples for a
git-LFS enabled repository.
`resources/example_resources/`. This tool is used to prepare your
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out you need 2 backticks on either side in ReStructured Text!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Thank you for your great attention to detail!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants