Adds a tool to extract resources from freeze-dry'd html files #109

danielhertenstein · 2019-07-12T20:33:16Z

This PR adds a tool called fathom-extract which will extract all base64 strings out of a page saved by freezeDry, save them in a new directory, and point the html to those new files.

This tool will allow us to save our corpora in git (and git LFS).

There is an option, on by default, to preserve the original files which I added because I'm still a little scared =P

cli/fathom_web/extract.py

danielhertenstein · 2019-07-17T19:27:12Z

For reference, so it can be linked back to the issue, this addresses #102

# Conflicts: # cli/setup.py

cli/setup.py

danielhertenstein · 2019-07-19T14:53:28Z

Okay, as far as I can tell, this does everything it needs to do now. Note that you may want to use #117 to serve the pages so you can see that they load correctly.

erikrose

Wow, this runs so fast and actually didn't crash on my login-forms corpus. Crazypants! I jotted down a few things, but the most important one is the observation on the location of the resources dir.

Good work!

cli/fathom_web/extract.py

erikrose · 2019-07-19T17:51:20Z

cli/fathom_web/extract.py

+from click import argument, command, option, Path
+
+
+BASE64_DATA_PATTERN = re.compile(r'(data:(?P<mime>[a-zA-Z0-9]+?/[a-zA-Z0-9\-.+]+?);(\s?charset=utf-8;)?base64,(?P<string>[a-zA-Z0-9+/=]+))')


Removing some of the non-greedy quantifiers would make this more efficient, which might even be noticeable at the scales we're talking about.

For instance,
?P<mime>[a-zA-Z0-9]+?/ could become ?P<mime>[a-zA-Z0-9]+/

+? makes the regex engine return as few repeats as possible while still letting the overall pattern match. That means, if it encounters "foo", it first tries "f/", then "fo/", then "foo/" before proceeding. Getting rid of the question mark saves compares.

Granted, I started this comment while I was still expecting to find a non-greedy quantifier on the actual base64 string. :-)

Just tested it. Boy, it's really fast even as you have it!

erikrose · 2019-07-19T17:53:33Z

cli/fathom_web/extract.py

+
+
+BASE64_DATA_PATTERN = re.compile(r'(data:(?P<mime>[a-zA-Z0-9]+?/[a-zA-Z0-9\-.+]+?);(\s?charset=utf-8;)?base64,(?P<string>[a-zA-Z0-9+/=]+))')
+BASE_TAG_PATTERN = re.compile(r'<base.*?>')


<base [^>]*, for similar reasons, should be faster without changing the meaning too much. Too bad it'll still mess up on <base foo=">" goo="bah">. This really needs a parser. Does it matter for us in practice? Are we dealing only with freeze-dry-added tags?

cli/fathom_web/extract.py

cli/setup.py

cli/fathom_web/extract.py

erikrose · 2019-07-26T19:54:13Z

Yes, I had to change my default, too. :-)

erikrose · 2019-07-26T20:04:11Z

Couldn't figure out for 5 minutes why it wasn't working! :-)

… changes

# Conflicts: # cli/setup.py

erikrose

All looks good except for what I think was a pathname miscommunication.

erikrose · 2019-07-30T17:07:54Z

cli/fathom_web/extract.py


+        html = extract_base64_data_from_html_page(file)


Much better factoring. Yay for fewer, shallower side effects! :-)

erikrose · 2019-07-30T17:10:12Z

cli/fathom_web/extract.py

@@ -83,7 +83,7 @@ def extract_base64_data_from_html_page(file: pathlib.Path):
        html = fp.read()

    # Make the subresources directory
-    subresources_directory = file.parent / 'resources' / f'{file.stem}_resources'
+    subresources_directory = file.parent / f'{file.stem}_resources'


Oh, I think we had a miscommunication. I thought the outcome of yesterday's conversation was that we'd keep the samples/negatives/resources/45/2.png-like layout and just remove the "_resources" suffix.

Indeed we did. I think this stems out of how we think fathom-pick will work now. Will fathom-pick create a resources directory in your destination directory and move the page-specific resource directories for the randomly picked items into that new resources directory?

Yep. And if the dir is already there, it'll just add stuff to it. But if the sample-specific subdir is already there, it should error.

Gotcha. I'm on it!

erikrose · 2019-07-30T17:23:16Z

Thanks, Daniel! :-D

erikrose

Fix the backticks and land the sucker! Woooo! :-D

erikrose · 2019-07-30T17:36:00Z

cli/fathom_web/extract.py

    For example, the resources for `example.html` would be stored in
-    `example_resources/`. This tool is used to prepare your samples for a
-    git-LFS enabled repository.
+    `resources/example_resources/`. This tool is used to prepare your


Turns out you need 2 backticks on either side in ReStructured Text!

Ah yes. Thank you for your great attention to detail!

Adds a tool to extract resources from freeze-dry'd html files

edc09f5

danielhertenstein requested a review from erikrose July 12, 2019 20:33

erikrose reviewed Jul 12, 2019

View reviewed changes

cli/fathom_web/extract.py Outdated Show resolved Hide resolved

danielhertenstein mentioned this pull request Jul 17, 2019

Store page subresources in separate files #102

Closed

danielhertenstein added 2 commits July 19, 2019 10:20

Adds removal of <base> tags and corrects CSP issues

1f4b3a6

Extracting overwrites existing files and preserves original in new dir

6bccb27

danielhertenstein changed the title ~~[DO NOT MERGE] Adds a tool to extract resources from freeze-dry'd html files~~ Adds a tool to extract resources from freeze-dry'd html files Jul 19, 2019

Merge branch 'master' into fathom-extract

84aa840

# Conflicts: # cli/setup.py

danielhertenstein commented Jul 19, 2019

View reviewed changes

cli/setup.py Outdated Show resolved Hide resolved

danielhertenstein requested review from erikrose and biancadanforth July 19, 2019 14:51

danielhertenstein mentioned this pull request Jul 19, 2019

Adds fathom-serve tool #117

Merged

erikrose suggested changes Jul 26, 2019

View reviewed changes

danielhertenstein added 5 commits July 29, 2019 09:43

Addresses all comments before starting to mess with the regex again

d1e90cf

Regex improvements and original saving switches from copies to moves

70a31fb

Plus original preservation logic back out of the extractor method

d188e20

Changes resource directory to coordinate with upcoming fathom-serve…

15bf71a

… changes

Update main help message to align with new storage structure

667953c

danielhertenstein requested review from erikrose and removed request for biancadanforth July 29, 2019 20:49

Merge branch 'master' into fathom-extract

0ba95a4

# Conflicts: # cli/setup.py

erikrose suggested changes Jul 30, 2019

View reviewed changes

Puts resources back into a resources directory

f7cbc09

erikrose approved these changes Jul 30, 2019

View reviewed changes

Fixes backtick formatting

9faa753

danielhertenstein merged commit 9daa455 into master Jul 30, 2019

biancadanforth mentioned this pull request Oct 4, 2019

Make the HTTPS certificate warning less of a stumbling block mozilla/fathom-fox#56

Closed

danielhertenstein deleted the fathom-extract branch November 1, 2019 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a tool to extract resources from freeze-dry'd html files #109

Adds a tool to extract resources from freeze-dry'd html files #109

danielhertenstein commented Jul 12, 2019 •

edited

danielhertenstein commented Jul 17, 2019

danielhertenstein commented Jul 19, 2019

erikrose left a comment

erikrose Jul 19, 2019

erikrose Jul 26, 2019

erikrose Jul 19, 2019

erikrose commented Jul 26, 2019 via email

erikrose commented Jul 26, 2019 via email

erikrose left a comment

erikrose Jul 30, 2019

erikrose Jul 30, 2019

danielhertenstein Jul 30, 2019

erikrose Jul 30, 2019

danielhertenstein Jul 30, 2019

erikrose commented Jul 30, 2019 via email

erikrose left a comment

erikrose Jul 30, 2019

danielhertenstein Jul 30, 2019

		from click import argument, command, option, Path


		BASE64_DATA_PATTERN = re.compile(r'(data:(?P<mime>[a-zA-Z0-9]+?/[a-zA-Z0-9\-.+]+?);(\s?charset=utf-8;)?base64,(?P<string>[a-zA-Z0-9+/=]+))')



		BASE64_DATA_PATTERN = re.compile(r'(data:(?P<mime>[a-zA-Z0-9]+?/[a-zA-Z0-9\-.+]+?);(\s?charset=utf-8;)?base64,(?P<string>[a-zA-Z0-9+/=]+))')
		BASE_TAG_PATTERN = re.compile(r'<base.*?>')

Adds a tool to extract resources from freeze-dry'd html files #109

Adds a tool to extract resources from freeze-dry'd html files #109

Conversation

danielhertenstein commented Jul 12, 2019 • edited

danielhertenstein commented Jul 17, 2019

danielhertenstein commented Jul 19, 2019

erikrose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikrose commented Jul 26, 2019 via email

erikrose commented Jul 26, 2019 via email

erikrose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikrose commented Jul 30, 2019 via email

erikrose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielhertenstein commented Jul 12, 2019 •

edited