Skip to content

Commit

Permalink
Updated docs and codebase in general to use save_content instead of s…
Browse files Browse the repository at this point in the history
…ave_all_content and save_javascript
  • Loading branch information
motin committed Aug 14, 2019
1 parent 0ff1975 commit 2e99d25
Show file tree
Hide file tree
Showing 5 changed files with 13 additions and 12 deletions.
14 changes: 8 additions & 6 deletions README.md
Expand Up @@ -96,13 +96,16 @@ available [below](#output-format).
* Response body content
* Saves all files encountered during the crawl to a `LevelDB`
database de-duplicated by the md5 hash of the content.
* Set `browser_params['save_all_content'] = True`
* Set `browser_params['save_content'] = True`
* The `content_hash` column of the `http_responses` table contains the md5
hash for each script, and can be used to do content lookups in the
LevelDB content database.
* NOTE: this instrumentation may lead to performance issues when a large
number of browsers are in use.
* Set `browser_params['save_javascript'] = True` to save only Javascript
* Set `browser_params['save_content']` to a comma-separated list of
[resource_types](https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType)
to save only specific types of files, for instance
`browser_params['save_content'] = "script"` to save only Javascript
files. This will lessen the performance impact of this instrumentation
when a large number of browsers are used in parallel.
* Flash Cookies
Expand Down Expand Up @@ -190,10 +193,9 @@ inline by sending a `create_table` message to the data aggregator.
#### Parquet on Amazon S3 **Experimental**
As an option, OpenWPM can save data directly to an Amazon S3 bucket as a
Parquet Dataset. This is currently experimental and hasn't been thoroughly
tested. Response body content (both `save_javascript` and `save_all_content`),
screenshots, and page source saving is not currently supported and will still
be stored in local databases and directories. To enable S3 saving specify the
following configuration parameters in `manager_params`:
tested. Screenshots, and page source saving is not currently supported and
will still be stored in local databases and directories. To enable S3
saving specify the following configuration parameters in `manager_params`:
* Output format: `manager_params['output_format'] = 's3'`
* S3 bucket name: `manager_params['s3_bucket'] = 'openwpm-test-crawl'`
* Directory within S3 bucket: `manager_params['s3_directory'] = '2018-09-09_test-crawl-new'`
Expand Down
2 changes: 1 addition & 1 deletion automation/DataAggregator/LocalAggregator.py
Expand Up @@ -196,7 +196,7 @@ def __init__(self, manager_params, browser_params):
# (if content saving is enabled on any browser)
self.ldb_enabled = False
for params in browser_params:
if params['save_javascript'] or params['save_all_content']:
if params['save_content']:
self.ldb_enabled = True
break

Expand Down
3 changes: 1 addition & 2 deletions automation/default_browser_params.json
Expand Up @@ -4,8 +4,7 @@
"js_instrument": false,
"http_instrument": false,
"navigation_instrument": false,
"save_javascript": false,
"save_all_content": false,
"save_content": false,

"random_attributes": false,
"bot_mitigation": false,
Expand Down
4 changes: 2 additions & 2 deletions test/test_http_instrumentation.py
Expand Up @@ -425,7 +425,7 @@ class TestHTTPInstrument(OpenWPMTest):
def get_config(self, data_dir=""):
manager_params, browser_params = self.get_test_config(data_dir)
browser_params[0]['http_instrument'] = True
browser_params[0]['save_javascript'] = True
browser_params[0]['save_content'] = "script"
return manager_params, browser_params

def test_page_visit(self):
Expand Down Expand Up @@ -602,7 +602,7 @@ def test_content_saving(self, tmpdir):
test_url = utilities.BASE_TEST_URL + '/http_test_page.html'
manager_params, browser_params = self.get_test_config(str(tmpdir))
browser_params[0]['http_instrument'] = True
browser_params[0]['save_all_content'] = True
browser_params[0]['save_content'] = True
manager = TaskManager.TaskManager(manager_params, browser_params)
manager.get(url=test_url, sleep=1)
manager.close()
Expand Down
2 changes: 1 addition & 1 deletion test/test_profile.py
Expand Up @@ -66,7 +66,7 @@ def test_profile_error(self):
def test_profile_saved_when_launch_crashes(self):
manager_params, browser_params = self.get_config()
browser_params[0]['proxy'] = True
browser_params[0]['save_javascript'] = True
browser_params[0]['save_content'] = "script"
manager = TaskManager.TaskManager(manager_params, browser_params)
manager.get('http://example.com')

Expand Down

0 comments on commit 2e99d25

Please sign in to comment.