From 0a588928c65a7ed8b94e5dd8883c750ae50993f6 Mon Sep 17 00:00:00 2001 From: Joel Purra Date: Thu, 19 Feb 2015 11:51:38 +0100 Subject: [PATCH] Add usage to readme --- README.md | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 100 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 628a70d..46f4af0 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,110 @@ Using [har-heedless](https://github.com/joelpurra/har-heedless/) to download and [har-dulcify](https://github.com/joelpurra/har-dulcify/) to analyze web pages in aggregate. +- [Downloads the front web page of all domains](https://github.com/joelpurra/har-heedless/) in a dataset. + - Input is a text file with one domain name per line. + - Downloads `n` domains in parallel. + - Tested with over 100 parallel requests on a single of moderate speed and memory. YMMV. + - Machine load heavily depends on the complexity and response rate of the average domain in the dataset. + - Shows progress as well as expected time to finish downloads. + - Download domains with different prefixes as separate dataset variations. + - Default prefixes: + - `http://` + - `https://` + - `http://www.` + - `https://www.` + - Retries failed domains twice to reduce effect of any intermittent problems. + - Increases domain timeouts for failed domains. + - Saves screenshots of all webpages. +- [Runs an analysis](https://github.com/joelpurra/har-dulcify/) on each dataset variation. + - Outputs JSON files for analysis. + - Prepared for aggregate dataset analysis to output tables (TSV/CSV), which in turn are prepared for graph creation. + +Directory structure + +```bash +# $PWD/$(date -u +%F)/$(basename "$domainlist")-$prefix/hars +``` + +## Usage + +```bash +# Create directory structure and download all domains in domains.txt with a single prefix/variation. +# ./src/domains/download-and-analyze-https-www-combos.sh +./src/domains/download-and-analyze-https-www-combos.sh 'https://www.' 10 many-domains.txt more-domains.txt 100k-se-domains.txt + +# Create directory structure and download all domains in domains.txt with all four variations. +# ./src/domains/download-and-analyze-https-www-combos.sh +./src/domains/download-and-analyze-https-www-combos.sh 10 many-domains.txt more-domains.txt 100k-se-domains.txt +``` + +Other usage: + +```bash +# Re-run question step in each dataset. +~/path/to/har-dulcify/src/util/dataset-foreach.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' ~/path/to/har-dulcify/src/one-shot/questions.sh + +# Re-run aggregate and question step in each dataset. +~/path/to/har-dulcify/src/util/dataset-foreach.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' ~/path/to/har-dulcify/src/one-shot/aggregate.sh '&&' ~/path/to/har-dulcify/src/one-shot/questions.sh + +# Copy selected files from each dataset. +OUTPUT="$HOME/path/to/output/analysis/$(date -u +%F)" ~/path/to/har-dulcify/src/util/dataset-query.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' 'T="$OUTPUT/$(basename "{}")"' '&&' echo '$T' '&&' mkdir -p '$T/' '&&' cp aggregate.disconnect.categories.organizations.json aggregates.analysis.json '*.log' 'failed*' google-gtm-ga-dc.aggregate.json origin-redirects.aggregate.json ratio-buckets.aggregate.json prepared.disconnect.services.analysis.json '$T/' +``` + +### Example output + +Downloading two domains, one of which doesn't have HTTPS. + +```text +$ ~/path/to/har-portent/src/domains/download-and-analyze-https-www-combos.sh 10 only-two-domains.txt +2015-01-31T113812Z start http:// + 2 /Users/joelpurra/analyze/the/web/only-two-domains.txt + in #1: 2 0:00:00 [26.7k/s] [=================================================================>] 100% + out #1: 2 0:00:12 [ 157m/s] [=================================================================>] 100% +Downloading https://services.disconnect.me/disconnect-plaintext.json +Downloading https://publicsuffix.org/list/effective_tld_names.dat +2015-01-31T113855Z done http:// +2015-01-31T113855Z start http://www. + 2 /Users/joelpurra/analyze/the/web/only-two-domains.txt + in #1: 2 0:00:00 [23.3k/s] [=================================================================>] 100% + out #1: 2 0:00:12 [ 164m/s] [=================================================================>] 100% +Downloading https://services.disconnect.me/disconnect-plaintext.json +Downloading https://publicsuffix.org/list/effective_tld_names.dat +2015-01-31T113937Z done http://www. +2015-01-31T113937Z start https:// + 2 /Users/joelpurra/analyze/the/web/only-two-domains.txt + in #1: 2 0:00:00 [26.3k/s] [=================================================================>] 100% + out #1: 2 0:00:12 [ 157m/s] [=================================================================>] 100% +Downloading 1 domains, up to 30 at a time + in #2: 1 0:00:00 [21.1k/s] [=================================================================>] 100% + out #2: 1 0:00:12 [ 163m/s] [=================================================================>] 100% +Downloading 1 domains, up to 50 at a time + in #3: 1 0:00:00 [29.9k/s] [=================================================================>] 100% + out #3: 1 0:00:12 [ 163m/s] [=================================================================>] 100% +Downloading https://services.disconnect.me/disconnect-plaintext.json +Downloading https://publicsuffix.org/list/effective_tld_names.dat +2015-01-31T114019Z done https:// +2015-01-31T114019Z start https://www. + 2 /Users/joelpurra/analyze/the/web/only-two-domains.txt + in #1: 2 0:00:00 [30.3k/s] [=================================================================>] 100% + out #1: 2 0:00:12 [ 163m/s] [=================================================================>] 100% +Downloading 1 domains, up to 30 at a time + in #2: 1 0:00:00 [32.3k/s] [=================================================================>] 100% + out #2: 1 0:00:12 [ 163m/s] [=================================================================>] 100% +Downloading 1 domains, up to 50 at a time + in #3: 1 0:00:00 [31.7k/s] [=================================================================>] 100% + out #3: 1 0:00:12 [ 163m/s] [=================================================================>] 100% +Downloading https://services.disconnect.me/disconnect-plaintext.json +Downloading https://publicsuffix.org/list/effective_tld_names.dat +2015-01-31T114101Z done https://www. +``` + ## Original purpose Built as a component in [Joel Purra's master's thesis](http://joelpurra.com/projects/masters-thesis/) research, where downloading lots of front pages in the .se top level domain zone was required to analyze their content and use of internal/external resources. +--- -Copyright (c) 2014 [Joel Purra](http://joelpurra.com/). Released under [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl.html). +Copyright (c) 2014, 2015 [Joel Purra](http://joelpurra.com/). Released under [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl.html).