Add usage to readme

joelpurra · Feb 19, 2015 · 0a58892 · 0a58892
1 parent 16735e1
commit 0a58892
Showing 1 changed file with 100 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -3,11 +3,110 @@
 Using [har-heedless](https://github.com/joelpurra/har-heedless/) to download and [har-dulcify](https://github.com/joelpurra/har-dulcify/) to analyze web pages in aggregate.
 
 
+- [Downloads the front web page of all domains](https://github.com/joelpurra/har-heedless/) in a dataset.
+  - Input is a text file with one domain name per line.
+  - Downloads `n` domains in parallel.
+    - Tested with over 100 parallel requests on a single of moderate speed and memory. YMMV.
+    - Machine load heavily depends on the complexity and response rate of the average domain in the dataset.
+  - Shows progress as well as expected time to finish downloads.
+  - Download domains with different prefixes as separate dataset variations.
+    - Default prefixes:
+      - `http://`
+      - `https://`
+      - `http://www.`
+      - `https://www.`
+  - Retries failed domains twice to reduce effect of any intermittent problems.
+    - Increases domain timeouts for failed domains.
+  - Saves screenshots of all webpages.
+- [Runs an analysis](https://github.com/joelpurra/har-dulcify/) on each dataset variation.
+  - Outputs JSON files for analysis.
+  - Prepared for aggregate dataset analysis to output tables (TSV/CSV), which in turn are prepared for graph creation.
+
+Directory structure
+
+```bash
+# $PWD/$(date -u +%F)/$(basename "$domainlist")-$prefix/hars
+```
+
+## Usage
+
+```bash
+# Create directory structure and download all domains in domains.txt with a single prefix/variation.
+# ./src/domains/download-and-analyze-https-www-combos.sh <prefix> <parallelism> <domainlists>
+./src/domains/download-and-analyze-https-www-combos.sh 'https://www.' 10 many-domains.txt more-domains.txt 100k-se-domains.txt
+
+# Create directory structure and download all domains in domains.txt with all four variations.
+# ./src/domains/download-and-analyze-https-www-combos.sh <parallelism> <domainlists>
+./src/domains/download-and-analyze-https-www-combos.sh 10 many-domains.txt more-domains.txt 100k-se-domains.txt
+```
+
+Other usage:
+
+```bash
+# Re-run question step in each dataset.
+~/path/to/har-dulcify/src/util/dataset-foreach.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' ~/path/to/har-dulcify/src/one-shot/questions.sh
+
+# Re-run aggregate and question step in each dataset.
+~/path/to/har-dulcify/src/util/dataset-foreach.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' ~/path/to/har-dulcify/src/one-shot/aggregate.sh '&&' ~/path/to/har-dulcify/src/one-shot/questions.sh
+
+# Copy selected files from each dataset.
+OUTPUT="$HOME/path/to/output/analysis/$(date -u +%F)" ~/path/to/har-dulcify/src/util/dataset-query.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' 'T="$OUTPUT/$(basename "{}")"' '&&' echo '$T' '&&' mkdir -p '$T/' '&&' cp aggregate.disconnect.categories.organizations.json aggregates.analysis.json '*.log' 'failed*' google-gtm-ga-dc.aggregate.json origin-redirects.aggregate.json ratio-buckets.aggregate.json prepared.disconnect.services.analysis.json '$T/'
+```
+
+### Example output
+
+Downloading two domains, one of which doesn't have HTTPS.
+
+```text
+$ ~/path/to/har-portent/src/domains/download-and-analyze-https-www-combos.sh 10 only-two-domains.txt
+2015-01-31T113812Z start http://
+       2 /Users/joelpurra/analyze/the/web/only-two-domains.txt
+    in #1:    2  0:00:00 [26.7k/s] [=================================================================>] 100%
+   out #1:    2  0:00:12 [ 157m/s] [=================================================================>] 100%
+Downloading https://services.disconnect.me/disconnect-plaintext.json
+Downloading https://publicsuffix.org/list/effective_tld_names.dat
+2015-01-31T113855Z done http://
+2015-01-31T113855Z start http://www.
+       2 /Users/joelpurra/analyze/the/web/only-two-domains.txt
+    in #1:    2  0:00:00 [23.3k/s] [=================================================================>] 100%
+   out #1:    2  0:00:12 [ 164m/s] [=================================================================>] 100%
+Downloading https://services.disconnect.me/disconnect-plaintext.json
+Downloading https://publicsuffix.org/list/effective_tld_names.dat
+2015-01-31T113937Z done http://www.
+2015-01-31T113937Z start https://
+       2 /Users/joelpurra/analyze/the/web/only-two-domains.txt
+    in #1:    2  0:00:00 [26.3k/s] [=================================================================>] 100%
+   out #1:    2  0:00:12 [ 157m/s] [=================================================================>] 100%
+Downloading 1 domains, up to 30 at a time
+    in #2:    1  0:00:00 [21.1k/s] [=================================================================>] 100%
+   out #2:    1  0:00:12 [ 163m/s] [=================================================================>] 100%
+Downloading 1 domains, up to 50 at a time
+    in #3:    1  0:00:00 [29.9k/s] [=================================================================>] 100%
+   out #3:    1  0:00:12 [ 163m/s] [=================================================================>] 100%
+Downloading https://services.disconnect.me/disconnect-plaintext.json
+Downloading https://publicsuffix.org/list/effective_tld_names.dat
+2015-01-31T114019Z done https://
+2015-01-31T114019Z start https://www.
+       2 /Users/joelpurra/analyze/the/web/only-two-domains.txt
+    in #1:    2  0:00:00 [30.3k/s] [=================================================================>] 100%
+   out #1:    2  0:00:12 [ 163m/s] [=================================================================>] 100%
+Downloading 1 domains, up to 30 at a time
+    in #2:    1  0:00:00 [32.3k/s] [=================================================================>] 100%
+   out #2:    1  0:00:12 [ 163m/s] [=================================================================>] 100%
+Downloading 1 domains, up to 50 at a time
+    in #3:    1  0:00:00 [31.7k/s] [=================================================================>] 100%
+   out #3:    1  0:00:12 [ 163m/s] [=================================================================>] 100%
+Downloading https://services.disconnect.me/disconnect-plaintext.json
+Downloading https://publicsuffix.org/list/effective_tld_names.dat
+2015-01-31T114101Z done https://www.
+```
+
 
 ## Original purpose
 
 Built as a component in [Joel Purra's master's thesis](http://joelpurra.com/projects/masters-thesis/) research, where downloading lots of front pages in the .se top level domain zone was required to analyze their content and use of internal/external resources.
 
 
+---
 
-Copyright (c) 2014 [Joel Purra](http://joelpurra.com/). Released under [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl.html).
+Copyright (c) 2014, 2015 [Joel Purra](http://joelpurra.com/). Released under [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl.html).