Skip to content
This repository has been archived by the owner on Jun 1, 2021. It is now read-only.
/ har-dulcify Public archive

Extract data from HTTP Archive (HAR) files, quite possibly downloaded by har-heedless, for some aggregate analysis.

License

Notifications You must be signed in to change notification settings

joelpurra/har-dulcify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Extract data from HTTP Archive (HAR) files, quite possibly downloaded by har-heedless, for some aggregate analysis. You might want to use har-portent, which runs both downloads multiple dataset variations using har-heedless and then analyzes them with har-dulcify in a single step.

⚠️ This project has been archived

No future updates are planned. Feel free to continue using it, but expect no support.

Usage

# TODO: describe relevant scripts.
# Start with src/one-shot/*.sh

$ tree src/
# src/
# ├── aggregate
# │   ├── all.sh
# │   ├── analysis.sh
# │   ├── merge.sh
# │   ├── prepare.sh
# │   └── prepare2.sh
# ├── classification
# │   ├── basic.sh
# │   ├── disconnect
# │   │   ├── add.sh
# │   │   ├── analysis.sh
# │   │   └── prepare-service-list.sh
# │   └── public-suffix
# │       ├── add.sh
# │       └── prepare-list.sh
# ├── domains
# │   └── latest
# │       ├── all.sh
# │       └── single.sh
# ├── extract
# │   ├── errors
# │   │   ├── all.sh
# │   │   ├── failed-page-loads.sh
# │   │   ├── page.sh
# │   │   └── successful-page-loads.sh
# │   └── request
# │       ├── expand-parts.sh
# │       └── parts.sh
# ├── multiset
# │   ├── download-retries.sh
# │   ├── non-failed.classification.disconnect.coverage.sh
# │   ├── non-failed.classification.domain-scope.coverage.sh
# │   ├── non-failed.classification.secure.coverage.sh
# │   ├── non-failed.disconnect.categories.coverage.external.sh
# │   ├── non-failed.disconnect.counts.sh
# │   ├── non-failed.disconnect.domains.coverage.external.google.sh
# │   ├── non-failed.disconnect.domains.coverage.external.sh
# │   ├── non-failed.disconnect.organizations.coverage.external.sh
# │   ├── non-failed.mime-types.groups.coverage.external.sh
# │   ├── non-failed.mime-types.groups.coverage.internal.sh
# │   ├── non-failed.mime-types.groups.coverage.origin.sh
# │   ├── non-failed.public-suffix.coverage.external.sh
# │   ├── non-failed.requests.counts.sh
# │   ├── origin-redirects.sh
# │   ├── ratio-buckets.sh
# │   └── request-status.codes.coverage.origin.sh
# ├── one-shot
# │   ├── aggregate.sh
# │   ├── all.sh
# │   ├── data.sh
# │   ├── multiset.sh
# │   ├── preparations.sh
# │   └── questions.sh
# ├── questions
# │   ├── disconnect.categories.organizations.sh
# │   ├── google-gtm-ga-dc.aggregate.sh
# │   ├── google-gtm-ga-dc.sh
# │   ├── origin-redirects.aggregate.sh
# │   ├── origin-redirects.sh
# │   ├── ratio-buckets.aggregate.analysis.sh
# │   ├── ratio-buckets.aggregate.sh
# │   └── ratio-buckets.sh
# └── util
#     ├── array-of-objects-to-csv.sh
#     ├── array-of-objects-to-tsv.sh
#     ├── cat-path.sh
#     ├── clean-csv-sorted-header.sh
#     ├── clean-tsv-sorted-header.sh
#     ├── concat.sh
#     ├── dataset-foreach.sh
#     ├── dataset-query.sh
#     ├── malformed-har.sh
#     ├── parallel-chunks.sh
#     ├── parallel-n-2.sh
#     ├── prepare-alexa-domain-lists.sh
#     ├── prepare-domain-lists.sh
#     ├── prepare-zone-file-domain-lists.sh
#     ├── reduce-merge-deep-add.sh
#     ├── structure.sh
#     ├── take.sh
#     ├── to-array.sh
#     └── unwrap-array.sh
#
# 13 directories, 69 files

Original purpose

Photo of Joel Purra presenting his master's thesis, named Swedes Online: You Are More Tracked Than You Think

Built as a component in Joel Purra's master's thesis research, where downloading lots of front pages in the .se top level domain zone was required to analyze their content and use of internal/external resources.

Citations

If you use, like, reference, or base work on the thesis report Swedes Online: You Are More Tracked Than You Think, the IEEE LCN 2016 paper Third-party Tracking on the Web: A Swedish Perspective, open source code, or open data, please add at least on of the following two citations with a link to the project website: https://joelpurra.com/projects/masters-thesis/

Master's thesis citation:

Joel Purra. 2015. Swedes Online: You Are More Tracked Than You Think. Master's thesis. Linköping University (LiU), Linköping, Sweden. https://joelpurra.com/projects/masters-thesis/

IEEE LCN 2016 paper citation:

J. Purra, N. Carlsson, Third-party Tracking on the Web: A Swedish Perspective, Proc. IEEE Conference on Local Computer Networks (LCN), Dubai, UAE, Nov. 2016. https://joelpurra.com/projects/masters-thesis/


Copyright (c) 2014, 2015, 2016, 2017 Joel Purra. Released under GNU General Public License version 3.0 (GPL-3.0).

About

Extract data from HTTP Archive (HAR) files, quite possibly downloaded by har-heedless, for some aggregate analysis.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages