New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add counters for hostnames in Alexa #179

Closed
teor2345 opened this Issue Feb 2, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@teor2345
Collaborator

teor2345 commented Feb 2, 2017

No description provided.

@teor2345 teor2345 added the enhancement label Feb 2, 2017

@teor2345 teor2345 added this to the v0.2.0 milestone Feb 2, 2017

@teor2345

This comment has been minimized.

Show comment
Hide comment
@teor2345

teor2345 Feb 5, 2017

Collaborator

How are we going to treat subdomains?
How are we going to deal with country-specific domains?
Does Alexa have a canonicalisation algorithm?
What do we do about sites where the client requests a different domain when it is coming via tor?
(Or when JavaScript is turned off, or other factors.)

Collaborator

teor2345 commented Feb 5, 2017

How are we going to treat subdomains?
How are we going to deal with country-specific domains?
Does Alexa have a canonicalisation algorithm?
What do we do about sites where the client requests a different domain when it is coming via tor?
(Or when JavaScript is turned off, or other factors.)

@robgjansen robgjansen modified the milestones: v0.2.1, v0.2.0 Feb 9, 2017

@teor2345 teor2345 modified the milestones: Counters, 1.1.0 - Alexa Hostnames Jun 21, 2017

@teor2345

This comment has been minimized.

Show comment
Hide comment
@teor2345

teor2345 Jun 21, 2017

Collaborator

We are deferring this.

Collaborator

teor2345 commented Jun 21, 2017

We are deferring this.

@teor2345 teor2345 modified the milestones: 1.5.0 - Alexa Stream Hostname Statistics, Counters Jul 6, 2017

@teor2345

This comment has been minimized.

Show comment
Hide comment
@teor2345

teor2345 Jul 7, 2017

Collaborator

This is Tor Trac ticket https://trac.torproject.org/projects/tor/ticket/18268
originally requested by OONI.

We would have a secure counter for each popular (top-level) hostname listed by Alexa. We would not collect any subdomains. We should probably also have a not-Alexa Web and not-Web Tor counter.

We should count streams (connections) and stream bytes (usage).

Collaborator

teor2345 commented Jul 7, 2017

This is Tor Trac ticket https://trac.torproject.org/projects/tor/ticket/18268
originally requested by OONI.

We would have a secure counter for each popular (top-level) hostname listed by Alexa. We would not collect any subdomains. We should probably also have a not-Alexa Web and not-Web Tor counter.

We should count streams (connections) and stream bytes (usage).

@teor2345 teor2345 modified the milestones: 1.5.0 - Alexa Stream Hostname Statistics, 1.2.0 - HSDir Fetch and Evict Statistics Oct 31, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 9, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 9, 2017

Refactor common stream event code
* Make _classify_port() return the actual counter name components
* Use these components for dict keys
* Use these components to create counter names programatically
* Refactor circuit history checks into is_circ_known()

Preparation for #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 9, 2017

Add counters for Exit Stream Hostnames and IPv4 / IPv6
Adds 126 new counters, 9 types of counter * 14 variants.

Closes #452.
Preparation for #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 9, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 9, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

Refactor common stream event code
* Make _classify_port() return the actual counter name components
* Use these components for dict keys
* Use these components to create counter names programatically
* Refactor circuit history checks into is_circ_known()

Preparation for #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

Add counters for Exit Stream Hostnames and IPv4 / IPv6
Adds 126 new counters, 9 types of counter * 14 variants.

Closes #452.
Preparation for #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

Implement domain exact and suffix matches on TS and DCs
A domain list is loaded by the TS and sent to the DCs.
Adds 108 new counters, 9 types of counter * 12 variants.

Part of #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

Make sure domains are unique, and strip periods
But don't do the expensive uniqueness tests, because they take a long time
every time the TS reloads the config.

Also use case-insensitive matching, just in case that ever matters.

Part of #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 17, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 20, 2017

Implement domain exact and suffix matches on TS and DCs
A domain list is loaded by the TS and sent to the DCs.
Adds 108 new counters, 9 types of counter * 12 variants.

The TS makes sure domains are unique, and strips periods.

The TS and DCs don't do expensive suffix uniqueness tests, because they take
a long time every time the TS reloads the config.

The DCs use case-insensitive matching, just in case that ever matters.

Put a summary of the domain lists in the outcomes file.
Don't send the full domain list from DCs to TS.
Don't put the full domain list from the TS in the outcomes file.

Add test configs and the Alexa Top 1000 from November 2017 as test data.

Part of #179.

Part of #179.

Increase protocol MAX_LENGTH to 100MB, so we can send 15MB domain lists

... without any warnings.

Cleanup after #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 20, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 20, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 20, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 20, 2017

Implement domain exact and suffix matches on TS and DCs
A domain list is loaded by the TS and sent to the DCs.
Adds 108 new counters, 9 types of counter * 12 variants.

The TS makes sure domains are unique, and strips periods.

The TS and DCs don't do expensive suffix uniqueness tests, because they take
a long time every time the TS reloads the config.

The DCs use case-insensitive matching, just in case that ever matters.

Put a summary of the domain lists in the outcomes file.
Don't send the full domain list from DCs to TS.
Don't put the full domain list from the TS in the outcomes file.

Add test configs and the Alexa Top 1000 from November 2017 as test data.

Increase protocol MAX_LENGTH to 100MB, so we can send 15MB domain lists
without any warnings.

Part of #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 20, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 20, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 20, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 21, 2017

@teor2345 teor2345 closed this in 56b7a19 Nov 21, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 24, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 29, 2017

Stop ignoring counter names containing 'name' in test_counter_match.sh
This affects the configs for the Hostname counters.

Also exclude a spurious counter name from counter matching.

Also avoid future instances of this issue by making pattern matches more
specific.

Part of #455, fix on #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 29, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Nov 29, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Dec 1, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Dec 6, 2017

Modify the domain lists based on initial results
Adds the following domain lists for testing:
  - domain-google.txt     : Google domains in the Alexa Top 1 million, Nov 2017
  - domain-torproject.txt : torproject.org, used for checks and updates

Comment out the following domain lists, as they have negligible traffic:
  - domain-example.txt    : IANA example domains
  - domain-arpa.txt       : IANA arpa infrastructure domain
  - domain-onion.txt      : Tor's onion special use domain
  - domain-i2p.txt        : I2P's i2p special use domain

Update to #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Dec 6, 2017

teor2345 added a commit to teor2345/privcount that referenced this issue Dec 6, 2017

Remove all the MatchWebSubsequent counters
We don't want to collect them, and they're slowing down processing.

Bugfix on #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Dec 7, 2017

Add ExitHostnameNonWebInitial/SubsequentStream counters
And copy their bins from similar counters.

Bugfix on #179.

teor2345 added a commit to teor2345/privcount that referenced this issue Dec 18, 2017

Remove all MatchWebStream counters
Since we removed the Subsequent counters in f525366 (which was itself
a fix on #179), these counters have been duplicates of the
MatchWebInitialStream counters.

This will improve performance, but it might not be noticeable.

Bugfix on #179.
Closes #469.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment