Skip to content
This repository has been archived by the owner on May 4, 2021. It is now read-only.

#16559, bwauth code needs to be smarter about failed circuits #160

Closed
juga0 opened this issue May 14, 2018 · 13 comments
Closed

#16559, bwauth code needs to be smarter about failed circuits #160

juga0 opened this issue May 14, 2018 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@juga0
Copy link
Contributor

juga0 commented May 14, 2018

@pastly this is for working on https://trac.torproject.org/projects/tor/ticket/16559.
Since we changed to use http and not helper relay (or now is optional, i dunno), could you update us on the status of this and/or point to the code that checks when a circuit fail or a download fail?.
If there're not such checks i'd create them, and if can't know so far the reason for failure, i'd try to include it so we can debug better

@juga0 juga0 added the bug Something isn't working label May 14, 2018
@juga0 juga0 added this to the SoP milestone May 14, 2018
@juga0 juga0 self-assigned this May 14, 2018
@pastly
Copy link
Member

pastly commented May 14, 2018

https://github.com/pastly/simple-bw-scanner/blob/2727e20ed54df9d37401035b97a6a67fd4ec60a6/sbws/core/scanner.py#L131 is measure_relay, which does all the circuit building and measuring. See where it calls measure_rtt_to_server and measure_bandwidth_to_server and see where it returns ResultError types early if something goes wrong.

@pastly
Copy link
Member

pastly commented May 14, 2018

From trac#16559 ...

it will still be counted as a circuit that went through all of the nodes [...] the code will consider failures from unstable nodes to be relevant for nodes that are perfectly stable

sbws has the same limitation.

@juga0
Copy link
Contributor Author

juga0 commented May 15, 2018

hmm, not sure what "relevant" means here.
Maybe we could check whether the relays have the stable flag?
After looking the code, i found the possible errors:

  • unable to build circuit between relay and exit: ResultErrorCircuit
    • PathLengthException
    • is_controller_okay: have to look what this means
    • InvalidRequest, CircuitExtensionFailed, ProtocolError: this errors can tell us more about what failed?
    • is c.new_circuit able to tell us more?
  • web server usable: ResultErrorStream: have to look what this means
  • measure bw and rtt: ResultErrorStream
    • requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout: this errors can tell us more about what failed?
    • other possible errors with requests?

@pastly
Copy link
Member

pastly commented May 15, 2018

In this comment I'm saying that if one relay in a circuit causes or experiences a problem, all relays in the circuit will get a ResultError* given to it.

A ResutlErrorCircuit is created if the circuit can't be built. As far as I know, there is no easy way to learn from stem where the circuit failed. For example, you can't easily tell if you failed to connect to the first relay or if the first relay failed to connect to the second. I think you could technically figure this out by watching all circuit events while building a circuit and seeing where you get EXTENDED events; however, I don't think this is useful. If relayA can't extend to relayB, is it really relayA's fault? Or is it relayB's fault? How can you know?

A ResultErrorStreamis created when something went wrong with the TCP stream between the scanner and the destination web server. For example, the destination suddenly stopped being "usable" (one reason would be that the file suddenly became too small), or the stream was closed during the RTT or bandwidth measurements.

is_controller_okay makes sure our control connection to Tor is still good.

Regarding c.new_circuit and the exceptions it throws, you can double check me here to see if we can learn more information from the exceptions. I don't think we really can.

Regarding the Requests exceptions, it doesn't know anything about Tor so it only says it failed to connect to the web server or timed out.

In short, I don't think there's much productive work to be done for this ticket.

@juga0
Copy link
Contributor Author

juga0 commented May 15, 2018

Thanks for all the useful comments.
Do you or @teor2345 still think that maybe we should collect/analyze all those different errors (even the lack of reasons for them) in logs or even put them in the bandwidth files so hopefully we can figure out something about them?.

@teor2345
Copy link
Contributor

teor2345 commented May 17, 2018 via email

@juga0
Copy link
Contributor Author

juga0 commented May 21, 2018

We can count failures in the bandwidth file.

ok, so far the intermediate files contain success, error-X, X = {misc, circ, stream, auth}, i'm not sure if we'd like to have a key_value for each type of error with the total number of them, or just one key_value called something like errors_measuring.

But there's no room for detailed information.

Currently intermediate files contain a msg with the reason for failure, though that message could be improved.

We can log failures.

most of them are being logged already

But we should let operators disabled failure logs, because they can be large

So far that could be managed with logging config. All those errors are currently shown as warning, so logging should be configure to error.

@teor2345
Copy link
Contributor

i'm not sure if we'd like to have a key_value for each type of error with the total number of them, or just one key_value called something like errors_measuring.

Maybe we should include success counts as well?

8 key_values per relay would take up ~150 bytes per relay. That's well below the 512 byte line limit. And that's only an extra megabyte for 6000 relays.

So let's have the detailed information.

And let's patch the spec to document the new fields.

@juga0
Copy link
Contributor Author

juga0 commented May 21, 2018

ok, though if it's not a problem, i'd start 1st with sbws, we can leave it in a branch until spec merged

@teor2345
Copy link
Contributor

I'm not sure what you mean.

The bandwidth file spec was merged in https://bugs.torproject.org/25869 to https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt

You can do a sbws and torspec branch with these new fields, and get them both reviewed at the same time.

@juga0
Copy link
Contributor Author

juga0 commented May 21, 2018

What i mean is that i need to add new key_value pairs to the bw lines to:

  1. sbws
  2. bandwidth file spec

And i would do it in that other, but leaving the changes in a branch, so that the spec branch can be merged before the sbws branch, in case it's needed.

I'm not sure we want to patch Torflow, but i think we can decide later.

@teor2345
Copy link
Contributor

No, torflow isn't getting any new features

@juga0
Copy link
Contributor Author

juga0 commented May 25, 2018

Ticket for changing the spec: https://trac.torproject.org/projects/tor/ticket/26200

@juga0 juga0 mentioned this issue May 28, 2018
@juga0 juga0 modified the milestones: SoP, 1.0 (MVP must) Sep 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants