New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pt pollution detection and handling #393

Merged
merged 15 commits into from Jan 26, 2018

Conversation

Projects
None yet
3 participants
@yachang
Contributor

yachang commented Jan 24, 2018

This change is Reviewable

yachang added some commits Jan 24, 2018

@coveralls

This comment has been minimized.

coveralls commented Jan 24, 2018

Coverage Status

Coverage increased (+0.4%) to 64.414% when pulling 80212d5 on pt-pollution into ed3f897 on integration.

yachang added some commits Jan 24, 2018

@yachang yachang requested a review from pboothe Jan 24, 2018

@yachang yachang requested a review from gfr10598 Jan 25, 2018

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 25, 2018

Reviewed 1 of 3 files at r1, 7 of 8 files at r2.
Review status: 8 of 9 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


parser/pt.go, line 39 at r2 (raw file):

	testID           string
	hops             []*schema.ParisTracerouteHop
	logTime          time.Time

Could you add a parse_time field? We should make that consistent across all the parsers, and now seems like a good time to add it. It should be straightforward to add it to the existing tables, but do be careful as you do it.

BTW, we should discuss sometime reprocessing all of the PT and SS data now that the annotator has been updated.


parser/pt_test.go, line 57 at r2 (raw file):

	if lastLine != "ReachExpectedDestIP" {
		fmt.Println(lastLine)
		t.Fatalf("Do not get last valid hop line correctly.")

"Did not reach expected destination."


parser/pt_test.go, line 84 at r2 (raw file):

	if lastLine != "ReachExpectedDestIP" {
		fmt.Println(lastLine)
		t.Fatalf("Do not get last valid hop line correctly.")

"Did not reach expected destination"


parser/pt_test.go, line 202 at r2 (raw file):

	}
	// expectedBufferedTest[2] == 0 means pollution detected and test removed.
	expectedBufferedTest := [6]int{1, 1, 0, 1, 2, 1}

There is a standard test pattern that is used, e.g. in the annotator tests. Please use that here.
https://github.com/m-lab/annotation-service/blob/9e690bc3f05786b70086ce8e91842afdc499c98c/handler/handler_test.go#L65


Comments from Reviewable

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 25, 2018

Review status: 8 of 9 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


parser/pt.go, line 36 at r2 (raw file):

// The data that will be inserted intp BQ tables
type ParsedPTData struct {

Can we keep this in schema.PT or is there a reason it should move here?


parser/pt.go, line 47 at r2 (raw file):

	inserter etl.Inserter
	etl.RowStats
	previousTests []ParsedPTData

Add a comment like
// Care should be taken to ensure this does not accumulate many rows and lead to OOM problems.


parser/pt.go, line 203 at r2 (raw file):

		err := pt.inserter.InsertRow(ptTest)
		if err != nil {
			metrics.ErrorCount.WithLabelValues(

With buffering, does this properly count errors? I think you might get no error for first 99, then a single error that actually means all 100 inserts failed.


parser/pt.go, line 265 at r2 (raw file):

		return err
	}

I will review this section after lunch.


Comments from Reviewable

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 25, 2018

Review status: 8 of 9 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


parser/pt.go, line 203 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

With buffering, does this properly count errors? I think you might get no error for first 99, then a single error that actually means all 100 inserts failed.

FYI This is an aspect of the inserter that I am not very happy with, but I haven't found time to make it better. 8-(


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 25, 2018

Review status: 8 of 9 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


parser/pt.go, line 36 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Can we keep this in schema.PT or is there a reason it should move here?

discussed offline. Will add more documents to clarify. This data structure is only used to buffer temp parsed results.


parser/pt.go, line 39 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Could you add a parse_time field? We should make that consistent across all the parsers, and now seems like a good time to add it. It should be straightforward to add it to the existing tables, but do be careful as you do it.

BTW, we should discuss sometime reprocessing all of the PT and SS data now that the annotator has been updated.

parse_time is irrelevant in this case.


parser/pt.go, line 47 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Add a comment like
// Care should be taken to ensure this does not accumulate many rows and lead to OOM problems.

Done.


parser/pt.go, line 203 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

FYI This is an aspect of the inserter that I am not very happy with, but I haven't found time to make it better. 8-(

Discussed offline. In brief,

  1. This PR did not change the code logic.

  2. There is a long-term concern for BQ table inserter error handling. Some fixes has been done inside insert.go, func HandleInsertErrors()

  3. The buffer size for PT inserter (defined in global.go) has been tuned when the first version of PT parser was written.

  4. According to Prometheus Metrics, the daily PT inserter errors are single digits given hundreds of thousands tests, and should not worry us too much.


parser/pt_test.go, line 57 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

"Did not reach expected destination."

Done.


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 25, 2018

Review status: 7 of 9 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


parser/pt_test.go, line 84 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

"Did not reach expected destination"

Done.


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 25, 2018

Review status: 7 of 9 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


parser/pt_test.go, line 202 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

There is a standard test pattern that is used, e.g. in the annotator tests. Please use that here.
https://github.com/m-lab/annotation-service/blob/9e690bc3f05786b70086ce8e91842afdc499c98c/handler/handler_test.go#L65

Done.


Comments from Reviewable

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 25, 2018

Review status: 7 of 9 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed.


parser/pt.go, line 203 at r2 (raw file):

Previously, yachang wrote…

Discussed offline. In brief,

  1. This PR did not change the code logic.

  2. There is a long-term concern for BQ table inserter error handling. Some fixes has been done inside insert.go, func HandleInsertErrors()

  3. The buffer size for PT inserter (defined in global.go) has been tuned when the first version of PT parser was written.

  4. According to Prometheus Metrics, the daily PT inserter errors are single digits given hundreds of thousands tests, and should not worry us too much.

Ok. We might want to revisit this for all the parsers in the future. I'll check if I have an issue for this already.


parser/pt.go, line 268 at r2 (raw file):

	// Check Client_ip in connSpec to check all buffered PT tests.
	// If pollution detected, then remove the test from buffer.
	// If no pollution detected, insert the oldest test into BitQuery table.

s/BitQuery/BigQuery/


parser/pt.go, line 271 at r2 (raw file):

	destIP := connSpec.Client_ip
	for index, PTTest := range pt.previousTests {
		// array of hops was built reversely from list of nodes (in func ProcessAllNodes)

"reversely"? previously?


parser/pt.go, line 37 at r3 (raw file):

// The data structure is used to store the parsed results temporarily before it is verified
// not polluted and can be inserted into BQ tables
type ParsedPTData struct {

Please make this unexported.
How about cachedPTData ?


parser/pt.go, line 259 at r3 (raw file):

	}

	hops, logTime, connSpec, lastValidHopLine, err := Parse(meta, testName, rawContent, pt.TableName())

Hmm. Can Parse() returned (cachedPTData, error)?


parser/pt.go, line 269 at r3 (raw file):

	}

	// Check Client_ip in connSpec to check all buffered PT tests.

?? This line isn't clear. This is getting the client_ip from the latest test, and looking for it in the earlier tests?


parser/pt.go, line 277 at r3 (raw file):

		hop := PTTest.hops[0]
		if PTTest.connSpec.Client_ip != destIP && (hop.Dest_ip == destIP || strings.Contains(PTTest.lastValidHopLine, destIP)) {
			// remove pt.previousTest[index]

s/remove/discard/

Ah - I see that on each new test, you are filtering all the older tests.


parser/pt.go, line 282 at r3 (raw file):

		}
	}

Could you comment the logic more thoroughly here? IIUC, a test that ends at the expected DestIP is not at risk of being polluted, so we don't have to wait to check against further tests - we can just go ahead an insert it.

I'm inclined to think this is an unnecessary optimization. Less than half of tests make it to destIP, and if you don't check here, the test will be inserted later anyway. The code would be simpler and easier to understand.

Is there another reason I'm missing?


parser/pt.go, line 283 at r3 (raw file):

	}

	// if lastValidHopLine is "ReachExpectedDestIP", we need not buffer this test.

Just "ExpectedDestIP" would make this more readable/easier to understand.


parser/pt.go, line 284 at r3 (raw file):

	// if lastValidHopLine is "ReachExpectedDestIP", we need not buffer this test.
	// We can insert it to BigQuery table directly.

Ok - I guess we don't care about order, since there are other workers inserting other blocks of hops concurrently. Maybe worth commenting that.


parser/pt.go, line 296 at r3 (raw file):

	}

	if len(pt.previousTests) >= 5 {

Please define a constant at the top of the file (or top of function) instead of using literal "5". Feel free to make me do the same, as I know I don't always follow this rule myself.


parser/pt.go, line 296 at r3 (raw file):

	}

	if len(pt.previousTests) >= 5 {

FYI, it is possible, though unlikely, that this code might unintentionally impact performance. Please do some kind of sanity check (average tests/CPU-sec), ideally before deploying to prod. I think the impact is likely to be minimal.


parser/pt.go, line 564 at r3 (raw file):

		return PTHops, t, connSpec, "ReachExpectedDestIP", nil
	}
	return PTHops, t, connSpec, lastValidHopLine, nil

I think it is time to create a struct for this return value, so this would become something like:
return foobar{PTHops, logTime, connSpec, lastValidHopLine}, nil


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


parser/pt.go, line 37 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Please make this unexported.
How about cachedPTData ?

Done.


parser/pt.go, line 259 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Hmm. Can Parse() returned (cachedPTData, error)?

Done.


parser/pt.go, line 269 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

?? This line isn't clear. This is getting the client_ip from the latest test, and looking for it in the earlier tests?

Done.


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


parser/pt.go, line 268 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

s/BitQuery/BigQuery/

Done.


parser/pt.go, line 271 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

"reversely"? previously?

Done.


parser/pt.go, line 277 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

s/remove/discard/

Ah - I see that on each new test, you are filtering all the older tests.

Done.


parser/pt.go, line 282 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Could you comment the logic more thoroughly here? IIUC, a test that ends at the expected DestIP is not at risk of being polluted, so we don't have to wait to check against further tests - we can just go ahead an insert it.

I'm inclined to think this is an unnecessary optimization. Less than half of tests make it to destIP, and if you don't check here, the test will be inserted later anyway. The code would be simpler and easier to understand.

Is there another reason I'm missing?

discussed offline. This optimization makes the pollution check more effective by removing those tests (reach expected dest_IP) from the buffer.


parser/pt.go, line 283 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Just "ExpectedDestIP" would make this more readable/easier to understand.

Done.


parser/pt.go, line 284 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Ok - I guess we don't care about order, since there are other workers inserting other blocks of hops concurrently. Maybe worth commenting that.

Done.


parser/pt.go, line 296 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Please define a constant at the top of the file (or top of function) instead of using literal "5". Feel free to make me do the same, as I know I don't always follow this rule myself.

Done.


parser/pt.go, line 564 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

I think it is time to create a struct for this return value, so this would become something like:
return foobar{PTHops, logTime, connSpec, lastValidHopLine}, nil

Done.


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.


parser/pt.go, line 296 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

FYI, it is possible, though unlikely, that this code might unintentionally impact performance. Please do some kind of sanity check (average tests/CPU-sec), ideally before deploying to prod. I think the impact is likely to be minimal.

I will check the throughput when deploy this to staging and production.


Comments from Reviewable

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.


parser/pt.go, line 271 at r2 (raw file):

Previously, yachang wrote…

Done.

Ah
// so the final parsed hop is Hops[0]


parser/pt.go, line 269 at r3 (raw file):

Previously, yachang wrote…

Done.

One more step in the explanation would help.

// If it does (or doesn't) then ... so ...


parser/pt.go, line 296 at r3 (raw file):

Previously, yachang wrote…

I will check the throughput when deploy this to staging and production.

Thanks!


parser/pt.go, line 277 at r4 (raw file):

	for index, PTTest := range pt.previousTests {
		// array of hops was built in reverse order from list of nodes (in func ProcessAllNodes())
		hop := PTTest.Hops[0]

final := PTTest.Hops[0]


parser/pt.go, line 285 at r4 (raw file):

	}

	// If a test that ends at the expected DestIP, it is not at risk of being polluted,

// If a test ends at ...


parser/pt.go, line 438 at r4 (raw file):

	// We do not need to get destIP and serverIP from file name, since they are at the first line
	// of test content as well.
	t, err := GetLogtime(fn)

logTime, err := ...


parser/pt.go, line 440 at r4 (raw file):

	t, err := GetLogtime(fn)
	if err != nil {
		return emptyTest, err

I think this should be
return cachedPTData{}, err

That way the test isn't constructed unless needed.


parser/pt.go, line 443 at r4 (raw file):

	}

	// The filename contains 5-tuple like 20170320T23:53:10Z-98.162.212.214-53849-64.86.132.75-42677.paris

Is this comment still accurate and relevant?


parser/pt.go, line 471 at r4 (raw file):

				metrics.ErrorCount.WithLabelValues(tableName, "pt", "corrupted first line").Inc()
				metrics.TestCount.WithLabelValues(tableName, "pt", "corrupted first line").Inc()
				return emptyTest, err

Same as above.


parser/pt.go, line 493 at r4 (raw file):

				if err != nil {
					metrics.PTHopCount.WithLabelValues(tableName, "pt", "discarded").Add(float64(len(allNodes)))
					return emptyTest, err

Same as above.


parser/pt.go, line 561 at r4 (raw file):

	}
	if reachedDest {
		return cachedPTData{

Is this identical to the block below? Is the if clause needed?

Or just
lastValidHopLine = "ExpectedDestIP"
and then fall through?


parser/pt_test.go, line 202 at r2 (raw file):

Previously, yachang wrote…

Done.

Thanks!


Comments from Reviewable

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


parser/pt.go, line 282 at r3 (raw file):

Previously, yachang wrote…

discussed offline. This optimization makes the pollution check more effective by removing those tests (reach expected dest_IP) from the buffer.

Please add that to code comments, too, so next person to look at the code will know.


Comments from Reviewable

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 26, 2018

Just a few more minor things.


Review status: 7 of 9 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


parser/pt.go, line 271 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Ah
// so the final parsed hop is Hops[0]

Done.


parser/pt.go, line 269 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

One more step in the explanation would help.

// If it does (or doesn't) then ... so ...

Done.


parser/pt.go, line 282 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Please add that to code comments, too, so next person to look at the code will know.

Done.


parser/pt.go, line 296 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Thanks!

Done.


parser/pt.go, line 277 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

final := PTTest.Hops[0]

Done.


parser/pt.go, line 285 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

// If a test ends at ...

Done.


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


parser/pt.go, line 443 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Is this comment still accurate and relevant?

Thanks for detecting this! Updated now.


parser/pt.go, line 471 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Same as above.

Done.


parser/pt.go, line 493 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Same as above.

Done.


parser/pt.go, line 561 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Is this identical to the block below? Is the if clause needed?

Or just
lastValidHopLine = "ExpectedDestIP"
and then fall through?

LastValidHopLine field are different.


Comments from Reviewable

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 26, 2018

LGTM
Couple more minor things.


Review status: 7 of 9 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


parser/pt.go, line 561 at r4 (raw file):

Previously, yachang wrote…

LastValidHopLine field are different.

Got it. How about my second suggestion?


parser/pt.go, line 274 at r5 (raw file):

	// If it does appear, then the buffered test was polluted, and it will
	// be discarded from buffer.
	// If it does not appear, then no pollution detected. If buffer is full,

If buffer full ... should go down below where you check against PTBufferSize.


Comments from Reviewable

@gfr10598

This comment has been minimized.

Contributor

gfr10598 commented Jan 26, 2018

:lgtm:


Review status: 7 of 9 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


parser/pt.go, line 438 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

logTime, err := ...

Done.


parser/pt.go, line 561 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Got it. How about my second suggestion?

Done.


parser/pt.go, line 274 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

If buffer full ... should go down below where you check against PTBufferSize.

Done.


Comments from Reviewable

@yachang

This comment has been minimized.

Contributor

yachang commented Jan 26, 2018

Review status: 7 of 9 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


parser/pt.go, line 440 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

I think this should be
return cachedPTData{}, err

That way the test isn't constructed unless needed.

Done.


Comments from Reviewable

yachang added some commits Jan 26, 2018

@yachang yachang merged commit 42a6bcf into integration Jan 26, 2018

3 of 4 checks passed

coverage/coveralls Coverage decreased (-0.09%) to 64.414%
Details
code-review/reviewable 1/1 LGTMs
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@pboothe pboothe removed their request for review Jan 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment