Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxDB should do a partial write on mismatched type errors #7814

Closed
sparrc opened this issue Jan 10, 2017 · 7 comments · Fixed by #7836
Closed

InfluxDB should do a partial write on mismatched type errors #7814

sparrc opened this issue Jan 10, 2017 · 7 comments · Fixed by #7836
Assignees
Labels
Milestone

Comments

@sparrc
Copy link
Contributor

sparrc commented Jan 10, 2017

Bug report

When a point is written with a mismatched type, it is handled differently than a malformed point.

With a malformed point, a partial write is done of all well-formed points that are provided in the batch, and a "partial write" error is returned.

With a mismatched point, no points are written and a "mismatched type" error is returned.

both cases return http status code 400.

I think that we should treat mismatched types the same as malformed points, performing a partial write of the well-formed and valid points in the batch.

System info: [Include InfluxDB version, operating system name, and other relevant details]

InfluxDB master, commit 8c2cfd1 (master some ways after 1.1.0)

Steps to reproduce:

  1. write a metric called test as a float:
curl -i "http://localhost:8086/write?db=telegraf" --data-binary "test value=1"
  1. write a batch of two metrics, first with test as an integer and then a new metric called test2:
curl -i "http://localhost:8086/write?db=telegraf" --data-binary "test value=1i
test2 value=1
"
  1. Observe that the test2 metric does not get created, even though it is valid.
  2. Now write a batch of two metrics, first with test completely malformed and then a new metric called test2:
curl -i "http://localhost:8086/write?db=telegraf" --data-binary "test value
test2 value=1
"
  1. Now observe that test2 does get created, though a 400 status code and "partial write" is still returned.

Not sure, but this might be a dupe of #4856

@sparrc
Copy link
Contributor Author

sparrc commented Jan 10, 2017

The reason for this is that telegraf (and any system writing batches to InfluxDB) currently can get into a situation where a mismatched type can cause the entire batch to be dropped.

If the entire batch is not dropped, then it will be in a state where all batches continually fail forever, as the mismatched point will always fail the entire retried batch.

@phemmer
Copy link
Contributor

phemmer commented Jan 10, 2017

With a mismatched point, no points are written and a "mismatched type" error is returned.

Actually, this isn't quite true. I got into this without fully explaining it in my comment over on #4856 (comment)

With a mismatched type, all the points in the batch prior to the mismatch get written. The ones after it do not.

@sparrc
Copy link
Contributor Author

sparrc commented Jan 10, 2017

@phemmer I think there might be a separate but related bug. I also initially see all points prior get "written" (they appear in SHOW MEASUREMENTS, but on restarting the DB they are not present.

I'm not sure exactly why, but I believe they're getting registered in the index but not fully committed.

I'm working on opening a separate issue.

@sparrc
Copy link
Contributor Author

sparrc commented Jan 10, 2017

@phemmer #7815

@sparrc
Copy link
Contributor Author

sparrc commented Jan 11, 2017

@jwilder should we throw this into the 1.2 milestone?

@abadyan-vonage
Copy link

We have seen this phenomena in an even weirder scenario - when there is a mismatched type error, other unrelated points being processed at the same time may be dropped.
We are using UDP and had one of the scripts constantly send mismatched metrics. We also saw many dropped points coming from other scripts and hosts. To our surprise, fixing the problematic script also resolved the other dropped points.
Again, this occurred for different machines reporting different metrics to different measurements while using UDP.
Perhaps we should open a separate issue?

@e-dard
Copy link
Contributor

e-dard commented Jan 17, 2017

Should be fixed by #7836.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants