Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.1] restarting process irrevocably BREAKS measurements with spaces #3319

Closed
greglook opened this issue Jul 13, 2015 · 13 comments
Closed
Labels
Milestone

Comments

@greglook
Copy link

Since upgrading from 0.9.0 to 0.9.1 earlier, all measurements with spaces in their names have been broken, including all historical data. The package was upgraded by downloading the .deb and installing in-place over 0.9.0. Once done, the server booted up without any issues, but I immediately noticed that most (but not all) of the graphs in Grafana were broken.

A bit more investigating revealed that all metrics with a single word name (e.g. "cpu", "load", etc.) were working fine, but anything with spaces (e.g. "disk /boot", "riemann streams rate") were not returning any results to queries. The queries hadn't changed, and look like this:

SELECT mean(value) FROM "riemann streams rate" WHERE host='$host' AND $timeFilter GROUP BY time($interval) ORDER BY asc

When I logged in using the influx CLI, I confirmed that querying for data from these measurements did not return any data. However, after running SHOW MEASUREMENTS I noticed that every measurement with spaces show up twice - once as usual, and once with the spaces escaped:

> SHOW MEASUREMENTS
http response 200
http response 201
http response 206
...
http\ response\ 200
http\ response\ 201
http\ response\ 206
...

I note that SHOW SERIES does not return any measurements without escaped names, which is bizarre. As I said, querying for data from the original (unescaped) measurement returns no data; querying for data from the escaped version (by double-escaping) complains that there's no fields in that measurement:

> SELECT * FROM "http\\ response\\ 200"
ERR: select statement must include at least one field or function call
> SELECT value FROM "http\\ response\\ 301" LIMIT 5
ERR: unknown field or tag name in select clause: value

This seems to have changed something at the data storage level, because downgrading InfluxDB to 0.9.0 has not fixed the issue. So, the current state of things is that the majority of our data cannot be accessed right now, despite no errors of any kind from Riemann, InfluxDB, or Grafana. Needless to say, this is a HUGE REGRESSION and we're currently facing loss of all our old data to try to get this back into a working state.

I posted in IRC but there was little to no response.

@beckettsean
Copy link
Contributor

This doesn't appear to be an issue with newly created measurements on 0.9.1:

> insert hello\ spaces\ my\ old\ friend,foo=bar value=1
> insert hello\ spaces\ my\ old\ friend,foo=bar value=2
> insert hello\ spaces\ my\ old\ friend,foo=bar value=3
> insert hello\ spaces\ my\ oldfriend,foo=bar value=3
> show measurements
name: measurements
------------------
name
field3
hello spaces my old friend
hello spaces my oldfriend

> select * from /hello.*/
name: hello spaces my old friend
tags: foo=bar
time                value
----                -----
2015-07-14T00:27:09.125308449Z  1
2015-07-14T00:27:11.461603474Z  2
2015-07-14T00:27:13.085517614Z  3


name: hello spaces my oldfriend
tags: foo=bar
time                value
----                -----
2015-07-14T00:27:16.15776629Z   3

@greglook I don't know if it will help with the grafana issue, but you can try selecting with a regex. For instance, does SELECT * FROM /http.*/ return data from the series with spaces and/or the series with escaped spaces? You could also try SELECT * FROM /http .*/ vs SELECT * FROM /http\\.*/.

Agreed this sounds like a terrible regression. I'll see if I can repro with an 0.9.0 to 0.9.1 upgrade.

@greglook
Copy link
Author

Selecting via regex does not return any results, most of our graphs already use regex-based queries. Sounds like the path forward here is to move our old DB and recreate it so the measurements get recreated.

@beckettsean
Copy link
Contributor

@greglook that's very odd that there's no way to get the original data back out. I don't quite follow what you're saying, but creating a new database and new measurements should solve the immediate issue. As for recovering the old data, we'll need help from the core team.

@otoolep or @dgnorton did any of your recent changes possibly contribute to this?

@greglook
Copy link
Author

It's not the end of the world if we lose the old data, but this is definitely not what I expected out of a minor version upgrade being touted for stability improvements.

@beckettsean
Copy link
Contributor

Further investigation. Spun up bare Ubuntu 14.04 box on DO. Installed 0.9.0.

root@sean2:~# wget http://influxdb.s3.amazonaws.com/influxdb_0.9.0_amd64.deb
<snip>
2015-07-13 22:00:39 (17.3 MB/s) - ‘influxdb_0.9.0_amd64.deb’ saved [7858284/7858284]

root@sean2:~# dpkg -i influxdb_0.9.0_amd64.deb 
Selecting previously unselected package influxdb.
<snip>
root@sean2:~# /etc/init.d/influxdb start
Starting the process influxd [ OK ]
influxd process was started [ OK ]

From the CLI:

root@sean2:/opt/influxdb# ./influx
Connected to http://localhost:8086 version FIXME
InfluxDB shell 0.9.0
> create database mydb
> use mydb
Using database mydb
<snip inserts>

> show measurements
name: measurements
------------------
name
foo with spaces
test

> show series
name: foo with spaces
---------------------
_key                bar baz
foo\ with\ spaces,bar=baz   baz 
foo\ with\ spaces           
foo\ with\ spaces,baz=foobar        foobar


name: test
----------
_key        foo
test,foo=bar    bar

> select * from "foo with spaces"
name: foo with spaces
tags: bar=baz, baz=
time                value
----                -----
2015-07-14T02:02:44.707759305Z  1
2015-07-14T02:02:46.14659743Z   2
2015-07-14T02:02:47.217675249Z  3


name: foo with spaces
tags: bar=, baz=
time                value
----                -----
2015-07-14T02:02:51.619625883Z  4
2015-07-14T02:02:53.459373288Z  5
2015-07-14T02:02:54.890904558Z  6


name: foo with spaces
tags: bar=, baz=foobar
time                value
----                -----
2015-07-14T02:03:01.667745737Z  6
2015-07-14T02:03:04.796348367Z  7
2015-07-14T02:03:05.875798916Z  8
2015-07-14T02:03:13.196690972Z  9

Upgraded to 0.9.1:

root@sean2:/opt/influxdb# /etc/init.d/influxdb stop
influxd process was stopped [ OK ]
root@sean2:/opt/influxdb# wget http://influxdb.s3.amazonaws.com/influxdb_0.9.1_amd64.deb
<snip>
2015-07-13 22:08:01 (15.9 MB/s) - ‘influxdb_0.9.1_amd64.deb’ saved [7076932/7076932]

root@sean2:/opt/influxdb# dpkg -i influxdb_0.9.1_amd64.deb
(Reading database ... 86977 files and directories currently installed.)
Preparing to unpack influxdb_0.9.1_amd64.deb ...
Unpacking influxdb (0.9.1) over (0.9.0) ...
Setting up influxdb (0.9.1) ...
Installing new version of config file /etc/opt/influxdb/influxdb.conf ...
 Removing any system startup links for /etc/init.d/influxdb ...
   /etc/rc0.d/K20influxdb
   /etc/rc1.d/K20influxdb
   /etc/rc2.d/S20influxdb
   /etc/rc3.d/S20influxdb
   /etc/rc4.d/S20influxdb
   /etc/rc5.d/S20influxdb
   /etc/rc6.d/K20influxdb
 Adding system startup for /etc/init.d/influxdb ...
   /etc/rc0.d/K20influxdb -> ../init.d/influxdb
   /etc/rc1.d/K20influxdb -> ../init.d/influxdb
   /etc/rc6.d/K20influxdb -> ../init.d/influxdb
   /etc/rc2.d/S20influxdb -> ../init.d/influxdb
   /etc/rc3.d/S20influxdb -> ../init.d/influxdb
   /etc/rc4.d/S20influxdb -> ../init.d/influxdb
   /etc/rc5.d/S20influxdb -> ../init.d/influxdb
root@sean2:/opt/influxdb# /etc/init.d/influxdb start
Starting the process influxd [ OK ]
influxd process was started [ OK ]

Now back to the CLI:

root@sean2:/opt/influxdb# ./influx
Connected to http://localhost:8086 version 0.9.1
InfluxDB shell 0.9.1
> use mydb
Using database mydb
> show measurements
name: measurements
------------------
name
foo with spaces
foo\ with\ spaces
test

> show series
name: foo\ with\ spaces
-----------------------
_key                bar baz
foo\ with\ spaces           
foo\ with\ spaces,bar=baz   baz 
foo\ with\ spaces,baz=foobar        foobar


name: test
----------
_key        foo
test,foo=bar    bar

> select * from /foo.*/
ERR: can not use field in group by clause: bar
> select * from "foo with spaces"
> select * from "foo\ with\ spaces"
ERR: error parsing query: found \ , expected identifier at line 1, char 20
> select * from /foo with spaces/
> select * from /foo\ with\ spaces/
> select * from /foo\\ with\\ spaces/
ERR: select statement must include at least one field or function call
> 

@greglook My apologies for this bad regression issue on upgrade. Clearly there's a gap in our automated testing and we'll take a look.

@toddboom Let's make sure the soak/load tester uses some identifiers with spaces.

@otoolep, @jwilder, @dgnorton seems like this regression is probably somewhere in one of your PRs, since @corylanou was on vacation for 0.9.1. Any ideas if @greglook can recover the data?

@beckettsean beckettsean added this to the 0.9.2 milestone Jul 14, 2015
@greglook
Copy link
Author

For the moment I've just yanked the database data directory out from under InfluxDB and I can confirm we're seeing the newly-created series show up and query fine. If it turns out there's an easy way to recover the data in the next day or two, I'm all ears! If it takes longer than that it may not be worth it unless we can merge the old data into the new database, since the fresh data is more useful than older history with a gap up to the present.

@greglook
Copy link
Author

So it turns out that just restarting the service is enough to break all series with spaces in them. We lost our data again. Downgrading to 0.9.0 until this is sorted out... not cool.

@beckettsean
Copy link
Contributor

@greglook I just confirmed that restarting the process somehow corrupts the metastore indices for measurements with spaces. I've got core devs looking at it already, no estimate yet on when this will get fixed but it's a top priority as it involves data loss.

@beckettsean beckettsean changed the title 0.9.1 upgrade BREAKS measurements with spaces [0.9.1] restarting process irrevocably BREAKS measurements with spaces Jul 15, 2015
@greglook
Copy link
Author

I understand. Sorry for the agitated posts, but as you can imagine this is not a great experience for our core monitoring system to be built on. :(

@greglook
Copy link
Author

Any progress on this issue? A restart while running 0.9.0 causes the same behavior (using the matadata store from the 0.9.1 version). I cleaned out the whole data folder this time and hopefully won't see a recurrence while we're on the old version.

@beckettsean
Copy link
Contributor

@greglook I've certainly raised awareness on the core team. I suspect this fix will get cherry-picked back into the 0.9.2 RCs once it's been solved. I don't have any more update than that.

@jwilder
Copy link
Contributor

jwilder commented Jul 22, 2015

@greglook A fix for this is in 0.9.2 branch and current master.

@jwilder jwilder removed their assignment Jul 22, 2015
@beckettsean
Copy link
Contributor

Verified with @jwilder that the fix will restore prior functionality, so if you had an 0.9.0 datastore with measurements that had spaces, upgrading to 0.9.2 will make those measurements available for queries again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants