Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client.WriteSeries returns: Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory #985

Merged
merged 1 commit into from Oct 24, 2014

Conversation

jvshahid
Copy link
Contributor

using https://github.com/vimeo/whisper-to-influxdb/ which invokes influxClient.WriteSeriesWithTimePrecision(toCommit, client.Second)
to write a series called "servers.dfvimeostatsd1.diskspace.root.inodes_free" with 60643 records of (time, sequence_number, value) format, to my graphite database, which i recreated from scratch yesterday after i upgraded.

influx> list shardspaces
Database    Name Regex Retention Duration RF Split
graphite default  /.*/      365d       7d  1     1
influx> 

got this response:

Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory

my influxdb is 0.8.3, has debug logging enabled,
but the log only contains messages matching
(GraphiteServer committing|Executing leader loop|Dumping the cluster config|Testing if we should|Checking for shards to drop), no other messages.
I also checked dmesg, no errors there. ditto for /var/log/messages, nothing useful there.

@Dieterbe
Copy link
Contributor Author

this might also be useful

[root@dfvimeographite1 ~]# df -hT
Filesystem           Type   Size  Used Avail Use% Mounted on
/dev/mapper/vg0-lv0  ext4    20G  9.1G  9.7G  49% /
tmpfs                tmpfs   48G     0   48G   0% /dev/shm
/dev/sda1            ext4   194M   61M  123M  34% /boot
/dev/mapper/vg0-lv3  ext4   1.6T  255G  1.3T  17% /data
/dev/mapper/vg0-lv1  ext4    20G  8.1G   11G  43% /var
[root@dfvimeographite1 ~]# df -i
Filesystem              Inodes  IUsed     IFree IUse% Mounted on
/dev/mapper/vg0-lv0    1310720  88453   1222267    7% /
tmpfs                 12381474      1  12381473    1% /dev/shm
/dev/sda1                51200     46     51154    1% /boot
/dev/mapper/vg0-lv3  106889216 182482 106706734    1% /data
/dev/mapper/vg0-lv1    1310720  28957   1281763    3% /var

@Dieterbe
Copy link
Contributor Author

when i manually retry the same write later, it works fine. so maybe it uses another dir then, or it was a race condition between creating the dir and trying to use it?

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Oct 3, 2014

today i got another, bit more exotic variant of this:

Failed to write InfluxDB series 'servers.dfvimeodfs5.iostat.sdf1.iops' (159429 points): Server returned (400): Corruption: Can't access /000013.sst: IO error: /opt/influxdb/shared/data/db/shard_db_v2/00222//000013.sst: No such file or directory
Can't access /000015.sst: IO error: /opt/influxdb/shared/data/db/shard_db_v2/00222//000015.sst: No such file or directory
 (operation took 2.183744668s)

nothing particular in /var/log/messages, dmesg, and plenty of space and inodes available.

again, resuming my program where it left off (i.e. doing the same write that failed) seems to work fine

@jvshahid
Copy link
Contributor

Similar to #1009 and #1013. This is caused by the concurrent closing of a shard and opening it at the same time. This operation needs to be go routine safe. I'm not sure why the shards are being dropped though. @Dieterbe are you trying to write points in the past or the data collection is lagging behind ?

@Dieterbe
Copy link
Contributor Author

yes, this is an import of old data, with timestamps anywhere between 2y ago and now.

@jvshahid
Copy link
Contributor

What's the retention and duration of those shards

@Dieterbe
Copy link
Contributor Author

not sure, have recreated the db a couple of times in the meantime. I think i've usually kept shard duration 7d, retention was probably 365 or 730 days. (it's possible that some of the points being written have timestamps older than what the shard cares about)

@jvshahid
Copy link
Contributor

Cool, just wanted to make sure my guess makes sense.

@jvshahid jvshahid self-assigned this Oct 21, 2014
* shard_datastore.go(Deleteshard): Check the reference count of the
  shard and mark it for deletion if there are still more references out
  there. Otherwise, delete the shard immediately. Also refactor the
  deletion code in deleteShard(), see below.
* shard_datastore.go(ReturnShard): Check to see if the shard is marked
  for deletion.
* shard_datastore.go(deleteShard): Refactor the code that used to be in
  Deleteshard in its own method. Use `closeShard` instead of doing the
  cleanup ourselves.
@jvshahid
Copy link
Contributor

/cc @toddboom @dgnorton review please

@toddboom
Copy link
Contributor

Looks good to me.

@dgnorton
Copy link
Contributor

lgtm

jvshahid added a commit that referenced this pull request Oct 24, 2014
client.WriteSeries returns: Server returned (400):  IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory
@jvshahid jvshahid merged commit 97cd03c into master Oct 24, 2014
@jvshahid jvshahid deleted the fix-985 branch October 24, 2014 21:17
@jvshahid jvshahid removed the review label Oct 24, 2014
jvshahid added a commit that referenced this pull request Oct 31, 2014
Background of the bug: Prior to this patch we actually tried writing
points that were older than the retention period of the shard. This
caused race condition when it came to writing points to a shard that's
being dropped, which will happen frequently if the user is loading old
data (by accident). This is demonstrated in the test in this commit.This
bug was previously addressed in #985. It turns the fix for #985 wasn't
enough. A user reported in #1078 that some shards are left behind and
not deleted.

It turns out that while the shard is being dropped more write
requests could come in and end up on line `cluster/shard.go:195` which
will cause the datastore to create a shard on disk that isn't tracked
anywhere in the metadata. This shard will live forever and never get
deleted. This fix address this issue by not writing old points in, but
there are still some edge cases with the current implementation, at
least not as bad as current master.
jvshahid added a commit that referenced this pull request Nov 3, 2014
Background of the bug: Prior to this patch we actually tried writing
points that were older than the retention period of the shard. This
caused race condition when it came to writing points to a shard that's
being dropped, which will happen frequently if the user is loading old
data (by accident). This is demonstrated in the test in this commit.This
bug was previously addressed in #985. It turns the fix for #985 wasn't
enough. A user reported in #1078 that some shards are left behind and
not deleted.

It turns out that while the shard is being dropped more write
requests could come in and end up on line `cluster/shard.go:195` which
will cause the datastore to create a shard on disk that isn't tracked
anywhere in the metadata. This shard will live forever and never get
deleted. This fix address this issue by not writing old points in, but
there are still some edge cases with the current implementation, at
least not as bad as current master.

Close #1078
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants