-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear error when there is not enough space left #2255
Comments
Uhm, this error is probably triggered when a file is being written to disk, alas, the only thing we can know is that a write as failed, so I'm not so sure how we could improve that... @meilisearch/core-team any idea? |
@MarinPostma This error is returned by LMDB when the library tried to get a page from the free list and failed. I think that it is the OS that returns that error from the memory mapping system. I don't really know what we can do but maybe add this input/ouput error in the list of errors of the documentation for now? |
This is an os error, don't you think that it could also be thrown when writing to a grenad file? |
Indeed, it seems like it could be thrown by grenad or anything that is trying to write on a disk that is full. I thought that the error could have been So, I don't know if we can catch the error code and categorize the message as a |
Seems to me that this could also be thrown on read,s, and on other kinds of Io error, and that would be even worse if we miscategorized the error |
Without creating a new error category, since the consequences for debugging could be bad, maybe could we just add a sentence in the error message like "This might be due to a lack of space on the device"? |
@curquiza but it could also send them in the wrong direction altogether. This is an error reported by the os, we don't know what happened, maybe the memory was corrupted, maybe they were using a network disk and the network failed... I don't think trying to guess what happened will be helpful :( |
It's not sending them in the wrong direction, it's giving a clue to the users. We say "this might be due to..." and not "this is definitely..."
|
Won't the fact that LMDB can't shrink the space previously used when emptying an index (without deleting it) cause problems on the cloud side when displaying the used space? @nicolasvienot |
@gmourier Yes, totally, and it is already a problem. We display the space used in the volume, and the value does not seem accurate. We need to investigate more, but I think this should be a separate GitHub issue. |
As discussed with @gmourier, here is my suggestion to improve this specific user message "error": {
"code": "internal",
"link": "https://docs.meilisearch.com/errors#internal",
"message": "I/O error (os error 5). This might be due to a lack of space on the device. If not, please contact us.",
"type": "internal"
}, @Kerollmops is it technically doable? @gmourier do you validate this solution? If I get 2 yes, I will make this issue as |
I validate your message suggestion! |
I would say yes, we must be able to retrieve the os error code (5) and change that. |
Hey @meilisearch/cloud-team, Are you sure it is an os error 5 that you get and not an os error 28. The 5 one seems to be access denied when the 28 is related to a lack of space on the device? |
Hey @Kerollmops, {
"uid": 24,
"indexUid": "movies20",
"status": "failed",
"type": "documentAddition",
"details": {
"receivedDocuments": 31968,
"indexedDocuments": 0
},
"error": {
"message": "An internal error has occurred. `I/O error (os error 5)`.",
"code": "internal",
"type": "internal",
"link": "https://docs.meilisearch.com/errors#internal"
},
"duration": "PT92.749877984S",
"enqueuedAt": "2022-07-04T22:57:33.046582055Z",
"startedAt": "2022-07-04T23:17:53.517704393Z",
"finishedAt": "2022-07-04T23:19:26.267582377Z"
}, There should not be any access denied as the previous task went well. |
I can reproduce the issue as well; I created a really smol partition on my Linux machine and started indexing documents; here's the result I got; {
"details": {
"indexedDocuments": 0,
"receivedDocuments": 19547
},
"duration": "PT11.489335705S",
"enqueuedAt": "2022-07-05T13:25:58.967517187Z",
"error": {
"code": "internal",
"link": "https://docs.meilisearch.com/errors#internal",
"message": "An internal error has occurred. `Input/output error (os error 5)`.",
"type": "internal"
},
"finishedAt": "2022-07-05T13:26:10.463413698Z",
"indexUid": "mieli",
"startedAt": "2022-07-05T13:25:58.974077993Z",
"status": "failed",
"type": "documentAdditionOrUpdate",
"uid": 1
} I'm running under Linux 5.18.8. |
@Kerollmops could we catch error 5 AND 28 in this case to return this custom error? |
Note that for the same os error 5, both systems return a different text, the Tamo one says Input/Output error when the Nico one says I/O error. However, it shouldn't be an issue if we are able to directly catch the |
Lol, I wanted to reproduce the issue for @Kerollmops. I did the EXACT same thing and it throwed another error; {
"details": {
"indexedDocuments": 0,
"receivedDocuments": 10271
},
"duration": "PT9.279810746S",
"enqueuedAt": "2022-07-05T17:30:30.714754378Z",
"error": {
"code": "internal",
"link": "https://docs.meilisearch.com/errors#internal",
"message": "No space left on device (os error 28)",
"type": "internal"
},
"finishedAt": "2022-07-05T17:30:40.009485687Z",
"indexUid": "mieli",
"startedAt": "2022-07-05T17:30:30.729674941Z",
"status": "failed",
"type": "documentAdditionOrUpdate",
"uid": 1
} And here is what we get from the system; Err(
Milli(
IoError(
Os {
code: 28,
kind: StorageFull,
message: "No space left on device",
},
),
),
) I then redid the same thing and got the first error again; {
"details": {
"indexedDocuments": 0,
"receivedDocuments": 19547
},
"duration": "PT11.699438861S",
"enqueuedAt": "2022-07-05T17:32:36.427322701Z",
"error": {
"code": "internal",
"link": "https://docs.meilisearch.com/errors#internal",
"message": "An internal error has occurred. `Input/output error (os error 5)`.",
"type": "internal"
},
"finishedAt": "2022-07-05T17:32:48.14102773Z",
"indexUid": "mieli",
"startedAt": "2022-07-05T17:32:36.441588869Z",
"status": "failed",
"type": "documentAdditionOrUpdate",
"uid": 1
} Err(
Internal(
Io(
Os {
code: 5,
kind: Uncategorized,
message: "Input/output error",
},
),
),
) |
OK SO BIG NEWS, it's worse than I thought. It's actually reproducible but I have NO IDEA how we could understand what makes the first or second error to be thrown. |
@nicolasvienot is it possible that the system you were testing had many incremental updates (huge /tasks backlog) to the index? I'm running into the same issue, were my index should be just a few MB big but the task queue is cluttered with old processed tasks (hitting the /tasks route will also kill my process because the size is too big) as we currently have many small updates to the index. (as you wrote, this should be a separate GH issue, I just wanted to point it out here) |
Hey @mmachatschek !
We have released v0.28 which finally brings tasks pagination and filtering by (index, type, and status) capabilities. You should not have this problem anymore if you upgrade to this version ! |
@gmourier 👍 already working with that version.
What I observed (with meilisearch v0.27 and v0.28) is, that the |
That is strange that Meilisearch is crashing when only 20 task statuses are returned, the engine doesn't even deserialize the whole data from the tasks, only the metadata. Is the engine being killed by the OS or is it crashing with a message?
Tasks data (the update content) should be removed from the disk when processed, only the task statuses (metadata) should be kept in the data.ms/data.mdb LMDB environment. LMDB grows to a maximum size, this size will stabilize at a point and should not grow more, you should probably read the linked blog post on our documentation page, it has been written by the maintainer of LMDB. |
The crash happened when using the v0.27.2 version. Tasks are returned as soon as I use v0.28.0
The updates folder in |
Hello here! Some updated after discussing with @Kerollmops, indeed a PR has been open for a long time for this issue on Milli side: meilisearch/milli#580 What this PR do: the code error 5 and 28, will be returned the following Meilisearch error {
"message": "There is no more space left on the device. Consider increasing the size of the disk/partition.",
"code": "no_space_left_on_device",
"type": "internal",
"link": "https://docs.meilisearch.com/errors#no_space_left_on_device"
} instead of the error code If we don't see any internal problem with this PR, this could be integrated in Meilisearch v0.30.0 |
We investigate how to fix this, this is not an easy improvement to do; we will try to do it for v1, but impossible for v0.30.0, sorry! 😢 |
For people watching this issue, investigations and first works have been done on milli side: meilisearch/milli#580 |
Ok, so after a meeting with @dureuill and another with @nicolasvienot we realized that overriding entirely the
Implementation:
@meilisearch/docs-team You might be interested by this change; it introduces error changes: |
Describe the bug
When trying to index documents in a kubernetes volume that does not have enough space left, Meilisearch returns the following error:
This error might also happen outside of a kubernetes environment, not tested.
To Reproduce
Steps to reproduce the behavior:
data.ms
Expected behavior
A clear error that is documented should be returned
Meilisearch version:
v0.26.0
EDIT from @curquiza
How?
When Meilisearch does not have enough place on the machine, you get the following error":
We want to replace
internal
byno_space_left_on_device
#internal
by#no_space_left_on_device
Impacted teams
Since this error is already in the spec (and so in the docs) and supposed to exist, no ping to do.
https://docs.meilisearch.com/reference/errors/error_codes.html#no-space-left-on-device
The text was updated successfully, but these errors were encountered: