Unclear error when there is not enough space left #2255

nicolasvienot · 2022-03-21T17:17:06Z

Describe the bug
When trying to index documents in a kubernetes volume that does not have enough space left, Meilisearch returns the following error:

{
      “details”: {
        “indexedDocuments”: 0,
        “receivedDocuments”: 46206
      },
      “duration”: “PT512.665928885S”,
      “enqueuedAt”: “2022-03-17T14:58:24.465745303Z”,
      “error”: {
        “code”: “internal”,
        “link”: “https://docs.meilisearch.com/errors#internal”,
        “message”: “I/O error (os error 5)“,
        “type”: “internal”
      },
      “finishedAt”: “2022-03-17T15:06:57.168374834Z”,
      “indexUid”: “bgg”,
      “startedAt”: “2022-03-17T14:58:24.502445949Z”,
      “status”: “failed”,
      “type”: “documentAddition”,
      “uid”: 3
}

This error might also happen outside of a kubernetes environment, not tested.

To Reproduce
Steps to reproduce the behavior:

Create a Meilisearch instance in a Kubernetes cluster and use a persistent volume to store the data.ms
Index documents until the volume is full

Expected behavior
A clear error that is documented should be returned

Meilisearch version:
v0.26.0

EDIT from @curquiza

How?

When Meilisearch does not have enough place on the machine, you get the following error":

{
    "message": "I/O error (os error 5).",
    "code": "internal",
    "type": "internal",
    "link": "https://docs.meilisearch.com/errors#internal"
}

We want to replace

the code (not type) internal by no_space_left_on_device
the link #internal by #no_space_left_on_device

⚠️ This is what the specification already mentioned and it looks like Meilisearch does not follow it yet: https://github.com/meilisearch/specifications/blob/main/text/0061-error-format-and-definitions.md#no_space_left_on_device

Impacted teams

Since this error is already in the spec (and so in the docs) and supposed to exist, no ping to do.
https://docs.meilisearch.com/reference/errors/error_codes.html#no-space-left-on-device

The text was updated successfully, but these errors were encountered:

MarinPostma · 2022-03-21T18:43:27Z

Uhm, this error is probably triggered when a file is being written to disk, alas, the only thing we can know is that a write as failed, so I'm not so sure how we could improve that... @meilisearch/core-team any idea?

Kerollmops · 2022-03-22T17:29:08Z

@MarinPostma This error is returned by LMDB when the library tried to get a page from the free list and failed. I think that it is the OS that returns that error from the memory mapping system. I don't really know what we can do but maybe add this input/ouput error in the list of errors of the documentation for now?

MarinPostma · 2022-03-22T17:35:19Z

This is an os error, don't you think that it could also be thrown when writing to a grenad file?

Kerollmops · 2022-03-23T09:45:08Z

Indeed, it seems like it could be thrown by grenad or anything that is trying to write on a disk that is full. I thought that the error could have been StorageFull or OutOfMemory.

So, I don't know if we can catch the error code and categorize the message as a NoMoreSpaceOnDevice user error?

MarinPostma · 2022-03-23T10:18:27Z

Seems to me that this could also be thrown on read,s, and on other kinds of Io error, and that would be even worse if we miscategorized the error

curquiza · 2022-03-23T10:21:16Z

Without creating a new error category, since the consequences for debugging could be bad, maybe could we just add a sentence in the error message like "This might be due to a lack of space on the device"?
This would orient the user without impacting us.
WDYT @gmourier?

MarinPostma · 2022-03-23T10:26:07Z

@curquiza but it could also send them in the wrong direction altogether. This is an error reported by the os, we don't know what happened, maybe the memory was corrupted, maybe they were using a network disk and the network failed... I don't think trying to guess what happened will be helpful :(

curquiza · 2022-03-23T10:46:33Z

It's not sending them in the wrong direction, it's giving a clue to the users. We say "this might be due to..." and not "this is definitely..."
When reading this error

the users will check their available space. On the SaaS, you can check it easily, and I personally had quickly the problem (missing space) when using it. So this clue can definitely help the SaaS users.
Users who don't use the SaaS would also check the available space: this is a first investigation for them. Users would rather fix the problem alone than contact the support and wait for an answer.
for the users who still have enough space, they will report the error by saying "I also checked I have enough place on my device, which is the case, but I still have the issue". It's even worthy for us since we will not even have to ask the question "do you have enough space on your device?"

gmourier · 2022-03-23T11:02:56Z

Won't the fact that LMDB can't shrink the space previously used when emptying an index (without deleting it) cause problems on the cloud side when displaying the used space? @nicolasvienot

nicolasvienot · 2022-03-24T13:55:38Z

@gmourier Yes, totally, and it is already a problem. We display the space used in the volume, and the value does not seem accurate. We need to investigate more, but I think this should be a separate GitHub issue.

curquiza · 2022-06-16T16:33:44Z

As discussed with @gmourier, here is my suggestion to improve this specific user message

"error": {
     "code": "internal",
     "link": "https://docs.meilisearch.com/errors#internal",
     "message": "I/O error (os error 5). This might be due to a lack of space on the device. If not, please contact us.",
     "type": "internal"
},

@Kerollmops is it technically doable?

@gmourier do you validate this solution?

If I get 2 yes, I will make this issue as good first issue and will update the spec once it's fixed 😇

gmourier · 2022-06-16T16:38:00Z

I validate your message suggestion!

Kerollmops · 2022-06-30T15:09:19Z

@Kerollmops is it technically doable?

I would say yes, we must be able to retrieve the os error code (5) and change that.

Kerollmops · 2022-06-30T15:51:58Z

Hey @meilisearch/cloud-team,

Are you sure it is an os error 5 that you get and not an os error 28. The 5 one seems to be access denied when the 28 is related to a lack of space on the device?

nicolasvienot · 2022-07-05T05:48:02Z

Hey @Kerollmops,
I just did the test again, trying to index documents when there is no space left on the volume.
Here is the failed task with Meilisearch v0.27.2:

        {
            "uid": 24,
            "indexUid": "movies20",
            "status": "failed",
            "type": "documentAddition",
            "details": {
                "receivedDocuments": 31968,
                "indexedDocuments": 0
            },
            "error": {
                "message": "An internal error has occurred. `I/O error (os error 5)`.",
                "code": "internal",
                "type": "internal",
                "link": "https://docs.meilisearch.com/errors#internal"
            },
            "duration": "PT92.749877984S",
            "enqueuedAt": "2022-07-04T22:57:33.046582055Z",
            "startedAt": "2022-07-04T23:17:53.517704393Z",
            "finishedAt": "2022-07-04T23:19:26.267582377Z"
        },

There should not be any access denied as the previous task went well.

irevoire · 2022-07-05T13:28:38Z

I can reproduce the issue as well; I created a really smol partition on my Linux machine and started indexing documents; here's the result I got;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 19547
  },
  "duration": "PT11.489335705S",
  "enqueuedAt": "2022-07-05T13:25:58.967517187Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "An internal error has occurred. `Input/output error (os error 5)`.",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T13:26:10.463413698Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T13:25:58.974077993Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}

I'm running under Linux 5.18.8.

curquiza · 2022-07-05T13:35:03Z

@Kerollmops could we catch error 5 AND 28 in this case to return this custom error?

Kerollmops · 2022-07-05T15:01:24Z

Note that for the same os error 5, both systems return a different text, the Tamo one says Input/Output error when the Nico one says I/O error. However, it shouldn't be an issue if we are able to directly catch the io::Error raw number.

irevoire · 2022-07-05T17:33:46Z

Lol, I wanted to reproduce the issue for @Kerollmops. I did the EXACT same thing and it throwed another error;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 10271
  },
  "duration": "PT9.279810746S",
  "enqueuedAt": "2022-07-05T17:30:30.714754378Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "No space left on device (os error 28)",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T17:30:40.009485687Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T17:30:30.729674941Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}

And here is what we get from the system;

    Err(
        Milli(
            IoError(
                Os {
                    code: 28,
                    kind: StorageFull,
                    message: "No space left on device",
                },
            ),
        ),
    )

I then redid the same thing and got the first error again;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 19547
  },
  "duration": "PT11.699438861S",
  "enqueuedAt": "2022-07-05T17:32:36.427322701Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "An internal error has occurred. `Input/output error (os error 5)`.",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T17:32:48.14102773Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T17:32:36.441588869Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}

    Err(
        Internal(
            Io(
                Os {
                    code: 5,
                    kind: Uncategorized,
                    message: "Input/output error",
                },
            ),
        ),
    )

irevoire · 2022-07-05T17:37:46Z

OK SO BIG NEWS, it's worse than I thought.

It's actually reproducible but I have NO IDEA how we could understand what makes the first or second error to be thrown.
What I know is that, if I index the movies.json datasets first and then the nested_movies.json I get os error 28.
But if I index the nested_movies.json and then the movies.json then I get the os error 5.

mmachatschek · 2022-07-13T06:43:56Z

gmourier Yes, totally, and it is already a problem. We display the space used in the volume, and the value does not seem accurate. We need to investigate more, but I think this should be a separate GitHub issue.

@nicolasvienot is it possible that the system you were testing had many incremental updates (huge /tasks backlog) to the index? I'm running into the same issue, were my index should be just a few MB big but the task queue is cluttered with old processed tasks (hitting the /tasks route will also kill my process because the size is too big) as we currently have many small updates to the index.

(as you wrote, this should be a separate GH issue, I just wanted to point it out here)

gmourier · 2022-07-13T08:05:16Z

Hey @mmachatschek !

(hitting the /tasks route will also kill my process because the size is too big)

We have released v0.28 which finally brings tasks pagination and filtering by (index, type, and status) capabilities. You should not have this problem anymore if you upgrade to this version !

mmachatschek · 2022-07-13T08:34:38Z

@gmourier 👍 already working with that version.

Won't the fact that LMDB can't shrink the space previously used when emptying an index (without deleting it) cause problems on the cloud side when displaying the used space? nicolasvienot

What I observed (with meilisearch v0.27 and v0.28) is, that the data.ms/data.mdb file in the root folder stays at the same file size even if all indexes where removed (the index files live in the data.ms/indexes folder anyway.
As I have a huge amount of tasks in meilisearch that are not cleared automatically by meilisearch, my guess was/is that the tasks table will take up all the space in that file.

Kerollmops · 2022-07-13T08:57:29Z

@gmourier 👍 already working with that version.

That is strange that Meilisearch is crashing when only 20 task statuses are returned, the engine doesn't even deserialize the whole data from the tasks, only the metadata. Is the engine being killed by the OS or is it crashing with a message?

As I have a huge amount of tasks in meilisearch that are not cleared automatically by meilisearch, my guess was/is that the tasks table will take up all the space in that file.

Tasks data (the update content) should be removed from the disk when processed, only the task statuses (metadata) should be kept in the data.ms/data.mdb LMDB environment. LMDB grows to a maximum size, this size will stabilize at a point and should not grow more, you should probably read the linked blog post on our documentation page, it has been written by the maintainer of LMDB.

mmachatschek · 2022-07-19T18:39:33Z

That is strange that Meilisearch is crashing when only 20 task statuses are returned, the engine doesn't even deserialize the whole data from the tasks, only the metadata. Is the engine being killed by the OS or is it crashing with a message?

The crash happened when using the v0.27.2 version. Tasks are returned as soon as I use v0.28.0

Tasks data (the update content) should be removed from the disk when processed, only the task statuses (metadata) should be kept in the data.ms/data.mdb LMDB environment. LMDB grows to a maximum size, this size will stabilize at a point and should not grow more, you should probably read the linked blog post on our documentation page, it has been written by the maintainer of LMDB.

The updates folder in data.ms was only 10MB big, I had problems with the data.ms/data.mdb file, which also saves the task db entries. I created an issue report here: #2628

curquiza · 2022-09-08T15:39:10Z

Hello here! Some updated after discussing with @Kerollmops, indeed a PR has been open for a long time for this issue on Milli side: meilisearch/milli#580

What this PR do: the code error 5 and 28, will be returned the following Meilisearch error

{
    "message": "There is no more space left on the device. Consider increasing the size of the disk/partition.",
    "code": "no_space_left_on_device",
    "type": "internal",
    "link": "https://docs.meilisearch.com/errors#no_space_left_on_device"
}

instead of the error code internal (note the link has also changed)
According to the spec and the docs, this is a bug fix

If we don't see any internal problem with this PR, this could be integrated in Meilisearch v0.30.0

curquiza · 2022-10-17T10:20:10Z

We investigate how to fix this, this is not an easy improvement to do; we will try to do it for v1, but impossible for v0.30.0, sorry! 😢

curquiza · 2022-11-29T17:40:31Z

For people watching this issue, investigations and first works have been done on milli side: meilisearch/milli#580

irevoire · 2022-12-19T16:55:11Z

Ok, so after a meeting with @dureuill and another with @nicolasvienot we realized that overriding entirely the os error 5 could have huge drawbacks, so the final proposition is to:

Create a no_space_left_on_device error code for the os error 28 (that happens sometimes)
Create an io_error for the os error 5
While we still let the kernel? generate a first error message for the os error 5. We're going to add at the end of the message some extra info:
Input/output error (os error 5). This error generally happens when you have no space left on device or when your database doesn't have read or write right.

Implementation:

Create and merge the PR
Update the spec

@meilisearch/docs-team You might be interested by this change; it introduces error changes: no_space_left_on_device is already in your docs but io_error will be a new created one. The spec will be updated accordingly.
Also, you might want to review the error message 👀

curquiza added enhancement New feature or improvement feature request & feedback Go to https://github.com/meilisearch/product/ labels Mar 23, 2022

curquiza added the milli Related to the milli workspace label Mar 24, 2022

curquiza added error handler Issues related to the returned errors in Meilisearch and removed feature request & feedback Go to https://github.com/meilisearch/product/ labels May 18, 2022

curquiza changed the title ~~Unclear error when there is not enough space left on kubernetes persistent volume~~ Unclear error when there is not enough space left Jun 23, 2022

Kerollmops mentioned this issue Jul 6, 2022

Categorize the os error codes 5 and 28 as storage full user errors meilisearch/milli#580

Closed

curquiza added bug Something isn't working as expected and removed enhancement New feature or improvement labels Aug 31, 2022

curquiza added breaking change The related changes are breaking for the users and removed breaking change The related changes are breaking for the users labels Sep 8, 2022

curquiza added this to the v0.30.0 milestone Sep 8, 2022

curquiza modified the milestones: v0.30.0, v1.0.0 Oct 17, 2022

curquiza mentioned this issue Nov 21, 2022

Improve error handler for v1 #3095

Closed

irevoire mentioned this issue Dec 19, 2022

Handle most io error instead of tagging everything as an internal #3263

Merged

3 tasks

bors bot closed this as completed in 9925309 Dec 20, 2022

meili-bot added the v1.0.0 PRs/issues solved in v1.0.0 released on 2023-02-06 label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear error when there is not enough space left #2255

Unclear error when there is not enough space left #2255

nicolasvienot commented Mar 21, 2022 •

edited by curquiza

Loading

MarinPostma commented Mar 21, 2022

Kerollmops commented Mar 22, 2022

MarinPostma commented Mar 22, 2022

Kerollmops commented Mar 23, 2022 •

edited

Loading

MarinPostma commented Mar 23, 2022

curquiza commented Mar 23, 2022 •

edited

Loading

MarinPostma commented Mar 23, 2022

curquiza commented Mar 23, 2022 •

edited

Loading

gmourier commented Mar 23, 2022

nicolasvienot commented Mar 24, 2022 •

edited

Loading

curquiza commented Jun 16, 2022 •

edited

Loading

gmourier commented Jun 16, 2022

Kerollmops commented Jun 30, 2022 •

edited by curquiza

Loading

Kerollmops commented Jun 30, 2022

nicolasvienot commented Jul 5, 2022 •

edited

Loading

irevoire commented Jul 5, 2022

curquiza commented Jul 5, 2022

Kerollmops commented Jul 5, 2022

irevoire commented Jul 5, 2022

irevoire commented Jul 5, 2022

mmachatschek commented Jul 13, 2022 •

edited

Loading

gmourier commented Jul 13, 2022

mmachatschek commented Jul 13, 2022 •

edited

Loading

Kerollmops commented Jul 13, 2022

mmachatschek commented Jul 19, 2022 •

edited

Loading

curquiza commented Sep 8, 2022 •

edited

Loading

curquiza commented Oct 17, 2022

curquiza commented Nov 29, 2022

irevoire commented Dec 19, 2022 •

edited

Loading

Unclear error when there is not enough space left #2255

Unclear error when there is not enough space left #2255

Comments

nicolasvienot commented Mar 21, 2022 • edited by curquiza Loading

How?

Impacted teams

MarinPostma commented Mar 21, 2022

Kerollmops commented Mar 22, 2022

MarinPostma commented Mar 22, 2022

Kerollmops commented Mar 23, 2022 • edited Loading

MarinPostma commented Mar 23, 2022

curquiza commented Mar 23, 2022 • edited Loading

MarinPostma commented Mar 23, 2022

curquiza commented Mar 23, 2022 • edited Loading

gmourier commented Mar 23, 2022

nicolasvienot commented Mar 24, 2022 • edited Loading

curquiza commented Jun 16, 2022 • edited Loading

gmourier commented Jun 16, 2022

Kerollmops commented Jun 30, 2022 • edited by curquiza Loading

Kerollmops commented Jun 30, 2022

nicolasvienot commented Jul 5, 2022 • edited Loading

irevoire commented Jul 5, 2022

curquiza commented Jul 5, 2022

Kerollmops commented Jul 5, 2022

irevoire commented Jul 5, 2022

irevoire commented Jul 5, 2022

mmachatschek commented Jul 13, 2022 • edited Loading

gmourier commented Jul 13, 2022

mmachatschek commented Jul 13, 2022 • edited Loading

Kerollmops commented Jul 13, 2022

mmachatschek commented Jul 19, 2022 • edited Loading

curquiza commented Sep 8, 2022 • edited Loading

curquiza commented Oct 17, 2022

curquiza commented Nov 29, 2022

irevoire commented Dec 19, 2022 • edited Loading

nicolasvienot commented Mar 21, 2022 •

edited by curquiza

Loading

Kerollmops commented Mar 23, 2022 •

edited

Loading

curquiza commented Mar 23, 2022 •

edited

Loading

curquiza commented Mar 23, 2022 •

edited

Loading

nicolasvienot commented Mar 24, 2022 •

edited

Loading

curquiza commented Jun 16, 2022 •

edited

Loading

Kerollmops commented Jun 30, 2022 •

edited by curquiza

Loading

nicolasvienot commented Jul 5, 2022 •

edited

Loading

mmachatschek commented Jul 13, 2022 •

edited

Loading

mmachatschek commented Jul 13, 2022 •

edited

Loading

mmachatschek commented Jul 19, 2022 •

edited

Loading

curquiza commented Sep 8, 2022 •

edited

Loading

irevoire commented Dec 19, 2022 •

edited

Loading