Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear error when there is not enough space left #2255

Closed
nicolasvienot opened this issue Mar 21, 2022 · 29 comments · Fixed by #3263
Closed

Unclear error when there is not enough space left #2255

nicolasvienot opened this issue Mar 21, 2022 · 29 comments · Fixed by #3263
Labels
bug Something isn't working as expected error handler Issues related to the returned errors in Meilisearch impacts cloud This issue involves changes for the Meilisearch's cloud team milli Related to the milli workspace v1.0.0 PRs/issues solved in v1.0.0 released on 2023-02-06
Milestone

Comments

@nicolasvienot
Copy link
Member

nicolasvienot commented Mar 21, 2022

Describe the bug
When trying to index documents in a kubernetes volume that does not have enough space left, Meilisearch returns the following error:

{
      “details”: {
        “indexedDocuments”: 0,
        “receivedDocuments”: 46206
      },
      “duration”: “PT512.665928885S”,
      “enqueuedAt”: “2022-03-17T14:58:24.465745303Z”,
      “error”: {
        “code”: “internal”,
        “link”: “https://docs.meilisearch.com/errors#internal”,
        “message”: “I/O error (os error 5)“,
        “type”: “internal”
      },
      “finishedAt”: “2022-03-17T15:06:57.168374834Z”,
      “indexUid”: “bgg”,
      “startedAt”: “2022-03-17T14:58:24.502445949Z”,
      “status”: “failed”,
      “type”: “documentAddition”,
      “uid”: 3
}

This error might also happen outside of a kubernetes environment, not tested.

To Reproduce
Steps to reproduce the behavior:

  1. Create a Meilisearch instance in a Kubernetes cluster and use a persistent volume to store the data.ms
  2. Index documents until the volume is full

Expected behavior
A clear error that is documented should be returned

Meilisearch version:
v0.26.0


EDIT from @curquiza

How?

When Meilisearch does not have enough place on the machine, you get the following error":

{
    "message": "I/O error (os error 5).",
    "code": "internal",
    "type": "internal",
    "link": "https://docs.meilisearch.com/errors#internal"
}

We want to replace

  • the code (not type) internal by no_space_left_on_device
  • the link #internal by #no_space_left_on_device

⚠️ This is what the specification already mentioned and it looks like Meilisearch does not follow it yet: https://github.com/meilisearch/specifications/blob/main/text/0061-error-format-and-definitions.md#no_space_left_on_device

Impacted teams

Since this error is already in the spec (and so in the docs) and supposed to exist, no ping to do.
https://docs.meilisearch.com/reference/errors/error_codes.html#no-space-left-on-device

@MarinPostma
Copy link
Contributor

Uhm, this error is probably triggered when a file is being written to disk, alas, the only thing we can know is that a write as failed, so I'm not so sure how we could improve that... @meilisearch/core-team any idea?

@Kerollmops
Copy link
Member

@MarinPostma This error is returned by LMDB when the library tried to get a page from the free list and failed. I think that it is the OS that returns that error from the memory mapping system. I don't really know what we can do but maybe add this input/ouput error in the list of errors of the documentation for now?

@MarinPostma
Copy link
Contributor

This is an os error, don't you think that it could also be thrown when writing to a grenad file?

@Kerollmops
Copy link
Member

Kerollmops commented Mar 23, 2022

Indeed, it seems like it could be thrown by grenad or anything that is trying to write on a disk that is full. I thought that the error could have been StorageFull or OutOfMemory.

So, I don't know if we can catch the error code and categorize the message as a NoMoreSpaceOnDevice user error?

@curquiza curquiza added enhancement New feature or improvement feature request & feedback Go to https://github.com/meilisearch/product/ labels Mar 23, 2022
@MarinPostma
Copy link
Contributor

Seems to me that this could also be thrown on read,s, and on other kinds of Io error, and that would be even worse if we miscategorized the error

@curquiza
Copy link
Member

curquiza commented Mar 23, 2022

Without creating a new error category, since the consequences for debugging could be bad, maybe could we just add a sentence in the error message like "This might be due to a lack of space on the device"?
This would orient the user without impacting us.
WDYT @gmourier?

@MarinPostma
Copy link
Contributor

@curquiza but it could also send them in the wrong direction altogether. This is an error reported by the os, we don't know what happened, maybe the memory was corrupted, maybe they were using a network disk and the network failed... I don't think trying to guess what happened will be helpful :(

@curquiza
Copy link
Member

curquiza commented Mar 23, 2022

It's not sending them in the wrong direction, it's giving a clue to the users. We say "this might be due to..." and not "this is definitely..."
When reading this error

  • the users will check their available space. On the SaaS, you can check it easily, and I personally had quickly the problem (missing space) when using it. So this clue can definitely help the SaaS users.
    Users who don't use the SaaS would also check the available space: this is a first investigation for them. Users would rather fix the problem alone than contact the support and wait for an answer.
  • for the users who still have enough space, they will report the error by saying "I also checked I have enough place on my device, which is the case, but I still have the issue". It's even worthy for us since we will not even have to ask the question "do you have enough space on your device?"

@gmourier
Copy link
Member

Won't the fact that LMDB can't shrink the space previously used when emptying an index (without deleting it) cause problems on the cloud side when displaying the used space? @nicolasvienot

@nicolasvienot
Copy link
Member Author

nicolasvienot commented Mar 24, 2022

@gmourier Yes, totally, and it is already a problem. We display the space used in the volume, and the value does not seem accurate. We need to investigate more, but I think this should be a separate GitHub issue.

@curquiza curquiza added the milli Related to the milli workspace label Mar 24, 2022
@curquiza curquiza added error handler Issues related to the returned errors in Meilisearch and removed feature request & feedback Go to https://github.com/meilisearch/product/ labels May 18, 2022
@curquiza
Copy link
Member

curquiza commented Jun 16, 2022

As discussed with @gmourier, here is my suggestion to improve this specific user message

"error": {
     "code": "internal",
     "link": "https://docs.meilisearch.com/errors#internal",
     "message": "I/O error (os error 5). This might be due to a lack of space on the device. If not, please contact us.",
     "type": "internal"
},

@Kerollmops is it technically doable?

@gmourier do you validate this solution?

If I get 2 yes, I will make this issue as good first issue and will update the spec once it's fixed 😇

@gmourier
Copy link
Member

I validate your message suggestion!

@curquiza curquiza changed the title Unclear error when there is not enough space left on kubernetes persistent volume Unclear error when there is not enough space left Jun 23, 2022
@Kerollmops
Copy link
Member

Kerollmops commented Jun 30, 2022

@Kerollmops is it technically doable?

I would say yes, we must be able to retrieve the os error code (5) and change that.

@Kerollmops
Copy link
Member

Hey @meilisearch/cloud-team,

Are you sure it is an os error 5 that you get and not an os error 28. The 5 one seems to be access denied when the 28 is related to a lack of space on the device?

@nicolasvienot
Copy link
Member Author

nicolasvienot commented Jul 5, 2022

Hey @Kerollmops,
I just did the test again, trying to index documents when there is no space left on the volume.
Here is the failed task with Meilisearch v0.27.2:

        {
            "uid": 24,
            "indexUid": "movies20",
            "status": "failed",
            "type": "documentAddition",
            "details": {
                "receivedDocuments": 31968,
                "indexedDocuments": 0
            },
            "error": {
                "message": "An internal error has occurred. `I/O error (os error 5)`.",
                "code": "internal",
                "type": "internal",
                "link": "https://docs.meilisearch.com/errors#internal"
            },
            "duration": "PT92.749877984S",
            "enqueuedAt": "2022-07-04T22:57:33.046582055Z",
            "startedAt": "2022-07-04T23:17:53.517704393Z",
            "finishedAt": "2022-07-04T23:19:26.267582377Z"
        },

There should not be any access denied as the previous task went well.

Capture d’écran 2022-07-05 à 01 51 59

@irevoire
Copy link
Member

irevoire commented Jul 5, 2022

I can reproduce the issue as well; I created a really smol partition on my Linux machine and started indexing documents; here's the result I got;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 19547
  },
  "duration": "PT11.489335705S",
  "enqueuedAt": "2022-07-05T13:25:58.967517187Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "An internal error has occurred. `Input/output error (os error 5)`.",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T13:26:10.463413698Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T13:25:58.974077993Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}

I'm running under Linux 5.18.8.

@curquiza
Copy link
Member

curquiza commented Jul 5, 2022

@Kerollmops could we catch error 5 AND 28 in this case to return this custom error?

@Kerollmops
Copy link
Member

Note that for the same os error 5, both systems return a different text, the Tamo one says Input/Output error when the Nico one says I/O error. However, it shouldn't be an issue if we are able to directly catch the io::Error raw number.

@irevoire
Copy link
Member

irevoire commented Jul 5, 2022

Lol, I wanted to reproduce the issue for @Kerollmops. I did the EXACT same thing and it throwed another error;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 10271
  },
  "duration": "PT9.279810746S",
  "enqueuedAt": "2022-07-05T17:30:30.714754378Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "No space left on device (os error 28)",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T17:30:40.009485687Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T17:30:30.729674941Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}

And here is what we get from the system;

    Err(
        Milli(
            IoError(
                Os {
                    code: 28,
                    kind: StorageFull,
                    message: "No space left on device",
                },
            ),
        ),
    )

I then redid the same thing and got the first error again;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 19547
  },
  "duration": "PT11.699438861S",
  "enqueuedAt": "2022-07-05T17:32:36.427322701Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "An internal error has occurred. `Input/output error (os error 5)`.",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T17:32:48.14102773Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T17:32:36.441588869Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}
    Err(
        Internal(
            Io(
                Os {
                    code: 5,
                    kind: Uncategorized,
                    message: "Input/output error",
                },
            ),
        ),
    )

@irevoire
Copy link
Member

irevoire commented Jul 5, 2022

OK SO BIG NEWS, it's worse than I thought.

It's actually reproducible but I have NO IDEA how we could understand what makes the first or second error to be thrown.
What I know is that, if I index the movies.json datasets first and then the nested_movies.json I get os error 28.
But if I index the nested_movies.json and then the movies.json then I get the os error 5.

@mmachatschek
Copy link

mmachatschek commented Jul 13, 2022

gmourier Yes, totally, and it is already a problem. We display the space used in the volume, and the value does not seem accurate. We need to investigate more, but I think this should be a separate GitHub issue.

@nicolasvienot is it possible that the system you were testing had many incremental updates (huge /tasks backlog) to the index? I'm running into the same issue, were my index should be just a few MB big but the task queue is cluttered with old processed tasks (hitting the /tasks route will also kill my process because the size is too big) as we currently have many small updates to the index.

(as you wrote, this should be a separate GH issue, I just wanted to point it out here)

@gmourier
Copy link
Member

Hey @mmachatschek !

(hitting the /tasks route will also kill my process because the size is too big)

We have released v0.28 which finally brings tasks pagination and filtering by (index, type, and status) capabilities. You should not have this problem anymore if you upgrade to this version !

@mmachatschek
Copy link

mmachatschek commented Jul 13, 2022

@gmourier 👍 already working with that version.

Won't the fact that LMDB can't shrink the space previously used when emptying an index (without deleting it) cause problems on the cloud side when displaying the used space? nicolasvienot

What I observed (with meilisearch v0.27 and v0.28) is, that the data.ms/data.mdb file in the root folder stays at the same file size even if all indexes where removed (the index files live in the data.ms/indexes folder anyway.
As I have a huge amount of tasks in meilisearch that are not cleared automatically by meilisearch, my guess was/is that the tasks table will take up all the space in that file.

@Kerollmops
Copy link
Member

@gmourier 👍 already working with that version.

That is strange that Meilisearch is crashing when only 20 task statuses are returned, the engine doesn't even deserialize the whole data from the tasks, only the metadata. Is the engine being killed by the OS or is it crashing with a message?

As I have a huge amount of tasks in meilisearch that are not cleared automatically by meilisearch, my guess was/is that the tasks table will take up all the space in that file.

Tasks data (the update content) should be removed from the disk when processed, only the task statuses (metadata) should be kept in the data.ms/data.mdb LMDB environment. LMDB grows to a maximum size, this size will stabilize at a point and should not grow more, you should probably read the linked blog post on our documentation page, it has been written by the maintainer of LMDB.

@mmachatschek
Copy link

mmachatschek commented Jul 19, 2022

That is strange that Meilisearch is crashing when only 20 task statuses are returned, the engine doesn't even deserialize the whole data from the tasks, only the metadata. Is the engine being killed by the OS or is it crashing with a message?

The crash happened when using the v0.27.2 version. Tasks are returned as soon as I use v0.28.0

Tasks data (the update content) should be removed from the disk when processed, only the task statuses (metadata) should be kept in the data.ms/data.mdb LMDB environment. LMDB grows to a maximum size, this size will stabilize at a point and should not grow more, you should probably read the linked blog post on our documentation page, it has been written by the maintainer of LMDB.

The updates folder in data.ms was only 10MB big, I had problems with the data.ms/data.mdb file, which also saves the task db entries. I created an issue report here: #2628

@curquiza curquiza added bug Something isn't working as expected and removed enhancement New feature or improvement labels Aug 31, 2022
@curquiza curquiza added breaking change The related changes are breaking for the users and removed breaking change The related changes are breaking for the users labels Sep 8, 2022
@curquiza
Copy link
Member

curquiza commented Sep 8, 2022

Hello here! Some updated after discussing with @Kerollmops, indeed a PR has been open for a long time for this issue on Milli side: meilisearch/milli#580

What this PR do: the code error 5 and 28, will be returned the following Meilisearch error

{
    "message": "There is no more space left on the device. Consider increasing the size of the disk/partition.",
    "code": "no_space_left_on_device",
    "type": "internal",
    "link": "https://docs.meilisearch.com/errors#no_space_left_on_device"
}

instead of the error code internal (note the link has also changed)
According to the spec and the docs, this is a bug fix

If we don't see any internal problem with this PR, this could be integrated in Meilisearch v0.30.0

@curquiza curquiza added this to the v0.30.0 milestone Sep 8, 2022
@curquiza curquiza modified the milestones: v0.30.0, v1.0.0 Oct 17, 2022
@curquiza
Copy link
Member

We investigate how to fix this, this is not an easy improvement to do; we will try to do it for v1, but impossible for v0.30.0, sorry! 😢

@curquiza
Copy link
Member

For people watching this issue, investigations and first works have been done on milli side: meilisearch/milli#580

@curquiza curquiza added impacts cloud This issue involves changes for the Meilisearch's cloud team impacts docs This issue involves changes in the Meilisearch's documentation impacts integrations This issue involves changes in the Meilisearch's integrations and removed impacts docs This issue involves changes in the Meilisearch's documentation impacts integrations This issue involves changes in the Meilisearch's integrations labels Nov 29, 2022
@irevoire
Copy link
Member

irevoire commented Dec 19, 2022

Ok, so after a meeting with @dureuill and another with @nicolasvienot we realized that overriding entirely the os error 5 could have huge drawbacks, so the final proposition is to:

  1. Create a no_space_left_on_device error code for the os error 28 (that happens sometimes)
  2. Create an io_error for the os error 5
  3. While we still let the kernel? generate a first error message for the os error 5. We're going to add at the end of the message some extra info:
    Input/output error (os error 5). This error generally happens when you have no space left on device or when your database doesn't have read or write right.

Implementation:

  • Create and merge the PR
  • Update the spec

@meilisearch/docs-team You might be interested by this change; it introduces error changes: no_space_left_on_device is already in your docs but io_error will be a new created one. The spec will be updated accordingly.
Also, you might want to review the error message 👀

@bors bors bot closed this as completed in 9925309 Dec 20, 2022
@meili-bot meili-bot added the v1.0.0 PRs/issues solved in v1.0.0 released on 2023-02-06 label Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected error handler Issues related to the returned errors in Meilisearch impacts cloud This issue involves changes for the Meilisearch's cloud team milli Related to the milli workspace v1.0.0 PRs/issues solved in v1.0.0 released on 2023-02-06
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants