Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meilisearch stops responding when getting "enough" concurrent traffic #4654

Closed
savikko opened this issue May 27, 2024 · 9 comments
Closed

Meilisearch stops responding when getting "enough" concurrent traffic #4654

savikko opened this issue May 27, 2024 · 9 comments
Labels
bug Something isn't working as expected v1.8.2 PRs/issues solved in v1.8.2
Milestone

Comments

@savikko
Copy link

savikko commented May 27, 2024

Issue:
When running tests with k6 and when going over specific number of VUs (virtual users) Meilisearch just stops responding to HTTP requests.

How to replicate:

  1. Start meilisearch with empty database
  2. Create some test index there
curl --request POST \
  --url http://localhost:7700/indexes \
  --data '{
  "uid": "test",
  "primaryKey": "id"
}'
  1. Install k6 (https://k6.io/docs/get-started/installation/)
  2. Save content below as meilisearch-test.js
  3. Run
k6 run --vus 10 --duration 10s meilisearch-test.js

It probably passes nicely, here it accomplishes something like 200k reqs.

Here with 50 VUs it also completes, about 170k reqs.

But then, with 100 VUs:

k6 run --vus 100 --duration 10s meilisearch-test.js

It completes 2700 reqs and then meilisearch http 7700 does not answer anymore.

Having checked TRACE and DEBUG logs, I cannot see anything which sheds more light to it. So no any idea what actually crashes there. Actually, nothing crashes, it just stops responding.

Machine should have enough memory and CPU but CPU/RAM usage is not even spiking.

Tried also to play with --experimental-search-queue-size -setting, but no mentionable difference with that.

Edit: seems that setting that to 0 keeps Meilisearch alive.

meilisearch-test.js:

import http from "k6/http";
import {
  randomString,
  randomIntBetween,
} from "https://jslib.k6.io/k6-utils/1.4.0/index.js";

export default function () {
  const params = {
    headers: {
      "content-type": "application/json",
    },
  };
  const url = "http://localhost:7700/indexes/test/search";
  const post = {
    q: randomString(randomIntBetween(2, 6)),
    limit: 10,
  };
  http.post(url, JSON.stringify(post), params);
}
@dureuill
Copy link
Contributor

dureuill commented Jun 4, 2024

Hello!

Thank you for your report!

I tried reproducing following the procedure you detailed but I wasn't successful in reproducing the issue. Here is what I observe:

  • with 10 VUs
     data_received..................: 41 MB  4.1 MB/s
     data_sent......................: 26 MB  2.6 MB/s
     http_req_blocked...............: avg=923ns    min=0s       med=1µs      max=1.49ms  p(90)=1µs     p(95)=2µs
     http_req_connecting............: avg=26ns     min=0s       med=0s       max=422µs   p(90)=0s      p(95)=0s
     http_req_duration..............: avg=635.95µs min=104µs    med=442µs    max=81.65ms p(90)=912µs   p(95)=1.24ms
       { expected_response:true }...: avg=635.95µs min=104µs    med=442µs    max=81.65ms p(90)=912µs   p(95)=1.24ms
     http_req_failed................: 0.00%  ✓ 0            ✗ 147918
     http_req_receiving.............: avg=14.02µs  min=3µs      med=8µs      max=27.97ms p(90)=23µs    p(95)=34µs
     http_req_sending...............: avg=4.59µs   min=1µs      med=3µs      max=2ms     p(90)=8µs     p(95)=12µs
     http_req_tls_handshaking.......: avg=0s       min=0s       med=0s       max=0s      p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=617.32µs min=94µs     med=425µs    max=81.64ms p(90)=890µs   p(95)=1.22ms
     http_reqs......................: 147918 14788.762388/s
     iteration_duration.............: avg=671.4µs  min=122.41µs med=476.47µs max=81.68ms p(90)=954.7µs p(95)=1.29ms
     iterations.....................: 147918 14788.762388/s
     vus............................: 10     min=10         max=10
     vus_max........................: 10     min=10         max=10
  • with 100 VUs
     data_received..................: 42 MB  4.2 MB/s
     data_sent......................: 26 MB  2.6 MB/s
     http_req_blocked...............: avg=2.46µs  min=0s       med=1µs    max=5.56ms  p(90)=2µs     p(95)=2µs
     http_req_connecting............: avg=852ns   min=0s       med=0s     max=1.99ms  p(90)=0s      p(95)=0s
     http_req_duration..............: avg=6.69ms  min=157µs    med=5.6ms  max=91.61ms p(90)=11.91ms p(95)=15.18ms
       { expected_response:true }...: avg=6.89ms  min=157µs    med=5.79ms max=91.61ms p(90)=12.12ms p(95)=15.45ms
     http_req_failed................: 5.90%  ✓ 8736         ✗ 139280
     http_req_receiving.............: avg=23.76µs min=3µs      med=9µs    max=15.94ms p(90)=31µs    p(95)=56µs
     http_req_sending...............: avg=7.37µs  min=1µs      med=3µs    max=7.37ms  p(90)=9µs     p(95)=16µs
     http_req_tls_handshaking.......: avg=0s      min=0s       med=0s     max=0s      p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=6.65ms  min=142µs    med=5.57ms max=91.5ms  p(90)=11.87ms p(95)=15.13ms
     http_reqs......................: 148016 14792.532178/s
     iteration_duration.............: avg=6.74ms  min=194.41µs med=5.65ms max=91.64ms p(90)=11.97ms p(95)=15.26ms
     iterations.....................: 148016 14792.532178/s
     vus............................: 100    min=100        max=100
     vus_max........................: 100    min=100        max=100
  • with 1000 VUs
     data_received..................: 41 MB  4.1 MB/s
     data_sent......................: 25 MB  2.5 MB/s
     http_req_blocked...............: avg=51.83µs min=0s     med=1µs     max=101.65ms p(90)=2µs      p(95)=3µs
     http_req_connecting............: avg=36.97µs min=0s     med=0s      max=101.63ms p(90)=0s       p(95)=0s
     http_req_duration..............: avg=69.03ms min=0s     med=66.17ms max=199.3ms  p(90)=101.5ms  p(95)=117.69ms
       { expected_response:true }...: avg=69.36ms min=5.93ms med=66.44ms max=199.3ms  p(90)=101.81ms p(95)=118.05ms
     http_req_failed................: 6.41%  ✓ 9278         ✗ 135364
     http_req_receiving.............: avg=25.05µs min=0s     med=9µs     max=26.62ms  p(90)=32µs     p(95)=57µs
     http_req_sending...............: avg=11.43µs min=0s     med=3µs     max=22.99ms  p(90)=10µs     p(95)=18µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=68.99ms min=0s     med=66.15ms max=199.29ms p(90)=101.42ms p(95)=117.64ms
     http_reqs......................: 144642 14371.752266/s
     iteration_duration.............: avg=69.15ms min=3.47ms med=66.22ms max=212.48ms p(90)=101.63ms p(95)=117.87ms
     iterations.....................: 144642 14371.752266/s
     vus............................: 1000   min=1000       max=1000
     vus_max........................: 1000   min=1000       max=1000
  • with 10_000 VUs
     data_received..................: 17 MB  1.6 MB/s
     data_sent......................: 11 MB  961 kB/s
     http_req_blocked...............: avg=61.03ms  min=0s        med=1µs   max=903.89ms p(90)=402.87ms p(95)=464.05ms
     http_req_connecting............: avg=60.93ms  min=0s        med=0s    max=820.7ms  p(90)=402.7ms  p(95)=462.81ms
     http_req_duration..............: avg=1.5s     min=0s        med=1.49s max=2.81s    p(90)=2.32s    p(95)=2.48s
       { expected_response:true }...: avg=1.58s    min=57.9ms    med=1.52s max=2.81s    p(90)=2.34s    p(95)=2.49s
     http_req_failed................: 10.26% ✓ 6564        ✗ 57374
     http_req_receiving.............: avg=110.17µs min=0s        med=26µs  max=32.8ms   p(90)=176µs    p(95)=420µs
     http_req_sending...............: avg=957.9µs  min=-213000ns med=6µs   max=80.53ms  p(90)=1ms      p(95)=4.36ms
     http_req_tls_handshaking.......: avg=0s       min=0s        med=0s    max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=1.5s     min=0s        med=1.49s max=2.81s    p(90)=2.32s    p(95)=2.48s
     http_reqs......................: 63938  5766.021004/s
     iteration_duration.............: avg=1.6s     min=334.38ms  med=1.53s max=2.81s    p(90)=2.32s    p(95)=2.49s
     iterations.....................: 63938  5766.021004/s
     vus............................: 6970   min=6970      max=10000
     vus_max........................: 10000  min=10000     max=10000

Following this, a:

 curl --request GET  \
  --url http://localhost:7700/indexes/test/search

Still answers

and so does:

curl --request POST -H 'Content-Type: application/json' \
  --url http://localhost:7700/indexes/test/search \
  --data '{}'
{"hits":[],"query":"","processingTimeMs":0,"limit":20,"offset":0,"estimatedTotalHits":0}
  1. What is your Meilisearch version?
  2. What your Meilisearch instance running on? OS, specs, etc.
  3. Can you confirm that the faulty behavior you experience is Meilisearch stopping to respond altogether following a k6 test with 100 VUs?

@dureuill dureuill added the needs more info This issue needs a minimal complete and verifiable example label Jun 4, 2024
@savikko
Copy link
Author

savikko commented Jun 5, 2024

Thanks for digging into this.

So I tried this again.

Computer I am running this is MacBook Air, M2, 16G RAM, OS XSonoma 14.5. This also happened on fly.io instance (using docker image, x64)

I am in empty directory and I started Meilisearch with this:

meilisearch --db-path ./meilidata

Output (so version is 1.8.1, installed via brew):

Config file path:	"none"
Database path:		"./meilidata"
Server listening on:	"http://localhost:7700"
Environment:		"development"
Commit SHA:		"unknown"
Commit date:		"unknown"
Package version:	"1.8.1"

Then I created test index:

curl --request POST \
  --url http://localhost:7700/indexes \
  -H 'Content-Type: application/json' \
  --data '{
  "uid": "test",
  "primaryKey": "id"
}'

Then, running the meilisearch-test.js:

k6 run --vus 10 --duration 10s meilisearch-test.js

Everything goes fine:

     data_received..................: 5.8 MB 577 kB/s
     data_sent......................: 3.6 MB 365 kB/s
     http_req_blocked...............: avg=1.23µs min=0s       med=0s     max=1.76ms  p(90)=1µs    p(95)=1µs    
     http_req_connecting............: avg=328ns  min=0s       med=0s     max=806µs   p(90)=0s     p(95)=0s     
     http_req_duration..............: avg=4.77ms min=412µs    med=2.83ms max=88.34ms p(90)=7.72ms p(95)=24.83ms
       { expected_response:true }...: avg=4.77ms min=412µs    med=2.83ms max=88.34ms p(90)=7.72ms p(95)=24.83ms
     http_req_failed................: 0.00%  ✓ 0         ✗ 20846
     http_req_receiving.............: avg=7.5µs  min=3µs      med=6µs    max=744µs   p(90)=9µs    p(95)=12µs   
     http_req_sending...............: avg=3.06µs min=1µs      med=2µs    max=2.97ms  p(90)=3µs    p(95)=4µs    
     http_req_tls_handshaking.......: avg=0s     min=0s       med=0s     max=0s      p(90)=0s     p(95)=0s     
     http_req_waiting...............: avg=4.76ms min=404µs    med=2.82ms max=88.3ms  p(90)=7.71ms p(95)=24.82ms
     http_reqs......................: 20846  2084.1567/s
     iteration_duration.............: avg=4.79ms min=429.25µs med=2.85ms max=88.39ms p(90)=7.74ms p(95)=24.85ms
     iterations.....................: 20846  2084.1567/s
     vus............................: 10     min=10      max=10 
     vus_max........................: 10     min=10      max=10 


running (10.0s), 00/10 VUs, 20846 complete and 0 interrupted iterations
default ✓ [======================================] 10 VUs  10s

Then with 100 VUs:

     data_received..................: 663 kB 17 kB/s
     data_sent......................: 436 kB 11 kB/s
     http_req_blocked...............: avg=181.61µs min=0s       med=1µs    max=5.82ms  p(90)=2µs     p(95)=4µs    
     http_req_connecting............: avg=114.77µs min=0s       med=0s     max=3.89ms  p(90)=0s      p(95)=0s     
     http_req_duration..............: avg=6.96ms   min=204µs    med=2.91ms max=76.45ms p(90)=15.08ms p(95)=31.97ms
       { expected_response:true }...: avg=6.96ms   min=204µs    med=2.91ms max=76.45ms p(90)=15.08ms p(95)=31.97ms
     http_req_failed................: 0.00%  ✓ 0         ✗ 2392 
     http_req_receiving.............: avg=12.13µs  min=3µs      med=7µs    max=636µs   p(90)=23µs    p(95)=35µs   
     http_req_sending...............: avg=5.79µs   min=1µs      med=2µs    max=1.57ms  p(90)=9µs     p(95)=13µs   
     http_req_tls_handshaking.......: avg=0s       min=0s       med=0s     max=0s      p(90)=0s      p(95)=0s     
     http_req_waiting...............: avg=6.95ms   min=192µs    med=2.89ms max=76.39ms p(90)=15.06ms p(95)=31.96ms
     http_reqs......................: 2392   59.792883/s
     iteration_duration.............: avg=7.19ms   min=222.79µs med=2.94ms max=81.73ms p(90)=15.11ms p(95)=32ms   
     iterations.....................: 2392   59.792883/s
     vus............................: 100    min=100     max=100
     vus_max........................: 100    min=100     max=100


running (40.0s), 000/100 VUs, 2392 complete and 100 interrupted iterations
default ✓ [======================================] 100 VUs  10s

So it stopped on 2392 requests.

Then it did not respond at all. It still listens on port 7700 but nothing happens. Only restarting meilisearch again resolves this.

It seems that if i added --no-analytics --log-level=TRACE it does stays alive longer.

If I add --experimental-search-queue-size 0, it works flawlessly. Of course at some point starting to respond with 503 or so.

Hopefully this helps out reproducing the issue.

@dureuill
Copy link
Contributor

dureuill commented Jun 5, 2024

OK, with this detailed report I now reproduced (but only with the meilisearch from homebrew so far).

I also think I understand a part of what is happening.
v1.8 introduced the search queue, which allows up to N concurrent search requests.

That means that before performing a search request, Meilisearch checks for a permit to do so. I think the sequence of events is roughly the following:

  1. All permits are exhausted by the numerous requests
  2. A permit is somehow leaked
  3. All request handler threads are busy waiting for a permit to perform a search query, so none of them is ready to accept a new connection
  4. Meilisearch does not answer anymore

This explains why you're not seeing the freeze with --experimental-search-queue-size 0, setting 0 disables the search queue

I'll try to reproduce with a debug build but it might be tricky since it looks timing related. I'll keep you informed

@dureuill dureuill added bug Something isn't working as expected and removed needs more info This issue needs a minimal complete and verifiable example labels Jun 5, 2024
@dureuill dureuill added this to the v1.9.0 milestone Jun 5, 2024
@dureuill
Copy link
Contributor

dureuill commented Jun 5, 2024

Opened #4681 with a fix. Waiting to decide if it lands in a 1.8.2 or a v1.9.0

Thank you again for your report ❤️

Meanwhile, as described, there is the workaround of setting --experimental-search-queue-size=0

meili-bors bot added a commit that referenced this issue Jun 10, 2024
4681: Fix concurrency issue r=irevoire a=dureuill

# Pull Request

## Related issue
Fixes #4654 

## What does this PR do?
- Asynchronously drop permits


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
@curquiza curquiza modified the milestones: v1.9.0, v1.8.2 Jun 11, 2024
@curquiza curquiza added the v1.8.2 PRs/issues solved in v1.8.2 label Jun 11, 2024
@curquiza
Copy link
Member

Closed by #4681
is now in v1.8.2

@martinnj
Copy link

I am unsure of whether I should open a new issue or post here, so please tell me if this isn't the place.

We're still experiencing this issue, or one very similar, on 1.8.2. Meilisearch will randomly stop processing HTTP requests.
The process are still alive, there is no errors in any of the logs, and the sockets are still open, we just never get a response.

This also happens even with --experimental-search-queue-size 0 it just seems to have decreased the frequency of occurrences.

@dureuill
Copy link
Contributor

Alright this needs another investigation then :-)

@dureuill dureuill reopened this Jun 20, 2024
@curquiza curquiza removed this from the v1.8.2 milestone Jun 20, 2024
@martinnj
Copy link

Alright this needs another investigation then :-)

Our current suspicion is that it's related to embeddings, but we can no longer recreate it.
Either way I think you can safely close this ticket again, I'll create a new ticket if we figure out how to recreate it.

Sorry about the noise.

@irevoire
Copy link
Member

Thanks a lot!
Closing

@curquiza curquiza added this to the v1.8.2 milestone Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected v1.8.2 PRs/issues solved in v1.8.2
Projects
None yet
Development

No branches or pull requests

5 participants