Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Elasticsearch: Request Entity Too Large #28117 #29062

Merged
merged 10 commits into from Feb 7, 2024

Conversation

inferno-umar
Copy link
Contributor

Fix for gitea putting everything into one request without batching and sending it to Elasticsearch for indexing as issued in #28117

This issue occured in large repositories while Gitea tries to
index the code using ElasticSearch.

I've applied necessary changes that takes batch length from below config (app.ini)

[queue.code_indexer]
BATCH_LENGTH=<length_int>

and batches all requests to Elasticsearch in chunks as configured in the above config

Fix for gitea putting everything into one request and send it to elasticsearch as issued in #28117
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 5, 2024
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 5, 2024
@lunny
Copy link
Member

lunny commented Feb 6, 2024

This batch length is for queue but no for elastic search.
For ealstic search I think we can have a hardcode batch length. So that you code could be simpler.

@inferno-umar
Copy link
Contributor Author

inferno-umar commented Feb 6, 2024

This batch length is for queue but no for elastic search. For ealstic search I think we can have a hardcode batch length. So that you code could be simpler.

I've 2 points to mention here:

  1. At the end the code_indexer's job is added to queue which Gitea then runs in the background in batches or as a whole for indexing which thereby takes the above config and it just works. Kindly refer this documentation -> it has clearly mentioned
    Gitea creates the following non-unique queues:
    code_indexer

  2. If I'm still wrong, I could introduce new setting in [indexer] as INDEXING_BATCH_LENGTH that'll take the length of batch just for code indexing, no need for hardcoding

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 6, 2024

Thank your for your PR! (It is really fast 👍)

Two suggestions in my mind:

  1. BATCH_LENGTH is only for the "queue dispatching", it shouldn't be used here. I think we do not need to fine-tune this value for elasticsearch batching, so you could just hard-code a reasonable length into code at the moment.
    • The document says: BATCH_LENGTH: 20: Batch data before passing to the handler.
    • The flow is "task publisher -> queue -> task handler -> elasticsearch"
    • So the "BATCH_LENGTH" is for "queue -> task handler" dispatching, not for "task handler -> elasticsearch" batching.
    • Why not adding a new option? There are too many options now .... no user really needs it at the moment, a reasonable value is good enough.
  2. There is a builtin min function in Go now 🤣

@lunny lunny added type/bug backport/v1.21 This PR should be backported to Gitea 1.21 labels Feb 6, 2024
@inferno-umar
Copy link
Contributor Author

Thanks a lot @lunny @wxiaoguang for clearing me out.
I've created a hardcoded value that pretty much batches all our indexing needs without errors to ElasticSearch. Kindly review

@pull-request-size pull-request-size bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 7, 2024
@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Feb 7, 2024
@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Feb 7, 2024
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 7, 2024

Ideally Gitea should also read the contents into batches (but not read everything into reqs in one time), to avoid OOM.

As a quick fix, this PR is better than before.

@lunny lunny added the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Feb 7, 2024
@lunny lunny enabled auto-merge (squash) February 7, 2024 08:08
@lunny lunny merged commit 5c0fc90 into go-gitea:main Feb 7, 2024
25 checks passed
@GiteaBot GiteaBot added this to the 1.22.0 milestone Feb 7, 2024
GiteaBot pushed a commit to GiteaBot/gitea that referenced this pull request Feb 7, 2024
…#29062)

Fix for gitea putting everything into one request without batching and
sending it to Elasticsearch for indexing as issued in go-gitea#28117

This issue occured in large repositories while Gitea tries to 
index the code using ElasticSearch.

I've applied necessary changes that takes batch length from below config
(app.ini)
```
[queue.code_indexer]
BATCH_LENGTH=<length_int>
```
and batches all requests to Elasticsearch in chunks as configured in the
above config
@GiteaBot GiteaBot added backport/done All backports for this PR have been created and removed reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. labels Feb 7, 2024
@wxiaoguang
Copy link
Contributor

@lunny the commit message is wrong (again). Please manually fix the PR description before merging.

wxiaoguang pushed a commit that referenced this pull request Feb 7, 2024
Backport #29062 by @inferno-umar

Fix for gitea putting everything into one request without batching and
sending it to Elasticsearch for indexing as issued in #28117

This issue occured in large repositories while Gitea tries to 
index the code using ElasticSearch.

Co-authored-by: dark-angel <70754989+inferno-umar@users.noreply.github.com>
silverwind pushed a commit to silverwind/gitea that referenced this pull request Feb 20, 2024
…#29062)

Fix for gitea putting everything into one request without batching and
sending it to Elasticsearch for indexing as issued in go-gitea#28117

This issue occured in large repositories while Gitea tries to 
index the code using ElasticSearch.

I've applied necessary changes that takes batch length from below config
(app.ini)
```
[queue.code_indexer]
BATCH_LENGTH=<length_int>
```
and batches all requests to Elasticsearch in chunks as configured in the
above config
6543 pushed a commit to 6543-forks/gitea that referenced this pull request Feb 26, 2024
…o-gitea#29062)

Fix for gitea putting everything into one request without batching and
sending it to Elasticsearch for indexing as issued in go-gitea#28117

This issue occured in large repositories while Gitea tries to
index the code using ElasticSearch.

I've applied necessary changes that takes batch length from below config
(app.ini)
```
[queue.code_indexer]
BATCH_LENGTH=<length_int>
```
and batches all requests to Elasticsearch in chunks as configured in the
above config

(cherry picked from commit 5c0fc90)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/done All backports for this PR have been created backport/v1.21 This PR should be backported to Gitea 1.21 lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants