Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Encoding: gzip #136

Open
andrew-at-rise opened this issue Mar 14, 2024 · 7 comments
Open

Content-Encoding: gzip #136

andrew-at-rise opened this issue Mar 14, 2024 · 7 comments

Comments

@andrew-at-rise
Copy link

I wonder if it would make sense to support compressed requests, esp. for /rerank, where the query and document list could be many 1k or 2k chunks of text? The incoming request could easily exceed 20 or 30k. The http server does not appear to handle gzipped request bodies, if present.

@michaelfeil
Copy link
Owner

Have you considered grcp protocol? If you fork the project and start building, thats something I potetntially would consider to pull in.

Questions:

  • I have never heard of gzip-requests - how does validation of requests (error 422 handling work?)
  • What kind of issues are you experienced when sending e.g. 2k requests? Why is this feature needed?
  • is sending 20-30k a good paragdim? When do you need it? Even with gpu you can encode around 200-1000 texts per second? I think this encourages a bad workload?

@peebles
Copy link

peebles commented Mar 15, 2024

Does your FastAPI server accept gRPC? I am using your docker container, behind nginx terminating TLS as a reverse proxy. Nginx apparently can proxy gRPC.

  • content-encoding: gzip is pretty common. All browsers will try to compress their request bodies if the server accepts.
  • I am not experiencing any issues with large requests. They can be slower is all.
  • My RAG text chunks are about 1k. My prompt, coming from Continue in vscode can be quite large (like an entire file.js). I fetch like 20ish chunks from my vector database, then I want to re-rank. Think this is too much data?

Here is an example of decompression middleware for FastAPI:

from fastapi import FastAPI, Request
from starlette.middleware.base import BaseHTTPMiddleware
import gzip

class GZipRequestMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        if 'content-encoding' in request.headers and request.headers['content-encoding'] == 'gzip':
            # Decompress the request body
            body = await request.body()
            decompressed_body = gzip.decompress(body)
            
            # Create a new request with the decompressed body
            scope = request.scope
            scope['body'] = decompressed_body
            request = Request(scope)
            
        response = await call_next(request)
        return response

app = FastAPI()

# Add the middleware to the app
app.add_middleware(GZipRequestMiddleware)

After that, request.body is used just as before.

I'll look into gRPC. I need speed.

@michaelfeil
Copy link
Owner

@peebles Thanks for the extensive example. https://stackoverflow.com/questions/43628605/does-the-zlib-module-release-the-global-interpreter-lock-gil-in-python-3 -> I assume this will not affect the GIL or performance. decompressed_body = gzip.decompress(body)
starlette integration seems elegant and without any extra dependencies at first glance!

Thoughts:

  1. Could you do routing based on the json content?
  2. Are you sure that the performance bottleneck is in sending/receiving the request? I think validation, tokenization, and especially forward pass of model will be much more compute heavy.
  3. The response (embedding) should be all unique floats, with little pattern - json is kind of lossy, but I would consider adding a grcp server to be more elegant, and has more traction in the embedding community (https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#grpc) grcp is not supported by fastapi

@peebles
Copy link

peebles commented Mar 15, 2024

  • "routing based on json content" sounds intriguing but I am not sure I understand ...
  • I am not sure the performance bottleneck is network transfer, although the machine hosting Infinity is a big iron monster and executes reranking almost instantaneously. Network transfer is probably my longer term concern.
  • Typically, client compression is not turned on unless the request payload goes above some threshold, like 1k or so, where the cost of transfer becomes greater than the cost of compression/decompression.

I am doing /rerank, where the input (to you) is a potentially large amount of text, and the output is a very small summary ... no floats, all text. In /rerank, it may make sense to compress the input but not the output ... the output is too small.

As for "I assume this will not affect the GIL or performance. decompressed_body = gzip.decompress(body)", I don't know. I come from more of a NodeJS background where everything is async.

I have seen significant performance improvements on past projects when I started compressing large network requests between clients on AWS to MongoDB servers at Atlas for example. Which is why I looked into this on Infinity in the first place.

@peebles
Copy link

peebles commented Mar 15, 2024

What is the difference between Infinity and https://github.com/huggingface/text-embeddings-inference?

@michaelfeil
Copy link
Owner

michaelfeil commented Mar 15, 2024

@peebles the most similar project out there - I think TEI is an exciting project showcasing a new framework in rust (I link rust). here are a couple of key differences.

Re: Routing: https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gzip-compression-decompression.html e.g. via AWS API Gateway and similar.

@peebles Feel free to PR the gzip compression, I can add a unit test if needed.

@peebles
Copy link

peebles commented Mar 15, 2024

I'll look into doing the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants