Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Lambda - Read Only #1412

Closed
Joepetey opened this issue Mar 3, 2023 · 13 comments
Closed

AWS Lambda - Read Only #1412

Joepetey opened this issue Mar 3, 2023 · 13 comments

Comments

@Joepetey
Copy link

Joepetey commented Mar 3, 2023

Langchain trys to download the GPT2FastTokenizer when I run a chain. In a Lambda function this doesnt work because the Lambda is read only. Any run into this, or know how to fix this?

@ellisonbg
Copy link

@3coins may be able to help with this.

@Joepetey
Copy link
Author

Joepetey commented Mar 5, 2023

That would be great if I could get some help with this, or some documentation!

Thank you!

@juankysoriano
Copy link
Contributor

juankysoriano commented Mar 5, 2023

Ah, I have also seen this issue. As I was mentioning on the langchain discord:

OpenAIChat uses GPT2TokenizerFast.from_pretrained("gpt2") for calculating the number of tokens, while tiktoken.get_encoding("gpt2") uses tiktoken.

The first one is not usable when going serverless (AWS lambdas, or Google Cloud Functions). And it might represent a deal breaker for a more than valid use case. I'd even say a desirable use case, the serverless scenario.

So far I have been able to find a workaround, but it's not ideal and will be likely to suffer breaking changes as they continue updating langchain (which happens very frequently).

The workaround is to create a custom implementation of OpenAIChat overriding the problematic function.

class TikOpenAIChat(OpenAIChat):
    def get_num_tokens(self, text: str) -> int:
        import tiktoken
        enc = tiktoken.get_encoding("gpt2")
        tokenized_text = enc.encode(text)
        return len(tokenized_text)

I hope this helps on your scenario @Joepetey , and let's aim for an official solution from @hwchase17

@juankysoriano
Copy link
Contributor

Alternatively we could continue using GPT2FastTokenizer but we need facilitate a way of selecting where are those models going to be stored, serverless functions has typically some available space on /tmp.

Now, having to download everytime (serverless is stateless) is going to make things very slow I believe.

I am talking btw without being an expert.

@juankysoriano
Copy link
Contributor

I have opened a PR that would solve this issue.

#1457

@Joepetey
Copy link
Author

Joepetey commented Mar 5, 2023

Wow thank you @juankysoriano this is so helpful!

@hwchase17
Copy link
Contributor

thanks @juankysoriano ! merging this in now

hwchase17 pushed a commit that referenced this issue Mar 6, 2023
Solves #1412

Currently `OpenAIChat` inherits the way it calculates the number of
tokens, `get_num_token`, from `BaseLLM`.
In the other hand `OpenAI` inherits from `BaseOpenAI`. 

`BaseOpenAI` and `BaseLLM` uses different methodologies for doing this.
The first relies on `tiktoken` while the second on `GPT2TokenizerFast`.

The motivation of this PR is:

1. Bring consistency about the way of calculating number of tokens
`get_num_token` to the `OpenAI` family, regardless of `Chat` vs `non
Chat` scenarios.
2. Give preference to the `tiktoken` method as it's serverless friendly.
It doesn't require downloading models which might make it incompatible
with `readonly` filesystems.
@juankysoriano
Copy link
Contributor

solved by #1457

@Joepetey
Copy link
Author

Joepetey commented Mar 6, 2023

Quick question, when can we expect this to be available in a release?

@juankysoriano
Copy link
Contributor

I don't know but the owner is typically very fast on doing releases, they are very frequent. It shouldn't take more than a couple of days in my experience. Will see

@juankysoriano
Copy link
Contributor

@Joepetey if the recent release, 0.0.102 solved your problem, the issue can be closed.

@gmpetrov
Copy link
Contributor

gmpetrov commented Mar 7, 2023

Alternatively we could continue using GPT2FastTokenizer but we need facilitate a way of selecting where are those models going to be stored, serverless functions has typically some available space on /tmp.

Now, having to download everytime (serverless is stateless) is going to make things very slow I believe.

I am talking btw without being an expert.

With AWS you can mount an EFS to a lambda to cache a pre-trained model.
Checkout this example https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face

Also if you're using HuggingFaceEmbeddings (which uses sentence_transformers.SentenceTransformer) you need to use SENTENCE_TRANSFORMERS_HOME env variable to download the model at a specific location

@Joepetey
Copy link
Author

Joepetey commented Mar 7, 2023

Thank you Juanky, it worked!

@Joepetey Joepetey closed this as completed Mar 7, 2023
zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this issue Mar 8, 2023
…-ai#1457)

Solves langchain-ai#1412

Currently `OpenAIChat` inherits the way it calculates the number of
tokens, `get_num_token`, from `BaseLLM`.
In the other hand `OpenAI` inherits from `BaseOpenAI`. 

`BaseOpenAI` and `BaseLLM` uses different methodologies for doing this.
The first relies on `tiktoken` while the second on `GPT2TokenizerFast`.

The motivation of this PR is:

1. Bring consistency about the way of calculating number of tokens
`get_num_token` to the `OpenAI` family, regardless of `Chat` vs `non
Chat` scenarios.
2. Give preference to the `tiktoken` method as it's serverless friendly.
It doesn't require downloading models which might make it incompatible
with `readonly` filesystems.
mikeknoop pushed a commit to zapier/langchain-nla-util that referenced this issue Mar 9, 2023
Solves langchain-ai/langchain#1412

Currently `OpenAIChat` inherits the way it calculates the number of
tokens, `get_num_token`, from `BaseLLM`.
In the other hand `OpenAI` inherits from `BaseOpenAI`. 

`BaseOpenAI` and `BaseLLM` uses different methodologies for doing this.
The first relies on `tiktoken` while the second on `GPT2TokenizerFast`.

The motivation of this PR is:

1. Bring consistency about the way of calculating number of tokens
`get_num_token` to the `OpenAI` family, regardless of `Chat` vs `non
Chat` scenarios.
2. Give preference to the `tiktoken` method as it's serverless friendly.
It doesn't require downloading models which might make it incompatible
with `readonly` filesystems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants