Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make numpy and pandas optional for ~7 times smaller deps #153

Merged
merged 15 commits into from
Jan 6, 2023

Conversation

jkbrzt
Copy link
Contributor

@jkbrzt jkbrzt commented Dec 18, 2022

This PR makes data libraries like numpy and pandas optional dependencies. These libraries add up to 146MB, which makes it challenging to deploy applications using this library in environments with code size constraints, such as AWS Lambda.

Since the primary use case of this library (talking to the OpenAI API) doesn’t generally require data libraries, it’s safe to make them optional. The rare case when the data libraries are needed in the API client is handled through assertions with instructive error messages.

Requirements before

Installing openai-python requires the numpy, pandas, and openpyxl data libraries that add up to 146MB:

$ pip install -e .
$ du -sh $VIRTUAL_ENV/lib/python*/site-packages/
167M	/Users/jakub/.virtualenvs/openai-python/lib/python3.11/site-packages/

Requirements after

Installing openai-python doesn’t require the data libraries by default, resulting in ~7 times smaller aggregate size of dependencies:

$ pip install -e .
$ du -sh $VIRTUAL_ENV/lib/python*/site-packages/
23M	/Users/jakub/.virtualenvs/openai-python/lib/python3.11/site-packages/

Data libraries can be installed manually using the new datalib extras, if needed:

$ pip install -e .[datalib]
$ du -sh $VIRTUAL_ENV/lib/python*/site-packages/
167M	/Users/jakub/.virtualenvs/openai-python/lib/python3.11/site-packages/

And they are now also included in the existing embeddings and wantdb extras:

$ pip install -e .[embeddings]
$ pip install -e .[wantdb]

@jkbrzt jkbrzt changed the title Make numpy and pandas optional dependencies Make numpy and pandas optional for ~7 times smaller dependencies Dec 18, 2022
@jkbrzt jkbrzt changed the title Make numpy and pandas optional for ~7 times smaller dependencies Make numpy and pandas optional for ~7 times smaller deps Dec 18, 2022
Copy link
Contributor

@ddeville ddeville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jakubroztocil!

Comment on lines +13 to +14
from openai.datalib import numpy as np
from openai.datalib import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should call assert_has_numpy and assert_has_pandas in each function these modules are used so that it's very clear to users what to do to fix the issue (rather than getting a generic 'NoneType' object has no attribute Python exception).

Copy link
Contributor Author

@jkbrzt jkbrzt Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The embeddings_utis.py file is not imported from anywhere and it’s the only module that imports sklearn and other libraries listed in the openai[embeddings] extra. I couldn’t find any docs, but its usage implies pip install openai[embeddings] (which now also ensures numpy/pandas/etc.), so the experience of using embeddings_utils.py should be unchanged.

https://github.com/jakubroztocil/openai-python/blob/jakub/data-libraries-optional/setup.py#L46-L53

It could be improved, though. I think each optional extra — embeddings, wandb, and the new datalib — would deserve mention in the README. I’ll add a section on the new one, and if you can give me some context on the other two, I’ll be happy to mention them too.

I wasn't sure whether you’d be interested in the PR, but it looks like you are, so I’ll polish it a bit: I’m thinking maybe throwing an ImportError instead of just Exception from the assert_has_* functions, ensuring the error messages are clear, etc.

It’s to a degree a backward-incompatible change (for existing users who don’t install openai[embeddings] and hit this line or use read_any_format() via the CLI ), so it might also be worth bumping the major version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right this is an embeddings file so it will have the right dependencies.

Regarding the backward-incompatibility, yes it's unfortunate but personally I think it's probably ok as long as the error is clear and explains how to resolve the problem. Also the line in read_any_format is specific to embeddings so it's fine to assume that the embedding deps were installed.

See #124 for some historical context too about how deps have been handled too.

@jkbrzt
Copy link
Contributor Author

jkbrzt commented Dec 21, 2022

@ddeville I’ve added a new subsection, “Optional dependencies,” under “Installation.”

I also tweaked the errors and instructions. This is what the user gets when trying to use a feature that needs one of the libraries:

Traceback (most recent call last):
  File "fail.py", line 2, in <module>
    datalib.assert_has_numpy()
  File "openai-python/openai/datalib.py", line 51, in assert_has_numpy
    raise MissingDependencyError(NUMPY_INSTRUCTIONS)
datalib.MissingDependencyError:

OpenAI error:

    missing `numpy`

This feature requires additional dependencies:

    $ pip install openai[datalib]

Copy link
Contributor

@ddeville ddeville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you so much!

@@ -25,6 +25,26 @@ Install from source with:
python setup.py install
```

### Optional dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

@adieuadieu
Copy link

adieuadieu commented Jan 6, 2023

@jakubroztocil Nice work! I saw this PR via your blog post.

I'm sure you're aware of this, but thought it might help anyone else who lands here to point it out:

These libraries add up to 146MB, which makes it challenging to deploy applications using this library in environments with code size constraints, such as AWS Lambda.

With AWS Lambda supporting container images, it's fairly trivial to deploy heavy libraries and large ML models to run in Lambda with little-to-no impact on performance (other than the initial pull from ECR after a fresh deployment.) Also has a nice side-benefit of making it easier to test the Lambda locally in a similar runtime environment.

https://docs.aws.amazon.com/lambda/latest/dg/images-create.html

(I realize it probably sounds like it, but no, I don't work for AWS. Just a Lambda & OpenAI fanboi. 😛)

@asciidiego
Copy link

best pr i've read the whole day. amazing work guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants