docs: revise option of TFIDFRetriever.load_local() in tf_idf.ipynb#30294
docs: revise option of TFIDFRetriever.load_local() in tf_idf.ipynb#30294ArrayPD wants to merge 1 commit intolangchain-ai:masterfrom
Conversation
**Description:** Revise option of TFIDFRetriever.load_local() **Issue:** ValueError: The de-serialization of this retriever is based on .joblib and .pkl files.Such files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.You will need to set `allow_dangerous_deserialization` to `True` to load this retriever. If you do this, make sure you trust the source of the file, and you are responsible for validating the file came from a trusted source. **Dependencies:** N/A
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
Thanks for pointing this out. I would prefer an additional note like your explanation in the PR header to communicate clearly why this option is needed. WDYT? |
Thanks for your advice. Exactly, explain the necessity of why we need option allow_dangerous_deserialization is needed. Since this hits a real security risk. Updated comment. |
Sorry if I wasn’t clear enough. I understand that this is important, but please don’t just add this parameter—also add a note for the reader explaining why it makes sense to set this value. It is described in the pull request, but in my opinion, that is not sufficient. Thanks |
The docstring of TFIDFRetriever.load_local in libs/community/langchain_community/retrievers/tfidf.py |
|
Hi ArrayPD, Would you mind documenting this directly in the example? Documentation in the PR is fine, but the full documentation should be in the example code as that's what users are going to be looking at. Alternatively, we can keep the code as is, which should raise a run time exception on usage with a detailed explanation on enabling Going to close the PR for now, but feel free to re-open if you'd like with an inline explanation for the example. Eugene |
Description:
Revise option of TFIDFRetriever.load_local()
Why we need option allow_dangerous_deserialization?
Because the de-serialization of this retriever is based on .joblib and .pkl files. Such files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine. So we would set allow_dangerous_deserialization as a guard, to emphasize the real security risk. Here is a simple example to simulate a malicious payload. Please check following references for detail.
Example: simulate a malicious payload
References:
pickle — Python object serialization
https://docs.python.org/3/library/pickle.html
Insecurity and Python pickles
https://lwn.net/Articles/964392/
Pickles are for delis
https://lwn.net/Articles/595352
Issue:
ValueError: The de-serialization of this retriever is based on .joblib and .pkl files.Such files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.You will need to set
allow_dangerous_deserializationtoTrueto load this retriever. If you do this, make sure you trust the source of the file, and you are responsible for validating the file came from a trusted source.Dependencies:
N/A