Skip to content

docs: revise option of TFIDFRetriever.load_local() in tf_idf.ipynb#30294

Closed
ArrayPD wants to merge 1 commit intolangchain-ai:masterfrom
ArrayPD:ArrayPD-patch-1
Closed

docs: revise option of TFIDFRetriever.load_local() in tf_idf.ipynb#30294
ArrayPD wants to merge 1 commit intolangchain-ai:masterfrom
ArrayPD:ArrayPD-patch-1

Conversation

@ArrayPD
Copy link
Contributor

@ArrayPD ArrayPD commented Mar 15, 2025

Description:
Revise option of TFIDFRetriever.load_local()

Why we need option allow_dangerous_deserialization?
Because the de-serialization of this retriever is based on .joblib and .pkl files. Such files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine. So we would set allow_dangerous_deserialization as a guard, to emphasize the real security risk. Here is a simple example to simulate a malicious payload. Please check following references for detail.

Example: simulate a malicious payload

>>> import pickle, pickletools
>>> pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")
hello world
0
>>> pickletools.dis(b"cos\nsystem\n(S'echo hello world'\ntR.")
    0: c    GLOBAL     'os system'
   11: (    MARK
   12: S        STRING     'echo hello world'
   32: t        TUPLE      (MARK at 11)
   33: R    REDUCE
   34: .    STOP
highest protocol among opcodes = 0
>>>

References:
pickle — Python object serialization
https://docs.python.org/3/library/pickle.html
Insecurity and Python pickles
https://lwn.net/Articles/964392/
Pickles are for delis
https://lwn.net/Articles/595352

Issue:
ValueError: The de-serialization of this retriever is based on .joblib and .pkl files.Such files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.You will need to set allow_dangerous_deserialization to True to load this retriever. If you do this, make sure you trust the source of the file, and you are responsible for validating the file came from a trusted source.

Dependencies:
N/A

**Description:**
Revise option of TFIDFRetriever.load_local()

**Issue:**
ValueError: The de-serialization of this retriever is based on .joblib and .pkl files.Such files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.You will need to set `allow_dangerous_deserialization` to `True` to load this retriever. If you do this, make sure you trust the source of the file, and you are responsible for validating the file came from a trusted source.

**Dependencies:**
N/A
@vercel
Copy link

vercel bot commented Mar 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 15, 2025 8:32am

@OskarStark
Copy link
Contributor

Thanks for pointing this out. I would prefer an additional note like your explanation in the PR header to communicate clearly why this option is needed. WDYT?

@ArrayPD
Copy link
Contributor Author

ArrayPD commented Mar 18, 2025

Thanks for pointing this out. I would prefer an additional note like your explanation in the PR header to communicate clearly why this option is needed. WDYT?

Thanks for your advice. Exactly, explain the necessity of why we need option allow_dangerous_deserialization is needed. Since this hits a real security risk. Updated comment.

@OskarStark
Copy link
Contributor

Thanks for your advice. Exactly, explain the necessity of why we need option allow_dangerous_deserialization is needed. Since this hits a real security risk. Updated comment.

Sorry if I wasn’t clear enough. I understand that this is important, but please don’t just add this parameter—also add a note for the reader explaining why it makes sense to set this value. It is described in the pull request, but in my opinion, that is not sufficient. Thanks

@ArrayPD
Copy link
Contributor Author

ArrayPD commented Mar 19, 2025

Thanks for your advice. Exactly, explain the necessity of why we need option allow_dangerous_deserialization is needed. Since this hits a real security risk. Updated comment.

Sorry if I wasn’t clear enough. I understand that this is important, but please don’t just add this parameter—also add a note for the reader explaining why it makes sense to set this value. It is described in the pull request, but in my opinion, that is not sufficient. Thanks

The docstring of TFIDFRetriever.load_local in libs/community/langchain_community/retrievers/tfidf.py
does mention "allow_dangerous_deserialization" and has a brief statement about the risk. I agree that the statement is insufficient.
Would you mind adding your note directly, please?
Thanks.

@eyurtsev
Copy link
Collaborator

Hi ArrayPD,

Would you mind documenting this directly in the example? Documentation in the PR is fine, but the full documentation should be in the example code as that's what users are going to be looking at.

https://github.com/langchain-ai/langchain/pull/30294/files#diff-0235289620f813af2c512a189e03af20b89e867fa078dca3d8ff041bfd9b7017R164

Alternatively, we can keep the code as is, which should raise a run time exception on usage with a detailed explanation on enabling allow_adangerous_deserilization.

Going to close the PR for now, but feel free to re-open if you'd like with an inline explanation for the example.

Eugene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments