Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Deny/permit configuration for read_pickle to address security concerns #48049

Closed
1 of 3 tasks
yilun11 opened this issue Aug 11, 2022 · 7 comments
Closed
1 of 3 tasks
Labels
Enhancement IO Pickle read_pickle, to_pickle Needs Discussion Requires discussion from core team before further action Upstream issue Issue related to pandas dependency

Comments

@yilun11
Copy link

yilun11 commented Aug 11, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Pandas.read_pickle() has a security concern, and today we recommend people not to load untrusted pickles. See the disputed CVE-2020-13091 for more background. Many commercial scanning software flag pandas due to the CVE and makes it challenging to adopt/use pandas in security conscious environments.

Feature Description

Implemented at https://github.com/yilun11/fanniemae-pandas/tree/CVE-2020-13091

I added a pickle option for off/deny/permit with a default to deny. Under deny, it blocks calls to os.system/posix.system and addresses the CVE concern. I also added restrictions on builtins like eval(). It passes all the pandas tests. I think this is backwards compatible assuming pickles don't make these type of calls in the wild.

Under permit, it will only allow calls to classes in the permit list. This fails the pandas tests. This is the original Python restricted globals example, and may be of limited use (or at least will need to add wildcard support). Under off, it does not do any security checks and reverts to current behavior.

In addition to the default settings, users can configure via a YML file or set_options().

Alternative Solutions

Since read_pickle() relies on pickle.load(), it may be possible to make the changes there. I think sandboxing Python is another way to reduce the potential for bad pickles to cause damage.

Additional Context

This is a long outstanding issue with pickle, and I've tried to read the history behind it. We don't flag Python yet for vulnerabilities, yet we do so for Pandas, hence I have the changes here. My goal is to make a backwards compatible change that addressed the disputed CVE and hope this does so. I welcome any feedback on this matter.

@yilun11 yilun11 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2022
@mroeschke
Copy link
Member

Definitely understand the security concern (hence why the read_pickle docstring has a warning regarding untrusted pickles) https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html

But as you mentioned, since read_pickle is essentially a wrapper of pickle.load, IMO it would be more maintainable (and benefit the wider Python ecosystem) if these behaviors were a feature of pickle.load

@mroeschke mroeschke added IO Pickle read_pickle, to_pickle Upstream issue Issue related to pandas dependency Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2022
@yilun11
Copy link
Author

yilun11 commented Aug 12, 2022

Another idea is leaving the default option to off, so there is full backwards compatibility. I assume this is a key concern against adoption.

Many orgs interested in security will be installing packages from an internal repo/mirror. At the repo level, I can probably enable support so it installs under deny mode by default for orgs that wish it. This could apply to other packages as well where a subset wants higher security. If I went upstream to CPython, I wouldn't have this repo setting option. I'd need to take this up with Sonatype (my repo vendor) but I feel like this could work for all parties.

@mroeschke
Copy link
Member

I think one of the concerns for integration into pandas is maintenance and scope of the issue. read_pickle is just calling pickle.load so the solution would not be targeting the key issue.

@Delengowski
Copy link
Contributor

@yilun11 I work on air gapped systems where I deal with similar issues. You need to find a security poc who also has software experience and can freethink rather than just reading a CVE.

Going to nvd and performing a search of "python pickle" reveals this issue to be rampant and not as you pointed out not tied the library but the pickle protocol itself. https://nvd.nist.gov/vuln/search/results?form_type=Basic&results_type=overview&query=Python+pickle&search_type=all&isCpeNameSearch=false

Does your work pay for Matlab and use it? The same thing can happen with .mat files for classes that implement saveobj https://www.mathworks.com/help/matlab/ref/saveobj.html

What I am suggesting is you need to point out the inconsistencies that your security may or may not have. You may gain ground here depending on how you approach it. This has worked best for me. Focus should be on not allowing 3rd party pickles on the system.

I see you are aware of this http://docs.python.org/3.4/library/pickle.html#restricting-globals given your branch. As others suggested this should be done more broadly or else you'd be performing this update for any library that has similar cve.

@yilun11
Copy link
Author

yilun11 commented Aug 13, 2022

I can look for another group to maintain the deny/permit lists to address the maintenance question. This will just leave the logic, which I hope will be agreeable. In terms of scope, I can implement it in cpython pickle.py but the blocker is I have no way of setting sitewide Python configurations at the org level, assuming default to off for compatibility. With packages, this is easier since I can ask private repos to do file injection to configure packages for the org that want the added security. This is the best that I could come up with, absent a PEP proposal to allow package installation configuration by repos.

I will do outreach to various groups to get more feedback. @Delengowski In some ways, this is a fool's errand. I do see audit hooks introduced in Py3.8, so there is progress on these issues.

@jreback
Copy link
Contributor

jreback commented Aug 13, 2022

-1 on changes in pandas to expose these types of options

this is clearly a cpython request

@mroeschke
Copy link
Member

Thanks for the suggestion, but as discussed, pandas isn't really the correct scope to address and fix this issue so closing.

If pickle.load ever exposes a security feature through a keyword argument, it would definitely make sense to also include it in read_pickle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Pickle read_pickle, to_pickle Needs Discussion Requires discussion from core team before further action Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

4 participants