Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presto ranger integration with row-level filters and column masking #11640

Closed
wants to merge 5 commits into from

Conversation

cryptoe
Copy link

@cryptoe cryptoe commented Oct 4, 2018

This PR includes #8980 and #10996, Also includes row-level filtering.

@cryptoe
Copy link
Author

cryptoe commented Oct 9, 2018

@stagraqubole Does this solve your use case ?

@shubhamtagra
Copy link
Contributor

@stagraqubole Does this solve your use case ?

Yet to try this out but the main usecase for us is using Ranger policies already defined via Hive. From code it looks like with config "ranger-service-types" it can do catalog level authorization but it would be good to have examples/doc.

@RameshByndoor
Copy link
Contributor

Following steps will help enable

  1. Add plugin jar to plugin folder
    presto/plugin/ranger-presto-plugin/presto-ranger-plugin-0.version-shaded.jar

  2. Enable ranger-access-control in etc/access-control.properties

access-control.name=ranger-access-control
#Comma separated list of service-types from ranger 
ranger-service-types=hive,mysql
#Comma separated list of app-ids from ranger 
ranger-app-ids=hive,mysql
ranger-security-config-hive=path_to_/ranger-hive-security.xml,path_to_/ranger-mysql-security.xml
ranger-audit-config-hive=path_to/ranger-hive-audit.xml,path_to_ranger-mysql-audit.xml
ranger.username=XXXX
ranger.password=XXXX

@cryptoe
Copy link
Author

cryptoe commented Oct 14, 2018

@tooptoop4
Copy link

ship it!

@dain
Copy link
Contributor

dain commented Oct 24, 2018

@cryptoe can you post a link to the ranger policy language (everything I can find is a sales pitch or pictures of UIs)? I'm curious what the limitations of the language are, or is anything allowed in the row level filter?

@cryptoe
Copy link
Author

cryptoe commented Oct 25, 2018

@dain Are you referring to language model of the row level filtering expression?

May be you are referring to https://cwiki.apache.org/confluence/display/RANGER/Row-level+filtering+and+column-masking+using+Apache+Ranger+policies+in+Apache+Hive .
"Services that need row-filter or data-masking feature should populate these attributes with appropriate values"

I think you can put any non empty string in the row level filter.
We are chaining multiple sql expressions in our internal ranger policies. For eg .
((city='abc' or city='def') and (user_type='xxx'))
Ranger client just returns that string if the (user/group) has a row level filtering policy defined against that table/view.

I am doing the row level filter string to presto sql conversion here.

@dain
Copy link
Contributor

dain commented Oct 25, 2018

@cryptoe That is whatI was afraid of. That leads to many questions:

  • What language is the string in? HQL or ANSI SQL?
  • What is the "environment" of the evaluation (e.g., timezone, character set, function path, default catalog/schema, etc.)
  • What a happens when the expression contains column references not present in the outer table, which makes it basically become a correlated subquery?

If there was a small spec for these expressions which contained the syntax for the language and semantics of functions/operations, I see the translation as doable, but short of that, I have concerns about this feature both producing wrong answers and breaking security, both of which are core values of this project.

@bolkedebruin
Copy link

bolkedebruin commented Oct 28, 2018

@dain (I have contributed some work to Ranger in the past)

Ranger works as a policy management system. It let's you define policies and store them by means of plugins. E.g. hive is a plugin, hdfs is a plugin and potentially presto could be a plugin. Policy checking happens entirely within the client (e.g. Presto). The Domain Specific Language is set by the plugin. Policies are refreshed by the client every 30s by default.

So the questions you are raising are highly relevant, but cannot not be answered in the domain of Ranger and should be answered in Presto's domain. So for example "what language is the string in?" should be up to Presto to define in context of its policy evaluation and supported by the Ranger plugin. Either as part of enforced templating of input or by definition. In the first then it would make sense to have the plugin as part of Presto (although it would need to be copied over to Ranger).

The nature of Presto being a kind of layer on top of other data requesting engines brings a practical challenge: ideally you would like to reuse the policies applied to the underlying databases and data (e.g. hive, mysql, postgres etc). Obviously you then become dependent on what those plugins support as DSL.

Thus my suggestion would be to have a "policy enforcement specification" in Presto that specifies what the DSL for Presto is and next to that a specification how to handle the "sub database policy definitions".

My 2 cents. Let me know how we could contribute if required. We would love to have this in.

@dain
Copy link
Contributor

dain commented Oct 29, 2018

@bolkedebruin That makes sense. The problem is that most people likely want to use existing Hive rules in Presto, but the super lax policy system in Ranger makes that almost impossible without implementing HQL in Presto. Looking at the docs for Ranger my guess is 99% of users only use the UI to define policy. Is the UI limited in scope? Specifically, does it only support a reasonable/portable language? Or is there a simple portable language that can be used from the UI and understandable from multiple systems.

@bolkedebruin
Copy link

bolkedebruin commented Oct 29, 2018

@dain maybe lax is another word for very flexible ;-). So in many cases the UI will be used for setting policies and they will be slow changing. However, I do know of deployments that do use the Rest API for defining policies and some deployments that have faster changing ones. For the language that doesn't really matter as it is verfied by the plugin in Ranger. For hive that means no verification (as far as I could determine) and the language is HQL. E.g. for row level masking the query is basically added to the statement created by the user. This also means that the query will fail if an error is made by the administrator while defining policies (probably tough to debug for the end user).

So I would argue that leaves us with a couple of scenarios

  1. Do our best to support the query for row level filtering / masking and fail if we cannot parse properly. If we assume we do not fully support HQL but do a best effort translation we are equal to behavior of Hive, but we might get different errors which could lead to confusion. However, in this case HQL might not be too far off from ANSI SQL (not sure though). <-- this is where the current patch is)
  2. We do not support row level filtering/masking with custom queries with HQL the Hive plugin, but instead stack a "presto" plugin on top that allows setting row level filtering / masking in the DSL of Presto. This removes the issue for the user at the expense of potentially requiring kepping policies in sync. I think this should be possible, but might look a bit hackish.
  3. Do not support custom row level filtering / masking and fail
  4. Do not use the Hive specification but use our own plugin entirely. This is cleaner from an implementation point of view and gives full control, however it would complicate matters for the policy manager to keep them in sync.
  5. ?

@dain
Copy link
Contributor

dain commented Oct 30, 2018

@bolkedebruin, to clarify, Ranger internally just stores strings? I ask because looking at the docs, the system seems to have a graphical UI to create and update the policies. To implement such a UI, I would assume that you either need to store the filter as an abstract syntax tree, or you need to be able to parse the native language of ever supported system and produce an AST for the UI. Actually even in that case, the AST supported by the UI would be limited, so maybe we can support only that subset?

@bolkedebruin
Copy link

bolkedebruin commented Oct 30, 2018

Ah gotcha. Ranger indeed stores an AST and renders this in the UI. That's easy to support and already done by this patch. Only supporting this is my scenario 3.

The Hive plugin in Ranger also adds support for row level filter expressions or masks. These are defined in a HQL subset and more or less directly added to the query a user creates (in the hive UI that is!, Eg a select query)

The REST API and UI both generate the same AST. They also both support the custom row level filters which are just an array of strings in the AST.

The patch as it stands supports the custom row level filters to the extend that the expressions aRe verified by prestos parser. Hence there might be minor caveats. This is scenario 1.

Why not make row level filtering and masking support optional (I think it is in this patch on a phone though). Also make make configurable if the query needs to fail if presto then encounters a requirement for those filters in a particular policy preventing data leakage. Maybe mark it as experimental.

This would put it in the hands of an operator and we could see what the real world experience is?

@RameshByndoor
Copy link
Contributor

@bolkedebruin How did you arrive at conclusion that ranger stores AST..?
AFAIK, ranger stores row filter_expression as just string, It neither validate SQL expression on save from UI/API nor parses expressions further.
(Look at Ranger DB table:x_policy_item_rowfilter column:filter_expr and related api for more info)
For making it optional, Ranger has switch at policy level, with this policy, can be enabled/disabled from UI/API.

@dain you noted about "Ranger makes that almost impossible without implementing HQL in Presto"
Yes this PR aimed to make use of the existing policies defined in ranger-hive-service. that said expression written in row filter section, which are exclusive HQL expression can fail in presto-ANSI.
Again i don't see any big difference between HQL and ANSI in that aspect with boolean expression.

@bolkedebruin
Copy link

bolkedebruin commented Oct 30, 2018

@RameshByndoor because it does. This is an example of a policy AST in Json (row level filter not being part of it)

    {
       "policyName":"name_of_policy",
       "databases":"db1,db2",
       "tables":"mytable,yourtable",
       "columns":"",
       "udfs":"",
       "description":"",
       "repositoryName":"",
       "repositoryType":"hive",
       "tableType":"Inclusion",
       "columnType":"Inclusion",
       "isEnabled":"true",
       "isAuditEnabled":"true",
       "permMapList": [{
    		"groupList": ["somegroup"],
    		"permList": ["Select"]
    	}]
    }

So the row level expression is stored as an arbitrary string within the AST of the policy (probably something like "rowLevelFilters" : [{"filterName": "my_name", "expression", "cust.name = 'bla'"}]). And indeed the Hive Plugin of Ranger does not validate the expression this is what I explained above.

In other words the AST of a Hive policy is more than just the row level expression. It is the whole Policy you define in Ranger, which can include row level filters and masks but not neccesarily. The client (presto in this case) downloads the whole Policy and then does verification on its own. Verification happens by parsing the AST (which is already done for you by the Ranger framework) and applying it.

Note: in this case the DB schema of Ranger is irrelevant. It reassembles the AST from the DB and provides that as an interface to clients.

@RameshByndoor
Copy link
Contributor

@bolkedebruin ln simple words Ranger Row filters are NOT parsed to Abstract syntax Tree(AST) anywhere in ranger, instead it propagates as just expression string, I don't see any code which is doing in apache/ranger,please do post code if it does.

@bolkedebruin
Copy link

@RameshByndoor that is exactly what I said and I am not trying to make the point that it does. However, the AST of Ranger 'wraps' (if you will) the row level expressions. Please note that we are not just talking about the row level expressions here, but that @dain was, in my opinion, asking for a more holistic view on how policies work in Ranger and how these could be integrated into Presto. A 'row level' expression is just a mere sub item of a policy.

@RameshByndoor
Copy link
Contributor

@bolkedebruin Thanks for raising that concern. You are right, Ranger is much more capable in that way, We can consume whole of presto resources and define everything over there. I have tried that too. @bolkedebruin @dain have look at here. Hope this helps. :)

@dain
Copy link
Contributor

dain commented Oct 30, 2018

@bolkedebruin and @RameshByndoor, my main concern is row level filtering (I think the rest is pretty straight forward). Specifically, for the example above, "rowLevelFilters" : [{"filterName": "my_name", "expression", "cust.name = 'bla'"}] what language is expression in? If it is HQL, I'm super concerned. As you all mention, the if the language is not understood the query could fail, and I agree that that is an acceptable outcome. I'm concerned about the case where we think we can run the expression, and the results of running it are not exactally the same as Hive, so the query returns too many or too few rows, both of which will produce incorrect answers and my leak sensitive information.

@dain
Copy link
Contributor

dain commented Nov 1, 2018

@cryptoe my personal suggestion is we split this PR into two parts. In the first part, we add support for column Ranger checks, but fail if there are row filters (I believe this is uncontroversial). Then for the second part, we develop a spec for a limited subset of the HQL filter language, that has tight definitions of the semantics the language and each supported operator, function and type. The spec would be our commitment to correctness. This second part might be controversial, so I'd wait to hear feedback from @martint, @electrum, @kokosing, and others.

@bolkedebruin
Copy link

Sounds fair. Can we add support for "just" the presto filter language (too)? (Eg. Decoupling it from hive?)

@dain
Copy link
Contributor

dain commented Nov 1, 2018

Can we add support for "just" the presto filter language (too)? (Eg. Decoupling it from hive?)

@bolkedebruin, is that something people want? I think there is a much bigger discussion to be had around how this is actually integrated into the engine. I expect that the current "string expression" API isn't what the optimizer folks will want for this integration point, but we should wait for them to comment.

@Phelodas
Copy link

Phelodas commented Feb 4, 2019

Hi everyone,

Thanks for putting this PR up and it looks great!

I am part of the security team at FB and we really have a high priority to merge the row level security piece (column masking is not as high a priority for us). However, it seems like there is still more flushing out needed in terms of how the filters are propagated. A couple of questions:
i) Do we need a more JSON like spec for how to transmit the filters? This will ensure a more concrete API rather than failing at parsing. This part of the design does not seem to be addressed yet and seems to be one of the questions @dain was posing. If we are going down this path, I would propose by starting with simple column expressions and then extending to correlated subqueries.
ii) @cryptoe are you looking into i) at the moment or looking into finishing up the column masking PR?

Thanks!

CC: @elonazoulay, @cemcayiroglu, @martint, @electrum

@cryptoe
Copy link
Author

cryptoe commented Feb 5, 2019

@Phelodas I was splitting the PR into two as @dain suggested. One would be the ranger piece containing only basic ACL's on tables/schemas/catalogs.
I have not started working on the second part. As of now

@bolkedebruin Presto sever uses jersey 2.xx while ranger uses jersey version 1.xx . We got around the dependency issues by shading the jar in the released artifact. Solving the dependency issues required for test case addition is a little tricky. Currently stuck there. Working on having a specific classloader for test cases to load the ranger jars in TestingPrestoServer .

@tooptoop4
Copy link

gentle ping

@WalkerTR
Copy link

Hi all,

I am preparing a PR for the Ranger 1.2.0 integration. This will only include some simple functionalities, row-level filters and column masking are not included.

I hope to submit the PR for review by the end of this week.

@Phelodas
Copy link

Hey guys,

I am going to make a new PR for the row level security changes. I think it's important to explore a more structured approach to passing the filters to Presto but it would be great to vet if this design works for Ranger as well. I'll have a design doc up by tomorrow.

Thanks!

@WalkerTR
Copy link

@Phelodas What do you mean by "passing the filters to Presto"? All the policies should be defined in the ranger UI. The policy file is then cached and evaluated within Presto using the RangerBasePlugin (or a subclass of it).

@cryptoe
Copy link
Author

cryptoe commented Mar 6, 2019

gentle ping

I have sent the pr in the prestosql repo. Please check if it solves your usecase

@bolkedebruin
Copy link

bolkedebruin commented Mar 16, 2019

Update: shading jersey when integrating with Ranger 1.2 proofs hard, due to transitive dependencies. We are working (incl @WalkerTR ) on an intermediate solution.

We would love to know how you shaded Jersey @cryptoe and would like to see if it applies to a recent ranger. We think Ranger changes are required and that will take some time to have those merged upstream.

@cryptoe
Copy link
Author

cryptoe commented Mar 18, 2019

@bolkedebruin Please checkout the ranger pom.xml https://github.com/prestodb/presto/blob/5bdc5106f2ecaf7068fdf5b54045deccd8bc0136/presto-ranger-plugin/pom.xml .

I think the pretty much the same pom should work for your usecase no ?

@bolkedebruin
Copy link

We reached out to the Ranger ppl. They have a shim pattern that solves the issues, most likely, without the need for sharing. Diving in today.

@Phelodas
Copy link

Opened a new issue that is focused solely on Row Level Security : #12491 . I've attached a design doc there as well. Would love to hear your comments/feedback.

@WalkerTR Sorry, I missed your comment. I am mentioning the format for the filter in the design doc I mentioned above. The idea would be to conform the SQL (HQL) to that format in the connector implementation. Please let me know your thoughts in the issue.

@bolkedebruin
Copy link

bolkedebruin commented Mar 27, 2019

@cryptoe @Phelodas I have made a plugin that works with latest Ranger and PrestoSql. It is part of the Ranger source code and I will push it upstream there. It does not do "control" yet, it is the boilerplate.

https://github.com/bolkedebruin/ranger/tree/presto_plugin

@bolkedebruin
Copy link

@Phelodas @cryptoe I have submitted the first iteration of the patch to Ranger: https://issues.apache.org/jira/browse/RANGER-2395 . We are testing it in our production environment.

Couple of points:

  • Does not use private APIs and thus doesn't require any user synchronization. It does uses standard Hadoop user integration (local system (ipa, nss), LDAP etc).
  • Builds against Ranger master
  • No row masking etc yet as Presto lacks support
  • Supports Kerberos
  • Standard auditing, to be seen if this needs to be extended

@radd3
Copy link

radd3 commented May 8, 2019

Following steps will help enable

  1. Add plugin jar to plugin folder
    presto/plugin/ranger-presto-plugin/presto-ranger-plugin-0.version-shaded.jar
  2. Enable ranger-access-control in etc/access-control.properties
access-control.name=ranger-access-control
#Comma separated list of service-types from ranger 
ranger-service-types=hive,mysql
#Comma separated list of app-ids from ranger 
ranger-app-ids=hive,mysql
ranger-security-config-hive=path_to_/ranger-hive-security.xml,path_to_/ranger-mysql-security.xml
ranger-audit-config-hive=path_to/ranger-hive-audit.xml,path_to_ranger-mysql-audit.xml
ranger.username=XXXX
ranger.password=XXXX

its throwing me error stating plugin not registered when i restart the presto.. here are my access control properties and jar is kept under plugin/ranger-presto/ranger-presto.jar
note: i pulled up only ranger code from https://github.com/cryptoe/presto/tree/origin/ranger_plugin
access-control.name=ranger-access-control
ranger-service-types=hive
ranger-app-ids=ranger_dev
ranger.username=xxxxx
ranger.password=xxxxx
ranger-security-config-hive=path_to_/ranger-hive-security.xml
ranger-audit-config-hive=path_to/ranger-hive-audit.xml

did i miss anything.

@shubhamtagra
Copy link
Contributor

@radd3 any chance you copied non-shaded jar? Shaded jar from presto-ranger-plugin/target/shaded directory needs to be copied to presto installation.

@radd3
Copy link

radd3 commented May 14, 2019

@radd3 any chance you copied non-shaded jar? Shaded jar from presto-ranger-plugin/target/shaded directory needs to be copied to presto installation.

i checked and it was shaded one.. after some debugging i am getting this error.. i also checked whether port was reused for something.. without this plugin the interaction in the cli works normal. with this plugin i get the below error.
java.lang.RuntimeException: Error fetching next at http://xxxxxxxxxx: 8084/v1/statement/20190514_183514_00003_qin73/1 returned an invalid response: JsonResponse{statusCode=500, statusMessage=Internal Server Error, headers={content-length=[1300], content-type=[text/plain], date=[Tue, 14 May 2019 18:35:14 GMT]}, hasValue=false} [Error: java.lang.AbstractMethodError: Method com/facebook/presto/ranger/RangerSystemAccessControl.checkCanSetUser(Ljava/util/Optional;Ljava/lang/String;)V is abstract

@bb786112
Copy link

Hi

I am not able to download the below jar. Github repo URL has compile time error in ranger plugin.
So, I am not able to build the project.

presto/plugin/ranger-presto-plugin/presto-ranger-plugin-0.version-shaded.jar

@andy12383
Copy link

Bumping PR.
@martint @dain

Design doc : https://docs.google.com/document/d/1Jtapmwkp1Up_w6w_3dUeOXfLPLRAsglbqoIWNRG-NJM/edit#

Presto-spi changes : https://github.com/prestodb/presto/blob/21f1f0932ac395a6a574596bc2f12a39ac5750df/presto-spi/src/main/java/com/facebook/presto/spi/security/SystemAccessControl.java

Presto-main changes https://github.com/prestodb/presto/blob/21f1f0932ac395a6a574596bc2f12a39ac5750df/presto-main/src/main/java/com/facebook/presto/sql/rewrite/RowFilterColumnMaskingRewrite.java

Rest all is the presto-ranger plugin

@cryptoe
hello, I have a question: in your code above , a sql with "WITH...AS", for example, "with a as (select colX from tableX) select * from a" in presto will fail, cause in the method visitTable in RowFilterColumnMaskingRewrite, it need a tablehandle which is get from metadata, but WITH...AS will not be added to metadata.

do you think this is a problem? how to deal with it?

thank you very much.

@stale
Copy link

stale bot commented May 26, 2020

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the task, make sure you've addressed reviewer comments, and rebase on the latest master. Thank you for your contributions!

@stale stale bot added the stale label May 26, 2020
@stale stale bot closed this Jun 2, 2020
@rohanpednekar rohanpednekar reopened this Jun 20, 2021
@stale stale bot removed the stale label Jun 20, 2021
@linux-foundation-easycla
Copy link

CLA Not Signed

@rohanpednekar rohanpednekar added the Roadmap A top level roadmap item label Jun 20, 2021
@ajaygeorge
Copy link
Contributor

ajaygeorge commented Sep 16, 2021

Hey @rohanpednekar , I see that you reopened this PR.
Is someone planning to work on this PR as this looks like a stale one which hasn't seen any updates from a long time.
cc @jbapple

@rohanpednekar rohanpednekar removed the Roadmap A top level roadmap item label Oct 1, 2021
@rohanpednekar
Copy link
Contributor

Yes, this is stale PR we have another PR #15519 for Ranger integration, Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet