Skip to content

Data security and governance SIG

Roman V Shaposhnik edited this page Feb 27, 2017 · 12 revisions

How to reach us

This SIG uses regular ODPi technical mailing list for all communications.

We also conduct regular weekly meetings each Tue at 4pm PST:

https://global.gotomeeting.com/join/664693301

United States (Toll Free): 1 866 899 4679 United States: +1 (224) 501-3318

Access Code: 664-693-301

Click here to see meeting past meeting notes

Introduction

Data security and governance SIG is providing a place for the industry experts to jointly collaborate on a set of best practices aimed at solving the complexities of dealing with multi-tenant bigdata data lakes in a secure fashion and with considerations for control points demanded by enterprise regulatory environments and compliance policies. We recognize that the security implications of data lakes are far-reaching and effective Hadoop security depends on a holistic approach that touches upon multiple software components in the stack. Given how flexible Hadoop components are in their configuration and deployment and how costly potential data breaches could be we feel that a more prescriptive, industry validated approach to security and data governance will prove superior to "roll your own security model" mentality that an open source, DYI nature of Hadoop ecosystem invites. To that end, we plan to produce a series of whitepapers and validation test suites addressing both platform considerations and solutions practitioners may need to augment their platform practices.

At the platform level we plan on giving advice and validation capabilities for:

  • Administration
    • setting up and monitoring cluster-wide, overall policies
  • Authentication
    • Hadoop-level authentication and user proxying
    • system-level extensible authentication such as PAM, Kerberos and AD
  • Authorization
    • RBAC considerations
    • access control expressions,
  • Audit
    • complete logging
    • querying, audit and BI capabilities that may be required to effectively make use of the logging data
    • integration with best-of-breed security information and event management "SIEM" systems
  • Data Protection
    • effective end-to-end (at rest and over-the-wire) data encryption practicies
    • secure Key Management
      • private keys for various certificates
      • pre-shared secrets such as passwords, etc.
      • Java keystores
    • integration with best-of-breed industry recommendations on encryption including hardware assisted encryption
    • semantic data protection (HDFS, HBase, Hive, logs, etc.)

At the solutions level we plan to mainly address the areas of:

  • Information lifecycle management as it relates to setting policies around data sets capabilities to do:
    • ETL
    • archiving (including glacial storage)
    • purging and eviction
    • secure disaster recovery and guaranteeing business continuity
  • Data lineage
  • Data privacy
    • monitoring, alerting and de-identifying sensitive and PII datasets
    • masking, redacting and maintaining common data definitions
  • Integrated anomaly detection and compliance monitoring

We plan to produce recommendations requiring little to know awareness at the ISV application level. Our goal is to achieve a truly transparent security model for Hadoop data lakes.

SIG membership

  • Roman Shaposhnik, Linux Foundation (SIG Champion)
  • Vineet Goel, Pivotal
  • Alan Gates, Hortonworks
  • Selvamohan Neethiraj, Hortonworks
  • Larry McCay, Hortonworks
  • Raj Desai, IBM
  • Sampada Basarkar, IBM
  • Pierre Regazzoni, IBM
  • Nisanth Simon, IBM
  • Mandy Chessell, IBM
  • Nigel L Jones, IBM
  • David Radley, IBM
  • Tanping Wang, IBM

Deliverables

We are working on a comprehensive HOWTO guide around Hadoop security.

Upstream Apache projects to consider

  • Apache Ranger
  • Apache Knox
  • Apache Sentry
  • Apache Spot
  • Apache Atlas
  • Apache Eagle