# 🔐 Seguridad, Compliance y Auditoría de Datos

Objetivo: implementar controles de seguridad (IAM, cifrado, enmascaramiento), cumplir regulaciones (GDPR, HIPAA, SOC2) y establecer auditoría de accesos y linaje.

- Duración: 120–150 min
- Dificultad: Alta
- Prerrequisitos: Governance (Senior 01), experiencia con cloud IAM

### 🔒 **Defense in Depth: Multi-Layer Security Architecture**

**Modelo de Seguridad en Capas para Data Platforms**

```
┌─────────────────────────────────────────────────────────┐
│  Layer 1: NETWORK SECURITY                              │
│  • VPC isolation, private subnets                       │
│  • Security Groups (stateful firewall)                  │
│  • NACLs (Network ACLs - stateless)                     │
│  • VPN/PrivateLink para conectividad                    │
│  • DDoS protection (AWS Shield, CloudFlare)             │
├─────────────────────────────────────────────────────────┤
│  Layer 2: IDENTITY & ACCESS MANAGEMENT (IAM)            │
│  • Least privilege principle                            │
│  • Role-Based Access Control (RBAC)                     │
│  • Multi-Factor Authentication (MFA)                    │
│  • Service accounts con permisos mínimos                │
│  • Temporary credentials (STS AssumeRole)               │
├─────────────────────────────────────────────────────────┤
│  Layer 3: DATA ENCRYPTION                               │
│  • At-rest: KMS, customer-managed keys                  │
│  • In-transit: TLS 1.3, mTLS                            │
│  • Application-level: Field-level encryption            │
│  • Key rotation (automática, 90 días)                   │
├─────────────────────────────────────────────────────────┤
│  Layer 4: APPLICATION SECURITY                          │
│  • Input validation (SQL injection, XSS)                │
│  • API authentication (JWT, OAuth2)                     │
│  • Rate limiting, DDoS mitigation                       │
│  • Security headers (HSTS, CSP)                         │
├─────────────────────────────────────────────────────────┤
│  Layer 5: DATA GOVERNANCE                               │
│  • PII masking/tokenization                             │
│  • Data classification (public, internal, confidential) │
│  • Access logs y auditoría                              │
│  • Data lineage tracking                                │
├─────────────────────────────────────────────────────────┤
│  Layer 6: MONITORING & INCIDENT RESPONSE                │
│  • SIEM (Splunk, Datadog Security)                      │
│  • Anomaly detection (ML-based)                         │
│  • Security alerts (Slack, PagerDuty)                   │
│  • Incident response playbooks                          │
└─────────────────────────────────────────────────────────┘
```

**AWS IAM: Least Privilege Implementation**

```python
import boto3
import json

# 1. ROLE-BASED ACCESS CONTROL (RBAC)

# Data Engineer (Read raw, Write curated)
data_engineer_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadRawData",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::data-lake/raw/*",
                "arn:aws:s3:::data-lake/raw"
            ]
        },
        {
            "Sid": "WriteCuratedData",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::data-lake/curated/*",
            "Condition": {
                "StringEquals": {
                    "s3:x-amz-server-side-encryption": "aws:kms"  # Enforce encryption
                }
            }
        },
        {
            "Sid": "GlueJobExecution",
            "Effect": "Allow",
            "Action": [
                "glue:StartJobRun",
                "glue:GetJobRun",
                "glue:GetJobRuns"
            ],
            "Resource": "arn:aws:glue:us-east-1:123456789012:job/etl-*"
        },
        {
            "Sid": "DenyProductionDelete",
            "Effect": "Deny",
            "Action": "s3:DeleteObject",
            "Resource": "arn:aws:s3:::data-lake/prod/*"
        }
    ]
}

# Data Scientist (Read-only curated + analytics)
data_scientist_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::data-lake/curated/*",
                "arn:aws:s3:::data-lake/analytics/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryExecution",
                "athena:GetQueryResults"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "athena:WorkGroup": "data-science-workgroup"
                }
            }
        },
        {
            "Effect": "Deny",
            "Action": ["s3:PutObject", "s3:DeleteObject"],
            "Resource": "arn:aws:s3:::data-lake/*"
        }
    ]
}

# 2. SERVICE ACCOUNT (EMR Cluster)
emr_service_role_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::data-lake/processing/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": "arn:aws:kms:us-east-1:123456789012:key/data-lake-key"
        }
    ]
}

# Create IAM role with policy
iam = boto3.client('iam')

def create_data_role(role_name, policy_document, description):
    """Create IAM role with inline policy"""
    
    # Trust policy (who can assume this role)
    trust_policy = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "glue.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    
    # Create role
    role = iam.create_role(
        RoleName=role_name,
        AssumeRolePolicyDocument=json.dumps(trust_policy),
        Description=description,
        MaxSessionDuration=3600,  # 1 hour
        Tags=[
            {'Key': 'Team', 'Value': 'DataEngineering'},
            {'Key': 'Environment', 'Value': 'Production'}
        ]
    )
    
    # Attach inline policy
    iam.put_role_policy(
        RoleName=role_name,
        PolicyName=f'{role_name}-policy',
        PolicyDocument=json.dumps(policy_document)
    )
    
    return role['Role']['Arn']

# 3. TEMPORARY CREDENTIALS (STS AssumeRole)
sts = boto3.client('sts')

def assume_data_engineer_role(role_arn, session_name):
    """Get temporary credentials"""
    response = sts.assume_role(
        RoleArn=role_arn,
        RoleSessionName=session_name,
        DurationSeconds=3600,  # 1 hour
        Tags=[
            {'Key': 'User', 'Value': session_name}
        ]
    )
    
    credentials = response['Credentials']
    
    # Use temporary credentials
    s3 = boto3.client(
        's3',
        aws_access_key_id=credentials['AccessKeyId'],
        aws_secret_access_key=credentials['SecretAccessKey'],
        aws_session_token=credentials['SessionToken']
    )
    
    return s3

# 4. MFA ENFORCEMENT
mfa_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListActions",
            "Effect": "Allow",
            "Action": [
                "iam:ListUsers",
                "iam:ListVirtualMFADevices"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowIndividualUserToManageTheirOwnMFA",
            "Effect": "Allow",
            "Action": [
                "iam:EnableMFADevice",
                "iam:CreateVirtualMFADevice"
            ],
            "Resource": "arn:aws:iam::*:user/${aws:username}"
        },
        {
            "Sid": "DenyAllExceptListedIfNoMFA",
            "Effect": "Deny",
            "NotAction": [
                "iam:CreateVirtualMFADevice",
                "iam:EnableMFADevice",
                "iam:ListMFADevices",
                "iam:ListUsers",
                "iam:GetUser"
            ],
            "Resource": "*",
            "Condition": {
                "BoolIfExists": {
                    "aws:MultiFactorAuthPresent": "false"
                }
            }
        }
    ]
}
```

**Secrets Management: Rotación Automática**

```python
import boto3
from datetime import datetime, timedelta

secretsmanager = boto3.client('secretsmanager')

# 1. Create secret with automatic rotation
def create_rotating_secret(secret_name, secret_value):
    """Create secret con rotación automática cada 30 días"""
    
    response = secretsmanager.create_secret(
        Name=secret_name,
        Description='Database password with auto-rotation',
        SecretString=secret_value,
        Tags=[
            {'Key': 'Environment', 'Value': 'Production'},
            {'Key': 'AutoRotate', 'Value': 'true'}
        ]
    )
    
    # Enable automatic rotation (Lambda-based)
    secretsmanager.rotate_secret(
        SecretId=secret_name,
        RotationLambdaARN='arn:aws:lambda:us-east-1:123456789012:function:rotate-db-password',
        RotationRules={
            'AutomaticallyAfterDays': 30,  # Rotate every 30 days
            'Duration': '2h',
            'ScheduleExpression': 'rate(30 days)'
        }
    )
    
    return response['ARN']

# 2. Retrieve secret in application
def get_database_credentials(secret_name):
    """Get current secret value"""
    response = secretsmanager.get_secret_value(SecretId=secret_name)
    
    import json
    secret = json.loads(response['SecretString'])
    
    return {
        'username': secret['username'],
        'password': secret['password'],
        'host': secret['host'],
        'port': secret['port']
    }

# 3. Lambda function para rotación
"""
import boto3
import psycopg2
import json

def lambda_handler(event, context):
    '''Rotate database password'''
    
    service_client = boto3.client('secretsmanager')
    arn = event['SecretId']
    token = event['ClientRequestToken']
    step = event['Step']
    
    if step == 'createSecret':
        # Generate new password
        new_password = generate_random_password()
        
        # Store pending secret
        service_client.put_secret_value(
            SecretId=arn,
            ClientRequestToken=token,
            SecretString=json.dumps({'password': new_password}),
            VersionStages=['AWSPENDING']
        )
    
    elif step == 'setSecret':
        # Update database with new password
        pending_secret = service_client.get_secret_value(
            SecretId=arn,
            VersionStage='AWSPENDING'
        )
        
        current_secret = service_client.get_secret_value(
            SecretId=arn,
            VersionStage='AWSCURRENT'
        )
        
        # Connect with current credentials
        conn = psycopg2.connect(**json.loads(current_secret['SecretString']))
        cursor = conn.cursor()
        
        # Update password
        new_password = json.loads(pending_secret['SecretString'])['password']
        cursor.execute(f"ALTER USER myuser WITH PASSWORD '{new_password}'")
        conn.commit()
    
    elif step == 'testSecret':
        # Verify new credentials work
        pending_secret = service_client.get_secret_value(
            SecretId=arn,
            VersionStage='AWSPENDING'
        )
        
        # Test connection
        conn = psycopg2.connect(**json.loads(pending_secret['SecretString']))
        conn.close()
    
    elif step == 'finishSecret':
        # Move AWSPENDING to AWSCURRENT
        service_client.update_secret_version_stage(
            SecretId=arn,
            VersionStage='AWSCURRENT',
            MoveToVersionId=token
        )
"""
```

**Caso Real: Capital One Breach (2019)**

**Vulnerabilidad**: Firewall mal configurado permitió acceso desde Internet a metadata service (169.254.169.254) que expuso IAM credentials.

**Impacto**: 100M registros de clientes comprometidos, multa de $80M.

**Lecciones**:
1. ✅ **Network Segmentation**: Servidores en subnets privadas
2. ✅ **IMDSv2**: Require token para metadata (AWS)
3. ✅ **Security Groups**: Deny 169.254.169.254 desde apps
4. ✅ **WAF**: Web Application Firewall con reglas OWASP
5. ✅ **Least Privilege**: Limitar scope de IAM roles

```python
# IMDSv2 enforcement (require token)
ec2 = boto3.client('ec2')

ec2.modify_instance_metadata_options(
    InstanceId='i-1234567890abcdef0',
    HttpTokens='required',  # Require IMDSv2
    HttpPutResponseHopLimit=1  # Prevent IP forwarding
)
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🔐 **Encryption: At-Rest, In-Transit y Application-Level**

**Encryption Hierarchy**

```
┌──────────────────────────────────────────────────────┐
│  1. AT-REST ENCRYPTION                               │
│     • Storage: S3 (SSE-KMS), EBS, RDS                │
│     • Databases: TDE (Transparent Data Encryption)   │
│     • Backups: Encrypted snapshots                   │
│     • Key Management: KMS, CloudHSM                  │
├──────────────────────────────────────────────────────┤
│  2. IN-TRANSIT ENCRYPTION                            │
│     • TLS 1.3 (HTTPS, gRPC)                          │
│     • mTLS (mutual TLS) para service-to-service      │
│     • VPN/PrivateLink para inter-VPC                 │
│     • IPSec para site-to-site                        │
├──────────────────────────────────────────────────────┤
│  3. APPLICATION-LEVEL ENCRYPTION                     │
│     • Field-level: Encrypt specific columns (PII)    │
│     • Envelope encryption (DEK + KEK)                │
│     • Client-side encryption antes de upload         │
│     • Tokenization (irreversible)                    │
└──────────────────────────────────────────────────────┘
```

**AWS KMS: Customer-Managed Keys**

```python
import boto3
import base64
from cryptography.fernet import Fernet

kms = boto3.client('kms')

# 1. CREATE CUSTOMER-MANAGED KEY (CMK)
def create_data_encryption_key():
    """Create KMS key con rotación automática"""
    
    response = kms.create_key(
        Description='Data Lake encryption key',
        KeyUsage='ENCRYPT_DECRYPT',
        Origin='AWS_KMS',
        MultiRegion=False,
        Tags=[
            {'TagKey': 'Environment', 'TagValue': 'Production'},
            {'TagKey': 'Purpose', 'TagValue': 'DataLake'}
        ]
    )
    
    key_id = response['KeyMetadata']['KeyId']
    
    # Create alias
    kms.create_alias(
        AliasName='alias/data-lake-key',
        TargetKeyId=key_id
    )
    
    # Enable automatic key rotation (365 días)
    kms.enable_key_rotation(KeyId=key_id)
    
    # Set key policy
    key_policy = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Enable IAM User Permissions",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::123456789012:root"
                },
                "Action": "kms:*",
                "Resource": "*"
            },
            {
                "Sid": "Allow use of the key for encryption",
                "Effect": "Allow",
                "Principal": {
                    "AWS": [
                        "arn:aws:iam::123456789012:role/DataEngineerRole",
                        "arn:aws:iam::123456789012:role/EMRServiceRole"
                    ]
                },
                "Action": [
                    "kms:Encrypt",
                    "kms:Decrypt",
                    "kms:GenerateDataKey",
                    "kms:DescribeKey"
                ],
                "Resource": "*"
            },
            {
                "Sid": "Allow CloudWatch Logs",
                "Effect": "Allow",
                "Principal": {
                    "Service": "logs.amazonaws.com"
                },
                "Action": [
                    "kms:Encrypt",
                    "kms:Decrypt",
                    "kms:GenerateDataKey"
                ],
                "Resource": "*"
            }
        ]
    }
    
    kms.put_key_policy(
        KeyId=key_id,
        PolicyName='default',
        Policy=json.dumps(key_policy)
    )
    
    return key_id

# 2. S3 ENCRYPTION (SSE-KMS)
s3 = boto3.client('s3')

def upload_encrypted_file(bucket, key, data, kms_key_id):
    """Upload file con KMS encryption"""
    
    s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=data,
        ServerSideEncryption='aws:kms',
        SSEKMSKeyId=kms_key_id,
        BucketKeyEnabled=True  # Reduce KMS API calls (cost)
    )

# Enforce encryption on bucket
def enforce_bucket_encryption(bucket_name, kms_key_id):
    """Bucket policy: deny unencrypted uploads"""
    
    bucket_policy = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "DenyUnencryptedObjectUploads",
                "Effect": "Deny",
                "Principal": "*",
                "Action": "s3:PutObject",
                "Resource": f"arn:aws:s3:::{bucket_name}/*",
                "Condition": {
                    "StringNotEquals": {
                        "s3:x-amz-server-side-encryption": "aws:kms"
                    }
                }
            },
            {
                "Sid": "DenyIncorrectKMSKey",
                "Effect": "Deny",
                "Principal": "*",
                "Action": "s3:PutObject",
                "Resource": f"arn:aws:s3:::{bucket_name}/*",
                "Condition": {
                    "StringNotEquals": {
                        "s3:x-amz-server-side-encryption-aws-kms-key-id": kms_key_id
                    }
                }
            }
        ]
    }
    
    s3.put_bucket_policy(
        Bucket=bucket_name,
        Policy=json.dumps(bucket_policy)
    )

# 3. RDS ENCRYPTION
rds = boto3.client('rds')

def create_encrypted_rds_instance():
    """Create RDS con encryption at-rest"""
    
    response = rds.create_db_instance(
        DBInstanceIdentifier='data-warehouse',
        DBInstanceClass='db.r5.2xlarge',
        Engine='postgres',
        MasterUsername='admin',
        MasterUserPassword='SecurePassword123!',
        AllocatedStorage=1000,
        StorageType='gp3',
        StorageEncrypted=True,  # Enable encryption
        KmsKeyId='arn:aws:kms:us-east-1:123456789012:key/data-lake-key',
        BackupRetentionPeriod=30,
        CopyTagsToSnapshot=True,
        EnableCloudwatchLogsExports=['postgresql'],
        Tags=[
            {'Key': 'Encrypted', 'Value': 'true'}
        ]
    )

# 4. ENVELOPE ENCRYPTION (Data Encryption Key)
def encrypt_large_file_with_envelope(file_path, kms_key_id):
    """Encrypt file usando envelope encryption pattern"""
    
    # Generate Data Encryption Key (DEK)
    dek_response = kms.generate_data_key(
        KeyId=kms_key_id,
        KeySpec='AES_256'
    )
    
    plaintext_dek = dek_response['Plaintext']  # Use this to encrypt
    encrypted_dek = dek_response['CiphertextBlob']  # Store this
    
    # Encrypt file with DEK
    cipher = Fernet(base64.urlsafe_b64encode(plaintext_dek[:32]))
    
    with open(file_path, 'rb') as f:
        plaintext_data = f.read()
    
    encrypted_data = cipher.encrypt(plaintext_data)
    
    # Upload encrypted file + encrypted DEK
    s3.put_object(
        Bucket='data-lake',
        Key=f'encrypted/{file_path}',
        Body=encrypted_data,
        Metadata={
            'x-amz-key': base64.b64encode(encrypted_dek).decode('utf-8')
        }
    )
    
    return encrypted_data

def decrypt_envelope_encrypted_file(s3_key):
    """Decrypt file encrypted con envelope pattern"""
    
    # Get encrypted file + metadata
    response = s3.get_object(Bucket='data-lake', Key=s3_key)
    encrypted_data = response['Body'].read()
    encrypted_dek = base64.b64decode(response['Metadata']['x-amz-key'])
    
    # Decrypt DEK usando KMS
    dek_response = kms.decrypt(CiphertextBlob=encrypted_dek)
    plaintext_dek = dek_response['Plaintext']
    
    # Decrypt file data usando DEK
    cipher = Fernet(base64.urlsafe_b64encode(plaintext_dek[:32]))
    plaintext_data = cipher.decrypt(encrypted_data)
    
    return plaintext_data
```

**TLS/mTLS Configuration**

```python
# FastAPI con TLS (HTTPS)
import uvicorn
from fastapi import FastAPI

app = FastAPI()

# Run with TLS
if __name__ == "__main__":
    uvicorn.run(
        app,
        host="0.0.0.0",
        port=8443,
        ssl_keyfile="/path/to/private.key",
        ssl_certfile="/path/to/certificate.crt",
        ssl_ca_certs="/path/to/ca-bundle.crt",  # For mTLS
        ssl_cert_reqs=2  # CERT_REQUIRED (enforce client certs)
    )

# Nginx reverse proxy con TLS termination
"""
server {
    listen 443 ssl http2;
    server_name api.example.com;
    
    # TLS 1.3 only
    ssl_protocols TLSv1.3;
    ssl_certificate /etc/nginx/ssl/certificate.crt;
    ssl_certificate_key /etc/nginx/ssl/private.key;
    
    # Modern cipher suite
    ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384';
    ssl_prefer_server_ciphers on;
    
    # HSTS (force HTTPS)
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    
    # OCSP Stapling
    ssl_stapling on;
    ssl_stapling_verify on;
    
    location / {
        proxy_pass http://backend:8000;
        proxy_set_header X-Forwarded-Proto https;
    }
}
"""

# Python requests con client certificate (mTLS)
import requests

response = requests.get(
    'https://api.example.com/data',
    cert=('/path/to/client.crt', '/path/to/client.key'),
    verify='/path/to/ca-bundle.crt'  # Verify server cert
)
```

**Field-Level Encryption (PII Protection)**

```python
from cryptography.fernet import Fernet
import hashlib
import hmac

class PIIEncryptor:
    """Encrypt/decrypt PII fields"""
    
    def __init__(self, encryption_key):
        self.cipher = Fernet(encryption_key)
    
    def encrypt_field(self, plaintext: str) -> str:
        """Encrypt single field"""
        return self.cipher.encrypt(plaintext.encode()).decode()
    
    def decrypt_field(self, ciphertext: str) -> str:
        """Decrypt single field"""
        return self.cipher.decrypt(ciphertext.encode()).decode()
    
    def tokenize(self, value: str, secret: str) -> str:
        """One-way tokenization (irreversible)"""
        return hmac.new(
            secret.encode(),
            value.encode(),
            hashlib.sha256
        ).hexdigest()

# Spark UDF para field-level encryption
from pyspark.sql import functions as F
from pyspark.sql.types import StringType

encryptor = PIIEncryptor(encryption_key=b'your-32-byte-key-here...')

def encrypt_pii_udf(value):
    if value:
        return encryptor.encrypt_field(value)
    return None

encrypt_udf = F.udf(encrypt_pii_udf, StringType())

# Apply to DataFrame
df_encrypted = df.withColumn(
    'email_encrypted',
    encrypt_udf(F.col('email'))
).withColumn(
    'ssn_tokenized',
    F.sha2(F.col('ssn'), 256)  # One-way hash
)

# Save con partition por encrypted field (queryable)
df_encrypted.write.partitionBy('date').parquet('s3://bucket/encrypted-data/')
```

**Caso Real: Equifax Breach (2017)**

**Vulnerabilidad**: Apache Struts sin parchear + SSL certificate expirado (no detected).

**Impacto**: 147M personas, $700M settlement.

**Protección**:
```python
# 1. Automated vulnerability scanning
"""
# Trivy (container scanning)
trivy image my-data-pipeline:latest --severity HIGH,CRITICAL

# Snyk (dependency scanning)
snyk test --all-projects
"""

# 2. Certificate monitoring
import ssl
import socket
from datetime import datetime

def check_ssl_expiry(hostname, port=443):
    """Alert si cert expira en <30 días"""
    context = ssl.create_default_context()
    with socket.create_connection((hostname, port)) as sock:
        with context.wrap_socket(sock, server_hostname=hostname) as ssock:
            cert = ssock.getpeercert()
            
            not_after = datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z')
            days_remaining = (not_after - datetime.now()).days
            
            if days_remaining < 30:
                send_alert(f"⚠️ SSL cert expires in {days_remaining} days")
            
            return days_remaining
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🎭 **PII Protection: Masking, Tokenization & Anonymization**

**PII Classification & Protection Strategies**

```
┌────────────────────────────────────────────────────────┐
│  PII Category            Strategy                      │
├────────────────────────────────────────────────────────┤
│  DIRECT IDENTIFIERS                                    │
│  • SSN, Passport         → Tokenization (irreversible) │
│  • Email                 → Masking (u***@domain.com)   │
│  • Phone                 → Partial mask (***-***-1234) │
│  • Name                  → Pseudonymization            │
│                                                         │
│  QUASI-IDENTIFIERS (combination = identity)            │
│  • ZIP + Age + Gender    → k-anonymity (generalize)    │
│  • IP Address            → Truncate last octet         │
│  • Timestamp             → Round to hour/day           │
│                                                         │
│  SENSITIVE ATTRIBUTES                                  │
│  • Health data           → Encryption + access control │
│  • Financial data        → Field-level encryption      │
│  • Biometrics            → Hash + salt                 │
└────────────────────────────────────────────────────────┘
```

**Masking Techniques**

```python
import re
import hashlib
import hmac
from typing import Optional

class PIIMasker:
    """Comprehensive PII masking library"""
    
    def __init__(self, tokenization_secret: str):
        self.secret = tokenization_secret.encode()
    
    # 1. EMAIL MASKING
    def mask_email(self, email: str) -> str:
        """user@example.com → u***@example.com"""
        if not email or '@' not in email:
            return email
        
        user, domain = email.split('@', 1)
        if len(user) <= 2:
            masked_user = user[0] + '*'
        else:
            masked_user = user[0] + '*' * (len(user) - 2) + user[-1]
        
        return f"{masked_user}@{domain}"
    
    # 2. PHONE MASKING
    def mask_phone(self, phone: str) -> str:
        """(555) 123-4567 → ***-***-4567"""
        digits = re.sub(r'\D', '', phone)
        if len(digits) >= 10:
            return f"***-***-{digits[-4:]}"
        return "***-****"
    
    # 3. SSN/ID MASKING
    def mask_ssn(self, ssn: str) -> str:
        """123-45-6789 → ***-**-6789"""
        digits = re.sub(r'\D', '', ssn)
        if len(digits) == 9:
            return f"***-**-{digits[-4:]}"
        return "***-**-****"
    
    # 4. CREDIT CARD MASKING
    def mask_credit_card(self, cc: str) -> str:
        """4532-1234-5678-9010 → ****-****-****-9010"""
        digits = re.sub(r'\D', '', cc)
        if len(digits) >= 12:
            return f"****-****-****-{digits[-4:]}"
        return "****-****-****-****"
    
    # 5. NAME MASKING
    def mask_name(self, name: str) -> str:
        """John Smith → J*** S*****"""
        words = name.split()
        masked = []
        for word in words:
            if len(word) > 1:
                masked.append(word[0] + '*' * (len(word) - 1))
            else:
                masked.append(word)
        return ' '.join(masked)
    
    # 6. TOKENIZATION (deterministic, irreversible)
    def tokenize(self, value: str) -> str:
        """Generate consistent token for same input"""
        return hmac.new(
            self.secret,
            value.encode(),
            hashlib.sha256
        ).hexdigest()
    
    # 7. PSEUDONYMIZATION (reversible con key)
    def pseudonymize(self, value: str, salt: str) -> str:
        """Create pseudonym usando hash + salt"""
        return hashlib.sha256(f"{value}{salt}".encode()).hexdigest()[:16]
    
    # 8. IP ADDRESS MASKING
    def mask_ip(self, ip: str) -> str:
        """192.168.1.100 → 192.168.1.0"""
        parts = ip.split('.')
        if len(parts) == 4:
            parts[-1] = '0'
            return '.'.join(parts)
        return ip
    
    # 9. DATE GENERALIZATION
    def generalize_date(self, date_str: str, precision: str = 'month') -> str:
        """2025-10-15 → 2025-10-01 (month precision)"""
        from datetime import datetime
        dt = datetime.fromisoformat(date_str)
        
        if precision == 'year':
            return dt.strftime('%Y-01-01')
        elif precision == 'month':
            return dt.strftime('%Y-%m-01')
        elif precision == 'week':
            # Round to Monday
            days_since_monday = dt.weekday()
            week_start = dt - timedelta(days=days_since_monday)
            return week_start.strftime('%Y-%m-%d')
        return date_str

# Spark UDF Implementation
from pyspark.sql import functions as F
from pyspark.sql.types import StringType

masker = PIIMasker(tokenization_secret='your-secret-key')

# Register UDFs
mask_email_udf = F.udf(masker.mask_email, StringType())
mask_phone_udf = F.udf(masker.mask_phone, StringType())
tokenize_udf = F.udf(masker.tokenize, StringType())

# Apply to DataFrame
df_masked = (
    df
    .withColumn('email_masked', mask_email_udf(F.col('email')))
    .withColumn('phone_masked', mask_phone_udf(F.col('phone')))
    .withColumn('ssn_token', tokenize_udf(F.col('ssn')))
    .drop('email', 'phone', 'ssn')  # Remove original PII
)

# Ejemplo: Production vs Development masking
def create_dev_dataset(prod_df):
    """Create dev dataset con PII masked"""
    return (
        prod_df
        .withColumn('email', mask_email_udf(F.col('email')))
        .withColumn('phone', F.lit('***-***-****'))  # Static mask
        .withColumn('ssn', tokenize_udf(F.col('ssn')))  # Consistent token
        .withColumn('name', mask_name_udf(F.col('name')))
        .withColumn('created_at', F.date_trunc('month', F.col('created_at')))
    )
```

**K-Anonymity: Quasi-Identifier Protection**

```python
import pandas as pd
import numpy as np

def k_anonymize(df: pd.DataFrame, quasi_identifiers: list, k: int = 5):
    """
    Generalize quasi-identifiers para achieve k-anonymity
    (cada combinación aparece al menos k veces)
    """
    
    def generalize_age(age):
        """Age → age range"""
        if age < 18:
            return '<18'
        elif age < 30:
            return '18-29'
        elif age < 40:
            return '30-39'
        elif age < 50:
            return '40-49'
        elif age < 60:
            return '50-59'
        else:
            return '60+'
    
    def generalize_zipcode(zipcode):
        """12345 → 123**"""
        return str(zipcode)[:3] + '**'
    
    # Apply generalizations
    df_anon = df.copy()
    
    if 'age' in quasi_identifiers:
        df_anon['age'] = df_anon['age'].apply(generalize_age)
    
    if 'zipcode' in quasi_identifiers:
        df_anon['zipcode'] = df_anon['zipcode'].apply(generalize_zipcode)
    
    # Check k-anonymity
    group_sizes = df_anon.groupby(quasi_identifiers).size()
    
    if (group_sizes < k).any():
        # Suppress rows que no cumplen k-anonymity
        valid_groups = group_sizes[group_sizes >= k].index
        df_anon = df_anon.set_index(quasi_identifiers).loc[valid_groups].reset_index()
    
    return df_anon

# Ejemplo
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 27, 35, 42, 55],
    'zipcode': [12345, 12346, 12347, 54321, 54322],
    'disease': ['flu', 'diabetes', 'flu', 'hypertension', 'diabetes']
}

df = pd.DataFrame(data)

# Apply k-anonymity (k=2)
df_k_anon = k_anonymize(df, quasi_identifiers=['age', 'zipcode'], k=2)

"""
Before:
name      age  zipcode    disease
Alice     25   12345      flu
Bob       27   12346      diabetes

After (k=2):
name      age     zipcode    disease
Alice     18-29   123**      flu
Bob       18-29   123**      diabetes
"""
```

**Differential Privacy: Statistical Noise**

```python
import numpy as np

class DifferentialPrivacy:
    """Add calibrated noise para preserve privacy"""
    
    def __init__(self, epsilon: float = 1.0):
        """
        epsilon: privacy budget (lower = more private, less accurate)
        - 0.1: Very private (high noise)
        - 1.0: Reasonable trade-off
        - 10: Minimal privacy (low noise)
        """
        self.epsilon = epsilon
    
    def add_laplace_noise(self, value: float, sensitivity: float = 1.0) -> float:
        """Add Laplacian noise"""
        scale = sensitivity / self.epsilon
        noise = np.random.laplace(0, scale)
        return value + noise
    
    def private_count(self, count: int) -> int:
        """Noisy count query"""
        noisy = self.add_laplace_noise(count, sensitivity=1.0)
        return max(0, int(round(noisy)))  # Non-negative
    
    def private_mean(self, values: list, lower_bound: float, upper_bound: float) -> float:
        """Noisy mean (bounded values)"""
        sensitivity = (upper_bound - lower_bound) / len(values)
        true_mean = np.mean(values)
        return self.add_laplace_noise(true_mean, sensitivity)

# Spark implementation
from pyspark.sql import functions as F

dp = DifferentialPrivacy(epsilon=1.0)

def add_noise_udf(value):
    """UDF to add differential privacy noise"""
    if value is None:
        return None
    return dp.add_laplace_noise(float(value), sensitivity=1.0)

add_noise = F.udf(add_noise_udf, DoubleType())

# Apply to aggregations
df_private = (
    df.groupBy('city')
    .agg(F.count('*').alias('count'))
    .withColumn('noisy_count', add_noise(F.col('count')))
)
```

**GDPR Right to Erasure (Right to Be Forgotten)**

```python
from pyspark.sql import functions as F
from delta.tables import DeltaTable

def gdpr_delete_user_data(user_id: str, tables: list):
    """
    Delete all data for user (GDPR compliance)
    Support Delta Lake time travel para audit
    """
    
    deletion_log = []
    
    for table_path in tables:
        # Load Delta table
        delta_table = DeltaTable.forPath(spark, table_path)
        
        # Count before
        before_count = spark.read.format('delta').load(table_path) \
            .filter(F.col('user_id') == user_id).count()
        
        # Delete
        delta_table.delete(condition=f"user_id = '{user_id}'")
        
        # Verify
        after_count = spark.read.format('delta').load(table_path) \
            .filter(F.col('user_id') == user_id).count()
        
        deletion_log.append({
            'table': table_path,
            'user_id': user_id,
            'records_deleted': before_count,
            'records_remaining': after_count,
            'timestamp': datetime.now().isoformat()
        })
    
    # Log deletion para audit
    log_df = spark.createDataFrame(deletion_log)
    log_df.write.mode('append').format('delta').save('s3://audit/gdpr-deletions/')
    
    return deletion_log

# Anonymize instead of delete (alternative)
def gdpr_anonymize_user_data(user_id: str):
    """Replace PII with anonymized values"""
    
    masker = PIIMasker('secret-key')
    
    delta_table = DeltaTable.forPath(spark, 's3://data-lake/users/')
    
    delta_table.update(
        condition=f"user_id = '{user_id}'",
        set={
            'email': F.lit(masker.tokenize(user_id)),
            'name': F.lit('ANONYMIZED'),
            'phone': F.lit('***-***-****'),
            'address': F.lit(None),
            'anonymized_at': F.current_timestamp()
        }
    )
```

**Caso Real: Facebook-Cambridge Analytica (2018)**

**Problema**: 87M perfiles usados sin consent para political targeting.

**GDPR Protección**:
```python
# 1. Consent management
class ConsentManager:
    def check_consent(self, user_id: str, purpose: str) -> bool:
        """Verify user consent antes de procesar"""
        consent = get_user_consent(user_id)
        return purpose in consent.get('purposes', [])
    
    def process_with_consent(self, user_id: str, purpose: str, data):
        if not self.check_consent(user_id, purpose):
            raise ValueError(f"No consent for {purpose}")
        
        # Process data
        return transform(data)

# 2. Data minimization
def minimize_data_collection(user_data: dict, purpose: str):
    """Collect only necessary fields"""
    
    purpose_fields = {
        'analytics': ['user_id', 'timestamp', 'event_type'],
        'marketing': ['user_id', 'email', 'opt_in'],
        'personalization': ['user_id', 'preferences']
    }
    
    allowed_fields = purpose_fields.get(purpose, [])
    return {k: v for k, v in user_data.items() if k in allowed_fields}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 📋 **Compliance Frameworks: GDPR, HIPAA, SOC2 & Audit**

**Compliance Requirements Matrix**

| Framework | Scope | Key Requirements | Data Engineer Responsibilities |
|-----------|-------|------------------|--------------------------------|
| **GDPR** | EU citizens | • Consent<br>• Right to erasure<br>• Data portability<br>• Breach notification (72h) | • Delete/anonymize on request<br>• Data residency (EU)<br>• Encryption at-rest<br>• Access logs |
| **CCPA** | California residents | • Opt-out of sale<br>• Access to data<br>• Deletion rights | • Implement do-not-sell flag<br>• Data export API<br>• 45-day deletion window |
| **HIPAA** | Healthcare (US) | • PHI encryption<br>• Audit trails<br>• Business Associate Agreements | • Encrypt PHI at-rest/in-transit<br>• Implement access controls<br>• Audit all PHI access |
| **SOC2** | Service organizations | • Security controls<br>• Availability<br>• Confidentiality | • Implement monitoring<br>• Incident response<br>• Change management |
| **PCI-DSS** | Payment cards | • Cardholder data encryption<br>• Secure networks<br>• Regular testing | • Tokenize cards<br>• Network segmentation<br>• Vulnerability scanning |

**GDPR Implementation**

```python
from datetime import datetime, timedelta
from pyspark.sql import functions as F
from delta.tables import DeltaTable

class GDPRCompliance:
    """GDPR compliance toolkit"""
    
    def __init__(self, spark_session):
        self.spark = spark_session
    
    # 1. RIGHT TO ACCESS (Article 15)
    def export_user_data(self, user_id: str, output_path: str):
        """
        Export all user data en formato legible (JSON)
        User can request their data
        """
        
        tables = [
            's3://data-lake/users/',
            's3://data-lake/transactions/',
            's3://data-lake/events/',
            's3://data-lake/preferences/'
        ]
        
        user_data = {}
        
        for table_path in tables:
            table_name = table_path.split('/')[-2]
            
            df = self.spark.read.format('delta').load(table_path) \
                .filter(F.col('user_id') == user_id)
            
            # Convert to JSON
            records = df.toJSON().collect()
            user_data[table_name] = [json.loads(r) for r in records]
        
        # Add metadata
        export_package = {
            'user_id': user_id,
            'export_date': datetime.now().isoformat(),
            'data': user_data,
            'retention_policy': '7 years',
            'data_controller': 'YourCompany Inc.'
        }
        
        # Write to S3 (user can download)
        import json
        with open(output_path, 'w') as f:
            json.dump(export_package, f, indent=2)
        
        # Log export request
        self.log_gdpr_request('access', user_id)
        
        return export_package
    
    # 2. RIGHT TO ERASURE (Article 17)
    def delete_user_data(self, user_id: str, reason: str = 'user_request'):
        """
        Delete all user data (Right to be Forgotten)
        Must complete within 30 days
        """
        
        tables = [
            's3://data-lake/users/',
            's3://data-lake/transactions/',
            's3://data-lake/events/',
            's3://data-lake/preferences/'
        ]
        
        deletion_report = {
            'user_id': user_id,
            'request_date': datetime.now(),
            'reason': reason,
            'tables_processed': []
        }
        
        for table_path in tables:
            delta_table = DeltaTable.forPath(self.spark, table_path)
            
            # Count before deletion
            before = self.spark.read.format('delta').load(table_path) \
                .filter(F.col('user_id') == user_id).count()
            
            # Delete
            delta_table.delete(condition=f"user_id = '{user_id}'")
            
            # Verify deletion
            after = self.spark.read.format('delta').load(table_path) \
                .filter(F.col('user_id') == user_id).count()
            
            deletion_report['tables_processed'].append({
                'table': table_path,
                'records_deleted': before,
                'records_remaining': after,
                'status': 'complete' if after == 0 else 'failed'
            })
        
        # Purge from backups (async job)
        self.schedule_backup_purge(user_id)
        
        # Log deletion
        self.log_gdpr_request('erasure', user_id, deletion_report)
        
        return deletion_report
    
    # 3. DATA PORTABILITY (Article 20)
    def export_user_data_structured(self, user_id: str) -> dict:
        """Export en formato machine-readable (CSV, JSON)"""
        
        # Export to multiple formats
        formats = ['json', 'csv', 'parquet']
        export_urls = {}
        
        for fmt in formats:
            output_path = f's3://exports/{user_id}/data.{fmt}'
            
            df = self.spark.read.format('delta').load('s3://data-lake/users/') \
                .filter(F.col('user_id') == user_id)
            
            if fmt == 'parquet':
                df.write.mode('overwrite').parquet(output_path)
            elif fmt == 'csv':
                df.write.mode('overwrite').option('header', True).csv(output_path)
            else:  # json
                df.write.mode('overwrite').json(output_path)
            
            export_urls[fmt] = output_path
        
        return export_urls
    
    # 4. CONSENT MANAGEMENT (Article 7)
    def check_consent(self, user_id: str, purpose: str) -> bool:
        """Verify explicit consent for data processing"""
        
        consent_df = self.spark.read.format('delta').load('s3://data-lake/consents/') \
            .filter(
                (F.col('user_id') == user_id) &
                (F.col('purpose') == purpose) &
                (F.col('consent_given') == True) &
                (F.col('consent_withdrawn_at').isNull())
            )
        
        return consent_df.count() > 0
    
    def withdraw_consent(self, user_id: str, purpose: str):
        """User withdraws consent"""
        
        delta_table = DeltaTable.forPath(self.spark, 's3://data-lake/consents/')
        
        delta_table.update(
            condition=f"user_id = '{user_id}' AND purpose = '{purpose}'",
            set={'consent_withdrawn_at': F.current_timestamp()}
        )
        
        # Stop processing data for this purpose
        self.log_gdpr_request('consent_withdrawal', user_id, {'purpose': purpose})
    
    # 5. BREACH NOTIFICATION (Article 33)
    def detect_data_breach(self):
        """Detect suspicious access patterns"""
        
        # Query access logs
        access_logs = self.spark.read.format('delta').load('s3://logs/access/')
        
        # Anomaly detection: unusual access volume
        anomalies = access_logs.groupBy('user_id', 'date') \
            .agg(F.count('*').alias('access_count')) \
            .filter(F.col('access_count') > 1000)  # Threshold
        
        if anomalies.count() > 0:
            self.trigger_breach_response(anomalies)
    
    def trigger_breach_response(self, anomalies):
        """72-hour notification requirement"""
        
        breach_report = {
            'detected_at': datetime.now(),
            'notification_deadline': datetime.now() + timedelta(hours=72),
            'affected_users': anomalies.select('user_id').distinct().count(),
            'actions_taken': [
                'Blocked suspicious IPs',
                'Reset affected user sessions',
                'Notified security team'
            ]
        }
        
        # Alert compliance team
        send_alert(
            title='⚠️ Potential Data Breach Detected',
            message=f"Affected users: {breach_report['affected_users']}\n"
                    f"Notification deadline: {breach_report['notification_deadline']}",
            channel='#security-incidents'
        )
        
        # Log breach
        self.log_gdpr_request('breach', 'system', breach_report)
    
    # 6. DATA RETENTION (Article 5)
    def enforce_retention_policy(self, table_path: str, retention_days: int = 2555):
        """
        Delete data older than retention period
        GDPR: Keep data no longer than necessary
        """
        
        delta_table = DeltaTable.forPath(self.spark, table_path)
        
        cutoff_date = datetime.now() - timedelta(days=retention_days)
        
        # Delete old data
        delta_table.delete(
            condition=f"created_at < '{cutoff_date.isoformat()}'"
        )
        
        # Vacuum to physically delete files
        delta_table.vacuum(retentionHours=0)  # Immediate deletion
    
    # 7. AUDIT LOGGING
    def log_gdpr_request(self, request_type: str, user_id: str, details: dict = None):
        """Log all GDPR-related operations"""
        
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'request_type': request_type,
            'user_id': user_id,
            'details': details or {},
            'processed_by': 'gdpr_compliance_service'
        }
        
        log_df = self.spark.createDataFrame([log_entry])
        log_df.write.mode('append').format('delta').save('s3://audit/gdpr-requests/')
```

**HIPAA Compliance (Healthcare)**

```python
class HIPAACompliance:
    """Health Insurance Portability and Accountability Act"""
    
    # PHI = Protected Health Information
    PHI_FIELDS = [
        'name', 'address', 'ssn', 'medical_record_number',
        'email', 'phone', 'ip_address', 'device_id',
        'diagnosis', 'prescription', 'lab_results'
    ]
    
    def encrypt_phi(self, df):
        """Encrypt all PHI fields"""
        
        for field in self.PHI_FIELDS:
            if field in df.columns:
                df = df.withColumn(
                    field,
                    encrypt_udf(F.col(field))
                )
        
        return df
    
    def audit_phi_access(self, user_id: str, patient_id: str, purpose: str):
        """Log every PHI access (HIPAA requirement)"""
        
        audit_entry = {
            'timestamp': datetime.now(),
            'user_id': user_id,
            'patient_id': patient_id,
            'purpose': purpose,
            'ip_address': get_client_ip(),
            'action': 'view_phi'
        }
        
        # Write to immutable audit log
        spark.createDataFrame([audit_entry]).write \
            .mode('append') \
            .format('delta') \
            .option('mergeSchema', False) \
            .save('s3://audit/phi-access/')
    
    def de_identify_dataset(self, df):
        """
        Remove 18 HIPAA identifiers para create de-identified dataset
        Safe Harbor method
        """
        
        identifiers = [
            'name', 'address', 'city', 'zip', 'phone', 'fax', 'email',
            'ssn', 'medical_record_number', 'health_plan_number',
            'account_number', 'certificate_number', 'vehicle_id',
            'device_id', 'url', 'ip_address', 'biometric_id',
            'photo', 'any_unique_code'
        ]
        
        # Remove identifiers
        df_deidentified = df.drop(*[col for col in identifiers if col in df.columns])
        
        # Generalize dates (only year)
        date_columns = [col for col, dtype in df.dtypes if 'date' in dtype or 'timestamp' in dtype]
        for col in date_columns:
            df_deidentified = df_deidentified.withColumn(
                col,
                F.year(F.col(col))
            )
        
        # Generalize age (>89 → 90+)
        if 'age' in df_deidentified.columns:
            df_deidentified = df_deidentified.withColumn(
                'age',
                F.when(F.col('age') > 89, 90).otherwise(F.col('age'))
            )
        
        return df_deidentified
```

**SOC2 Audit Evidence Collection**

```python
# SOC2 Trust Service Criteria
class SOC2Compliance:
    """System and Organization Controls Type 2"""
    
    def collect_audit_evidence(self, period_days: int = 90):
        """Gather evidence for SOC2 audit"""
        
        evidence = {
            # CC6.1: Logical access controls
            'access_controls': self.verify_access_controls(),
            
            # CC6.6: Vulnerability management
            'vulnerability_scans': self.collect_vulnerability_scans(period_days),
            
            # CC7.2: System monitoring
            'monitoring_alerts': self.collect_monitoring_data(period_days),
            
            # CC8.1: Change management
            'code_changes': self.collect_git_commits(period_days),
            
            # A1.2: System availability
            'uptime_metrics': self.calculate_uptime(period_days),
        }
        
        return evidence
    
    def verify_access_controls(self):
        """Verify IAM policies follow least privilege"""
        
        iam = boto3.client('iam')
        
        # Check MFA enforcement
        users = iam.list_users()['Users']
        
        mfa_status = []
        for user in users:
            mfa_devices = iam.list_mfa_devices(UserName=user['UserName'])
            
            mfa_status.append({
                'user': user['UserName'],
                'mfa_enabled': len(mfa_devices['MFADevices']) > 0
            })
        
        # Check overly permissive policies
        policies = iam.list_policies(Scope='Local')['Policies']
        
        risky_policies = []
        for policy in policies:
            policy_version = iam.get_policy_version(
                PolicyArn=policy['Arn'],
                VersionId=policy['DefaultVersionId']
            )
            
            doc = policy_version['PolicyVersion']['Document']
            
            # Check for wildcard permissions
            for statement in doc.get('Statement', []):
                if statement.get('Effect') == 'Allow' and '*' in statement.get('Action', []):
                    risky_policies.append(policy['PolicyName'])
        
        return {
            'mfa_compliance': sum(1 for u in mfa_status if u['mfa_enabled']) / len(mfa_status),
            'risky_policies': risky_policies
        }
```

**Caso Real: Uber 2016 Breach (Concealed)**

**Problema**: Breach de 57M usuarios, pagaron $100K a hackers para ocultarlo, no notificaron (violación GDPR/leyes estatales).

**Resultado**: $148M multa, CEO criminally charged.

**Compliance Checklist**:
```python
compliance_checklist = """
✅ Data Classification (public, internal, confidential, restricted)
✅ Access Controls (RBAC, MFA)
✅ Encryption (at-rest: AES-256, in-transit: TLS 1.3)
✅ Data Retention Policy (automated deletion)
✅ Breach Detection (SIEM, anomaly detection)
✅ Incident Response Plan (72h notification for GDPR)
✅ Audit Logs (immutable, centralized)
✅ Third-Party Audits (annual SOC2, penetration testing)
✅ Employee Training (security awareness, phishing)
✅ Disaster Recovery (RTO < 4h, RPO < 1h)
"""
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. IAM y principio de mínimo privilegio

- Roles por función: data-engineer-ro, data-scientist, admin.
- Políticas granulares: lectura de bucket específico, escritura en tabla específica.
- MFA obligatorio para operaciones sensibles (producción, eliminación).
- Rotación automática de credenciales (secretos, keys).

In [None]:
iam_policy_example = r'''
# AWS IAM policy para data engineer con acceso de lectura a raw y escritura a curated
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": ["arn:aws:s3:::data-lake/raw/*"]
    },
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:DeleteObject"],
      "Resource": ["arn:aws:s3:::data-lake/curated/*"]
    }
  ]
}
'''
print(iam_policy_example.splitlines()[:18])

## 2. Cifrado en tránsito y at-rest

- **At-rest**: S3 SSE-KMS, RDS encryption, disk encryption (EBS/GCS).
- **In-transit**: TLS 1.2+ para todas las APIs, VPN/PrivateLink para conectividad interna.
- Gestión de claves: AWS KMS, GCP Cloud KMS, Azure Key Vault con rotación automática.

## 3. Enmascaramiento y anonimización de PII

In [None]:
import hashlib
def mask_email(email: str) -> str:
    user, domain = email.split('@')
    return f'{user[0]}***@{domain}'

def hash_pii(value: str) -> str:
    return hashlib.sha256(value.encode()).hexdigest()

mask_email('usuario@ejemplo.com'), hash_pii('12345678A')

## 4. Cumplimiento normativo

### GDPR (Europa)
- Derecho al olvido: implementar DELETE cascada y purga en backups.
- Consentimiento explícito y auditable.
- Data residency: almacenar en región EU.

### HIPAA (salud, USA)
- PHI cifrado, logs de acceso auditables.
- Business Associate Agreements con proveedores cloud.

### SOC2 (seguridad organizacional)
- Controles de acceso, monitoreo, incident response.
- Auditorías anuales por terceros.

## 5. Auditoría de accesos y linaje

- CloudTrail (AWS), Cloud Audit Logs (GCP), Activity Log (Azure).
- Registrar quién accedió qué dato, cuándo, desde dónde.
- Linaje de datos: OpenLineage, DataHub, Marquez → rastrear transformaciones y uso.
- Alertas ante accesos anómalos (SIEM: Splunk, Datadog Security).

## 6. Checklist de seguridad para pipelines

In [None]:
checklist = '''
☑ IAM con mínimo privilegio y MFA
☑ Cifrado at-rest (KMS) y in-transit (TLS)
☑ Enmascaramiento de PII en logs y datasets de dev
☑ Rotación automática de secretos (API keys, DB passwords)
☑ Auditoría habilitada (CloudTrail, logs centralizados)
☑ Vulnerability scanning de contenedores (Trivy, Clair)
☑ Network segmentation (VPCs, subnets privadas)
☑ Incident response plan documentado y probado
'''
print(checklist)