# 1. GDPR overview
Focus on lawful processing and the key principles with concrete examples.

In [1]:
# Imports
from datetime import datetime, timedelta
import hashlib
import secrets
from pprint import pprint

## 1.1 Learning goals
- Identify key GDPR actors (data subject, controller, processor) and records (RoPA, DPIA).
- Know the six lawful bases and valid consent criteria.
- Apply the seven principles to simple processing steps.
- Identify data subject rights and response timelines.
- Practice data minimization and purpose limitation in code.

Use the prompts to reason about lawful processing, minimization, retention, and accountability.

## 1.2 Quick reference
- Actors: data subject (individual), controller (decides purposes/means), processor (acts for the controller), DPO (independent advisor).
- Records: RoPA (record of processing activities), DPIA (impact assessment for higher-risk processing), DSR log (data subject requests).
- International transfers: rely on adequacy, SCCs, or BCRs; document the transfer risk assessment.

## 1.3 Lawful bases (Article 6)
1. Consent (freely given, specific, informed, unambiguous; easy to withdraw).
2. Contract (necessary to perform a contract with the data subject).
3. Legal obligation (required by law).
4. Vital interests (protects someone's life).
5. Public task (public authority or official task).
6. Legitimate interests (documented balancing test; perform an LIA).
Pick one lawful basis per purpose and document why it applies.

## 1.4 Lawful bases — quick reminder
- Consent: freely given, specific, informed, unambiguous; withdrawal as easy as acceptance.
- Contract: necessary to perform or prepare a contract with the data subject.
- Legal obligation: required to comply with a legal duty.
- Vital interests: necessary to protect someone's life.
- Public task: carried out in the public interest or under official authority.
- Legitimate interests: controller's interests outweigh risks without overriding rights; document the assessment.

## 2. Article 5 principles (summary)
- Lawfulness, fairness, transparency
- Purpose limitation
- Data minimization
- Accuracy
- Storage limitation
- Integrity and confidentiality (security)
- Accountability (demonstrate compliance)
The snippets below translate these principles into concrete controls.

### 2.1 Key articles and sources
- Lawfulness, fairness, transparency — Article 5(1)(a): [Eur-Lex](https://eur-lex.europa.eu/eli/reg/2016/679/oj)
- Purpose limitation — Article 5(1)(b): [Eur-Lex](https://eur-lex.europa.eu/eli/reg/2016/679/oj)
- Data minimization — Article 5(1)(c): [Eur-Lex](https://eur-lex.europa.eu/eli/reg/2016/679/oj)
- Accuracy — Article 5(1)(d): [Eur-Lex](https://eur-lex.europa.eu/eli/reg/2016/679/oj)
- Storage limitation — Article 5(1)(e): [Eur-Lex](https://eur-lex.europa.eu/eli/reg/2016/679/oj)
- Integrity and confidentiality — Article 5(1)(f): [Eur-Lex](https://eur-lex.europa.eu/eli/reg/2016/679/oj)
- Accountability — Article 5(2): [Eur-Lex](https://eur-lex.europa.eu/eli/reg/2016/679/oj)

### 2.2 Article 5 in plain language
- Lawfulness, fairness, transparency: process data legally, fairly, and in an understandable way for the data subject.
- Purpose limitation: collect for specific, explicit, legitimate purposes; avoid incompatible reuse.
- Data minimization: keep only data that is adequate, relevant, and necessary.
- Accuracy: keep data accurate and up to date; rectify or erase without delay when needed.
- Storage limitation: keep data identifiable only as long as necessary for the purposes.
- Integrity and confidentiality: protect against unauthorized access, loss, or disclosure.
- Accountability: be able to prove compliance with the principles.

## 3. Applying the principles in code
Each sub-section exercises a principle with multiple cases so you can see the outputs.

### 3.1 Lawfulness, fairness, transparency
Four scenarios mix valid or invalid lawful bases and whether a notice was sent; the output table shows which ones pass.

In [2]:
# Ensures a lawful basis is selected and a notice was sent before processing proceeds.
# Lawfulness, fairness, transparency: require a lawful basis and a clear notice before processing
lawful_bases = {"consent", "contract", "legal_obligation", "vital_interests", "public_task", "legitimate_interests"}
def can_process(lawful_basis: str, notice_sent: bool) -> bool:
    return lawful_basis in lawful_bases and notice_sent
scenarios = [
    {'purpose': 'account_signup', 'lawful_basis': 'consent', 'notice_sent': True},
    {'purpose': 'order_shipping', 'lawful_basis': 'contract', 'notice_sent': False},
    {'purpose': 'fraud_monitoring', 'lawful_basis': 'legitimate_interests', 'notice_sent': True},
    {'purpose': 'ad_targeting', 'lawful_basis': 'unknown', 'notice_sent': True},
]
decisions = [
    {**scenario, 'can_process': can_process(scenario['lawful_basis'], scenario['notice_sent'])}
    for scenario in scenarios
]
pprint(decisions)

[{'can_process': True,
  'lawful_basis': 'consent',
  'notice_sent': True,
  'purpose': 'account_signup'},
 {'can_process': False,
  'lawful_basis': 'contract',
  'notice_sent': False,
  'purpose': 'order_shipping'},
 {'can_process': True,
  'lawful_basis': 'legitimate_interests',
  'notice_sent': True,
  'purpose': 'fraud_monitoring'},
 {'can_process': False,
  'lawful_basis': 'unknown',
  'notice_sent': True,
  'purpose': 'ad_targeting'}]


### 3.2 Purpose limitation
Three processing requests: two match the declared purposes, one is rejected with an explicit message.

In [3]:
# Allows data use only for declared purposes; raises an error if a new purpose is not covered.
# Purpose limitation: reject uses outside the documented purpose
allowed_purposes = {'fulfill_order', 'fraud_detection'}
def use_data(purpose: str) -> str:
    if purpose not in allowed_purposes:
        raise ValueError('Purpose not allowed by the privacy notice')
    return f'Using data only for {purpose}'
requests = [
    {'purpose': 'fulfill_order', 'data': 'shipping address'},
    {'purpose': 'fraud_detection', 'data': 'device fingerprint'},
    {'purpose': 'ad_targeting', 'data': 'page views'},
]
decisions = []
for req in requests:
    try:
        status = use_data(req['purpose'])
    except ValueError as err:
        status = f'Rejected: {err}'
    decisions.append({**req, 'decision': status})
pprint(decisions)

[{'data': 'shipping address',
  'decision': 'Using data only for fulfill_order',
  'purpose': 'fulfill_order'},
 {'data': 'device fingerprint',
  'decision': 'Using data only for fraud_detection',
  'purpose': 'fraud_detection'},
 {'data': 'page views',
  'decision': 'Rejected: Purpose not allowed by the privacy notice',
  'purpose': 'ad_targeting'}]


### 3.3 Data minimization
Three user profiles with extra fields; the analytics view keeps only the fields needed.


In [4]:
# Builds a reduced view of a record that keeps only fields needed for the analytics purpose.
# Data minimization: keep only fields needed for analytics
users = [
    {'user_id': 'u-101', 'name': 'Ana', 'email': 'ana@example.com', 'age': 34, 'country': 'FR', 'address': '12 Rue Bleue', 'marketing_opt_in': True},
    {'user_id': 'u-102', 'name': 'Lee', 'email': 'lee@example.com', 'age': 28, 'country': 'DE', 'address': '9 Hauptstrasse', 'marketing_opt_in': False},
    {'user_id': 'u-103', 'name': 'Sam', 'email': 'sam@example.com', 'age': 41, 'country': 'ES', 'address': 'Calle Verde 3', 'marketing_opt_in': True},
]
needed = {'user_id', 'country', 'age'}
analytics_view = [{k: v for k, v in user.items() if k in needed} for user in users]
pprint(analytics_view)


[{'age': 34, 'country': 'FR', 'user_id': 'u-101'},
 {'age': 28, 'country': 'DE', 'user_id': 'u-102'},
 {'age': 41, 'country': 'ES', 'user_id': 'u-103'}]


### 3.4 Accuracy
Three incoming updates: one newer applied, one too old ignored, then another newer update applied.


In [5]:
# Updates a record only when the incoming data is newer to keep information accurate.
# Accuracy: accept updates only when the incoming record is newer
current = {'email': 'ana@old.com', 'updated_at': datetime.fromisoformat('2024-01-01T12:00:00')}
incoming_updates = [
    {'email': 'ana@example.com', 'updated_at': datetime.fromisoformat('2024-02-01T09:00:00')},
    {'email': 'ana@typo.com', 'updated_at': datetime.fromisoformat('2023-12-31T09:00:00')},
    {'email': 'ana@work.com', 'updated_at': datetime.fromisoformat('2024-03-15T18:30:00')},
]
def update_if_newer(current_record, new_record):
    if new_record['updated_at'] > current_record['updated_at']:
        current_record.update(new_record)
    return current_record
history = []
for update in incoming_updates:
    before = current.copy()
    update_if_newer(current, update)
    history.append({
        'incoming_at': update['updated_at'].isoformat(),
        'incoming_email': update['email'],
        'updated': current != before,
        'current_email': current['email'],
    })
pprint(history)


[{'current_email': 'ana@example.com',
  'incoming_at': '2024-02-01T09:00:00',
  'incoming_email': 'ana@example.com',
  'updated': True},
 {'current_email': 'ana@example.com',
  'incoming_at': '2023-12-31T09:00:00',
  'incoming_email': 'ana@typo.com',
  'updated': False},
 {'current_email': 'ana@work.com',
  'incoming_at': '2024-03-15T18:30:00',
  'incoming_email': 'ana@work.com',
  'updated': True}]


### 3.5 Storage limitation
Five records with different ages; the report shows which ones remain under a 30-day window.


In [6]:
# Applies a retention window and drops rows older than the allowed cutoff.
# Storage limitation: drop records older than retention
now = datetime.utcnow()
rows = [
    {'id': 1, 'created_at': now - timedelta(days=10), 'source': 'consent_form'},
    {'id': 2, 'created_at': now - timedelta(days=45), 'source': 'support_ticket'},
    {'id': 3, 'created_at': now - timedelta(days=30), 'source': 'checkout'},
    {'id': 4, 'created_at': now - timedelta(days=75), 'source': 'legacy_migration'},
    {'id': 5, 'created_at': now - timedelta(days=5), 'source': 'signup'},
]
def apply_retention(data, days: int = 30):
    cutoff = datetime.utcnow() - timedelta(days=days)
    return [row for row in data if row['created_at'] >= cutoff]
kept = apply_retention(rows, days=30)
kept_ids = {row['id'] for row in kept}
report = [
    {'id': row['id'], 'age_days': (now - row['created_at']).days, 'kept': row['id'] in kept_ids}
    for row in rows
]
pprint(report)


[{'age_days': 10, 'id': 1, 'kept': True},
 {'age_days': 45, 'id': 2, 'kept': False},
 {'age_days': 30, 'id': 3, 'kept': False},
 {'age_days': 75, 'id': 4, 'kept': False},
 {'age_days': 5, 'id': 5, 'kept': True}]


### 3.6 Integrity and confidentiality
Pseudonymize several emails and test raw access by role to illustrate access control.


In [7]:
# Pseudonymizes identifiers with a salted hash and gates raw access by role.
# Integrity and confidentiality: pseudonymize identifiers and gate access
def pseudonymize(value: str) -> str:
    salt = secrets.token_hex(4)
    return hashlib.sha256((salt + value).encode()).hexdigest()[:16]
def view_raw(email: str, role: str) -> str:
    if role not in {'dpo', 'security_admin'}:
        raise PermissionError('Raw access denied')
    return email
emails = ['ana@example.com', 'lee@example.com', 'sam@example.com']
pseudonyms = {email: pseudonymize(email) for email in emails}
access_checks = []
for role in ['analyst', 'security_admin']:
    try:
        access_checks.append({'role': role, 'raw_view': view_raw(emails[0], role)})
    except PermissionError as err:
        access_checks.append({'role': role, 'raw_view': str(err)})
pprint({'pseudonyms': pseudonyms, 'access_checks': access_checks})


{'access_checks': [{'raw_view': 'Raw access denied', 'role': 'analyst'},
                   {'raw_view': 'ana@example.com', 'role': 'security_admin'}],
 'pseudonyms': {'ana@example.com': 'f70553670e7b4832',
                'lee@example.com': '7aa61b3f4296d84b',
                'sam@example.com': '6997304d54b32e54'}}


### 3.7 Accountability
Three actions are logged with lawful basis and subject id to keep usable audit evidence.


In [8]:
# Adds an auditable log entry with timestamp, action, lawful basis, and subject id.
# Accountability: keep an auditable log of processing actions
audit_log = []
def log_action(action: str, lawful_basis: str, subject_id: str):
    audit_log.append({
        'timestamp': datetime.utcnow().isoformat() + 'Z',
        'action': action,
        'lawful_basis': lawful_basis,
        'subject_id': subject_id,
    })
events = [
    ('respond_access_request', 'consent', 'subj-001'),
    ('delete_expired_records', 'legal_obligation', 'batch-2024-02'),
    ('correct_email', 'contract', 'subj-002'),
]
for action, basis, subject in events:
    log_action(action, basis, subject_id=subject)
pprint(audit_log)


[{'action': 'respond_access_request',
  'lawful_basis': 'consent',
  'subject_id': 'subj-001',
  'timestamp': '2025-12-01T08:45:33.222073Z'},
 {'action': 'delete_expired_records',
  'lawful_basis': 'legal_obligation',
  'subject_id': 'batch-2024-02',
  'timestamp': '2025-12-01T08:45:33.222077Z'},
 {'action': 'correct_email',
  'lawful_basis': 'contract',
  'subject_id': 'subj-002',
  'timestamp': '2025-12-01T08:45:33.222078Z'}]


## 4. Data subject rights
Access, rectification, erasure, restriction, portability, objection, and human review of automated decisions. Aim to respond within one month (extensions possible in some cases). Tie each dataset to an owner, a contact path, and the fields you can produce for access/portability.


### 4.1 Mini-case: is this lawful?
- Identify controller and processor roles.
- Select and justify the lawful basis.
- Check for special-category data (Article 9) and applicable conditions.
- Decide if a DPIA is needed (systematic monitoring, large-scale sensitive data, or new tech).
- Note any limits to rights (e.g., erasure blocked by a legal obligation).
Document your reasoning in the next cell.


In [9]:
# Capture roles, lawful basis, and rights limits for the mini-case.
analysis = {
    'controller': '',
    'processor': '',
    'lawful_basis': '',
    'special_category': False,
    'dpia_needed': 'maybe',
    'rights_limits': []
}
analysis


{'controller': '',
 'processor': '',
 'lawful_basis': '',
 'special_category': False,
 'dpia_needed': 'maybe',
 'rights_limits': []}

## 5. Worked example: pseudonymization + minimization
The next cell shows a simple pattern you can reuse: derive a stable subject identifier from an email address, then keep only the fields required for the stated analytics purpose. In real systems, prefer keyed hashing (HMAC) or a dedicated pseudonymization service to avoid linkability across datasets.

In [10]:
# Demonstrates pseudonymizing emails and keeping only the minimal fields for age analytics.
raw_users = [
    {'name': 'Ana', 'email': 'ana@example.com', 'age': 34, 'city': 'Paris', 'signup_date': datetime(2024, 1, 15), 'marketing_opt_in': True},
    {'name': 'Lee', 'email': 'lee@example.com', 'age': 28, 'city': 'Berlin', 'signup_date': datetime(2024, 2, 3), 'marketing_opt_in': False},
    {'name': 'Sam', 'email': 'sam@example.com', 'age': 41, 'city': 'Lyon', 'signup_date': datetime(2023, 12, 20), 'marketing_opt_in': True},
    {'name': 'Maya', 'email': 'maya@example.com', 'age': 22, 'city': 'Madrid', 'signup_date': datetime(2024, 3, 8), 'marketing_opt_in': False},
]
# Suppose the purpose is age-distribution analytics (no need for direct identifiers).
def pseudonymize_email(email: str) -> str:
    return hashlib.sha256(email.encode()).hexdigest()[:12]
processed = [
    {
        'subject_id': pseudonymize_email(u['email']),
        'age': u['age'],
        # city retained only if needed for segmentation; drop otherwise
    }
    for u in raw_users
]
pprint(processed)
avg_age = sum(u['age'] for u in raw_users) / len(raw_users)
summary = {
    'count': len(processed),
    'min_age': min(u['age'] for u in raw_users),
    'max_age': max(u['age'] for u in raw_users),
    'avg_age': round(avg_age, 1),
}
pprint(summary)


[{'age': 34, 'subject_id': '8e43ca377012'},
 {'age': 28, 'subject_id': '556740ed46f0'},
 {'age': 41, 'subject_id': 'cd25a6171969'},
 {'age': 22, 'subject_id': 'a813d5642c0c'}]
{'avg_age': 31.2, 'count': 4, 'max_age': 41, 'min_age': 22}


## 6. Guided exercises
1) Rewrite the code to remove the city and keep only the minimum fields for an age histogram.
2) Add a retention control: remove entries older than 30 days (simulate with a `timestamp` field).
3) Create a checklist to answer an access request for this dataset (which fields, timeline, contact point).
Capture your answers in new cells to make review easy.


### 6.1 Solution — Minimization: drop city
Keep only the pseudonymized identifier and age because an age histogram does not need the city or name. Keep city only if you truly segment by geography; otherwise drop it to reduce exposure.


In [11]:
# Solution exercise 1: minimal view for an age histogram
raw_users_solution = [
    {'name': 'Ana', 'email': 'ana@example.com', 'age': 34, 'city': 'Paris'},
    {'name': 'Lee', 'email': 'lee@example.com', 'age': 28, 'city': 'Berlin'},
    {'name': 'Sam', 'email': 'sam@example.com', 'age': 41, 'city': 'Lyon'},
    {'name': 'Maya', 'email': 'maya@example.com', 'age': 22, 'city': 'Madrid'},
    {'name': 'Omar', 'email': 'omar@example.com', 'age': 30, 'city': 'Brussels'},
]
def pseudonymize_email(email: str) -> str:
    return hashlib.sha256(email.encode()).hexdigest()[:12]
minimal_age_view = [
    {
        'subject_id': pseudonymize_email(u['email']),
        'age': u['age'],
    }
    for u in raw_users_solution
]
pprint(minimal_age_view)


[{'age': 34, 'subject_id': '8e43ca377012'},
 {'age': 28, 'subject_id': '556740ed46f0'},
 {'age': 41, 'subject_id': 'cd25a6171969'},
 {'age': 22, 'subject_id': 'a813d5642c0c'},
 {'age': 30, 'subject_id': '0de61b111046'}]


### 6.2 Solution — Retention control (30 days)
Add a timestamp to each record, then cut off at 30 days. The report shows what is kept or dropped for easy auditing.


In [12]:
# Solution exercise 2: apply a 30-day retention window
now = datetime.utcnow()
raw_users_with_ts = [
    {'email': 'ana@example.com', 'age': 34, 'last_event_at': now - timedelta(days=5)},
    {'email': 'lee@example.com', 'age': 28, 'last_event_at': now - timedelta(days=12)},
    {'email': 'sam@example.com', 'age': 41, 'last_event_at': now - timedelta(days=35)},
    {'email': 'maya@example.com', 'age': 22, 'last_event_at': now - timedelta(days=60)},
    {'email': 'omar@example.com', 'age': 30, 'last_event_at': now - timedelta(days=1)},
]
def apply_retention_minimal(records, days: int = 30):
    cutoff = datetime.utcnow() - timedelta(days=days)
    kept, dropped = [], []
    for record in records:
        if record['last_event_at'] >= cutoff:
            kept.append(record)
        else:
            dropped.append(record)
    return kept, dropped
kept, dropped = apply_retention_minimal(raw_users_with_ts, days=30)
report = [
    {
        'email': r['email'],
        'age_days': (now - r['last_event_at']).days,
        'kept': r in kept,
    }
    for r in raw_users_with_ts
]
pprint({'kept_records': kept, 'dropped_records': dropped, 'report': report})


{'dropped_records': [{'age': 41,
                      'email': 'sam@example.com',
                      'last_event_at': datetime.datetime(2025, 10, 27, 8, 45, 33, 244719)},
                     {'age': 22,
                      'email': 'maya@example.com',
                      'last_event_at': datetime.datetime(2025, 10, 2, 8, 45, 33, 244719)}],
 'kept_records': [{'age': 34,
                   'email': 'ana@example.com',
                   'last_event_at': datetime.datetime(2025, 11, 26, 8, 45, 33, 244719)},
                  {'age': 28,
                   'email': 'lee@example.com',
                   'last_event_at': datetime.datetime(2025, 11, 19, 8, 45, 33, 244719)},
                  {'age': 30,
                   'email': 'omar@example.com',
                   'last_event_at': datetime.datetime(2025, 11, 30, 8, 45, 33, 244719)}],
 'report': [{'age_days': 5, 'email': 'ana@example.com', 'kept': True},
            {'age_days': 12, 'email': 'lee@example.com', 'kept': True},
      

### 6.3 Solution — Access request checklist
- **Contact point**: dedicated privacy inbox (e.g., privacy@example.com) or DPO; log receipt of the request.
- **Identity verification**: request proportionate proof (avoid excessive collection) and note validation date.
- **Data scope**: for this dataset, return `subject_id`, `age`, and only add city if segmentation genuinely requires it; exclude salted hashes and internal debug logs.
- **Timeline**: respond within 1 month; if complex, notify an extension (up to 2 additional months) with justification.
- **Format**: provide in readable CSV/JSON; explain purpose and lawful bases linked to the processing.
- **Logging**: record request date, scope provided, exceptions invoked, and closure date.
- **Follow-up rights**: remind about rectification or erasure (subject to retention duties) and the right to complain to the authority.
