# Day 12: Data Ethics & Privacy - Starter Notebook

Welcome to Day 12! This notebook covers data ethics and privacy considerations.

## Learning Objectives
- Understand the importance of ethics and privacy in data analysis
- Identify personal and sensitive data (PII)
- Learn about data privacy regulations (GDPR, CCPA)
- Recognize and mitigate bias in data

## Instructions
Complete each exercise section below. Refer to `docs/day_12_data_ethics_privacy.md` for detailed guidance.

---
## Setup

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

print("Libraries imported successfully!")

---
## Exercise 1: Identifying PII

**Deliverables:**
1. Review a dataset and identify columns containing PII.

**Success Criteria:**
- All PII columns are correctly identified

**Dataset:** Use the sample privacy dataset at `../data/sample_privacy.csv`

### What is PII (Personally Identifiable Information)?

PII includes any data that can be used to identify an individual:

**Direct Identifiers:**
- Full name
- Email address
- Phone number
- Social Security Number
- Driver's license number
- Passport number
- Credit card number
- Home address

**Indirect Identifiers (Quasi-identifiers):**
- Date of birth
- ZIP code
- Gender
- Race/ethnicity
- Occupation
- Education level

In [None]:
# Load the sample privacy dataset
df = pd.read_csv('../data/sample_privacy.csv')
df.head()

In [None]:
# TODO: Identify PII columns in the dataset
# List all columns and classify each as:
# - Direct PII
# - Indirect PII (quasi-identifier)
# - Non-PII

print("Columns in dataset:")
print(df.columns.tolist())

In [None]:
# TODO: Document your PII classification
pii_classification = {
    'direct_pii': [],      # List columns here
    'indirect_pii': [],    # List columns here
    'non_pii': []          # List columns here
}

# Print your classification
for category, columns in pii_classification.items():
    print(f"\n{category.upper()}:")
    for col in columns:
        print(f"  - {col}")

---
## Exercise 2: Data Privacy Regulations

**Deliverables:**
1. Summarize key points of GDPR or CCPA relevant to data analysts.

**Success Criteria:**
- Summary covers main requirements and analyst responsibilities

### GDPR Key Principles

1. **Lawfulness, fairness, and transparency**
2. **Purpose limitation** - collect for specified purposes only
3. **Data minimization** - collect only what's necessary
4. **Accuracy** - keep data accurate and up to date
5. **Storage limitation** - don't keep data longer than needed
6. **Integrity and confidentiality** - ensure appropriate security
7. **Accountability** - demonstrate compliance

### CCPA Key Rights

1. **Right to know** - what personal info is collected
2. **Right to delete** - request deletion of personal info
3. **Right to opt-out** - of sale of personal information
4. **Right to non-discrimination** - for exercising privacy rights

### TODO: Your Summary

*Write your summary of key privacy regulation points relevant to data analysts here:*

1. **Data Collection:**
   - ...

2. **Data Storage:**
   - ...

3. **Data Processing:**
   - ...

4. **User Rights:**
   - ...

5. **Analyst Responsibilities:**
   - ...

---
## Exercise 3: Bias and Fairness

**Deliverables:**
1. Analyze a dataset for potential sources of bias and suggest mitigation strategies.

**Success Criteria:**
- Biases are identified and solutions proposed

### Types of Bias in Data

1. **Selection Bias** - Non-representative sampling
2. **Measurement Bias** - Systematic errors in data collection
3. **Confirmation Bias** - Interpreting data to confirm beliefs
4. **Historical Bias** - Past discrimination reflected in data
5. **Representation Bias** - Underrepresented groups in data
6. **Aggregation Bias** - Different patterns for different groups

In [None]:
# Load the Titanic dataset for bias analysis
df_titanic = pd.read_csv('../data/titanic.csv')
df_titanic.head()

In [None]:
# TODO: Analyze the dataset for potential biases
# Consider: representation of different groups, survival rates by demographics


In [None]:
# TODO: Check representation across different groups
# Hint: Use value_counts() and groupby()


In [None]:
# TODO: Identify potential sources of bias
# Document your findings below


### TODO: Bias Mitigation Strategies

*Based on your analysis, propose strategies to mitigate identified biases:*

1. **Identified Bias:**
   - Description: ...
   - Mitigation: ...

2. **Identified Bias:**
   - Description: ...
   - Mitigation: ...

3. **General Best Practices:**
   - ...

---
## Validation Checklist

Before completing the bootcamp, verify:
- [ ] Can identify PII in datasets
- [ ] Understands key privacy regulations
- [ ] Can recognize and address bias in data

---

## Congratulations!

You have completed the Data Analysis path of the Full Stack Development Bootcamp!