**1. Data Preparation**

**Data Cleaning:**
- **Remove PII Columns:** Columns like `Name`, `SSN`, and `DOB` are direct identifiers and removed to protect privacy.
- **Handle Missing Values:**
  - Numerical features (e.g., `Income`, `Heart rate`) are imputed with median.
  - Categorical features (e.g., `Marital Status`) are imputed with mode.
- **Encode Categorical Variables:** Use one-hot encoding for `Country`, `Marital Status`, `House Status`, `Blood Type`, etc. Label encode the target variable `Tumor` (Normal=0, Abnormal=1).

**2. Apply Differential Privacy**
- **Add Laplace Noise:** For numerical features (`Income`, `Heart rate`, `Oxygen level`), add noise using the Laplace mechanism. The noise scale is determined by sensitivity (Δ) and privacy budget (ε):


In [None]:
def add_laplace_noise(data, epsilon=1.0):
    sensitivity = data.max() - data.min()
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale, data.shape)
    return data + noise

**3. Model Training**
- **Baseline Model (Original Data):** Train a logistic regression model and evaluate performance.
- **Private Model (Noisy Data):** Train the same model on data with Laplace noise and compare metrics.

**4. Evaluation**
- **Performance Metrics:** Compare accuracy, precision, recall, and F1-score between baseline and private models.
- **Privacy vs. Utility Trade-off:** Discuss how ε impacts privacy and model performance.

**5. Legal and Ethical Implications**
- **GDPR Compliance:** Techniques like differential privacy help anonymize data, supporting compliance.
- **Data Ownership:** Users retain ownership; encrypted processing respects their rights.
- **Ethical Considerations:** Ensure transparency and prevent re-identification.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load data
# Make sure that the dataset is uploaded in the session storage
df = pd.read_csv('Assignment2Dataset-1.csv')

# Remove PII columns
df_clean = df.drop(['Name', 'SSN', 'DOB'], axis=1)
df_clean['Oxygen Level'] = df_clean['Oxygen Level'].str.replace('%', '').astype(float)

# Handle missing values
num_cols = ['Income', 'Heart Rate', 'Oxygen Level']
cat_cols = ['Country', 'Marital Status', 'House Status', 'Blood Type']


In [None]:

# Impute numerical features
num_imputer = SimpleImputer(strategy='median')
df_clean[num_cols] = num_imputer.fit_transform(df_clean[num_cols])

# Impute categorical features
cat_imputer = SimpleImputer(strategy='most_frequent')
df_clean[cat_cols] = cat_imputer.fit_transform(df_clean[cat_cols])


In [None]:
# Encode categorical variables
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_cats = encoder.fit_transform(df_clean[cat_cols])
encoded_cat_df = pd.DataFrame(encoded_cats, columns=encoder.get_feature_names_out(cat_cols))

# Combine encoded data with numerical features
X = pd.concat([df_clean[num_cols], encoded_cat_df], axis=1)
y = LabelEncoder().fit_transform(df_clean['Tumor Condition'])


In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Baseline model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Baseline Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"Baseline F1: {f1_score(y_test, y_pred):.2f}")

# Apply differential privacy to numerical features
def add_laplace_noise(data, epsilon=1.0):
    sensitivity = data.max() - data.min()
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale, data.shape)
    return data + noise


Baseline Accuracy: 1.00
Baseline F1: 1.00


In [None]:

X_private = X.copy()
X_private[num_cols] = add_laplace_noise(X_private[num_cols], epsilon=1.0)

# Split private data
Xp_train, Xp_test, yp_train, yp_test = train_test_split(X_private, y, test_size=0.2, random_state=42)

# Private model
model_private = LogisticRegression()
model_private.fit(Xp_train, yp_train)
yp_pred = model_private.predict(Xp_test)
print(f"Differential Privacy Accuracy: {accuracy_score(yp_test, yp_pred):.2f}")
print(f"Differential Privacy F1: {f1_score(yp_test, yp_pred):.2f}")

# Save encrypted dataset
df_clean[num_cols] = X_private[num_cols]
df_clean.to_csv('encrypted_dataset.csv', index=False)

Differential Privacy Accuracy: 0.92
Differential Privacy F1: 0.96



**Output Analysis:**
- **Baseline Model:** Accuracy=1.0, F1=1.0
- **Private Model (ε=1.0):** Accuracy=0.92, F1=0.96

**Discussion:**
- The private model’s performance slightly decreased but maintained reasonable accuracy, indicating a viable privacy-utility balance.
- Higher ε values yield better utility but less privacy; tuning is essential.

**Legal & Ethical Evaluation:**
- Differential privacy aligns with GDPR’s data minimization and anonymization requirements.
- Ensures ethical AI by preventing re-identification and respecting user consent.
