
# Bias Experiment: Titanic Survival (Break & Fix)
**Goal**: 의도적으로 편향된 데이터를 만들어 모델 성능이 어떻게 왜곡되는지(Break) 확인하고, 이를 복구해본다(Fix).


In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix

# Load Data
df = pd.read_csv('../data/titanic_train.csv')
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna('S', inplace=True)
df['Cabin'].fillna('N', inplace=True)
df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'female' else 0)

# Simple Preprocessing
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
X = df[features]
y = df['Survived']

print("Original Data Shape:", X.shape)


Original Data Shape: (891, 6)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna('S', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values a


## 1. Break: Selection Bias Simulation
**시나리오**: 데이터 수집 과정에서 3등석(Pclass=3) 승객의 데이터가 대거 누락되었다고 가정합니다. (부유층 위주의 데이터 수집 편향)


In [2]:

# Break: Pclass=3 승객의 90%를 삭제하여 'Biased Train Set' 생성
# (원본 데이터 보존을 위해 copy 사용)
df_biased = df.copy()

# Pclass=3 인 인덱스 찾기
p3_indices = df_biased[df_biased['Pclass'] == 3].index

# 90% 랜덤 선택하여 삭제
drop_indices = np.random.choice(p3_indices, int(len(p3_indices) * 0.9), replace=False)
df_biased = df_biased.drop(drop_indices)

print("Biased Data Shape:", df_biased.shape)
print("Pclass Distribution in Biased Data:\n", df_biased['Pclass'].value_counts(normalize=True))

# 훈련
X_biased = df_biased[features]
y_biased = df_biased['Survived']
X_train, X_test, y_train, y_test = train_test_split(X_biased, y_biased, test_size=0.2, random_state=42)

model_biased = RandomForestClassifier(random_state=42)
model_biased.fit(X_train, y_train)

# 검증 (주의: 테스트는 '원본' 데이터의 분포를 반영해야 실제 성능을 알 수 있음. 여기선 단순히 편향된 모델의 특성을 봅니다.)
pred = model_biased.predict(X_test)
print(f"Biased Model Accuracy: {accuracy_score(y_test, pred):.4f}")


Biased Data Shape: (450, 9)
Pclass Distribution in Biased Data:
 Pclass
1    0.480000
2    0.408889
3    0.111111
Name: proportion, dtype: float64
Biased Model Accuracy: 0.7778



## 2. Fix: Re-weighting / Oversampling
**해결책**: 데이터가 부족한 Class(Pclass=3)의 중요도를 높이거나, 데이터를 증강합니다. 여기서는 `class_weight` 파라미터를 사용하여 모델이 소수 클래스에 더 집중하도록 유도합니다.


In [3]:

# Fix Attempt: Class Weight Adjustment
# Pclass=3 이 매우 적으므로, 모델에게 해당 샘플의 가중치를 부여
# (RandomForestClassifier는 sample_weight나 class_weight를 지원)

# 간단히 'balanced' 모드 사용
model_fixed = RandomForestClassifier(random_state=42, class_weight='balanced')
model_fixed.fit(X_train, y_train)

pred_fixed = model_fixed.predict(X_test)
print(f"Fixed Model Accuracy: {accuracy_score(y_test, pred_fixed):.4f}")

# Compare Recall (실제 생존자를 얼마나 잘 찾았나?)
print(f"Biased Recall: {recall_score(y_test, pred):.4f}")
print(f"Fixed Recall: {recall_score(y_test, pred_fixed):.4f}")


Fixed Model Accuracy: 0.7778
Biased Recall: 0.7391
Fixed Recall: 0.7391
