# **Titanic Classifier**

## 1. 라이브러리 불러오기
실습에서 사용할 seaborn, sklearn, matplot, pandas 라이브러리를 불러옵니다.

In [17]:
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
import pandas as pd

## 2. 데이터 로드
seaborn 라이브러리의 타이타닉 데이터셋를 로드합니다.

In [2]:
titanic = sns.load_dataset("titanic")

## 3. DataFrame 구조 확인 및 전처리
- DataFrame 구조

In [3]:
print(titanic.info())

print(titanic.describe)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None
<bound method NDFrame.describe of      survived  pclass     sex   age  sib

- `전처리`  
전처리를 진행하기 전에 위 수치형 데이터에서 결측치 확인을 합니다.

In [4]:
print(titanic.isnull().sum())

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [5]:
# 1. deck 컬럼 제거 (결측치가 너무 많다. 688/891)
titanic.drop(columns='deck', inplace=True)

# 2. 중복 컬럼 제거
titanic.drop(columns='alive', inplace=True)
titanic.drop(columns='class', inplace=True)
titanic.drop(columns='adult_male', inplace=True)
titanic.drop(columns='embarked', inplace=True)

# 3. age의 값이 결코 무시할 수 없는 수치이기에 결측치를 평균으로 대체
titanic['age'].fillna(titanic['age'].mean(), inplace=True)

# 4. embarked의 값이 2로 무시할 수 있는 수치이기에 흔한 값인 최빈값으로 대체
titanic['embark_town'].fillna(titanic['embark_town'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['embark_town'].fillna(titanic['embark_town'].mode()[0], inplace=True)


In [6]:
# 전처리 잘 진행된건지 결측치 재확인
print(titanic.isnull().sum())

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
who            0
embark_town    0
alone          0
dtype: int64


## 4. X, y 분리 및 학습용/테스트용 분할
- 승객의 생존 여부인 **survived**가 우리가 풀고하 하는 **예측 대상**  
→ 따라서 다양한 feature들을 보고 **생존했을지**(**survived**)를 **예측**하는 것  

* X(feature) : 입력 → pclass, sex, age, sibsp 등 // 예측에 사용하는 데이터(입력값)
* y(label) : 타깃 → survived // 예측하고자 하는 결과값

In [7]:
X = titanic.drop(columns="survived")
y = titanic["survived"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

## 5. 컬럼 분리 및 파이프라인 설계

In [8]:
# 수치형/범주형 컬럼 분리
number_features = ["age", "fare", "sibsp", "parch"]
category_features = ["sex", "who", "alone", "pclass", "embark_town"]

In [9]:
# 수치형 컬럼 전처리
number_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("scaler", StandardScaler())
])

# 범주형 컬럼 전처리
category_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# 컬럼 전처리 통합
preprocessor = ColumnTransformer([
    ("number", number_pipeline, number_features),
    ("category", category_pipeline, category_features)
])

## 6. Pipeline에 여러 분류 모델 연결

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

models = {
    "LogisticRegression": LogisticRegression(),
    "RandomForest": RandomForestClassifier(),
    "DecisionTree": DecisionTreeClassifier(),
    "SVC": SVC(),
    "NaiveBayes": GaussianNB()
}

## 7. 학습 및 평가

In [11]:
for name, model in models.items():
    pipe = Pipeline([
        ("preprocessing", preprocessor),
        ("classifier", model)
    ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {acc:.3f}")

LogisticRegression Accuracy: 0.810
RandomForest Accuracy: 0.821
DecisionTree Accuracy: 0.771
SVC Accuracy: 0.816
NaiveBayes Accuracy: 0.793


## 8. 최적의 파라미터 찾기(GridSearchCV)

In [12]:
from sklearn.model_selection import GridSearchCV

# 파이프라인 구성

# RandomForest만 적용
pipe_RF = Pipeline([
    ("preprocessing", preprocessor),
    ("classifier", RandomForestClassifier(random_state=42))
     ])

In [13]:
# 하이퍼 파라미터 범위 정의
param_grid = {
    "classifier__n_estimators": [50, 100, 200],
    "classifier__max_depth": [None, 5, 10],
    "classifier__min_samples_split": [2, 5],
    "classifier__min_samples_leaf": [1, 2, 4]
}

In [14]:
# GridSearch 설정

# RandomForest
grid_search_RF = GridSearchCV(
    pipe_RF,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

In [None]:
# 학습 및 파라미터 찾기

# RandomForest 학습
grid_search_RF.fit(X_train, y_train)

# 결과 출력
print("Best parameters:", grid_search_RF.best_params_)
print("Best cross-validation accuracy: {:.3f}".format(grid_search_RF.best_score_))

Fitting 5 folds for each of 54 candidates, totalling 270 fits
