### 1. Import module and test

In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import warnings 
warnings.filterwarnings('ignore')
from matplotlib import pyplot as plt

In [None]:
train = pd.read_csv("./dataset/train.csv")
test = pd.read_csv("./dataset/test.csv")

### 2. test Overview

* test Dictionary
    1. Survival : survival or not (0 : No, 1 : Yes)  
    2. pclass : Ticket class  
    3. sex : sex  
    4. Age : age in years  
    5. sibsp : # of sibilings / spouses aborad the titanic  
    6. parch : # of parents / children aborad the titanic  
    7. ticket : Ticket number  
    8. fare : Passenger fare  
    9. cabin : Cabin number  
    10. embarked : Port of Embarkation (C : Cherborug, Q : Queenstown, S : Southampton)  

**Train test**

각각 891개의 데이터가 있으며 12개의 column으로 이루어져 있음.

In [None]:
print(train.shape)

Age에서 177개, Cabin에서 687개, Embarked에서 2개의 결측치가 있음.

In [None]:
print(train.info())

In [None]:
train.isnull().sum()

각각의 데이터는 아래와 같이 이루어져 있음.

In [None]:
train.head()

각 데이터의 분포는 아래와 같음.

In [None]:
train.describe()

**Test test**

Test test는 총 418개의 데이터로 이루어져있고 11개의 column으로 이루져있음

In [None]:
print(test.shape)

Test 데이터는 Age에서 86개, Cabin에서 327개의 결측치가 있음.

In [None]:
print(test.info())

In [None]:
test.isnull().sum()

Test 데이터의 분포는 아래와 같음.

In [None]:
test.describe()

### 3. Data Analyzation

Name column의 경우 Prefix를 정규표현식을 이용해 추출함.

In [None]:
train["Prefix"] = train["Name"].str.extract("([A-Za-z]+)\.")
test["Prefix"] = test["Name"].str.extract("([A-Za-z]+)\.")
train["Prefix"].value_counts()

In [None]:
test["Prefix"].value_counts()

추출된 Prefix를 학습에 사용할 수 있도록 int화 시켜줍니다.

In [None]:
pre_dict = {"Mr":0, "Miss":1, "Mrs":2, "Master":3, "Dr":4, "Rev":4, "Major":4, "Col":4, "Mlle":4, "Ms":4, "Countess":4, "Jonkheer":4, "Lady":4, "Capt":4, "Don":4, "Sir":4, "Mme":4, "Dona":4}
train["Prefix"] = train["Prefix"].map(pre_dict)
test["Prefix"] = test["Prefix"].map(pre_dict)

In [None]:
train["Prefix"].value_counts()

In [None]:
test["Prefix"].value_counts()

Embarked column의 결측치를 채워넣음. Embarked의 경우, 절대적 다수인 "S"로 채워줍니다. Test의 경우에는 Embarked Column에 결측치가 없으므로 Train 데이터에만 작업을 진행합니다.

In [None]:
train["Embarked"].fillna(train["Embarked"].mode()[0], inplace=True)

In [None]:
plt.figure(figsize=(15,15))
sb.heatmap(data=train.corr(), annot=True, fmt='.2f', linewidths=.5, cmap='Blues')

Age의 경우, Pclass와 Age의 Correlation이 가장 크므로, 각 Pclass의 평균치를 입력해줍니다.  
Test의 경우는 Train에서의 평균치를 삽입해줍니다.  
Test의 존재하는 Fare의 결측치 경우, Fare와 Pclass의 Correlation이 가장 크므로, Train의 각 Pclass의 평균치를 입력해줌.

In [None]:
sb.barplot(x="Pclass", y="Age", ci=None, data= train)

In [None]:
sb.barplot(x="Pclass", y="Fare", ci=None, data=train)

In [None]:
train["Age"].fillna(train.groupby("Pclass")["Age"].transform("mean"), inplace=True)

# for test
for idx in range(len(test)):
    if np.isnan(test["Age"][idx]):
        test["Age"][idx] = train[train["Pclass"] == test["Pclass"][idx]]["Age"].mean()
    if np.isnan(test["Fare"][idx]):
        test["Fare"][idx] = train[train["Pclass"] == test["Pclass"][idx]]["Fare"].mean()

현재 str 되어있는 Sex와 Embarked column또한 모델이 인식할 수 있도록 int화 시킴.

In [None]:
s_dict = {"male":0, "female":1}
train["Sex"] = train["Sex"].map(s_dict)
test["Sex"] = test["Sex"].map(s_dict)

In [None]:
ebk_dict = {"S": 0, "C": 1, "Q": 2}
train["Embarked"] = train["Embarked"].map(ebk_dict)
test["Embarked"] = test["Embarked"].map(ebk_dict)

In [None]:
sib_dict = {0:0, 1:1, 2:2, 3:2, 4:2, 5:2, 8:2}
train["SibSp"] = train["SibSp"].map(sib_dict)
test["SibSp"] = test["SibSp"].map(sib_dict)

In [None]:
parch_dict = {0:0, 1:1, 2:2, 3:2, 4:2, 5:2, 6:2, 9:2}
train["Parch"] = train["Parch"].map(parch_dict)
test["Parch"] = test["Parch"].map(parch_dict)

이후 필요없는 label의 경우 drop을 시켜줍니다.

In [None]:
train.drop(["PassengerId", "Name", "Cabin", "Ticket"], axis=1, inplace=True)
test.drop(["Name", "Cabin", "Ticket"], axis=1, inplace=True)

각 데이터에 존재하는 Outlier를 확인합니다.

In [None]:
sb.boxplot(train["Age"])

In [None]:
sb.boxplot(train["Fare"])

In [None]:
def change_outlier(data,col):
    q3 = train[col].quantile(q=0.75)
    q1 = train[col].quantile(q=0.25)
    iqr = 1.5 * (q3-q1)
    data.loc[data[col] < q1-iqr, col] = q1-iqr
    data.loc[data[col] > q3+iqr, col] = q3+iqr

In [None]:
change_outlier(train, "Age")
change_outlier(test, "Age")
change_outlier(train, "Fare")
change_outlier(test, "Fare")

In [None]:
train.describe()

Age와 Fare는 범위가 넓기때문에 4개의 범위로 grouping을 진행합니다.

In [None]:
train.loc[train["Age"] <=22.0, "Age"] = 0
train.loc[(train["Age"] >22.0) & (train["Age"] <= 26.0), "Age"] = 1
train.loc[(train["Age"] >26.0) & (train["Age"] <= 37.0), "Age"] = 2
train.loc[train["Age"] > 37.0, "Age"] = 3

train.loc[train["Fare"] <= 7.91, "Fare"] = 0
train.loc[(train["Fare"] > 7.91) & (train["Fare"] <= 14.4542), "Fare"] = 1
train.loc[(train["Fare"] > 14.4542) & (train["Fare"] <= 31), "Fare"] = 2
train.loc[train["Fare"] > 31, "Fare"] = 3
train["Age"] = train["Age"].astype("int64")
train["Fare"] = train["Fare"].astype("int64")


test.loc[test["Age"] <= 22.0, "Age"] = 0
test.loc[(test["Age"] > 22.0) & (test["Age"] <= 26.0), "Age"] = 1
test.loc[(test["Age"] > 26.0) & (test["Age"] <= 37.0), "Age"] = 2
test.loc[test["Age"] > 37, "Age"] = 3

test.loc[test["Fare"] <= 7.91, "Fare"] = 0
test.loc[(test["Fare"] > 7.91) & (test["Fare"] <= 14.4542), "Fare"] = 1
test.loc[(test["Fare"] > 14.4542) & (test["Fare"] <= 31), "Fare"] = 2
test.loc[test["Fare"] > 31, "Fare"] = 3

test["Age"] = test["Age"].astype("int64")
test["Fare"] = test["Fare"].astype("int64")

In [None]:
train.info()

In [None]:
test.info()

데이터 전처리 후 Correlation 확인

In [None]:
plt.figure(figsize=(15,15))
sb.heatmap(data=train.corr(), annot=True, fmt='.2f', linewidths=.5, cmap='Blues')

Correlation 0.1이상의 column을 제외한 데이터 drop

In [None]:
drop_list = ['SibSp', 'Age']
train.drop(drop_list, axis=1)
test.drop(drop_list, axis=1)

### 4. Train and Predict

In [None]:
# Classification module
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [None]:
y_train = train["Survived"]
x_train = train.drop(['Survived'], axis=1)

In [None]:
# Logistic Regression
logis = LogisticRegression()
logis.fit(x_train, y_train)
logis_score = logis.score(x_train, y_train)
print(f"Logistic regression Score: {logis_score}")

In [None]:
# Support Vector Machine(SVM)
svm = SVC()
svm.fit(x_train, y_train)
svm_score = svm.score(x_train, y_train)
print(f"SVM score: {svm_score}")

In [None]:
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(x_train, y_train)
knn_score = knn.score(x_train, y_train)
print(f"KNN score: {knn_score}")

In [None]:
tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
tree_score = tree.score(x_train, y_train)
print(f"Decision Tree Score: {tree_score}")

In [None]:
forest = RandomForestClassifier()
forest.fit(x_train, y_train)
forest_score = forest.score(x_train, y_train)
print(f"Random Forest score: {forest_score}")

In [None]:
lda = LDA()
lda.fit(x_train, y_train)
lda_score = lda.score(x_train, y_train)
print(f"LDA score: {lda_score}")

In [None]:
ridge = RidgeClassifier()
ridge.fit(x_train, y_train)
ridge_score = ridge.score(x_train, y_train)
print(f"Ridge score: {ridge_score}")

In [None]:
test_pred = test.copy()
test_pred = test_pred.drop("PassengerId", axis= 1)

In [None]:
y_pred = forest.predict(test_pred)
submission = pd.DataFrame({"PassengerId" : test["PassengerId"], "Survived":y_pred})
submission.to_csv("submission.csv", index=False)

### 5. Train with K-Fold

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [None]:
# AutoML
from pycaret import classification

In [None]:
auto_ml = classification.setup(data=train, target="Survived", fold_strategy='stratifiedkfold')

In [None]:
auto_ml_result = classification.compare_models(fold=5, round=3, sort="Accuracy", n_select=5)

In [None]:
kfold = KFold(n_splits=20, shuffle=True, random_state=42)

In [None]:
print("---- Logistic Regression ----")
logis_k = LogisticRegression()
logis_k_score = cross_val_score(logis_k, x_train, y_train, cv=kfold, n_jobs=1, scoring="accuracy")
print(logis_k_score)
print(f"Average logis_k score: {np.mean(logis_k_score)} \n")

print("---- Support Vector Machine ----")
svm_k = SVC()
svm_k_score = cross_val_score(svm_k, x_train, y_train, cv=kfold, n_jobs=1, scoring='accuracy')
print(svm_k_score)
print(f"Average svm_k score: {np.mean(svm_k_score)} \n")

print("---- K-Nerest Neighbor ----")
knn_k = KNeighborsClassifier(n_neighbors=5)
knn_k_score = cross_val_score(knn_k, x_train, y_train, cv=kfold, n_jobs=1, scoring="accuracy")
print(knn_k_score)
print(f"Average knn_k score: {np.mean(knn_k_score)} \n")

print("---- Decision Tree ----")
tree_k = DecisionTreeClassifier()
tree_k_score = cross_val_score(tree_k, x_train, y_train, cv=kfold, n_jobs=1, scoring="accuracy")
print(tree_k_score)
print(f"Average tree_k score: {np.mean(tree_k_score)} \n")

print("---- Random Forest ----")
forest_k = RandomForestClassifier()
forest_k_score = cross_val_score(forest_k, x_train, y_train, cv=kfold, n_jobs=1, scoring="accuracy")
print(forest_k_score)
print(f"Average forest_k score: {np.mean(forest_k_score)} \n")

print("---- LDA ----")
lda_k = LDA()
lda_k_score = cross_val_score(lda_k, x_train, y_train, cv=kfold, n_jobs=1, scoring="accuracy")
print(lda_k_score)
print(f"Average lda_k score: {np.mean(lda_k_score)} \n")

print("---- Ridge ----")
ridge_k = RidgeClassifier()
ridge_k_score = cross_val_score(ridge_k, x_train, y_train, cv=kfold, n_jobs=1, scoring="accuracy")
print(ridge_k_score)
print(f"Average ridge_k score: {np.mean(ridge_k_score)} \n")

In [None]:
lda_final = SVC()
lda_final.fit(x_train, y_train)
y_pred = lda_final.predict(test_pred)
submission = pd.DataFrame({"PassengerId" : test["PassengerId"], "Survived":y_pred})
submission.to_csv("submission_kfold.csv", index=False)

### 5. Catboost

In [None]:
import optuna
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
from optuna import Trial, visualization
from optuna.samplers import TPESampler

In [None]:
def optimizerCAT(trial, data, target):
    param = {
        'random_state':42,
        'n_estimators': trial.suggest_int('n_estimators', 300, 3500),
        'depth': trial.suggest_int('depth', 6, 14),
        'fold_permutation_block': trial.suggest_int('fold_permutation_block', 1, 256),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 1),
        'od_pval': trial.suggest_float('od_pval', 0, 1),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 0, 4),
    }

    x_train, x_valid, y_train, y_valid = train_test_split(data, target, test_size=0.2)

    model = CatBoostClassifier(**param)
    model.fit(x_train, y_train, verbose= True)
    score = accuracy_score(model.predict(x_valid), y_valid)

    return score

In [None]:
study = optuna.create_study(direction="maximize", sampler=TPESampler())
x_train = train.drop(["Survived"], axis=1)
y_train = train["Survived"]
study.optimize(lambda trial: optimizerCAT(trial, x_train, y_train), n_trials=5)

print('Best trial : score {}, \nparams {}'.format(study.best_trial.value, study.best_trial.params))

In [None]:
cat_parameter = {'n_estimators': [3472], 'depth': [11], 'fold_permutation_block': [201], 'learning_rate': [0.7135660214466232], 'od_pval': [0.14156911916283843], 'l2_leaf_reg': [1.3590814127726265]}
cat = CatBoostClassifier(random_state=42, verbose=False)
model = RandomizedSearchCV(cat, cat_parameter, cv=kfold, n_jobs=1)
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(test_pred)
submission = pd.DataFrame({"PassengerId" : test["PassengerId"], "Survived":y_pred})
submission.to_csv("submission_catboost.csv", index=False)