# 1. Get Data

## 1.1 Requirements
- please refer to requirements.txt to check libraries used in this project

## 1.2 Download Data
- Data
    - gender_submission.csv (submission)
    - test.csv (test data)
    - train.csv (train data)
- Download
    - using Kaggle API

In [None]:
!kaggle competitions download -c titanic

# 2. Data Analyze

## 2.1 Overview Data
- Label
    - Survived
- Column that has null value -> have to process
    - Age
    - Cabin
    - Embarked
- Categorical column -> have to encoding
    - Pclass: already ordinal encoded
    - Sex
    - Embarked
- Correlation of columns
    - direct proportion with label('Survived'):
        - Fare
        - Pclass: In fact, the larger class of Pclass, the lower value it has, therefore correlation label with Pclass is actually direcr proportion / Fare and Pclass are direct proportion
    - inverse proportion
        - Parch: It looks like direct propotion in corr matrix because of outlier, In fact, we can check out that's not true in scatter matrix
        - SibSp
        - Age


In [1]:
import pandas as pd
import os
import numpy as np
from load_data import load_titanic_data
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

ModuleNotFoundError: No module named 'load_data'

In [2]:
titanic = load_titanic_data('train')
titanic.head()

NameError: name 'load_titanic_data' is not defined

In [3]:
titanic.info()

NameError: name 'titanic' is not defined

In [4]:
titanic['Sex'].value_counts()

NameError: name 'titanic' is not defined

In [5]:
titanic['Embarked'].value_counts()

NameError: name 'titanic' is not defined

In [6]:
titanic.describe()

NameError: name 'titanic' is not defined

In [7]:
titanic[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].hist(bins=50, figsize=(20, 15))
plt.show()

NameError: name 'titanic' is not defined

In [8]:
titanic['Age_cat'] = pd.cut(titanic['Age'],
                               bins=[0., 10., 20., 30., 40., 50., np.inf],
                               labels=[1, 2, 3, 4, 5, 6])
titanic['Age_cat'].hist()
titanic.drop('Age_cat', axis=1, inplace=True)

NameError: name 'titanic' is not defined

In [9]:
corr_matrix = titanic.corr()

NameError: name 'titanic' is not defined

In [10]:
corr_matrix['Survived'].sort_values(ascending=False)

NameError: name 'corr_matrix' is not defined

In [1]:
from pandas.plotting import scatter_matrix

attributes = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch','Fare']
scatter_matrix(titanic[attributes], figsize=(12, 8), alpha=0.1)

NameError: name 'titanic' is not defined

# 3. Preprocessing Data (preprocessing.py)

## 3.1 Data Cleaning
- Data Shuffle
- Delete unnecessary column
    - Name: no matter to train good model
    - Ticket: too many text values to encode
    - Cabin: too many null and text values
    - Embarked: I think it isn't related to survival so much
- Process null values
    - Age: fill null with median
- Standardize
- Process categorical column
    - Sex: female:0, male:1 (used OrdinalEncoder)

In [2]:
import os
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


In [3]:
def load_titanic_data(mode: str) -> pd.DataFrame:
    TRAIN_PATH = os.path.join('data', 'train.csv')
    TEST_PATH = os.path.join('data', 'test.csv')
    modes = {'train': TRAIN_PATH, 'test': TEST_PATH}
    csv_path = os.path.join(modes[mode])
    return pd.read_csv(csv_path)

In [4]:
def load_preprocessed_data(mode):
    titanic = load_titanic_data(mode)
    if mode == 'train':
        shuffled_titanic = titanic.sample(frac=1)
        shuffled_titanic = shuffled_titanic.drop('PassengerId', axis=1)
        # print(shuffled_titanic.head())
        # print(shuffled_titanic.head())
        x_data, t_data = shuffled_titanic.drop('Survived', axis=1), \
                        shuffled_titanic['Survived'].copy()

        x_data_num = x_data.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
        num_attribs = list(x_data_num)
        cat_attribs = ['Sex']
        # print('attribs:', num_attribs + cat_attribs)

        num_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('std_scaler', StandardScaler()),
        ])
            
        full_pipeline = ColumnTransformer([
            ('num', num_pipeline, num_attribs),
            ('cat', OrdinalEncoder(), cat_attribs),
        ])

        x_data_prepared = full_pipeline.fit_transform(x_data)
        t_data = np.array(t_data, dtype=np.uint8)
        
        return x_data_prepared, t_data
    elif mode == 'test':
        titanic_num = titanic.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
        num_attribs = list(titanic_num)
        cat_attribs = ['Sex']
        # print(titanic_num.head())

        num_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('std_scaler', StandardScaler()),
        ])
            
        full_pipeline = ColumnTransformer([
            ('num', num_pipeline, num_attribs),
            ('cat', OrdinalEncoder(), cat_attribs),
        ])

        titanic_prepared = full_pipeline.fit_transform(titanic)

        return titanic_prepared

# 4. Training Model

- validation strategy: using stratified k-fold to avoid data bias and overfitting

- (model) SGD Classifier
    - Review: Regularization showed better performance. non-regularization, rasso SGD Classifier showed overfitting.
    - score example:
        - non-regularization:
            - accuracy(5-fold validation): \[0.68, 0.78, 0.79, 0.76, 0.76]
            - confusion matrix: \[\[429 120]
                                \[ 78 264]]
            - precision score: 0.69
            - recall score: 0.77
            - f1 score: 0.72
            - kaggle submission score(accuracy): 0.71
        - rasso regularization:
            - accuracy(5-fold validation): \[0.68, 0.78, 0.79, 0.76, 0.77]
            - confusion matrix: \[\[442 107]
                                \[ 95 247]]
            - precision score: 0.70
            - recall score: 0.72
            - f1 score: 0.71
            - kaggle submission score(accuracy): 0.72
        - ridge regularization:
            - accuracy(5-fold validation): \[0.75, 0.73, 0.78, 0.79, 0.73]
            - confusion matrix: \[\[397 152]
                                \[ 71 271]]
            - precision score: 0.64
            - recall score: 0.79
            - f1 score: 0.71
            - kaggle submission score(accuracy): 0.76
        - elastic regularization:
            - accuracy(5-fold validation): \[0.75, 0.73, 0.78, 0.79, 0.73]
            - confusion matrix: \[\[486  63]
                                \[125 217]]
            - precision score: 0.78
            - recall score: 0.63
            - f1 score: 0.70
            - kaggle submission score(accuracy): 0.76

- (model) Logistic Regression
- Review: Regularization didn't show better performance. it showed overfitting little bit.
- score example:
    - non-regularization:
        - accuracy(5-fold validation): \[0.79, 0.75, 0.80, 0.81, 0.81]
        - confusion matrix: \[\[471  78]
                            \[101 241]]
        - precision score: 0.76
        - recall score: 0.70
        - f1 score: 0.73
        - kaggle submission score(accuracy): 0.76
    - ridge regularization(default):
        - accuracy(5-fold validation): \[0.79, 0.75, 0.80, 0.82, 0.81]
        - confusion matrix: \[\[474 75]
                            \[101 241]]
        - precision score: 0.76
        - recall score: 0.70
        - f1 score: 0.73
        - kaggle submission score(accuracy): 0.76

- (model) Support Vector Classifier
    - Review: This model showed better performance than SGDClassifier, LogisticRegression but there was overfitting.
    - score example:
        - non-regularization
            - accuracy(5-fold validation): \[0.82, 0.81, 0.87, 0.80, 0.82]
            - confusion matrix: \[\[499  50]
                                \[ 99 243]]
            - precision score: 0.83
            - recall score: 0.71
            - f1 score: 0.77
            - kaggle submission score(accuracy): 0.78

- (model) Random Forest Classifier
    - Review: First emsemble model, it showed better performance than single model but there was overfitting little bit.
    - score example:
        - accuracy(5-fold validation): \[0.78, 0.79, 0.90, 0.80, 0.80]
        - confusion matrix: \[\[527  22]
                            \[ 26 316]]
        - precision score: 0.93
        - recall score: 0.92
        - f1 score: 0.93
        - kaggle submission score(accuracy): 0.77

- (model) Voting Classifier
    - Review: Ensemble model, it used estimators that are model used before. it showed better performance than single model too. there was no critical overfitting.
    - score example:
        - accuracy(5-fold validation): \[0.80, 0.77, 0.83, 0.81, 0.83]
        - confusion matrix: \[\[501  48]
                            \[112 230]]
        - precision score: 0.83
        - recall score: 0.67
        - f1 score: 0.74
        - kaggle submission score(accuracy): 0.78

- (model) Bagging Classifier: didn't use because it's known that it didn't show good performance in small dataset.

- (model) AdaBoost Classifier
    - Review: Ensemble model, it used Decision Tree. it showed worse preformance than single model. there was overfitting.
    - score example:
        - accuracy(5-fold validation): \[0.85, 0.81, 0.78, 0.81, 0.77]
        - confusion matrix: \[\[478  71]
                            \[ 79 263]]
        - precision score: 0.79
        - recall score: 0.77
        - f1 score: 0.78
        - kaggle submission score(accuracy): 0.75


In [5]:
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

In [6]:
X_data, y_data = load_preprocessed_data('train')
test_data = load_preprocessed_data('test')
submit_data = pd.read_csv('data/gender_submission.csv')

NameError: name 'Pipeline' is not defined

In [7]:
sgd_clf = SGDClassifier(penalty='none')
rasso_sgd = SGDClassifier(penalty='l1')
ridge_sgd = SGDClassifier(penalty='l2')
elastic_sgd = SGDClassifier(penalty='elasticnet')
log_clf = LogisticRegression(penalty='none')
ridge_log = LogisticRegression() # ridge regulization is default
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

models = {
    '0': sgd_clf, 
    '1': rasso_sgd, 
    '2': ridge_sgd, 
    '3': elastic_sgd, 
    '4': log_clf, 
    '5': ridge_log,
    '6': rnd_clf,
    '7': svm_clf
    }

voting_clf = VotingClassifier(
    estimators=[(key, model) for key, model in models.items()],
    voting='hard'
)
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm='SAMME.R', learning_rate=0.5
)

models.update({
    '8': voting_clf,
    '9': ada_clf
    })

In [9]:
skfolds = StratifiedKFold(n_splits=5, shuffle=False)

for clf in models.values():
    scores = []
    for train_index, test_index in skfolds.split(X_data, y_data):
        X_train_folds = X_data[train_index]
        y_train_folds = y_data[train_index]
        X_test_fold = X_data[test_index]
        y_test_fold = y_data[test_index]

        clf.fit(X_train_folds, y_train_folds)
        y_pred = clf.predict(X_test_fold)
        score = sum(y_pred == y_test_fold) / len(y_pred)
        scores.append(score)

    print('clf:', clf, ', scores:', scores)

    y_pred = clf.predict(X_data)
    print('confusion matrix:\n', confusion_matrix(y_data, y_pred))
    print('precision_score:', precision_score(y_data, y_pred), \
    ', recall score:', recall_score(y_data, y_pred), ', f1 score:', \
        f1_score(y_data, y_pred), '\n')
    
    prediction = clf.predict(test_data)

    submission = pd.DataFrame({
        'PassengerId': submit_data['PassengerId'],
        'Survived': prediction
    })

    if clf == voting_clf:
        name = 'VotingClassifier'
    elif clf == ada_clf:
        name = 'AdaBoostClassifier'
    else:
        name = clf

    submission.to_csv(f'data/{name}_submission.csv', index=False)

NameError: name 'X_data' is not defined

In [15]:
!kaggle competitions submit -c titanic -f data\gender_submission.csv -m "test"

Successfully submitted to Titanic - Machine Learning from Disaster

  0%|          | 0.00/3.18k [00:00<?, ?B/s]
100%|██████████| 3.18k/3.18k [00:00<00:00, 24.1kB/s]
100%|██████████| 3.18k/3.18k [00:04<00:00, 717B/s]  


### Conclusion: Ensemble model showed better performance than single model usually. But among single model, Support Vector Classifier showed good performance like ensemble model. Also regularized model generally showed better performance in kaggle submission that predicts 'test.csv' data(non-label) because it prevents overfitting.