# Modeling for Titanic Dataset

In this notebook, we focus on building and evaluating machine learning models for predicting survival on the Titanic dataset. The steps involve training a variety of classification models, comparing their performance, and selecting the best-performing model. Additionally, we fine-tune the chosen model and generate predictions on the test data for submission.



In [15]:
!pip install pycaret
!pip install catboost



In [16]:
import pandas as pd
import numpy as np

from pycaret.classification import *
from sklearn.svm import SVC

## Loading Data for Modeling
- Loaded the preprocessed training features (`train_x`), target variable (`train_y`), and test features (`test_x`) from CSV files.
- Combined the training features and target variable into a single DataFrame (`combined_data`) for easier exploration and validation.


In [17]:
train_x = pd.read_csv('train_x.csv')
train_y = pd.read_csv('train_y.csv')
test_x = pd.read_csv('test_x.csv')

submission_data = pd.read_csv('gender_submission.csv')

In [18]:
train_x.shape, train_y.shape, test_x.shape

((891, 9), (891, 1), (418, 9))

In [19]:
combined_data = pd.concat([train_x, train_y], axis=1)
combined_data.shape

(891, 10)

### Model Comparison with PyCaret
- Set up the PyCaret classification environment using the training data (`train_data_with_target`) with the target variable `Survived`.
- Normalized the data to ensure better performance across different models.
- Used PyCaret's `compare_models()` function to quickly evaluate and compare multiple machine learning models based on their performance.


In [20]:
train_data_with_target = combined_data
# Setup PyCaret classification
clf_setup = setup(
    data=train_data_with_target,
    target='Survived',
    normalize=True,
    session_id=42
)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Survived
2,Target type,Binary
3,Original data shape,"(891, 10)"
4,Transformed data shape,"(891, 10)"
5,Transformed train set shape,"(623, 10)"
6,Transformed test set shape,"(268, 10)"
7,Numeric features,9
8,Preprocess,True
9,Imputation type,simple


In [21]:
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.8379,0.8615,0.7453,0.8227,0.7786,0.6516,0.6569,0.2
xgboost,Extreme Gradient Boosting,0.8346,0.862,0.7453,0.815,0.7754,0.6453,0.6501,0.113
dt,Decision Tree Classifier,0.8267,0.8285,0.7286,0.8088,0.7634,0.6275,0.6328,0.058
catboost,CatBoost Classifier,0.8267,0.8725,0.7161,0.8227,0.7596,0.6256,0.6349,1.716
gbc,Gradient Boosting Classifier,0.825,0.8708,0.7036,0.8232,0.754,0.6201,0.6291,0.141
et,Extra Trees Classifier,0.825,0.8476,0.7245,0.8095,0.7608,0.6239,0.6299,0.235
ada,Ada Boost Classifier,0.8201,0.8663,0.7286,0.788,0.7549,0.6137,0.6169,0.119
lightgbm,Light Gradient Boosting Machine,0.8186,0.8661,0.7118,0.7975,0.7492,0.6083,0.6134,0.454
lr,Logistic Regression,0.8074,0.8618,0.7203,0.7691,0.7409,0.5883,0.592,0.686
knn,K Neighbors Classifier,0.7994,0.8451,0.6868,0.7787,0.7231,0.5676,0.5763,0.08


Processing:   0%|          | 0/69 [00:00<?, ?it/s]

### Support Vector Classifier (SVC) Model
- Trained a Support Vector Classifier (SVC) with a radial basis function (RBF) kernel on the training data.
- Made predictions on the test set (`test_x`) and compared the predicted survival values with the original `Survived` values from `submission_data`.
- Exported the results to a CSV file for submission.

In [22]:
sv = SVC(kernel='rbf', shrinking=True,gamma='auto', C=1)
sv.fit(train_x, train_y)

In [23]:
output = submission_data.copy()

output['Survived'] = sv.predict(test_x)

correct_predictions = (output['Survived'] == submission_data['Survived']).sum()

accuracy = correct_predictions / len(submission_data) * 100

print(f'Accuracy: {accuracy:.2f}%')

Accuracy: 95.93%


In [24]:
#output.to_csv('submission_.csv', index=False)