<a href="https://www.kaggle.com/code/pachecopacheco4/survival-titanic-2-0?scriptVersionId=148633423" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Trying to use more information 

In this ocassion we are going to try to obtain more information by analyzing 'Ticket', 'Name' and 'Cabin' variables. We ommit some steps made in Survival Titanic (past notebook).  

In [None]:
import pandas as pd
from sklearn.metrics import classification_report
import seaborn as sns
titanic = pd.read_csv("/kaggle/input/titanic/train.csv")
titanic.head()

In [None]:
titanic['Survived'].value_counts()

 We are going to do an election that tends to predict 'Survived'.
 We thought that likely that are families that have more survival rate than others. For this reason we are going to create 'Family' variable that is going to save the last name of the passengers. 

In [None]:
titanic2=titanic.copy()
titanic2['Family'] = [c.split(',')[0] for c in titanic2.Name]
titanic2.head()

In [None]:
a=titanic.pivot_table(index='Cabin', columns='Survived', aggfunc='size', fill_value=0)
a.columns=['NotSurvived','Survived']
a2=a[a['Survived']>0].copy()
a2['SurvivedPercentage'] = (a2['Survived'] / a2.sum(axis=1)) * 100
a2

Although likely this contains some information we are going to ommit 'Cabin' variable again. 

In [None]:
b=titanic2.pivot_table(index='Family', columns='Survived', aggfunc='size', fill_value=0)
b.columns=['NotSurvived','Survived']
b2=b[b.Survived>2].copy()
b2['SurvivedPercentage'] = (b2['Survived'] / b2.sum(axis=1)) * 100
print(b2)
mostSurvivedFamilies=b2.reset_index()['Family']

Here we are going to delete family Brown and Johnson.

In [None]:
excluded_families = ['Brown','Johnson']

mostSurvivedFamilies = set(b2.index).difference(excluded_families)

mostSurvivedFamilies

In [None]:
c=titanic.pivot_table(index='Ticket', columns='Survived', aggfunc='size', fill_value=0)
c.columns=['NotSurvived','Survived']
c2=c[c.Survived>=3].copy()
c2['SurvivedPercentage'] = (c2['Survived'] / c2.sum(axis=1)) * 100
print(c2)
mostSurvivedTickets=c2.index

Here we are going to ommit 1601 Ticket (we would like to be strict in classifying).

In [None]:
excluded_tickets = ['1601']

mostSurvivedTickets = set(c2.index).difference(excluded_tickets)

mostSurvivedTickets

In [None]:
titanic2['NewFamily'] = titanic2['Family'].apply(lambda x: x if x in mostSurvivedFamilies else 'otherFamily')
titanic2['NewFamily'].unique()

In [None]:
titanic2['NewTicket'] = titanic2['Ticket'].apply(lambda x: x if x in mostSurvivedTickets else 'otherTicket')
titanic2['NewTicket'].unique()

In [None]:
titanic2.head()

In [None]:
y=titanic.Survived
X=titanic2[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','NewFamily','NewTicket']]
X.head()

### Imputation and one hot encoding 

In [None]:
X.isnull().sum()

**Numerical imputation**

In [None]:
from sklearn.impute import SimpleImputer
X2 = X.copy()
my_imputer = SimpleImputer()
numerical_cols2 = [cname for cname in X2.columns if X2[cname].dtype in ['int64', 'float64']]
print(numerical_cols2)

X2[numerical_cols2] = pd.DataFrame(my_imputer.fit_transform(X2[numerical_cols2]))
X2[numerical_cols2].columns = X[numerical_cols2].columns

print(X2[numerical_cols2].isnull().sum())


**Categorical imputation**

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
categorical_cols2 = [cname for cname in X2.columns if X2[cname].dtype == "object"]

In [None]:
my_imputer = SimpleImputer(strategy='most_frequent')
X2[categorical_cols2] = pd.DataFrame(my_imputer.fit_transform(X2[categorical_cols2]))
X2[categorical_cols2].columns = X[categorical_cols2].columns
X2.head()

**One hot encoding**

In [None]:
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

OH_cols = pd.DataFrame(OH_encoder.fit_transform(X2[categorical_cols2]))

OH_cols

OH_cols.index = X2.index
OH_cols

X2.drop(categorical_cols2, axis=1, inplace=True)

X2 = pd.concat([X2, OH_cols], axis=1)

X2.columns = X2.columns.astype(str)

In [None]:
X2.head()

#### Train and validation sets

In [None]:
from sklearn.model_selection import train_test_split
train_X2, val_X2, train_y2, val_y2 = train_test_split(X2, y, random_state = 35643419)

**XGBOOST**

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

#xgbm = XGBClassifier(n_estimators=1000,max_leaves=50,random_state=35643419,early_stopping_rounds=5,learning_rate=0.05)
xgbm = XGBClassifier(n_estimators=150,max_leaves=50,random_state=35643419,learning_rate=0.05)

#xgbm.fit(train_X2, train_y2, eval_set=[(val_X2, val_y2)],verbose=False)
xgbm.fit(train_X2, train_y2)

preds_xgb2 = xgbm.predict(val_X2)

**Learning rate** 
- With early_stopping_rounds
1. default | 0.8340807174887892
2. 0.05 | 0.8430493273542601

- Without early_stopping_rounds
1. 0.05 | 0.8475336322869955

**max_leafs**

**lambda**

**gamma**


In [None]:
score = accuracy_score(val_y2, preds_xgb2)
print('Score:', score)
print(classification_report(val_y2, preds_xgb2))

In [None]:
def get_accuracy(max_leaves, train_X, val_X, train_y, val_y):
    xgbm = XGBClassifier(n_estimators=1000,max_leaves=max_leaves,random_state=35643419,early_stopping_rounds=5)
    xgbm.fit(train_X, train_y, eval_set=[(val_X, val_y)],verbose=False)
    preds_xgb = xgbm.predict(val_X)
    accuracy = accuracy_score(val_y, preds_xgb)
    return(accuracy)

for max_leaves in [5, 50, 500, 5000]:
    print('max_leaves:',max_leaves,' accuracy:',get_accuracy(max_leaves, train_X2, val_X2, train_y2, val_y2))

**DECISSION TREE**

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtm = DecisionTreeClassifier(max_leaf_nodes=50,min_samples_split=4,random_state=35643419)

from sklearn.metrics import accuracy_score

dtm.fit(train_X2, train_y2)

# Preprocessing of validation data, get predictions
preds = dtm.predict(val_X2)

**max_depth**
1. 0.8385650224215246

**min_samples_split**
1. 5 | 0.8385650224215246
2. 4 | 0.8430493273542601 

**min_samples_leaf** 

In [None]:
score = accuracy_score(val_y2, preds)
print('Score:', score)
print(classification_report(val_y2, preds))

In [None]:
def get_accuracy(max_leaf_nodes, train_X, val_X, train_y, val_y):
    dtm = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=35643419)
    dtm.fit(train_X, train_y)
    preds = dtm.predict(val_X)
    accuracy = accuracy_score(val_y, preds)
    return(accuracy)

for max_leaf_nodes in [5, 50, 500, 5000]:
    print('max_leaf_nodes:',max_leaf_nodes,' accuracy:',get_accuracy(max_leaf_nodes, train_X2, val_X2, train_y2, val_y2))

**RANDOM FORESTS**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rfm = RandomForestClassifier(n_estimators=400,max_leaf_nodes=50,max_depth=10,min_samples_split=3,random_state=35643419)

rfm.fit(train_X2, train_y2)

preds_rf = rfm.predict(val_X2)

In [None]:
score = accuracy_score(val_y2, preds_rf)
print('Score:', score)
print(classification_report(val_y2, preds_rf))

**max_depth**
1. default | 0.8340807174887892
2. 5 | 0.8385650224215246
3. 10 | 0.8475336322869955

**min_samples_split**
1. 3 | 0.852017937219731
2. 5 | 0.8340807174887892

**min_samples_leaf**

In [None]:
def get_accuracy(max_leaf_nodes, train_X, val_X, train_y, val_y):
    rfm = RandomForestClassifier(n_estimators=5,max_leaf_nodes=max_leaf_nodes,random_state=35643419)
    rfm.fit(train_X, train_y)
    preds_rf = rfm.predict(val_X)
    accuracy = accuracy_score(val_y, preds_rf)
    return(accuracy)

for max_leaf_nodes in [5, 50, 500, 5000]:
    print('max_leaf_nodes:',max_leaf_nodes,' accuracy:',get_accuracy(max_leaf_nodes, train_X2, val_X2, train_y2, val_y2))

**NEURAL NETWORK**

In [None]:
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Set seeds for reproducibility
seed_value = 35643419
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

model = keras.Sequential([
    layers.BatchNormalization(),
    layers.Dense(32, activation='relu', input_shape=[train_X2.shape[1]]),
    #layers.Dropout(rate=0.1),
    layers.Dense(16, activation='relu'),
    #layers.Dropout(rate=0.1),
    layers.Dense(1, activation='sigmoid'),
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)

early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)

history = model.fit(
    train_X2, train_y2,
    validation_data=(val_X2, val_y2),
    batch_size=512,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0, # hide the output because we have so many epochs
)

history_df = pd.DataFrame(history.history)
# Start the plot at epoch 5
history_df.loc[5:, ['loss', 'val_loss']].plot()
history_df.loc[5:, ['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))

In [None]:
predictions = model.predict(val_X2)

# If you want binary predictions (0 or 1), you can round the predictions
binary_predictions = np.round(predictions)
score = accuracy_score(val_y2, binary_predictions)
print(score)
print(classification_report(val_y2, binary_predictions))


**LOGISTIC REGRESSION**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(train_X2)
X_test_scaled = scaler.transform(val_X2)

logreg = LogisticRegression()
logreg.fit(X_train_scaled, train_y2)

y_pred = logreg.predict(X_test_scaled)

In [None]:
print(accuracy_score(val_y2, y_pred))
print(classification_report(val_y2, y_pred))

**TODO**: we can see what happens when we consider the categorical variables to be the ones that recover the less number of survivals. 

# New submission to competition 

In this ocassion we had a problem because there were more levels in 'Ticket' and 'Family' variables in test set than in train set. For this reason maybe this would not imply a great improvement. 

In [None]:
import pandas as pd
titanic_train = pd.read_csv('/kaggle/input/titanic/train.csv')
titanic_test = pd.read_csv('/kaggle/input/titanic/test.csv')
titanic_train['Family'] = [c.split(',')[0] for c in titanic_train.Name]
titanic_test['Family'] = [c.split(',')[0] for c in titanic_test.Name]

In [None]:
titanic_train['NewFamily'] = titanic_train['Family'].apply(lambda x: x if x in mostSurvivedFamilies else 'otherFamily')
titanic_test['NewFamily'] = titanic_test['Family'].apply(lambda x: x if x in mostSurvivedFamilies else 'otherFamily')

In [None]:
print(titanic_train['NewFamily'].unique())
print(titanic_test['NewFamily'].unique())
print(set(titanic_train['NewFamily'].unique())-set(titanic_test['NewFamily'].unique()))

We have to modify the titanic_train NewFamily feature.

In [None]:
mostSurvivedTickets2 = ['Kelly','Asplund']
titanic_train['NewFamily'] = titanic_train['Family'].apply(lambda x: x if x in mostSurvivedTickets2 else 'otherTicket')

In [None]:
titanic_train['NewTicket'] = titanic_train['Ticket'].apply(lambda x: x if x in mostSurvivedTickets else 'otherTicket')
titanic_test['NewTicket'] = titanic_test['Ticket'].apply(lambda x: x if x in mostSurvivedTickets else 'otherTicket')

In [None]:
print(titanic_train['NewTicket'].unique())
print(titanic_test['NewTicket'].unique())
print(set(titanic_train['NewTicket'].unique())-set(titanic_test['NewTicket'].unique()))

We have to modify the titanic_train NewTicket feature. 

In [None]:
mostSurvivedTickets2=['347077','PC 17757','24160','PC 17755']
titanic_train['NewTicket'] = titanic_train['Ticket'].apply(lambda x: x if x in mostSurvivedTickets2 else 'otherTicket')

In [None]:
y_train=titanic_train.Survived

X_train=titanic_train[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','NewFamily','NewTicket']]
X_test=titanic_test[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','NewFamily','NewTicket']]

X_train2=X_train.copy()
X_test2=X_test.copy()

In [None]:
my_imputer = SimpleImputer()
numerical_cols2 = [cname for cname in X_train2.columns if X_train2[cname].dtype in ['int64', 'float64']]

X_train2[numerical_cols2] = pd.DataFrame(my_imputer.fit_transform(X_train2[numerical_cols2]))
X_train2[numerical_cols2].columns = X_train[numerical_cols2].columns

X_test2[numerical_cols2] = pd.DataFrame(my_imputer.fit_transform(X_test2[numerical_cols2]))
X_test2[numerical_cols2].columns = X_test[numerical_cols2].columns


In [None]:
categorical_cols2 = [cname for cname in X_train2.columns if X_train2[cname].dtype == "object"]
my_imputer = SimpleImputer(strategy='most_frequent')

X_train2[categorical_cols2] = pd.DataFrame(my_imputer.fit_transform(X_train2[categorical_cols2]))
X_train2[categorical_cols2].columns = X_train[categorical_cols2].columns

X_test2[categorical_cols2] = pd.DataFrame(my_imputer.fit_transform(X_test2[categorical_cols2]))
X_test2[categorical_cols2].columns = X_test[categorical_cols2].columns

In [None]:
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

OH_cols = pd.DataFrame(OH_encoder.fit_transform(X_train2[categorical_cols2]))

OH_cols.index = X_train2.index

X_train2.drop(categorical_cols2, axis=1, inplace=True)

X_train2 = pd.concat([X_train2, OH_cols], axis=1)

X_train2.columns = X_train2.columns.astype(str)

In [None]:
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

OH_cols = pd.DataFrame(OH_encoder.fit_transform(X_test2[categorical_cols2]))

OH_cols.index = X_test2.index

X_test2.drop(categorical_cols2, axis=1, inplace=True)

X_test2 = pd.concat([X_test2, OH_cols], axis=1)

X_test2.columns = X_test2.columns.astype(str)

In [None]:
print(X_train2.shape)
print(X_test2.shape)

In [None]:
dtm = DecisionTreeClassifier(max_leaf_nodes=50,random_state=35643419)

dtm.fit(X_train2, y_train)

preds = dtm.predict(X_test2)

In [None]:
output = pd.DataFrame({'PassengerId': titanic_test.PassengerId, 'Survived': preds})
output.to_csv('submission2.csv', index=False)

**COMMENT**: this improvement has been made modifying different parameters in order to obtain better accuracy. 

In [None]:
rfm = RandomForestClassifier(n_estimators=400,max_leaf_nodes=50,max_depth=10,min_samples_split=3,random_state=35643419)

rfm.fit(X_train2, y_train)

preds_rf = rfm.predict(X_test2)

In [None]:
output = pd.DataFrame({'PassengerId': titanic_test.PassengerId, 'Survived': preds_rf})
output.to_csv('submission3.csv', index=False)

**COMMENT**: this improvement from here has been made using 5-folds CV using the train dataset.  

In [None]:
rfm = RandomForestClassifier(n_estimators=150,max_leaf_nodes=50,max_depth=10,min_samples_split=3,random_state=35643419)

rfm.fit(X_train2, y_train)

preds_rf2 = rfm.predict(X_test2)

In [None]:
output = pd.DataFrame({'PassengerId': titanic_test.PassengerId, 'Survived': preds_rf2})
output.to_csv('submission6.csv', index=False)