<a href="https://www.kaggle.com/code/kelixirr/titanic-survivors-end-to-end-project?scriptVersionId=135780648" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Titanic Surviors Prediction Using Machine Learning
Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. It is your job to predict if a passenger survived the sinking of the Titanic or not. For each in the test set, you must predict a 0 or 1 value for the variable.

0 = No, 1 = Yes

### Metric
Your score is the percentage of passengers you correctly predict. This is known as accuracy.

### Data 
*pclass*: A proxy for socio-economic status (SES)

1st = Upper
2nd = Middle
3rd = Lower

*age*: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

*sibsp*: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

*parch*: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton


In [None]:
# Importing the important libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

We already have the test set and training set so we don't need to create them. 

Let's analyse the training set and see what we can do. 

In [None]:
df = pd.read_csv("/kaggle/input/titanic/train.csv")

In [None]:
df.head()

In [None]:
# let's check for dtypes
df.dtypes

In [None]:
# let's check for shape
df.shape

In [None]:
# let's check for missing values
df.isna().sum()

We seem to have several missing values in our data. 

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# checking for duplicate values 
df.duplicated().sum()

We don't have duplicate values

In [None]:
df.tail()

We can remove the column "Ticket" And "Cabin" because it does not seem to have any relationship with the accident unless other features are involved.

In [None]:
new_df = df.drop(["Ticket", "Cabin"], axis=1)

In [None]:
new_df.head()

## Exploratory Data Analysis (EDA)

In [None]:
# let's check the data distribution 
new_df.hist(bins=50, figsize=(20,10))
plt.show()

In [None]:
# let's check for the correlation
corr = new_df.corrwith(new_df["Survived"], numeric_only=True)

In [None]:
corr.sort_values(ascending=False)

As we can see there is huge correlation between fare and survival. Pclass may not give us clear idea unless we compare the particular type. Let's plot the graph and see. 

In [None]:
new_df.columns

In [None]:
pd.crosstab(new_df.Pclass, df.Survived).plot(kind="bar", figsize=(10,5),  color=["red", "blue"])

plt.title("Relationship between Pclass And Survivors")
plt.ylabel("Numbers")
plt.xlabel("Pclass 1, 2, 3")
plt.xticks(rotation=0);

In [None]:
pd.crosstab(new_df.Pclass, df.Survived)

As we can see that the people from 3rd class did not survive as much as people from 1st class. Notice we also have higher number of individuals in 3rd class. 

In [None]:
pd.crosstab(new_df.Parch, new_df.Survived)

In [None]:
pd.crosstab(new_df.SibSp, new_df.Survived)

In [None]:
pd.crosstab(new_df.Sex, new_df.Survived)

As we can see women survived more than men. 

Let's create a new feature from the name feature as these names could be of important individuals and their survivals may show some correlation with the names. 

In [None]:
new_df.head()

In [None]:
def title_feature(df):
    
    # Creating new feature
    df["Title"] = df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    
    #Replacing the rare titles with other
    df["Title"] = df["Title"].replace(['Lady', 'Countess','Capt', 'Col',
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], "Other")
    
    #Replacing other titles with common ones =
    df["Title"] = df["Title"].replace("Mlle", "Miss")
    df["Title"] = df["Title"].replace("Ms", "Miss")
    df["Title"] = df["Title"].replace("Mme", "Mrs")
    
    #converting to ordinal form 
    map_dict = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    df['Title'] = df['Title'].map(map_dict)
    df['Title'] = df['Title'].fillna(0)
    

title_feature(new_df)

Although based on the names here we don't seem to seeing any relationship with the title and survival because we don't see any such title biasedness here. 

In [None]:
# Removing other unnecassary features
train_df = new_df.drop(["Name", "PassengerId"], axis=1)
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.isna().sum()

In [None]:
train_df.dtypes

In [None]:
# converting sex feature to One Hot Encoder
from sklearn.preprocessing import LabelEncoder
train_df["Sex"] = LabelEncoder().fit_transform(train_df["Sex"])

In [None]:
# let's convert other categorical variable into numbers as well
train_df["Embarked"] = LabelEncoder().fit_transform(train_df["Embarked"])


In [None]:
train_df.isna().sum()

In [None]:
# We can now fill the missing values in Age 
train_df.fillna({"Age": df["Age"].median()}, inplace=True)

In [None]:
train_df.isna().sum()

In [None]:
train_df.shape

Well now we have removed our missing values. We can now create our models and start. Before we do that let's create a training and validation set so that we can evaluate our model and once we are done we can then test our model on test data. This is done to prevent the data snooping. 

In [None]:
X = train_df.drop("Survived", axis=1)

In [None]:
y = train_df["Survived"]

In [None]:
# creating training and validation set 
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.05, random_state = 42)

In [None]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

## Training Our Models

Please note this is a classification problem not prediction problem. Obviously one can frame any problem in prediction manner but here we will be using classification models. 

### Logistic Regression

In [None]:
# Logistic Regression 
from sklearn.linear_model import LogisticRegression
model_1 = LogisticRegression(max_iter=200)
model_1.fit(X_train, y_train)

In [None]:
y_pred = model_1.predict(X_val)
y_pred

In [None]:
y_predict_prob = model_1.predict_proba(X_val)
y_predict_prob

In [None]:
score_1 = model_1.score(X_val, y_val)
round(score_1*100, 2)

##### Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
log_reg = LogisticRegression(max_iter=200, random_state = 42)
scores = cross_val_score(log_reg, X, y, cv=5) 
scores.mean()

### Support Vector Machines

In [None]:
from sklearn.svm import SVC

model_2 = SVC().fit(X_train, y_train)

In [None]:
y_pred_2 = model_2.predict(X_val)
y_pred_2

In [None]:
score_2 = model_2.score(X_val, y_val)
score_2

##### Cross Val Score

In [None]:
SVM = SVC(random_state=42)
scores = cross_val_score(SVM, X, y, cv=5)
scores.mean()

### SGD

In [None]:
from sklearn.linear_model import SGDClassifier
model_3 = SGDClassifier().fit(X_train, y_train)
y_pred_3 = model_3.predict(X_val)
y_pred_3

In [None]:
score_3 = model_3.score(X_val, y_val)
round(score_3*100, 2)

You can also try by applying different losses here. 

In [None]:
model_3_1 = SGDClassifier(loss = "log_loss").fit(X_train, y_train)
y_pred_3_1 = model_3_1.predict(X_val)
score_3_1 = model_3_1.score(X_val, y_val)
score_3_1

##### Cross Val Score

In [None]:
sgd = SGDClassifier(random_state = 42)
scores = cross_val_score(sgd, X, y, cv=5)
scores.mean()

### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
model_4 = DecisionTreeClassifier().fit(X_train, y_train)
y_pred_4 = model_4.predict(X_val)
y_pred_4

In [None]:
score_4 = model_4.score(X_val, y_val)
round(score_4 *100, 2)

##### Cross Val Score

In [None]:
dt = DecisionTreeClassifier(random_state = 42)
scores = cross_val_score(dt, X, y, cv = 5)
scores.mean()

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_5 = RandomForestClassifier().fit(X_train, y_train)
y_pred_5 = model_5.predict(X_val)

In [None]:
y_pred_5

In [None]:
score_5 = model_5.score(X_val, y_val)
score_5

##### Cross Validation

In [None]:
rfc = RandomForestClassifier()
scores = cross_val_score(rfc, X, y, cv=5)
scores.mean()

### K-Nearest Neighbors 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model_6 = KNeighborsClassifier().fit(X_train, y_train)
y_pred_6 = model_6.predict(X_val)
y_pred_6

In [None]:
score_6 = model_6.score(X_val, y_val)
score_6

checking with different parameters

In [None]:
# try to hypertune the paramater to see if the performance improves

train_scores = []
val_scores = []
neighbors = range(1, 21)

knn = KNeighborsClassifier()

for i in neighbors:
    knn.set_params(n_neighbors = i)
    knn.fit(X_train, y_train)
    train_scores.append(knn.score(X_train, y_train))
    val_scores.append(knn.score(X_val, y_val))
    


In [None]:
train_scores

In [None]:
val_scores

In [None]:
# let's create a graph to see how our model performed on test and val data

plt.plot(neighbors, train_scores, label ="Train Score")
plt.plot(neighbors, val_scores, label = "Val Score")
plt.xticks(np.arange(1, 21, 1))
plt.xlabel("Number of Neighbors")
plt.ylabel("Model Score")
plt.legend()

print(f"Max score on validation set is: {max(val_scores):.2f}%")

##### cross val scores

In [None]:
knc = KNeighborsClassifier()
scores = cross_val_score(knc, X, y, cv = 5)
scores.mean()

As we can see the accuracy is max 80% without cross val and 71% with cross val which is fine but not as good as our Random Forest. So, we will discard this. 

Let's select the Random Forest Classifier and hypertune to see what we can achieve

## Hyperparameter Tuning 

In [None]:
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import accuracy_score

param_grid = {
    
    "n_estimators": np.arange(10, 100, 50),
    "max_depth": [None, 3, 5, 10],
    "min_samples_split": np.arange(2, 20, 2),
    "min_samples_leaf": np.arange(1, 20, 2)
}

clf = RandomForestClassifier()

grid_search = GridSearchCV(clf, param_grid, cv = 5, verbose = 2)
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
# best estimator output
grid_search.best_estimator_.score(X_val, y_val)

In [None]:
# let's try Randomized Search CV
from sklearn.model_selection import RandomizedSearchCV

rs_rfc = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions = param_grid,
                           cv=5,
                           n_iter = 20,
                           verbose = 2) 
rs_rfc.fit(X_train, y_train)

In [None]:
rs_rfc.best_params_

In [None]:
rs_rfc.score(X_val, y_val)  

Well seems like we have improved a lot. We can select this model and evaluate our model now on actual Data that is our test data but first we need to make the data similiar to training data

In [None]:
X_train

In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

In [None]:
# we can get rid of unnecesaary columns

# function to drop columns
def remove_col(data):
    
    data.drop(["PassengerId", "Ticket", "Cabin", "Name"], axis = 1, inplace = True)
    

# function to modify our data for testing or evaluation
def modify_data(data):
    
    title_feature(data)  # we created this function earlier
    remove_col(data)     # once we have title we can remove cols 
    
    # Label Encoding the categorical data 
    data["Sex"] = LabelEncoder().fit_transform(data["Sex"])        
    data["Embarked"] = LabelEncoder().fit_transform(data["Embarked"])
    
    return data 

In [None]:
test_df = test_data.copy()

In [None]:
test_df = modify_data(test_df)
test_df.head(5)

In [None]:
test_df.isna().sum()

In [None]:
test_df.fillna({  
    "Age": test_df["Age"].median(),
    "Fare": test_df["Fare"].median()
}, inplace = True)

In [None]:
test_df.isna().sum()

## Evaluating Our Best Model On Test Data

In [None]:
best_model = rs_rfc.best_estimator_
y_pred = best_model.predict(test_df)
y_pred

In [None]:
survivors_report = pd.DataFrame({
    
    "PassengerId": test_data["PassengerId"],
    "Survived": y_pred
})

survivors_report.head()

In [None]:
# exporting our data 
survivors_report.to_csv("submission.csv", index = False)