In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('Titanic-Dataset.csv')

### These columns are very less likely to affect whether or not a person survived. Thus, we remove them.

In [6]:
df.drop(columns=['PassengerId', 'Name', 'Cabin', 'Ticket'], inplace=True)

### Check for missing values

In [7]:
df.isna().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

We can fill in the empty values in the `Age` column with the `mean`, and fill in the missing values in the `Embarked` column with the `mode`

In [9]:
df.Age.fillna(df.Age.mean(), inplace=True)
df.Embarked.fillna('S', inplace=True)

Is there any other missing value?

In [13]:
df.isna().any().any()

False

## Data preprocessing and Feature engineering

In [16]:
df = pd.get_dummies(df, columns=['Embarked']) #one-hot encoding

In [17]:
df.Sex = df.Sex.map({"female": 0, 'male': 1}) #label encoding

Let's seperate features and labels

In [20]:
X = df.drop(columns=['Survived'])
y = df.Survived

Let's seperate training and test sets

In [18]:
from sklearn.model_selection import train_test_split

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

Now let's build and train a model. Let's try a logistic regression since we're doing a bimary classification. (Logistic Regression is ideal for binary classification problems where you're predicting one of two possible outcomes (e.g., spam or not spam))

In [139]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
l_model = LogisticRegression(max_iter=200, solver='liblinear', random_state=42)

In [140]:
l_model.fit(X_train, y_train) #train model

In [141]:
pred = l_model.predict(X_test) #make predictions
accuracy_score(y_true=y_test, y_pred=pred) #check accuracy of prediction

0.776536312849162

An accuracy of 78% is not bad for learning projects, but not ideal in real-world scenarios. Let's see if we can do better, let's try a DecisionTree algorithm

In [142]:
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train) #training
pred = tree_model.predict(X_test) #testing/prediction
accuracy_score(y_true=y_test, y_pred=pred)

0.7653631284916201

This seems to be close to our previous logistic regression algorithm. Let's try random forest

In [157]:
from sklearn.ensemble import RandomForestClassifier
forest_model = RandomForestClassifier(random_state=42, max_depth=14)
forest_model.fit(X_train, y_train)
pred = forest_model.predict(X_test)
accuracy_score(y_true=y_test, y_pred=pred)

0.8435754189944135

An accuracy of 84%. That's reasonably well for the task at hand.