### Problem Statement: Titanic Survival Prediction

#### Context
In 1912, the RMS Titanic, a luxury British steamship, sank after hitting an iceberg during its maiden voyage from Southampton to New York City. Of the 2,224 passengers and crew aboard, more than 1,500 perished, making it one of the deadliest maritime disasters in history. The disaster has since become a pivotal case study in safety, regulations, and human factors. In this context, we aim to develop a machine learning model to predict the likelihood of survival for passengers aboard the Titanic based on various features.

#### Business Objective
The primary goal is to create a machine learning pipeline that accurately predicts whether a passenger survived or not based on the available data. The insights derived from this model can help in understanding the factors that influenced survival rates and can be applied to improve safety measures in modern maritime and other transportation industries.

#### Scope and Deliverables
1. **Data Understanding and Exploration**:
   - Explore the dataset to understand the distribution and relationship of features.
   - Identify and handle missing values.
   - Perform exploratory data analysis (EDA) to uncover patterns and insights.

2. **Feature Engineering and Data Preprocessing**:
   - Clean the data by handling missing values and removing irrelevant features.
   - Transform categorical features into numerical values using techniques like one-hot encoding.
   - Scale numerical features to ensure consistency across data inputs.

3. **Model Training and Evaluation**:
   - Train multiple machine learning models, including Logistic Regression, Decision Tree, and K-Nearest Neighbors (KNN) classifiers.
   - Evaluate models using appropriate metrics such as accuracy, precision, recall, and F1-score.
   - Compare model performance and select the best model based on evaluation metrics.

4. **Machine Learning Pipeline**:
   - Develop a robust machine learning pipeline that integrates data preprocessing, feature engineering, and model training.
   - Ensure that the pipeline is modular, scalable, and easy to maintain.
   - Provide detailed documentation of each step within the pipeline.

#### Detailed Steps and Justification

1. **Data Collection**:
   - Load the Titanic dataset from a reliable source (e.g., seaborn library) to ensure data consistency and accuracy.

2. **Data Understanding and Exploration**:
   - **Exploratory Data Analysis (EDA)**: Perform initial data analysis to understand the structure, distribution, and relationships within the data. This helps in identifying important features and potential challenges like missing values and data imbalance.
   - **Handling Missing Values**: Missing values can lead to inaccurate model predictions. We will impute missing values for numerical features using mean/median values and for categorical features using the most frequent values.

3. **Feature Engineering**:
   - **Dropping Irrelevant Columns**: Columns such as 'deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', and 'alone' do not contribute significantly to the prediction task and will be removed to simplify the dataset.
   - **One-Hot Encoding**: Convert categorical features such as 'sex' and 'embarked' into numerical format using one-hot encoding. This ensures that the machine learning algorithms can process these features effectively.
   - **Scaling Numerical Features**: Scale features like 'age' and 'fare' to standardize the data. This step is crucial for models that rely on distance calculations, like KNN.

4. **Model Training and Evaluation**:
   - **Logistic Regression**: A simple yet effective linear model that provides a probabilistic approach to binary classification problems.
   - **Decision Tree**: A non-linear model that splits data based on feature values, providing an intuitive understanding of feature importance and decision-making process.
   - **K-Nearest Neighbors (KNN)**: A distance-based algorithm that classifies data points based on their proximity to other points in the dataset.
   - **Evaluation Metrics**: Use accuracy, precision, recall, and F1-score to evaluate model performance. These metrics provide a comprehensive view of model effectiveness, especially in dealing with imbalanced datasets.

5. **Machine Learning Pipeline**:
   - **ColumnTransformer**: Combine preprocessing steps for numerical and categorical features into a single transformer. This ensures that all preprocessing steps are applied consistently.
   - **Pipeline Integration**: Create a pipeline that includes both the preprocessing steps and the machine learning model. This modular approach simplifies model training, testing, and maintenance.

#### Ask
Create a comprehensive machine learning pipeline for the Titanic dataset to predict passenger survival. The pipeline should include data cleaning, feature engineering, and model training using Logistic Regression, Decision Tree, and K-Nearest Neighbors classifiers. Provide detailed documentation of each step, including the justification for chosen techniques and models. The final deliverable should be a modular, scalable, and maintainable pipeline that can be used for further analysis and model refinement.

By building this pipeline, we aim to derive valuable insights into the factors affecting survival rates and apply these learnings to improve safety protocols in modern transportation systems.

A machine learning pipeline is a sequence of steps that you follow to build and deploy a machine learning model. Think of it like an assembly line in a factory, where each step in the line adds something new to the product until it's complete. Here are the basic steps in a machine learning pipeline:

1. **Data Collection**: Gathering the data you need to train your model.
2. **Data Cleaning**: Fixing or removing any incorrect or missing parts of the data.
3. **Feature Engineering**: Creating new features or selecting important ones to help the model make better predictions.
4. **Model Training**: Using the cleaned data to teach the model to recognize patterns.
5. **Model Evaluation**: Checking how well the model performs on new, unseen data.
6. **Model Deployment**: Putting the model into use so it can make predictions on new data in real-time.

Each step in the pipeline ensures that the process is organized, repeatable, and efficient, making it easier to build accurate and reliable machine learning models.

Creating a machine learning pipeline has several advantages:

1. **Automation**: A pipeline automates the repetitive tasks involved in the machine learning process, like data cleaning, feature engineering, model training, and evaluation. This saves time and reduces the chances of human error.

2. **Consistency**: By using a pipeline, the same steps are applied every time you run the process. This ensures that your results are consistent and reliable.

3. **Efficiency**: Pipelines streamline the workflow by chaining different processes together. This makes it easier to manage and modify steps as needed, improving overall efficiency.

4. **Reproducibility**: Pipelines make it easy to reproduce results. If someone else needs to replicate your work, they can use the same pipeline to achieve the same outcomes.

5. **Modularity**: Pipelines break down the machine learning process into smaller, manageable parts (modules). Each part can be developed, tested, and improved independently, making the whole system more flexible and easier to maintain.

6. **Scalability**: Pipelines can handle large datasets and complex processes more effectively. As your data grows or your model becomes more sophisticated, the pipeline can scale to meet these needs without a complete overhaul.

7. **Ease of Experimentation**: With a pipeline, you can easily tweak different parts of the process to see how changes affect the outcome. This makes experimenting with different models and techniques simpler and faster.

In [1]:
# Importing necessary libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
import seaborn as sns  # For data visualization and loading the Titanic dataset
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.impute import SimpleImputer  # For handling missing values
from sklearn.preprocessing import StandardScaler, OneHotEncoder  # For scaling numerical data and encoding categorical data
from sklearn.compose import ColumnTransformer  # For applying different preprocessing steps to different feature types
from sklearn.pipeline import Pipeline  # For creating a machine learning pipeline
from sklearn.ensemble import RandomForestClassifier  # For the Random Forest classification model
from sklearn.metrics import accuracy_score, classification_report  # For evaluating the model performance

# Load the Titanic dataset from seaborn
titanic = sns.load_dataset('titanic')
titanic.shape

(891, 15)

In [2]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Column Descriptions:

1. **survived**:
   - **Description**: Indicates whether the passenger survived the Titanic disaster.
   - **Type**: Integer (0 = No, 1 = Yes)
   - **Example**: 0 (did not survive), 1 (survived)

2. **pclass**:
   - **Description**: Passenger class, representing socio-economic status.
   - **Type**: Integer (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
   - **Example**: 1 (1st class), 2 (2nd class), 3 (3rd class)

3. **sex**:
   - **Description**: Gender of the passenger.
   - **Type**: String
   - **Example**: 'male', 'female'

4. **age**:
   - **Description**: Age of the passenger.
   - **Type**: Float
   - **Example**: 22.0, 38.0

5. **sibsp**:
   - **Description**: Number of siblings or spouses aboard the Titanic.
   - **Type**: Integer
   - **Example**: 1 (one sibling or spouse), 0 (no siblings or spouses)

6. **parch**:
   - **Description**: Number of parents or children aboard the Titanic.
   - **Type**: Integer
   - **Example**: 0 (no parents or children), 1 (one parent or child)

7. **fare**:
   - **Description**: Passenger fare (ticket price).
   - **Type**: Float
   - **Example**: 7.2500, 71.2833

8. **embarked**:
   - **Description**: Port of embarkation.
   - **Type**: String (C = Cherbourg, Q = Queenstown, S = Southampton)
   - **Example**: 'S' (Southampton), 'C' (Cherbourg)

9. **class**:
   - **Description**: Class of the passenger (alternative representation of pclass).
   - **Type**: String (First, Second, Third)
   - **Example**: 'First', 'Third'

10. **who**:
    - **Description**: Descriptive gender category.
    - **Type**: String (man, woman, child)
    - **Example**: 'man', 'woman'

11. **adult_male**:
    - **Description**: Indicates if the passenger is an adult male.
    - **Type**: Boolean (True = Yes, False = No)
    - **Example**: True (adult male), False (not an adult male)

12. **deck**:
    - **Description**: Deck where the passenger's cabin was located.
    - **Type**: String
    - **Example**: 'C', 'NaN' (missing)

13. **embark_town**:
    - **Description**: Town where the passenger embarked.
    - **Type**: String
    - **Example**: 'Southampton', 'Cherbourg'

14. **alive**:
    - **Description**: Indicates if the passenger survived (alternative representation of survived).
    - **Type**: String (yes, no)
    - **Example**: 'yes', 'no'

15. **alone**:
    - **Description**: Indicates if the passenger was alone (no family aboard).
    - **Type**: Boolean (True = Yes, False = No)
    - **Example**: True (alone), False (not alone)

Each column in the Titanic dataset provides specific information about the passengers and their circumstances on the Titanic, which can be used for predictive modeling to understand the factors influencing survival rates.

In [3]:
# Drop irrelevant columns
titanic.drop(['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone', 'embark_town'], axis=1, inplace=True)


In [4]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [5]:
# Separate target variable (Survived) and features
X = titanic.drop(['survived'], axis=1)
y = titanic['survived']

In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Define preprocessing for numerical columns (impute missing values and scale)
numerical_features = ['age', 'fare']
numerical_features

['age', 'fare']

In [8]:
# https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [11]:
# Define preprocessing for categorical columns (impute missing values and one-hot encode)
categorical_features = ['sex', 'embarked']
categorical_features

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [12]:
# Combine preprocessing steps into a single ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [13]:
# Define the machine learning pipeline with preprocessing and model training steps
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

In [14]:
# Train the pipeline on the training data

pipeline.fit(X_train, y_train)

In [15]:
# Make predictions on the testing data
y_pred = pipeline.predict(X_test)

In [17]:
# Evaluate the model's performance
print(f'TrainAccuracy: {accuracy_score(y_train, pipeline.predict(X_train))}')
print(f'Test Accuracy: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

TrainAccuracy: 0.9789325842696629
Test Accuracy: 0.770949720670391
              precision    recall  f1-score   support

           0       0.79      0.83      0.81       105
           1       0.74      0.69      0.71        74

    accuracy                           0.77       179
   macro avg       0.77      0.76      0.76       179
weighted avg       0.77      0.77      0.77       179



In [18]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the Titanic dataset from seaborn
titanic = sns.load_dataset('titanic')


In [19]:
print(titanic.head())
print(titanic.info())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-nu

In [20]:
# Drop irrelevant columns
titanic.drop(['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone', 'embark_town'], axis=1, inplace=True)

# Separate target variable (Survived) and features
X = titanic.drop(['survived'], axis=1)
y = titanic['survived']

In [21]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [22]:
# Define preprocessing for numerical columns (impute missing values and scale)
numerical_features = ['age', 'fare']
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [23]:
# Define preprocessing for categorical columns (impute missing values and one-hot encode)
categorical_features = ['sex', 'embarked']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [24]:
# Combine preprocessing steps into a single ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [25]:
# Define a function to create, train, and evaluate the model pipeline
def train_and_evaluate(model):
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print(f'Using {model.__class__.__name__}:')
    print(f'Train Accuracy: {accuracy_score(y_train, pipeline.predict(X_train))}')
    print(f'Test Accuracy: {accuracy_score(y_test, y_pred)}')

In [26]:
# Logistic Regression
logreg = LogisticRegression(max_iter=1000, random_state=42)
train_and_evaluate(logreg)

Using LogisticRegression:
Train Accuracy: 0.7823033707865169
Test Accuracy: 0.776536312849162


In [27]:
# Decision Tree
train_and_evaluate(DecisionTreeClassifier(random_state=42))

Using DecisionTreeClassifier:
Train Accuracy: 0.9789325842696629
Test Accuracy: 0.7262569832402235


In [28]:
# K-Nearest Neighbors
train_and_evaluate(KNeighborsClassifier())

Using KNeighborsClassifier:
Train Accuracy: 0.8469101123595506
Test Accuracy: 0.7486033519553073


In [30]:
# XGBoost
# https://xgboost.readthedocs.io/en/stable/parameter.html

from xgboost import XGBClassifier
train_and_evaluate(XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42))

Using XGBClassifier:
Train Accuracy: 0.9620786516853933
Test Accuracy: 0.8044692737430168


In [31]:
# Extra Trees Classifier

from sklearn.ensemble import ExtraTreesClassifier
train_and_evaluate(ExtraTreesClassifier(random_state=42))

Using ExtraTreesClassifier:
Train Accuracy: 0.9789325842696629
Test Accuracy: 0.7262569832402235


In [32]:
# Neural Network (MLPClassifier)

from sklearn.neural_network import MLPClassifier
train_and_evaluate(MLPClassifier(max_iter=1000, random_state=42))

Using MLPClassifier:
Train Accuracy: 0.8075842696629213
Test Accuracy: 0.770949720670391


In [34]:
import joblib

# Save the trained model to disk using joblib
# This step serializes the entire machine learning pipeline, including the preprocessing steps and the trained MLPClassifier model.
# Saving the model to disk allows us to reload and use it later without needing to retrain it.
joblib.dump(pipeline, 'titanic_mlp_model.pkl')

# Print a confirmation message to indicate that the model has been successfully saved to disk.
print("Model saved to disk as 'titanic_mlp_model.pkl'")

Model saved to disk as 'titanic_mlp_model.pkl'


# Happy Learning