In [1]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold, cross_val_predict, cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 'sklearn'

### Problem Overview

The Titanic – Machine Learning from Disaster dataset from Kaggle is one of the most popular beginner-friendly challenges in data science. The goal is to build a model that can predict whether a passenger survived the Titanic disaster based on features such as age, gender, ticket class, number of family members on board, and other related information.

Each row in the dataset represents one passenger, and the target column — "Survived" — indicates the outcome:

1 → the passenger survived

0 → the passenger did not survive

By analyzing these features and training machine learning models, we aim to uncover patterns that influenced survival and evaluate how accurately our model can predict the outcome for unseen test data.

### Dataset

In machine learning projects, understanding your dataset is a crucial step toward building an effective model. The file **"data.csv"** contains the prepared and optimized data used for training. All essential steps of feature engineering have already been completed, so you can focus entirely on the core aspects — learning how machine learning works, implementing various models, and optimizing their performance.

### Data Loading and Scaling

The first step is to load the data from the provided .csv file and extract the numerical features. We then split the dataset into training and testing subsets, using 80% of the data for training and 20% for testing. Before training the model, we apply StandardScaler() to standardize the features by transforming each of them to have a mean of 0 and a standard deviation of 1, as some machine learning algorithms perform better on standardized inputs. The model learns patterns and relationships from the training data and is later evaluated on the test set to assess its ability to make accurate predictions on unseen data.

In [None]:
# Import the data from the CSV file
# The file "data.csv" contains the complete dataset that we’ll split
# into training and testing parts - 80% for training and 20% for testing.
data = pd.read_csv("data.csv")

# Select the columns (features) that the model will use to learn
# These include numeric values describing each passenger,
# such as ticket class, age, number of relatives, ticket fare, and cabin information.
# Additional columns come from previous feature engineering
# (for example, encoded titles and embarkation ports).
X = data[[
    "Pclass", "Age", "SibSp", "Parch",
    "Fare", "Cabin_quantity", "Binary",
    "Mr", "Miss", "Mrs", "Master", "Rev", "Dr", "Col", "Major", "Mlle", "Ms",
    "Mme", "Don", "Sir", "Lady", "Capt", "the Countess", "Jonkheer", "Dona",
    "Cherbourg", "Queenstown", "Southampton", "A", "B", "C", "D", "E", "F", "G", "T"
]]

# Select the target column ("Survived"), which represents the value the model aims to predict
# Based on the input features in X, the model will learn to determine whether a passenger survived (1) or not (0)
y = data["Survived"]
y = data["Survived"]

# Split the dataset into training (80%) and testing (20%) parts
# This allows us to train the model on one portion and test it on unseen data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a StandardScaler object to normalize numeric data
# Scaling ensures that features are on a similar scale,
# which helps many algorithms train faster and perform better.
scaler = StandardScaler()

# Fit the scaler on the training data and apply it to both training and test sets
X_scaled_train = scaler.fit_transform(X_train)
X_scaled_test = scaler.transform(X_test)


### Model Training

With the data properly loaded and scaled, we can now move on to training our first model — the K-Nearest Neighbors (KNN) classifier.
If you’re not familiar with the specific models used in this guide, you can find short descriptions and documentation references in the accompanying [ML.md](ML.md) file.

Each machine learning algorithm has its own hyperparameters — settings that must be chosen before training. The choice of both the model type and its hyperparameters is one of the key factors affecting prediction accuracy.

While there are advanced libraries such as Optuna that can automatically optimize these parameters, in this tutorial we will focus on the built-in tools provided by scikit-learn.
To find the best settings, we’ll use GridSearchCV, and to evaluate model accuracy we’ll rely on standard scikit-learn performance metrics and validation functions.

### K-Nearest Neighbors (KNN) Classifier

In [290]:
# Define a Pipeline that bundles preprocessing (scaling) and the model
# The scaler is fit ONLY on the training portion inside each CV fold,
# then applied to the validation fold — this prevents data leakage.
# We choose "random" hiperparameters at our own discretion
pipe_knn = Pipeline(steps=[
    ("scaler", StandardScaler()),     # Standardize features inside each CV fold
    ("model", KNeighborsClassifier(   # K-Nearest Neighbors classifier
        n_neighbors = 3,                # Number of neighbors used for prediction
        weights='distance',           # Weight controls how much influence each neighbor has when making a prediction.
        metric='manhattan'            # Look up different distance formulas in documentation
    ))
])

# Fit the Pipeline on the training data
# Internally, the scaler will be fit to X_train and the model will learn from the scaled data.
pipe_knn.fit(X_train, y_train)

# Evaluate the model using cross-validation WITHOUT leakage
# For each fold, the scaler is fit on the fold’s training split and applied to its validation split.
cv_score = np.round(cross_val_score(pipe_knn, X_train, y_train), 2)

# Display detailed results
# Shows accuracy for each fold, the mean accuracy, and the standard deviation.
# Lower standard deviation indicates more consistent performance across folds.
print("Scores of training data cross-validation (each fold):")
list(map(print, cv_score))
print(f"\nCross-validation mean score: {np.mean(cv_score):.3}")
print(f"Standard deviation of CV score: {np.std(cv_score):.3f}")

Scores of training data cross-validation (each fold):
0.8
0.82
0.82
0.83
0.8

Cross-validation mean score: 0.814
Standard deviation of CV score: 0.012


A score of around 80% accuracy is already quite good, but the hyperparameters used so far were chosen somewhat at random, without any systematic optimization. Let’s improve that now.

To achieve more reliable results, we’ll use GridSearchCV for hyperparameter tuning, combined with RepeatedStratifiedKFold.
Without this method, running the same model configuration multiple times can produce slightly different accuracy scores due to the randomness in how the data is split.
By using Repeated Stratified K-Fold, we repeat the cross-validation process several times and average the results, leading to a more stable and consistent accuracy estimate.

In [232]:
# Define a grid of hyperparameters to test
# Each combination of these parameters will be evaluated to find the best-performing model.
param_grid = {
    "model__n_neighbors": [3, 5, 7, 9, 11, 15],  # Number of neighbors to consider
    "model__weights": ["uniform", "distance"],   # How neighbors contribute to the prediction
    "model__metric": ["minkowski", "manhattan", "euclidean", "chebyshev"],  # Distance metrics to test
}

# Set up the cross-validation strategy
# RepeatedStratifiedKFold splits the data into several folds while preserving class balance.
# The process is repeated multiple times (100 here) to get more stable results.
rskf = RepeatedStratifiedKFold(
    n_splits=5,      # Number of folds per repetition
    n_repeats=100,   # Number of times to repeat the process
    random_state=None  # Random seed (None = random each run)
)

# Define a Pipeline combining the scaler and the model
# This ensures that scaling happens *within* each CV fold, preventing data leakage.
pipe_knn = Pipeline(steps=[
    ("scaler", StandardScaler()),           # Standardize data inside each CV fold
    ("model", KNeighborsClassifier())       # The KNN model to be optimized
])

# Initialize the Grid Search for the KNN model
# GridSearchCV will train and evaluate a model for every combination of parameters in param_grid,
# using the defined cross-validation strategy.
grid_search = GridSearchCV(
    estimator=pipe_knn,          # The pipeline (scaler + model)
    param_grid=param_grid,       # The grid of parameters to test
    scoring="accuracy",          # Metric used to evaluate model performance
    cv=rskf,                     # Cross-validation strategy
    verbose=1,                   # Display progress in the console
    n_jobs=-1                    # Use all available CPU cores for faster processing
)

# Train (fit) the grid search on the raw training data
# The scaler will be fit automatically within each fold.
grid_search.fit(X_train, y_train)

# Display the best results
# best_params_ shows which combination performed the best,
# best_score_ shows the corresponding average accuracy.
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy (averaged CV): {grid_search.best_score_:.4f}")

Fitting 500 folds for each of 48 candidates, totalling 24000 fits
Best parameters: {'model__metric': 'manhattan', 'model__n_neighbors': 7, 'model__weights': 'uniform'}
Best accuracy (averaged CV): 0.8367


As we can see, the initial model used completely unoptimized hyperparameters.
In the KNN example, we’ll go through each step of the optimization process using GridSearchCV to demonstrate how it works in detail.
For the upcoming models, however, we’ll simply provide the already optimized parameters to keep the focus on comparing their performance rather than repeating the full tuning process.

In [None]:
# Refine the hyperparameter grid based on previous search results
# We now focus on narrower ranges around the most promising values to fine-tune the model.

param_grid = {
    "model__n_neighbors": [6, 7, 8],  
    "model__weights": ["uniform"],  
    "model__metric": ["manhattan"], 
}

rskf = RepeatedStratifiedKFold(
    n_splits=5,      
    n_repeats=100,   
    random_state=None  
)

pipe_knn = Pipeline(steps=[
    ("scaler", StandardScaler()),           
    ("model", KNeighborsClassifier())     
])

grid_search = GridSearchCV(
    estimator=pipe_knn,         
    param_grid=param_grid,     
    scoring="accuracy",         
    cv=rskf,                 
    verbose=1,                 
    n_jobs=-1                    
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy (averaged CV): {grid_search.best_score_:.4f}")

Fitting 50 folds for each of 144 candidates, totalling 7200 fits
Best parameters: {'model__metric': 'manhattan', 'model__n_neighbors': 7, 'model__p': 1, 'model__weights': 'uniform'}
Best accuracy (averaged CV): 0.8381


In [292]:
# We now enter optimized parameters into out model and check for improved accuracy score

pipe_knn = Pipeline(steps=[
    ("scaler", StandardScaler()),    
    ("model", KNeighborsClassifier(  
        n_neighbors = 7,              
        weights='uniform',         
        metric='manhattan'        
    ))
])

pipe_knn.fit(X_train, y_train)

cv_score = np.round(cross_val_score(pipe_knn, X_train, y_train), 2)

print("Scores of training data cross-validation (each fold):")
list(map(print, cv_score))
print(f"\nCross-validation mean score: {np.mean(cv_score):.3}")
print(f"Standard deviation of CV score: {np.std(cv_score):.3f}")

Scores of training data cross-validation (each fold):
0.84
0.85
0.86
0.86
0.8

Cross-validation mean score: 0.842
Standard deviation of CV score: 0.022


We achieved an improvement of about 2% in accuracy, meaning our KNN model is now fully optimized.
With this baseline established, we can move on to testing other models to see which one delivers the best overall performance.

### Decision Tree

In [235]:
# We now move on to the Decision Tree.
# Since Decision Trees are not sensitive to feature scaling, we’ll not include data normalization and not use pipeline 

clf_tree = DecisionTreeClassifier(
    max_depth=7,            # Limits how deep the tree can grow (to prevent overfitting)
    criterion='log_loss',    # Measures the quality of a split using information gain
    min_samples_split=7,    # Minimum number of samples required to split an internal node
    min_samples_leaf=5,     # Minimum number of samples required to be at a leaf node
    class_weight=None       # No weighting — all classes are treated equally
)

clf_tree.fit(X_train, y_train)

cv_score = np.round(cross_val_score(clf_tree, X_train, y_train), 2)

print(f"Scores of training data cross-validation (each fold):")
list(map(print, cv_score))
print(f"\nCross-validation mean score: {np.mean(cv_score):.3}")
print(f"Standard deviation of CV score: {np.std(cv_score):.3f}")


Scores of training data cross-validation (each fold):
0.79
0.83
0.84
0.85
0.72

Cross-validation mean score: 0.806
Standard deviation of CV score: 0.048


### Support Vector Machine (SVM)

In [None]:
# We now move on to the Support Vector Classifier (SVC).
# Since SVMs are sensitive to feature scaling, we’ll include data normalization 
# directly in a Pipeline to ensure proper preprocessing and prevent any data leakage during cross-validation.

pipe_svc = Pipeline(steps=[
    ("scaler", StandardScaler()),   
    ("model", SVC(                  # Support Vector Classifier
        kernel="rbf",               # RBF kernel captures non-linear decision boundaries
        C=3,                        # Regularization strength (higher = tighter fit to training data)
        gamma="scale",              # Kernel width; 'scale' adapts to data variance
        class_weight=None           # Treat classes equally (no re-weighting)
    ))
])

pipe_svc.fit(X_train, y_train)

cv_score = np.round(cross_val_score(pipe_svc, X_train, y_train), 2)

print("Scores of training data cross-validation (each fold):")
list(map(print, cv_score))
print(f"\nCross-validation mean score: {cv_score.mean():.3f}")
print(f"Standard deviation of CV score: {cv_score.std():.3f}")

Scores of training data cross-validation (each fold):
0.81
0.85
0.85
0.85
0.8

Cross-validation mean score: 0.832
Standard deviation of CV score: 0.022


### Logistic Regression

In [247]:
# Same with logistic regression 

pipe_log = Pipeline(steps=[
    ("scaler", StandardScaler()),           # Normalize numeric features per fold
    ("model", LogisticRegression(           # Logistic Regression for binary classification
        C=1,                              # Inverse regularization strength (higher = weaker regularization)
        penalty="l1",                       # L1 regularization (drives some coefficients to exactly zero)
        solver="liblinear",                 # Solver compatible with L1 penalty
        max_iter=1000,                      # Ensure convergence
        class_weight=None                   # Treat classes equally
    ))
])

pipe_log.fit(X_train, y_train)

cv_score = np.round(cross_val_score(pipe_log, X_train, y_train), 2)

print("Scores of training data cross-validation (each fold):")
list(map(print, cv_score))
print(f"\nCross-validation mean score: {cv_score.mean():.3f}")
print(f"Standard deviation of CV score: {cv_score.std():.3f}")

Scores of training data cross-validation (each fold):
0.81
0.83
0.86
0.85
0.78

Cross-validation mean score: 0.826
Standard deviation of CV score: 0.029


### Random Forest Classifier

In [None]:
# And Random Forest

clf_rf = RandomForestClassifier(
    max_depth=6,                     # Limit the depth of each tree to prevent overfitting
    min_samples_split=6,             # Minimum number of samples required to split a node
    n_estimators=125,                # Number of trees in the forest
    min_samples_leaf=2,              # Minimum number of samples required to be at a leaf node
    max_features='sqrt'              # Number of features to consider when looking for the best split
)

clf_rf.fit(X_train, y_train)

cv_score = np.round(cross_val_score(clf_rf, X_train, y_train), 2)

print(f"Scores of training data cross-validation (each fold):")
list(map(print, cv_score))
print(f"\nCross-validation mean score: {np.mean(cv_score):.3f}")
print(f"Standard deviation of CV score: {np.std(cv_score):.3f}")


Scores of training data cross-validation (each fold):
0.84
0.85
0.85
0.85
0.83

Cross-validation mean score: 0.844
Standard deviation of CV score: 0.008


### Models score

| Classifier           | Accuracy |
|----------------------|----------|
| KNN                  | 84.2%    |
| Decision Tree        | 80.6%    |
| SVM                  | 83.2%    |
| Logistic Regression  | 82.6%    |
| Random Forest        | 84.4%    |

We achieved the highest accuracy with the Decision Tree model, so now we’ll apply it to the test dataset to see how well it performs on unseen data.

### Running the Model on the Test Dataset

In [None]:
# Predictions: use raw X_test
# (Pipelines include preprocessing, so no manual scaling of X_test)
y_pred_knn  = pipe_knn.predict(X_test)   # nie X_scaled_test
y_pred_svc  = pipe_svc.predict(X_test)   # nie X_scaled_test
y_pred_log  = pipe_log.predict(X_test)   # nie X_scaled_test
y_pred_tree = clf_tree.predict(X_test)
y_pred_rf   = clf_rf.predict(X_test)

# Model evaluation: calculate accuracy for each model separately
acc_knn  = accuracy_score(y_test, y_pred_knn)
acc_tree = accuracy_score(y_test, y_pred_tree)
acc_svc  = accuracy_score(y_test, y_pred_svc)
acc_log  = accuracy_score(y_test, y_pred_log)
acc_rf   = accuracy_score(y_test, y_pred_rf)

print(f"Accuracy on test set:")
print(f"- Accuracy of KNN Classifier model on test dataset:              {acc_knn:.4f}")
print(f"- Accuracy of Decision Tree model on test dataset:    {acc_tree:.4f}")
print(f"- Accuracy of SVC model on test dataset:              {acc_svc:.4f}")
print(f"- Accuracy of Logistic Regression model on test dataset:    {acc_log:.4f}")
print(f"- Accuracy of Random Forest model on test dataset:    {acc_rf:.4f}")

Accuracy on test set:
- Accuracy of KNN Classifier model on test dataset:              0.7584
- Accuracy of Decision Tree model on test dataset:    0.8258
- Accuracy of SVC model on test dataset:              0.8146
- Accuracy of Logistic Regression model on test dataset:    0.8202
- Accuracy of Random Forest model on test dataset:    0.8090


### Final remarks

As we can see, the final results on the test dataset are not exactly what we anticipated earlier. During the GridSearch and optimization process, we worked exclusively with the training dataset, so it’s natural that the performance slightly decreased when the model was evaluated on unseen test data. This difference reflects how well the model generalizes to new examples, which is the ultimate goal of machine learning.

Despite not using more advanced tools such as a detailed confusion matrix analysis or feature importance evaluation, we still achieved reasonably strong results. The outcome demonstrates that even classical machine learning methods can perform well when the data is properly prepared and the model is tuned thoughtfully.

Of course, this is not the best possible solution to the problem. As stated at the beginning, this tutorial was created solely for the purpose of a recruitment task — the goal was to present a clear, comprehensible approach rather than to maximize performance or use overly complex methods that might confuse ML beginners.

If you wish to improve the results further, you can explore more advanced techniques such as model ensembling, feature selection, or hyperparameter optimization with libraries like Optuna or Bayesian Optimization.

That concludes this tutorial — good luck with your recruitment task, and keep exploring machine learning!