# NBS Titanic Challenge

## Purpose
This notebooks has the purpose to work with [FALL Detection Dataset](http://fenix.univ.rzeszow.pl/~mkepski/ds/uf.html) using machine learning algorithms.

## Methodology

The goal of this experiment is to predict if has a fall or not using a machine learning model.

For build the machine learning classifier, we have to create some feature with based of max average value, standard deviation, etc.

## Results

The results show that using methods based on decision tree algorithms and bagging/boosting techniques the performance increase significantly. The table below show the results of the methods:

|                     | **Train Set** | **Test Set** | **Test Set** |
|:-------------------:|:-------------:|:------------:|:------------:|
|   **Classifiers**   |    Accuracy   |   Accuracy   |   F1-score   |
| Logistic Regression |      81%      |      81%     |      54%     |
|         SVM         |      49%      |      34%     |      32%     |
|    Random Forest    |      94%      |      94%     |      87%     |
|  Gradient Boosting  |      94%      |      94%     |      87%     |
|       XGBoost       |      94%      |      93%     |      86%     |



## Suggested next steps
...

# Setup

## Library import
We import all the required Python libraries

In [18]:
import pandas as pd
import os
import numpy as np
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, roc_curve, roc_auc_score


import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
%matplotlib inline

# Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.

In [2]:
TRAIN_DATA_PATH_ADL = "../data/ADL/Train"
TRAIN_DATA_PATH_FALL = "../data/Fall/Train"


# Data import
We retrieve all the required data for the analysis.

In [3]:
def load_dataframes(root_path: str):
    """This function load multiple DataFrames

    Args:
        root_path (str): root where the dataframe are located

    Returns:
        pd.DataFrame: The dataframes merged together
    """
    df = pd.DataFrame()
    names = [f"feat_{i+1}" for i in range(5)]

    for file_name in os.listdir(root_path):
        file_path = os.path.join(root_path, file_name)
        df = pd.concat([df,
                        pd.read_csv(file_path, names=names)], axis=0)
        
    return df

In [4]:
adl_df = load_dataframes(TRAIN_DATA_PATH_ADL)
adl_df = adl_df.reset_index(drop=True)
adl_df["Fall"] = [0 for i in range(len(adl_df))]
display(adl_df.head(), adl_df.tail())

Unnamed: 0,feat_1,feat_2,feat_3,feat_4,feat_5,Fall
0,-16,1.09928,-0.203149,1.067084,-0.168758,0
1,16,1.063184,-0.217888,1.029009,-0.155002,0
2,47,1.016192,-0.234346,0.975704,-0.160406,0
3,62,0.974664,-0.228696,0.933944,-0.159424,0
4,94,0.954256,-0.210518,0.914538,-0.172934,0


Unnamed: 0,feat_1,feat_2,feat_3,feat_4,feat_5,Fall
13519,9204,1.150714,1.12088,0.059201,-0.253506,0
13520,9240,1.151249,1.126284,0.118647,-0.206833,0
13521,9274,1.113213,1.089438,0.025056,-0.227468,0
13522,9301,1.201963,1.154779,-0.08745,-0.321796,0
13523,9330,1.215942,1.187696,-0.059201,-0.253752,0


In [5]:
fall_df = load_dataframes(TRAIN_DATA_PATH_FALL)
fall_df = fall_df.reset_index(drop=True)
fall_df["Fall"] = [1 for i in range(len(fall_df))]
display(fall_df.head(), fall_df.tail())

Unnamed: 0,feat_1,feat_2,feat_3,feat_4,feat_5,Fall
0,-9,0.926322,-0.253997,0.815543,0.358397,1
1,8,0.866156,-0.252278,0.768625,0.309513,1
2,26,0.838198,-0.278562,0.742586,0.271193,1
3,43,0.920216,-0.350045,0.810384,0.259893,1
4,59,0.948437,-0.394261,0.822912,0.258665,1


Unnamed: 0,feat_1,feat_2,feat_3,feat_4,feat_5,Fall
4641,2652,1.086798,0.258419,0.897097,0.556387,1
4642,2662,1.053838,0.25891,0.879411,0.519786,1
4643,2676,1.117649,0.227468,0.924364,0.585619,1
4644,2685,1.071484,0.257191,0.880393,0.55393,1
4645,2700,1.133348,0.279545,0.954087,0.544105,1


### Merge the two dataframes

In [6]:
df = pd.concat([adl_df, fall_df], axis=0)
df = df.reset_index(drop=True)
display(df.head(), df.tail())

Unnamed: 0,feat_1,feat_2,feat_3,feat_4,feat_5,Fall
0,-16,1.09928,-0.203149,1.067084,-0.168758,0
1,16,1.063184,-0.217888,1.029009,-0.155002,0
2,47,1.016192,-0.234346,0.975704,-0.160406,0
3,62,0.974664,-0.228696,0.933944,-0.159424,0
4,94,0.954256,-0.210518,0.914538,-0.172934,0


Unnamed: 0,feat_1,feat_2,feat_3,feat_4,feat_5,Fall
18165,2652,1.086798,0.258419,0.897097,0.556387,1
18166,2662,1.053838,0.25891,0.879411,0.519786,1
18167,2676,1.117649,0.227468,0.924364,0.585619,1
18168,2685,1.071484,0.257191,0.880393,0.55393,1
18169,2700,1.133348,0.279545,0.954087,0.544105,1


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18170 entries, 0 to 18169
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   feat_1  18170 non-null  int64  
 1   feat_2  18170 non-null  float64
 2   feat_3  18170 non-null  float64
 3   feat_4  18170 non-null  float64
 4   feat_5  18170 non-null  float64
 5   Fall    18170 non-null  int64  
dtypes: float64(4), int64(2)
memory usage: 851.8 KB


# Data Exploration

## Look at the outcome/ground truth distribution

In [8]:
def outcome_distribution(df, outcome):
    fig = px.pie(df, names=outcome, title="Grount Truth Distribution", hole=0.3)
    fig.show()

In [9]:
outcome_distribution(df, outcome="Fall")

We have a umbalanced dataset problem

In [10]:
class_proportion = df["Fall"].value_counts(normalize=True)
print(f"Proportion: ")
print(f"Not Fall: {class_proportion[0]:.2f}")
print(f"Fall: {class_proportion[1]:.2f}")

Proportion: 
Not Fall: 0.74
Fall: 0.26


## Univariate Analysis

In [11]:
def feature_distribution(df: pd.DataFrame, feature_name: str, title: str):
    """This feature plot a distribution of the independet feature

    Args:
        df (pd.DataFrame): DataFrame.
        feature_name (str): feature that will be visualized.
        title (str): Name of the feature to show in the plot.
    """
    fig = px.histogram(df, x=feature_name, marginal='box', histnorm="probability density")
    fig.show() 

In [12]:
for feat_name in df.columns.tolist()[:-1]:
    feature_distribution(df, feature_name=feat_name, title=f"{feat_name.capitalize()} distribution")

This feature show that we have some outlier, we can normalize the feature using standardization.

# Preprocessing

In [13]:
X = df.drop(["Fall"], axis=1)
y = df[["Fall"]]

print(f"Number of samples: {X.shape[0]}, Number of features: {X.shape[1]}")


Number of samples: 18170, Number of features: 5


# Split the model into Train and Test

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42,
                                                    test_size=0.1)
print(f"Number of Train Samples: {X_train.shape[0]}")
print(f"Number of Test Samples: {X_test.shape[0]}")


Number of Train Samples: 16353
Number of Test Samples: 1817


# Feature Scaling

In [15]:
scaler = StandardScaler()
scaler.fit(X_train)

In [16]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# Modeling

## Random Forest

**Build GridSearchCV Pipeline for Hyperparameter Tuning**

In [17]:
rf_pipeline = Pipeline([
    ("rf", RandomForestClassifier(random_state=42))
])
rf_params = {
    "rf__n_estimators": [100, 200, 300],
    "rf__max_depth": [None, 5, 10]
}
rf_grid = GridSearchCV(rf_pipeline, rf_params, cv=10)
rf_grid.fit(X_train, y_train.values.reshape(-1,))

In [19]:
def plot_learning_curve(train_sizes, train_mean, train_std, test_mean, test_std):
    """This function plot the learning curve to verify which scenario we have using the classifier

    Args:
        train_sizes (np.array): This array has all the size of train set
        train_mean (np.array): This array has the mean accuracy of train set
        train_std (np.array): This array has the std accuracy of train set
        test_mean (np.array): this array has the mean accuracy of the test set
        test_std (np.array): This array has the std accuracy of the test set

    """
    # Create the figure
    fig = go.Figure()

    # Add the training accuracy trace
    fig.add_trace(go.Scatter(
        x=train_sizes,
        y=train_mean,
        mode='markers+lines',
        marker=dict(color='blue', size=5),
        name='Training accuracy'
    ))

    # Add the fill between the training accuracy range
    fig.add_trace(go.Scatter(
        x=train_sizes + train_sizes[::-1],
        y=train_mean + train_std + train_mean[::-1] - train_mean - train_std[::-1],
        fill='tozerox',
        mode='none',
        fillcolor='rgba(0, 0, 255, 0.15)',
        line=dict(color='rgba(255, 255, 255, 0)'),
        name='Training accuracy range'
    ))

    # Add the validation accuracy trace
    fig.add_trace(go.Scatter(
        x=train_sizes,
        y=test_mean,
        mode='markers+lines',
        marker=dict(color='green', size=5, symbol='square-open-dot'),
        line=dict(dash='dash'),
        name='Validation accuracy'
    ))

    # Add the fill between the validation accuracy range
    fig.add_trace(go.Scatter(
        x=train_sizes + train_sizes[::-1],
        y=test_mean + test_std + test_mean[::-1] - test_mean - test_std[::-1],
        fill='tozerox',
        mode='none',
        fillcolor='rgba(0, 128, 0, 0.15)',
        line=dict(color='rgba(255, 255, 255, 0)'),
        name='Validation accuracy range'
    ))

    # Set layout properties
    fig.update_layout(
        title='Accuracy vs. Number of training examples',
        xaxis_title='Number of training examples',
        yaxis_title='Accuracy',
        legend=dict(x=1.0, y=0.1),
        yaxis=dict(range=[0.4, 1.03]),
        template='plotly_white'
    )
    fig.show()


In [21]:
train_sizes, train_scores, test_scores = learning_curve(estimator=RandomForestClassifier(random_state=42,
                                                                      max_depth=None, n_estimators=300), 
                                                        X=X_train, y=y_train.values.reshape(-1, ), 
                                                        train_sizes=np.linspace(0.1, 1.0, 10), 
                                                        cv=10, n_jobs=-1)
train_mean, train_std = np.mean(train_scores, axis=1), np.std(train_scores, axis=1)
test_mean, test_std = np.mean(test_scores, axis=1), np.std(test_scores, axis=1)

In [28]:
def plot_learning_curve(train_sizes, train_mean, train_std, test_mean, test_std):
    """This function plot the learning curve to verify which scenario we have using the classifier

    Args:
        train_sizes (np.array): This array has all the size of train set
        train_mean (np.array): This array has the mean accuracy of train set
        train_std (np.array): This array has the std accuracy of train set
        test_mean (np.array): this array has the mean accuracy of the test set
        test_std (np.array): This array has the std accuracy of the test set

    """
    # Create the figure
    fig = go.Figure()

    # Add the training accuracy trace
    fig.add_trace(go.Scatter(
        x=train_sizes,
        y=train_mean,
        mode='markers+lines',
        marker=dict(color='blue', size=5),
        name='Training accuracy'
    ))

    # Add the fill between the training accuracy range
    fig.add_trace(go.Scatter(
        x=train_sizes + train_sizes[::-1],
        y=train_mean + train_std + train_mean[::-1] - train_mean - train_std[::-1],
        fill='tozerox',
        mode='none',
        fillcolor='rgba(0, 0, 255, 0.15)',
        line=dict(color='rgba(255, 255, 255, 0)'),
        name='Training accuracy range'
    ))

    # Add the validation accuracy trace
    fig.add_trace(go.Scatter(
        x=train_sizes,
        y=test_mean,
        mode='markers+lines',
        marker=dict(color='green', size=5, symbol='square-open-dot'),
        line=dict(dash='dash'),
        name='Validation accuracy'
    ))

    # Add the fill between the validation accuracy range
    fig.add_trace(go.Scatter(
        x=train_sizes + train_sizes[::-1],
        y=test_mean + test_std + test_mean[::-1] - test_mean - test_std[::-1],
        fill='tozerox',
        mode='none',
        fillcolor='rgba(0, 128, 0, 0.15)',
        line=dict(color='rgba(255, 255, 255, 0)'),
        name='Validation accuracy range'
    ))

    # Set layout properties
    fig.update_layout(
        title='Accuracy vs. Number of training examples',
        xaxis_title='Number of training examples',
        yaxis_title='Accuracy',
        legend=dict(x=1.0, y=0.1),
        yaxis=dict(range=[0.4, 1.03]),
        template='plotly_white'
    )
    fig.show()


In [29]:
plot_learning_curve(train_sizes, train_mean, train_std, test_mean, test_std)

In [20]:
rf_grid.best_params_

{'rf__max_depth': None, 'rf__n_estimators': 300}

In [22]:
rf_grid.best_score_

0.943436740614462

Feature Importance

In [23]:
def plot_feature_importance(feature_importance):
    fig = go.Figure()

    fig.add_trace(go.Bar(x=feature_importance["feature"], 
                         y=feature_importance["importance"]))

    # add annotations
    impo = feature_importance["importance"]
    
    for i, imp in enumerate(impo):
        fig.add_annotation(
            x=feature_importance["feature"][i],
            y=imp-0.8,
            text=f"{imp:.2f}%",
            showarrow=False,
            font=dict(color="white", size=12)
        )

    fig.update_layout(
        title="Feature Importance",
        xaxis_title="Features",
        yaxis_title="Importance (%)",
        width=900,
        height=500,
    )

    fig.show()

In [24]:
rf = RandomForestClassifier(max_depth=None, n_estimators=300)
rf.fit(X_train, y_train.values.reshape(-1, ))

In [25]:
features_importances = pd.DataFrame(rf.feature_importances_,
                                    index=X.columns,
                                    columns=["importance"]).sort_values("importance", ascending=False)
features_importances = features_importances.reset_index()
features_importances = features_importances.rename(columns={"index": "feature"})
display(features_importances)

plot_feature_importance(features_importances)     

Unnamed: 0,feature,importance
0,feat_1,0.272319
1,feat_4,0.256113
2,feat_5,0.198946
3,feat_3,0.174424
4,feat_2,0.098199


This test shows, that feature 1 dont't give much information about this, so I can remove this feature for my classifier

Evaluate the Performance of Test Set

In [26]:
rf_pred = rf_grid.predict(X_test)
acc = accuracy_score(y_test.values.reshape(-1, ), rf_pred)
f1 = f1_score(y_test.values.reshape(-1, ), rf_pred)
recall = recall_score(y_test.values.reshape(-1, ), rf_pred)
precision = precision_score(y_test.values.reshape(-1, ), rf_pred)

print(f"Accuracy: {acc:.4f}, F1-score: {f1:.4f}, \n\
      Recall: {recall:.4f}, Precision: {precision:.4f}")

Accuracy: 0.9389, F1-score: 0.8702, 
      Recall: 0.8194, Precision: 0.9277


Using RandomForest the model got 87% of f1_score, that its a good metric, because this performance measure penalize the false positive and false negative applying a harmonic mean.

In [27]:
def plot_roc_curve():
    # Create the ROC curve trace
    roc_trace = go.Scatter(
        x=fpr,
        y=tpr,
        mode='lines',
        line=dict(color='blue', width=2),
        name=f'ROC curve (AUC = {auc_score:.2f})'
    )

    # Create the diagonal trace (the gray dashed line)
    diagonal_trace = go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode='lines',
        line=dict(color='gray', dash='dash'),
        showlegend=False
    )

    # Create the layout
    layout = go.Layout(
        title='Receiver Operating Characteristic (ROC) Curve',
        xaxis=dict(title='False Positive Rate'),
        yaxis=dict(title='True Positive Rate'),
        width=800,
        height=600,
        legend=dict(x=0.02, y=0.98, bordercolor='Black', borderwidth=1)
    )

    # Create the figure and add the traces
    fig = go.Figure(data=[roc_trace, diagonal_trace], 
                    layout=layout
    )

    # Show the plot
    fig.show()

y_pred_prob = rf_grid.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test.values.reshape(-1, ), y_pred_prob)
auc_score = roc_auc_score(y_test.values.reshape(-1,), y_pred_prob)
plot_roc_curve()

## GradientBoosting

Build a GridSearchCV for Hyperparameter Tuning

In [30]:
grad_boost_pipeline = Pipeline([
    ("grad_boost", GradientBoostingClassifier(random_state=42))
])
grad_boost_params = {
    "grad_boost__n_estimators": [50, 100, 200, 500, 1000],
    "grad_boost__learning_rate": [0.1, 1, 10],
    "grad_boost__max_depth": [2, 4, 8, 16, 32]
}
grad_boost_grid = GridSearchCV(grad_boost_pipeline, grad_boost_params, cv=10)
grad_boost_grid.fit(X_train, y_train.values.reshape(-1,))

In [32]:
train_sizes, train_scores, test_scores = learning_curve(estimator=GradientBoostingClassifier(random_state=42,
                                                                      learning_rate=0.1, max_depth=16, n_estimators=1000), 
                                                        X=X_train, y=y_train.values.reshape(-1, ), 
                                                        train_sizes=np.linspace(0.1, 1.0, 10), 
                                                        cv=10, n_jobs=-1)
train_mean, train_std = np.mean(train_scores, axis=1), np.std(train_scores, axis=1)
test_mean, test_std = np.mean(test_scores, axis=1), np.std(test_scores, axis=1)

In [33]:
plot_learning_curve(train_sizes, train_mean, train_std, test_mean, test_std)

In [97]:
grad_boost_grid.best_params_

{'grad_boost__learning_rate': 0.1,
 'grad_boost__max_depth': 16,
 'grad_boost__n_estimators': 1000}

In [98]:
grad_boost_grid.best_score_

0.9458210523167567

Evaluate the Performance of Test Set

In [99]:
grad_boost_pred = grad_boost_grid.predict(X_test)
acc = accuracy_score(y_test.values.reshape(-1, ), grad_boost_pred)
f1 = f1_score(y_test.values.reshape(-1, ), grad_boost_pred)
recall = recall_score(y_test.values.reshape(-1, ), grad_boost_pred)
precision = precision_score(y_test.values.reshape(-1, ), grad_boost_pred)

print(f"Accuracy: {acc:.4f}, F1-score: {f1:.4f}, \n\
      Recall: {recall:.4f}, Precision: {precision:.4f}")

Accuracy: 0.9411, F1-score: 0.8746, 
      Recall: 0.8216, Precision: 0.9348


The results are consitent, a little better compared with RandomForest Classifier

In [34]:
y_pred_prob = grad_boost_grid.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test.values.reshape(-1, ), y_pred_prob)
auc_score = roc_auc_score(y_test.values.reshape(-1,), y_pred_prob)
plot_roc_curve()

## XGBoost

Build a GridSearchCV for Hyperparameter Tuning

In [31]:
xgboost_pipeline = Pipeline([
    ("xgboost", XGBClassifier(random_state=42))
])
xgboost_params = {
    "xgboost__n_estimators": [50, 100, 200, 500, 1000],
    "xgboost__learning_rate": [0.001, 0.01, 0.05, 0.1],
    "xgboost__max_depth": [3, 6, 10],
    "xgboost__colsample_bytree": [0.3, 0.7]
}
xgboost_grid = GridSearchCV(xgboost_pipeline, xgboost_params, cv=10)
xgboost_grid.fit(X_train, y_train.values.reshape(-1, ))

In [35]:
xgboost_grid.best_params_

{'xgboost__colsample_bytree': 0.7,
 'xgboost__learning_rate': 0.1,
 'xgboost__max_depth': 10,
 'xgboost__n_estimators': 1000}

In [36]:
train_sizes, train_scores, test_scores = learning_curve(estimator=XGBClassifier(random_state=42,
                                                                      colsample_bytree=0.7, learning_rate=0.1, max_depth=10, n_estimators=1000), 
                                                        X=X_train, y=y_train.values.reshape(-1, ), 
                                                        train_sizes=np.linspace(0.1, 1.0, 10), 
                                                        cv=10, n_jobs=-1)
train_mean, train_std = np.mean(train_scores, axis=1), np.std(train_scores, axis=1)
test_mean, test_std = np.mean(test_scores, axis=1), np.std(test_scores, axis=1)

In [37]:
plot_learning_curve(train_sizes, train_mean, train_std, test_mean, test_std)

Evaluate the performance of Test Set

In [40]:
xgboost_pred = xgboost_grid.predict(X_test)
acc = accuracy_score(y_test.values.reshape(-1, ), xgboost_pred)
f1 = f1_score(y_test.values.reshape(-1, ), xgboost_pred)
recall = recall_score(y_test.values.reshape(-1, ), xgboost_pred)
precision = precision_score(y_test.values.reshape(-1, ), xgboost_pred)

print(f"Accuracy: {acc:.4f}, F1-score: {f1:.4f}, \n\
      Recall: {recall:.4f}, Precision: {precision:.4f}")

Accuracy: 0.9345, F1-score: 0.8640, 
      Recall: 0.8326, Precision: 0.8979


Using the XGBoost, the performance looks poorly compared with Gradient Boosting Algotihm.

In [39]:
y_pred_prob = xgboost_grid.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test.values.reshape(-1, ), y_pred_prob)
auc_score = roc_auc_score(y_test.values.reshape(-1,), y_pred_prob)
plot_roc_curve()