# Executive Summary

This report details the end-to-end development of a machine learning model to predict gender from footprint landmarks, fulfilling a request from a local police department. The project follows the CRISP-DM framework, encompassing data cleaning, extensive feature engineering, systematic model evaluation, and hyperparameter tuning.

*   **Problem:** Predict gender with >90% accuracy from footprint data for forensic analysis.
*   **Key Technique:** Advanced feature engineering based on podiatric research was the most critical factor for success.
*   **Final Model:** An XGBoost classifier, tuned with Bayesian Optimization.
*   **Result:** The final model achieved a **Kaggle private score of 0.9067**, successfully exceeding the 90% accuracy target and demonstrating a robust, explainable solution.

## Business Understanding

The local police department has required the development of a binary prediction automated system that could determine the sex of individuals from the footprints that have been left at crime scenes, for the automated model, the local police force requires, needs to be able to make reasonably high predictions accuracy, within a limited of time, that will be used on a new device and to help the investigation team to narrow down suspects on the initial stages.

To achieve these targets, we have been given a set of data that contains 18 landmarks in the form of X and y coordinates, the report below will provide a detailed examination of the data and its findings, the decision-making of each process, and recommendations for potential improvements and future work.

## step 0: Prepareing

At step zero, we will first be setting up the necessary components for the work to work seamlessly and error free.

### local RUN setup

In [None]:
import zipfile

In [None]:
pip install kaggle pandas joblib numpy matplotlib seaborn xgboost scipy statsmodels scikit-learn shap imblearn scikit-optimize

In [None]:
!pip install kaggle xgboost joblib statsmodels

pip install --upgrade kaggle pandas joblib numpy matplotlib seaborn xgboost scipy statsmodels scikit-learn shap

In [None]:
pip install --upgrade kaggle pandas joblib numpy matplotlib seaborn xgboost scipy statsmodels scikit-learn shap imblearn

### Import List

In [None]:
import kaggle
import pandas as pd
from joblib import dump, load
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier, plot_importance
from scipy import stats
from scipy.stats import spearmanr
from scipy.stats.mstats import winsorize
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, make_scorer, classification_report, confusion_matrix
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, normalize
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, RandomForestClassifier
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier, LocalOutlierFactor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, make_scorer, classification_report, confusion_matrix

### funstion list

In function list it will hold all the implement function for features use and robustness.

In [None]:
X_train_variants = {}
X_test_variants = {}
y_train_variants = {}
y_test_variants = {}

In [None]:
def my_plot_importance(booster, figsize, **kwargs):
    plt.rcParams["figure.figsize"] = (20, 10)
    plot_importance(booster=booster)

In [None]:
def plot_footprint(footprint_row, title):

    x_values = [footprint_row[f'x{i}'] * width for i in range(18)]
    y_values = [footprint_row[f'y{i}'] * height for i in range(18)]

    plt.figure(figsize=(8, 12))
    plt.scatter(x_values, y_values, color='red')

    for i, (x, y) in enumerate(zip(x_values, y_values)):
        plt.text(x, y, str(i), fontsize=12, color='black', ha='right', va='bottom')

    plt.xlabel('Width (pixels)')
    plt.ylabel('Height (pixels)')
    plt.title(title)
    plt.gca().invert_yaxis()
    plt.show()

distance clataiton

In [None]:
def euclidean_distance(df, x1, y1, x2, y2):
    return np.sqrt((df[x1] - df[x2])**2 + (df[y1] - df[y2])**2)

lengths and widths

In [None]:
def lengths_widths_calculation(df):
    df_lengths_widths = df.copy()
    df_lengths_widths['lengths'] = euclidean_distance(df_lengths_widths, 'x1', 'y1', 'x9', 'y9')
    df_lengths_widths['widths'] = euclidean_distance(df_lengths_widths, 'x5', 'y5', 'x15', 'y15')
    return df_lengths_widths

7 foot point

In [None]:
def point7_calculation(df):
    df_7_point_footprints = df.copy()
    df_7_point_footprints['T1'] = euclidean_distance(df_7_point_footprints, 'x0', 'y0', 'x9', 'y9')
    df_7_point_footprints['T2'] = euclidean_distance(df_7_point_footprints, 'x1', 'y1', 'x9', 'y9')
    df_7_point_footprints['T3'] = euclidean_distance(df_7_point_footprints, 'x2', 'y2', 'x9', 'y9')
    df_7_point_footprints['T4'] = euclidean_distance(df_7_point_footprints, 'x3', 'y3', 'x9', 'y9')
    df_7_point_footprints['T5'] = euclidean_distance(df_7_point_footprints, 'x4', 'y4', 'x9', 'y9')
    df_7_point_footprints['BAB'] = euclidean_distance(df_7_point_footprints, 'x5', 'y5', 'x15', 'y15')
    df_7_point_footprints['BAH'] = euclidean_distance(df_7_point_footprints, 'x8', 'y8', 'x10', 'y10')

    return df_7_point_footprints

IQR missing value function

In [None]:
def IQR(df):
    df_outlier_IQR = df.copy()
    for column in df_outlier_IQR.columns:
        Q1 = df_outlier_IQR[column].quantile(0.25)
        Q3 = df_outlier_IQR[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 3 * IQR
        upper_bound = Q3 + 3 * IQR

        df_outlier_IQR[column] = df_outlier_IQR[column].clip(lower_bound, upper_bound)

    return df_outlier_IQR

Cap Outliers and Apply Robust and standard Scaling

In [None]:
def cap_outliers_and_scale(df):
    df_outlier_capped_scale = df.copy()

    for column in df_outlier_capped_scale:
        Q1 = df_outlier_capped_scale[column].quantile(0.10)
        Q3 = df_outlier_capped_scale[column].quantile(0.80)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df_outlier_capped_scale[column] = np.clip(df_outlier_capped_scale[column], lower_bound, upper_bound)

    robust_scaler = RobustScaler()
    df_outlier_capped_scale = robust_scaler.fit_transform(df_outlier_capped_scale)

    standard_scaler = StandardScaler()
    df_outlier_capped_scale = standard_scaler.fit_transform(df_outlier_capped_scale)

    return pd.DataFrame(df_outlier_capped_scale, columns=X.columns)

Winsorization

In [None]:
def Winsorization(df):
    df_winsorized = df.copy()
    for column in df_winsorized.columns:
        df_winsorized[column] = winsorize(df_winsorized[column], limits=[0.003, 0.004])
    return df_winsorized

z_score

In [None]:
def z_score(df):
    df_z_score = df.copy()
    z_threshold = 4
    for column in df_z_score.columns:
        z_scores = stats.zscore(df_z_score[column])
        df_z_score[column] = np.where(z_scores > z_threshold, df_z_score[column].median(), df_z_score[column])
        df_z_score[column] = np.where(z_scores < -z_threshold, df_z_score[column].median(), df_z_score[column])
    return df_z_score

isolation_forest

In [None]:
def isolation_forest(df, contamination=0.05):
    df_isolation = df.copy()
    model = IsolationForest(contamination=contamination, random_state=42)

    model.fit(df_isolation)

    outlier_predictions = model.predict(df_isolation)

    for column in df_isolation.columns:
        median_value = df_isolation[column].median()
        df_isolation[column] = np.where(outlier_predictions == -1, median_value, df_isolation[column])

    return df_isolation

## step 1: Understanding the data


 Although all landmarks are provided, it does not necessarily mean all of them will be positive for the model, therefore we will implement features engineering, This involves both adding new features and feature selection to improve model learning, details on feature engineering will be discussed in a later section.

The data has been standardized between 0 and 1, if needed, we can recover to the original values by scaling back to 2240x3200, this will bring us back to its true data form, for more data understanding.

The dataset contains 2,000 entries, which will be used to train the model and between them, x1 to y17 contain 6 to 17 missing values in between that require handling to ensure the data quality, and we will experiment with different imputation methods in step 3.

In [None]:
footprints_data = pd.read_csv('SexLandmarks-train.csv')
print(footprints_data.info())
footprints_data.head()

In [None]:
footprints_data.isnull().sum()

In this step, on "Box Plots for Outliers", outliers are present on the dataset, for early outlier handling, we can scale back the standardized data and calculate basic length and width, as it is difficult to gain meaningful information from the basic box plots, by doing so, we can identify extreme outliers more easily and correct them manually if needed, this method allows us to clean data more consistently, as leaving unreasonable extreme outliers most likely hurt the robustness of the dataset and effectiveness of the deployment.

In [None]:
plt.figure(figsize=(15, 10))
sns.boxplot(data=footprints_data)
plt.xticks(rotation=90)
plt.title("Box Plots for Outliers")
plt.show()

In [None]:
width, height = 2240, 3200

original_scaled_data = footprints_data.copy()

for column in original_scaled_data.columns:
    if column.startswith('x'):
        original_scaled_data[column] = original_scaled_data[column] * width
    elif column.startswith('y'):
        original_scaled_data[column] = original_scaled_data[column] * height

print(original_scaled_data.head())

In [None]:
plt.figure(figsize=(15, 10))
sns.boxplot(data=original_scaled_data)
plt.xticks(rotation=90)
plt.title("Box Plots for Outliers")
plt.show()

In [None]:
original_scaled_data_with_lengths_widths = lengths_widths_calculation(original_scaled_data)

In [None]:
for name, group in original_scaled_data_with_lengths_widths.groupby('sex'):
    plt.plot(group.lengths, group.widths, '.', label=name)
plt.legend()

The graph below shows the length and width of each footprint, Based on it we can observe extreme outliers, we will check if should we remove or correct these outliers, based on the landmark and dose it relistic.

In [None]:
lengths_upper_threshold = original_scaled_data_with_lengths_widths['lengths'].quantile(0.95)
lengths_lower_threshold = original_scaled_data_with_lengths_widths['lengths'].quantile(0.05)
widths_upper_threshold = original_scaled_data_with_lengths_widths['widths'].quantile(0.95)
widths_lower_threshold = original_scaled_data_with_lengths_widths['widths'].quantile(0.05)


big_feet = original_scaled_data_with_lengths_widths[
    (original_scaled_data_with_lengths_widths['lengths'] > lengths_upper_threshold) |
    (original_scaled_data_with_lengths_widths['widths'] > widths_upper_threshold)
]

small_feet = original_scaled_data_with_lengths_widths[
    (original_scaled_data_with_lengths_widths['lengths'] < lengths_lower_threshold) |
    (original_scaled_data_with_lengths_widths['widths'] < widths_lower_threshold)
]


print("Big Feet Data Points:")
print(big_feet)

print("\nSmall Feet Data Points:")
print(small_feet)


import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))


plt.scatter(original_scaled_data_with_lengths_widths['lengths'], original_scaled_data_with_lengths_widths['widths'], alpha=0.5, label='Normal Data')


plt.scatter(big_feet['lengths'], big_feet['widths'], color='red', label='Big Feet Outliers', edgecolor='black')
plt.scatter(small_feet['lengths'], small_feet['widths'], color='yellow', label='Small Feet Outliers', edgecolor='black')

plt.xlabel('Lengths')
plt.ylabel('Widths')
plt.legend()
plt.show()

In [None]:
small_foot_1 = small_feet[
    (small_feet['lengths'] > 1600) & (small_feet['lengths'] < 1800) &
    (small_feet['widths'] > 0) & (small_feet['widths'] < 500)
]

small_foot_2 = small_feet[
    (small_feet['lengths'] > 2000) & (small_feet['lengths'] < 2150) &
    (small_feet['widths'] > 400) & (small_feet['widths'] < 600)
]

big_foot_1 = big_feet[
    (big_feet['lengths'] > 3100) & (big_feet['lengths'] < 3300) &
    (big_feet['widths'] > 1900) & (big_feet['widths'] < 2100)
]

big_foot_2 = big_feet[
    (big_feet['lengths'] > 2200) & (big_feet['lengths'] < 2300) &
    (big_feet['widths'] > 100) & (big_feet['widths'] < 1100)
]


As shown in the graph below, the coordinates of the small feet, has shown a spread that are hardly can be recognized as human, therefore drop these data point from the dataset should improve the dataset.

On the other hand both of the big foot seems to be showing a normal spared therefore they will be kept.

In [None]:
plot_footprint(small_foot_1.iloc[0], 'Small Foot 1 (Length ~ 1780, Width ~ 200)')

plot_footprint(small_foot_2.iloc[0], 'Small Foot 2 (Length ~ 2080, Width ~ 500)')

plot_footprint(big_foot_1.iloc[0], 'Big Foot 1 (Length ~ 3200, Width ~ 2000)')

plot_footprint(big_foot_2.iloc[0], 'Big Foot 2 (Length ~ 2230, Width ~ 1100)')

In [None]:
indices_to_drop = [small_foot_1.index[0], small_foot_2.index[0]]

footprints_data = footprints_data.drop(index=indices_to_drop)

In [None]:
footprints_data.describe().T

The graph below shown there is a class imbalance on the dataset, it will be the best practice to implement the Synthetic Minority Over-sampling Technique (SMOTE) to prevent model bias. SMOTE will generate synthetic samples for the minority class, this can help to balance the dataset and improve the model's ability to generalize both classes.

In [None]:
barplot=(sns.countplot(data= footprints_data, x='sex',hue='sex', palette=['b', 'g']))
plt.title('0 v/s 1\n')

In [None]:
corr = footprints_data.corr(method='spearman')

triangle = np.triu(corr)

plt.figure(figsize=(16, 7))
sns.heatmap(data=corr, annot=True, mask=triangle, vmin=-1, vmax=1, cmap='RdBu_r', linewidths=.5, fmt= '.1f') # with 1 decimal precision

In [None]:
plt.figure(figsize=(20,12))
sns.set_context('notebook',font_scale = 1)
sns.heatmap(footprints_data.corr(),annot=True,linewidth =0.5)
plt.tight_layout()

In [None]:
ax = sns.heatmap(
    corr,
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);
ax


Th is dataset has shown there is no duplicated, therefore no action needed

In [None]:
footprints_data.duplicated().value_counts()

## step 2: data processing







### outliers handling

In [None]:
footprints_data_df = footprints_data.copy()
footprints_data_df.describe().T

we will uses 4 method to handle outliers, and we will not be dropping outliers, because as seen there is meaningful data with in the outliers, therefore Dropping them could bring loss of important patterns.

1.Basic IQR Method:
* The Interquartile Range (IQR) is a standard technique used to identify outliers, the outliers will be capped to a bounds, to limit their range.

2.Cap Outliers and Apply Robust Scaling
* Similar to the IQR method but apply robusts and standard scaling to create deviation of the data.

3.Winsorization
* limits exteme values by capping them within specified boundaries.

4.Use Z score
* uses standard deviation to identify outliers, which are then replaced replaced with the median to reduce their effect.

In [None]:
footprints_data_df = footprints_data.copy()

 #### use IQR for outliners

In [None]:
footprints_data_df = footprints_data.copy()
footprints_data_df.describe().T

In [None]:
X = footprints_data_df.drop('sex', axis=1)
y = footprints_data_df['sex']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train_iqr = IQR(X_train)
X_test_iqr = IQR(X_test)

y_train_iqr = y_train.loc[X_train_iqr.index]
y_test_iqr = y_test.loc[X_test_iqr.index]

In [None]:
X_train_variants['IQR'] = X_train_iqr
X_test_variants['IQR'] = X_test_iqr
y_train_variants['IQR'] = y_train_iqr
y_test_variants['IQR'] = y_test_iqr

In [None]:
import matplotlib.pyplot as plt
lengths_widths_df_iqr = pd.concat([X_train_iqr, y_train_iqr], axis=1)

for name, group in lengths_widths_df_iqr.groupby('sex'):
    plt.plot(group.x1, group.x14, '.', label=name)
plt.legend()

In [None]:
X_train_iqr.describe()

#### Cap Outliers and Apply Robust and standard Scaling

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train_robust = cap_outliers_and_scale(X_train)
X_test_robust = cap_outliers_and_scale(X_test)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

y_train_robust = y_train.loc[X_train_robust.index]
y_test_robust = y_test.loc[X_test_robust.index]

In [None]:
X_train_variants['RobustScaling'] = X_train_robust
X_test_variants['RobustScaling'] = X_test_robust
y_train_variants['RobustScaling'] = y_train_robust
y_test_variants['RobustScaling'] = y_test_robust

In [None]:
lengths_widths_df_robust = pd.concat([X_train_robust, y_train_robust], axis=1)

for name, group in lengths_widths_df_robust.groupby('sex'):
    plt.plot(group.x1, group.x14, '.', label=name)
plt.legend()

In [None]:
lengths_widths_df_robust.describe()

for now we have done with the grouping onto lengths	and widths which we have mentioned earlier and we have use 2 ways to deal with outliers

#### Winsorization


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train_Winsorization = Winsorization(X_train)
X_test_Winsorization = Winsorization(X_test)

y_train_Winsorization = y_train.loc[X_train_Winsorization.index]
y_test_Winsorization = y_test.loc[X_test_Winsorization.index]

In [None]:
X_train_variants['Winsorization'] = X_train_Winsorization
X_test_variants['Winsorization'] = X_test_Winsorization
y_train_variants['Winsorization'] = y_train_Winsorization
y_test_variants['Winsorization'] = y_test_Winsorization

In [None]:
lengths_widths_df_Winsorization = pd.concat([X_train_Winsorization, y_train_Winsorization], axis=1)

for name, group in lengths_widths_df_Winsorization.groupby('sex'):
    plt.plot(group.x1, group.x14, '.', label=name)
plt.legend()

In [None]:
lengths_widths_df_Winsorization.describe()

#### z_score_df


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train_z_score = z_score(X_train)
X_test_z_score = z_score(X_test)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

y_train_z_score = y_train.loc[X_train_robust.index]
y_test_z_score = y_test.loc[X_test_robust.index]

In [None]:
X_train_variants['z_score'] = X_train_z_score
X_test_variants['z_score'] = X_test_z_score
y_train_variants['z_score'] = y_train_z_score
y_test_variants['z_score'] = y_test_z_score

In [None]:
lengths_widths_df_z_score = pd.concat([X_train_z_score, y_train_z_score], axis=1)

for name, group in lengths_widths_df_z_score.groupby('sex'):
    plt.plot(group.x1, group.x14, '.', label=name)
plt.legend()

In [None]:
lengths_widths_df_z_score.describe()

## step 3: missing value handle

At this step, we group the data that has been processed for outliers handling, assign key values for easier management, then we apply KNN Imputer and Iterative Imputer, this avoids data leakage meanwhile being efficient.
These two imputation methods are chosen because:

1. KNN Imputer:
* The KNN fills up missing values by averaging the values from the nearest    
neighbours, this helps missing values while keeping the patterns related to those neighbours.

2. Iterative Imputer:
* The Iterative predicts each missing value by running an iterative regression.



In [None]:
datasets = {
    'IQR': (X_train_iqr, X_test_iqr, y_train_iqr, y_test_iqr),
    'RobustScaling': (X_train_robust, X_test_robust, y_train_robust, y_test_robust),
    'Winsorization': (X_train_Winsorization, X_test_Winsorization, y_train_Winsorization, y_test_Winsorization),
    'Zscore': (X_train_z_score, X_test_z_score, y_train_z_score, y_test_z_score),
}

KNNImputer = KNNImputer(n_neighbors=4)
IterativeImputer = IterativeImputer(max_iter=6, random_state=0)

imputed_variants = {}

for variant_name, (X_train, X_test, y_train, y_test) in datasets.items():

    X_train_imputed = pd.DataFrame(IterativeImputer.fit_transform(X_train), columns=X_train.columns)
    X_test_imputed = pd.DataFrame(IterativeImputer.transform(X_test), columns=X_test.columns)

    key = f"{variant_name}_Iterative"
    imputed_variants[key] = (X_train_imputed, X_test_imputed, y_train, y_test)

for variant_name, (X_train, X_test, y_train, y_test) in datasets.items():

    X_train_imputed = pd.DataFrame(KNNImputer.fit_transform(X_train), columns=X_train.columns)
    X_test_imputed = pd.DataFrame(KNNImputer.transform(X_test), columns=X_test.columns)

    key = f"{variant_name}_knn"
    imputed_variants[key] = (X_train_imputed, X_test_imputed, y_train, y_test)

print(imputed_variants.keys())

## step 4: baseline test before features engineering

In this step, we will prepare a baseline test for the performance of the models, as at this point we have already cleaned up our data with the basic method we have covered, and now the data are already for a baseline test and we will choose models that perform well for further development.

In [None]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=500),
    "Gaussian Naïve Bayes": GaussianNB(),
    "Support Vector Machine": SVC(kernel='rbf', degree=3, gamma='scale', max_iter=1000),
    "KNN Classifier": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "AdaBoost": AdaBoostClassifier(),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    "SGDOneClassSVM":linear_model.SGDOneClassSVM()
}

The results from the baseline test has shown a acceptable performance concider we have only processed with basic methodology to clean the data.

The list below has shown the results in order of accuracy score, along with the model, variant of the dataset, outliers method and imputation method. According to the results, the best-performing model so far is XGBoost, which uses IQR_Iterative and it able to achieve 0.8250, This suggests further development of XGBoost will be worthwhile, followed by Random Forest and Gradient Boosting, with a different set of variants, it has also shown there is not yet have a clear idea of which variants will be the best for us to achieve our goal there for more test will be needed in future steps.

In [None]:
results = []

for model_name, model in models.items():
    for dataset_name, (X_train, X_test, y_train, y_test) in imputed_variants.items():

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        accuracy = accuracy_score(y_test, y_pred)

        results.append({
            "Model": model_name,
            "Variant": dataset_name,
            "Outlier Handling Method": dataset_name.split('_')[0],
            "Imputation Method": dataset_name.split('_')[1],
            "Accuracy": accuracy
        })

results_df = pd.DataFrame(results)

results_df = results_df.sort_values(by="Accuracy", ascending=False)

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
print(results_df)

## step 5: features engineering

At this step we will implement two different kinds of feature engineering, first, we will simply calculate the fundamental lengths and widths of the footprints, to obtain extra measurements that can help understanding of the size.

Second approach involves using a more unique feature extraction technique based on research of Abledu et al. (2015), published by the NIH (National Library of Medicine), they have implemented an calculation of Seven dimensions–length of each toe to the bottom (t1 to t5), breadth at the ball (BAB) and breadth at heel (BAH), will this approach they have able to achieve a remarkable accuracy in a similar tasks, therefore we will implement this along with the basic lengths and widths calculation.

ref

Abledu, J. K., Abledu, G. K., Offei, E. B., and Antwi, E. M., 2015. Determination of sex from footprint dimensions in a Ghanaian population [online]. PloS one. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC4596846/ [Accessed 5 Nov 2024].

In [None]:
feature_engineered_variants = {}

for variant_name, (X_train, X_test, y_train, y_test) in imputed_variants.items():

    X_train_lengths = lengths_widths_calculation(X_train)
    X_test_lengths = lengths_widths_calculation(X_test)

    key = f"{variant_name}_lengths_widths"
    feature_engineered_variants[key] = (X_train_lengths, X_test_lengths, y_train, y_test)

for variant_name, (X_train, X_test, y_train, y_test) in imputed_variants.items():

    X_train_point7 = point7_calculation(X_train)
    X_test_point7 = point7_calculation(X_test)

    key = f"{variant_name}_point7"
    feature_engineered_variants[key] = (X_train_point7, X_test_point7, y_train, y_test)

print(feature_engineered_variants.keys())

## step 6: Testing all the model after features engineering (baseline)

As we have now implement features engineering, it will be beneficial to did an other baseline test to have a better understanding dose the features we create  bring positive or negative impact to the model learning

In [None]:
X_train_point7, X_test_point7, y_train_point7, y_test_point7 = feature_engineered_variants['Winsorization_Iterative_lengths_widths']

In [None]:
engineered_results = []

for model_name, model in models.items():
    for dataset_name, (X_train, X_test, y_train, y_test) in feature_engineered_variants.items():

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        accuracy = accuracy_score(y_test, y_pred)

        engineered_results.append({
            "Model": model_name,
            "Variant": dataset_name,
            "Outlier Handling Method": dataset_name.split('_')[0],
            "Imputation Method": dataset_name.split('_')[1],
            "Feature Engineering": dataset_name.split('_')[2],
            "Accuracy": accuracy
        })

engineered_results_df = pd.DataFrame(engineered_results)

engineered_results_df = engineered_results_df.sort_values(by="Accuracy", ascending=False)

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
print(engineered_results_df)

## step 7: Hyperparameter Tuning For model

In [None]:
print("Imputed Variants:")
print(imputed_variants.keys())

print("Feature-Engineered Variants:")
print(feature_engineered_variants.keys())

print("Holdout Imputed Variants:")
print(imputed_variants_holdout.keys())

print("Holdout Feature-Engineered Variants:")
print(holdout_feature_engineered_variants.keys())

#### Hyperparameter Tuning for XGBoost (kaggle 0.8334)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

best_variant_name = 'Winsorization_knn_point7'
X_train_best, X_test_best, y_train_best, y_test_best = feature_engineered_variants[best_variant_name]

param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7, 9, 11],
    'n_estimators': [50, 100, 200, 300],
    'subsample': [0.5, 0.7, 0.9, 1],
    'colsample_bytree': [0.5, 0.7, 0.9, 1],
    'colsample_bylevel': [0.5, 0.7, 0.9, 1],
    'colsample_bynode': [0.5, 0.7, 0.9, 1],
    'min_child_weight': [1, 3, 5, 7, 10],
    'gamma': [0, 0.1, 0.3, 0.5, 1],
    'reg_lambda': [0.5, 1, 1.5],
    'reg_alpha': [0, 0.5, 1, 1.5],
    'booster': ['gbtree', 'dart'],
    'tree_method': ['auto', 'exact', 'hist']
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

grid_search = RandomizedSearchCV(
    xgb, param_grid, n_iter=50, cv=3, scoring='accuracy', n_jobs=-1, verbose=2, random_state=42
)
grid_search.fit(X_train_best, y_train_best)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy from Grid Search:", grid_search.best_score_)

In [None]:
best_variant_name = 'Winsorization_Iterative_lengths_widths'
X_train_best, X_test_best, y_train_best, y_test_best = feature_engineered_variants[best_variant_name]

param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.05, 0.1, 0.3],
    'max_depth': [3, 5, 7],
    'subsample': [0.5, 0.6, 1.0],
    'colsample_bytree': [0.5, 0.7, 0.9, 1.0],
    'gamma': [0, 0.1, 0.3, 0.5, 1],
    'min_child_weight': [1, 3, 5, 7],
    'reg_alpha': [0, 0.1, 1],
    'reg_lambda': [0.5, 1, 2, 5]
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

grid_search = GridSearchCV(xgb, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search.fit(X_train_best, y_train_best)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy from Grid Search:", grid_search.best_score_)

best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_best)
accuracy_best = accuracy_score(y_test_best, y_pred_best)
print(f"Test Accuracy for Best Model: {accuracy_best:.2f}")

RandomizedSearchCV first than GridSearchCV to safe time as there will be less to try on and close down the candidates

In [None]:
best_variant_name = 'Winsorization_knn_point7'
X_train, X_test, y_train, y_test = feature_engineered_variants[best_variant_name]
print(f"Running model for variant: {best_variant_name}")

all_accuracies = []
num_runs = 10

for i in range(num_runs):
    X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
        X_train, y_train, test_size=0.2, random_state=i
    )

    print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")

    model = XGBClassifier(
        learning_rate=0.05,
        max_depth=5,
        n_estimators=100,
        subsample=0.5,
        eval_metric='logloss',
        reg_lambda=0.5,
        reg_alpha=1,
        min_child_weight=5,
        gamma=0.1,
        colsample_bytree=0.9,
        random_state=i,
    )

    model.fit(X_train_best, y_train_best)
    y_pred = model.predict(X_test_best)

    accuracy = accuracy_score(y_test_best, y_pred)
    all_accuracies.append(accuracy)

    scores = cross_validate(
        model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
    )

    print(f"Run {i + 1}:")
    print(f"Accuracy (Testing): {accuracy:.2f}")
    print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

conf_matrix = confusion_matrix(y_test_best, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

print("\nSummary of accuracies across runs:")
print(f"Mean accuracy over {num_runs} runs: {np.mean(all_accuracies):.2f} (+/- {np.std(all_accuracies):.2f})")

print(classification_report(y_test_best, y_pred))

training to test on the robustness of the process we are getting 85% with almost the same CV mean which means it is generalising well

In [None]:
best_xgb = XGBClassifier(
    learning_rate=0.05,
    max_depth=5,
    n_estimators=100,
    subsample=0.5,
    eval_metric='logloss',
    reg_lambda=0.5,
    reg_alpha=1,
    min_child_weight=5,
    gamma=0.1,
    colsample_bytree=0.9,
)

best_xgb.fit(X_train_best, y_train_best)

model output for more data understanding later

##### XGBoost play ground

In [None]:
def my_plot_importance(booster, figsize, **kwargs):
    plt.rcParams["figure.figsize"] = (20, 10)
    plot_importance(booster=booster)

my_plot_importance(best_xgb, figsize=(10,10), importance_type='gain')

In [None]:
import shap

explainer = shap.TreeExplainer(best_xgb)
shap_values = explainer.shap_values(X_test_best)
shap.summary_plot(shap_values, X_test_best)

In [None]:
from sklearn.inspection import permutation_importance

result = permutation_importance(best_xgb, X_test_best, y_test_best, n_repeats=10, random_state=42)
for i in result.importances_mean.argsort()[::-1]:
    print(f"{X_test_best.columns[i]}: {result.importances_mean[i]:.4f} +/- {result.importances_std[i]:.4f}")

up to now we can see which features are more imporatnat which are not, it will allow us to do features selection, base on the infrmation above

In [None]:
data_imbalance = pd.concat([X_train_best, y_train_best], axis=1)

for name, group in data_imbalance.groupby('sex'):
    plt.plot(group.BAB, group.HB_index, '.', label=name)
plt.legend()

In [None]:
barplot=(sns.countplot(data= data_imbalance, x='sex',hue='sex', palette=['b', 'g']))
plt.title('0 v/s 1\n')

as shown above there is heavy data imblance and there is ouliers with in the engineered features, to move forword for better XGBoost performacne we will implnemnt 3 different ways for outliners and ways uses SMOTE for class imblance than perform feature selection and compaire there perfomance together

In [None]:
X_train, X_test, y_train, y_test = feature_engineered_variants['Winsorization_knn_point7']

X_train_point7_iqr = IQR(X_train)
X_test_point7_iqr = IQR(X_test)

y_train_point7_iqr = y_train.loc[X_train_iqr.index]
y_test_point7_iqr = y_test.loc[X_test_iqr.index]


In [None]:
point7_df_iqr = pd.concat([X_train_point7_iqr, y_train_point7_iqr], axis=1)

for name, group in point7_df_iqr.groupby('sex'):
    plt.plot(group.BAB, group.HB_index, '.', label=name)
plt.legend()

In [None]:
X_train_point7_z_score = z_score(X_train)
X_test_point7_z_score = z_score(X_test)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

y_train_point7_z_score = y_train.loc[X_train_robust.index]
y_test_point7_z_score = y_test.loc[X_test_robust.index]

In [None]:
point7_df_z_score = pd.concat([X_train_point7_z_score, y_train_point7_z_score], axis=1)

for name, group in point7_df_z_score.groupby('sex'):
    plt.plot(group.BAB, group.HB_index, '.', label=name)
plt.legend()

In [None]:
X_train_point7_isolation_forest = isolation_forest(X_train)
X_test_point7_isolation_forest = isolation_forest(X_test)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

y_train_point7_isolation_forest = y_train.loc[X_train_point7_isolation_forest.index]
y_test_point7_isolation_forest = y_test.loc[X_test_point7_isolation_forest.index]

In [None]:
point7_df_isolation_forest = pd.concat([X_train_point7_isolation_forest, y_train_point7_isolation_forest], axis=1)

for name, group in point7_df_isolation_forest.groupby('sex'):
    plt.plot(group.BAB, group.HB_index, '.', label=name)
plt.legend()

do the same for hold out

now we have done all 3 ways that we have talked about for daeling with the ouliners, and each of them have perfrom abit different which it will be provide a good variants on the outcome

In [None]:
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, RandomOverSampler
from imblearn.combine import SMOTETomek

In [None]:
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, RandomOverSampler
from imblearn.combine import SMOTETomek

point7_datasets = {
    'point7_IQR': (X_train_point7_iqr, X_test_point7_iqr, y_train_point7_iqr, y_test_point7_iqr),
    'point7_Zscore': (X_train_point7_z_score, X_test_point7_z_score, y_train_point7_z_score, y_test_point7_z_score),
    'point7_isolationforest': (X_train_point7_isolation_forest, X_test_point7_isolation_forest, y_train_point7_isolation_forest, y_test_point7_isolation_forest)
}

XGBoost_outliers_variants = {}

for dataset_name, (X_train, X_test, y_train, y_test) in point7_datasets.items():

    smote = SMOTE()
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    X_train_imputed = X_train_resampled
    X_test_imputed = X_test

    key = f"{dataset_name}_SMOTE"
    XGBoost_outliers_variants[key] = (X_train_imputed, X_test_imputed, y_train_resampled, y_test)

for dataset_name, (X_train, X_test, y_train, y_test) in point7_datasets.items():

    smote = BorderlineSMOTE()
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    X_train_imputed = X_train_resampled
    X_test_imputed = X_test

    key = f"{dataset_name}_BorderlineSMOTE"
    XGBoost_outliers_variants[key] = (X_train_imputed, X_test_imputed, y_train_resampled, y_test)

for dataset_name, (X_train, X_test, y_train, y_test) in point7_datasets.items():

    smote = SVMSMOTE()
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    X_train_imputed = X_train_resampled
    X_test_imputed = X_test

    key = f"{dataset_name}_SVMSMOTE"
    XGBoost_outliers_variants[key] = (X_train_imputed, X_test_imputed, y_train_resampled, y_test)

for dataset_name, (X_train, X_test, y_train, y_test) in point7_datasets.items():

    smote = RandomOverSampler(random_state=42)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    X_train_imputed = X_train_resampled
    X_test_imputed = X_test

    key = f"{dataset_name}_RandomSMOTE"
    XGBoost_outliers_variants[key] = (X_train_imputed, X_test_imputed, y_train_resampled, y_test)

for dataset_name, (X_train, X_test, y_train, y_test) in point7_datasets.items():

    smote = SMOTETomek(random_state=42)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    X_train_imputed = X_train_resampled
    X_test_imputed = X_test

    key = f"{dataset_name}_SMOTETomek"
    XGBoost_outliers_variants[key] = (X_train_imputed, X_test_imputed, y_train_resampled, y_test)

print(XGBoost_outliers_variants.keys())

In [None]:
X_train_point7_IQR_SMOTE, X_test_point7_IQR_SMOTE, y_train_point7_IQR_SMOTE, y_test_point7_IQR_SMOTE = XGBoost_outliers_variants['point7_IQR_SMOTE']

train_data_point7_IQR_SMOTE = pd.concat([X_train_point7_IQR_SMOTE, y_train_point7_IQR_SMOTE], axis=1)
test_data_point7_IQR_SMOTE = pd.concat([X_test_point7_IQR_SMOTE, y_test_point7_IQR_SMOTE], axis=1)

barplot=(sns.countplot(data= train_data_point7_IQR_SMOTE, x='sex',hue='sex', palette=['b', 'g']))
barplot=(sns.countplot(data= test_data_point7_IQR_SMOTE, x='sex',hue='sex', palette=['b', 'g']))
plt.title('0 v/s 1\n')

as shown the smote is applied only to the test set to avoid data leakage  

In [None]:
point7_smote_engineered_results_df = []

for model_name, model in models.items():
    for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants.items():

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        accuracy = accuracy_score(y_test, y_pred)

        point7_smote_engineered_results_df.append({
            "Model": model_name,
            "Variant": dataset_name,
            "Feature Engineering": dataset_name.split('_')[0],
            "Imputation Method": dataset_name.split('_')[1],
            "Smote Method": dataset_name.split('_')[2],
            "Accuracy": accuracy
        })

point7_smote_engineered_df = pd.DataFrame(point7_smote_engineered_results_df)

point7_smote_engineered_results_df = point7_smote_engineered_df.sort_values(by="Accuracy", ascending=False)

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print(point7_smote_engineered_results_df)

for now the Accuracy seems like the same as before but we should try on XGBoost to get more inforemation about its performacne

In [None]:
from sklearn.metrics import classification_report

best_variant_name = 'point7_Zscore_SMOTE'
X_train, X_test, y_train, y_test = XGBoost_outliers_variants[best_variant_name]
print(f"Running model for variant: {best_variant_name}")

all_accuracies = []
num_runs = 10

for i in range(num_runs):
    X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
        X_train, y_train, test_size=0.2, random_state=i
    )

    print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")

    model = XGBClassifier(
        learning_rate=0.05,
        max_depth=5,
        n_estimators=100,
        subsample=0.5,
        eval_metric='logloss',
        reg_lambda=0.5,
        reg_alpha=1,
        min_child_weight=5,
        gamma=0.1,
        colsample_bytree=0.9,'
        random_state=43,
    )

    model.fit(X_train_best, y_train_best)
    y_pred = model.predict(X_test_best)

    accuracy = accuracy_score(y_test_best, y_pred)
    all_accuracies.append(accuracy)

    scores = cross_validate(
        model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
    )

    print(f"Run {i + 1}:")
    print(f"Accuracy (Testing): {accuracy:.2f}")
    print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

conf_matrix = confusion_matrix(y_test_best, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

print("\nSummary of accuracies across runs:")
print(f"Mean accuracy over {num_runs} runs: {np.mean(all_accuracies):.2f} (+/- {np.std(all_accuracies):.2f})")
print(classification_report(y_test_best, y_pred))

we can see our performance have largely increased on different areas with out features selection, now we will move onto features selection.

In [None]:
smote_best_xgb = XGBClassifier(
    learning_rate=0.05,
    max_depth=5,
    n_estimators=100,
    subsample=0.5,
    eval_metric='logloss',
    reg_lambda=0.5,
    reg_alpha=1,
    min_child_weight=5,
    gamma=0.1,
    colsample_bytree=0.9,
)

smote_best_xgb.fit(X_train_best, y_train_best)

In [None]:
explainer = shap.TreeExplainer(smote_best_xgb)
shap_values = explainer.shap_values(X_test_best)
shap.summary_plot(shap_values, X_test_best)

we can see that even after we done isoforest and smotie, it stay the same as before because (look at i can talk about for smote)

In [None]:
def my_plot_importance(booster, figsize, **kwargs):
    plt.rcParams["figure.figsize"] = (20, 10)
    plot_importance(booster=booster)

my_plot_importance(smote_best_xgb, figsize=(10,10), importance_type='gain')

In [None]:
from sklearn.feature_selection import SelectFromModel, RFE, RFECV, SequentialFeatureSelector
from sklearn.model_selection import StratifiedKFold

XGBoost_outliers_variants_features_selected = {}

threshold_importance = 0.9
n_features_to_select = 25
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants.items():
    model = XGBClassifier()
    model.fit(X_train, y_train)

    importance_scores = model.get_booster().get_score(importance_type='gain')

    importance_df = pd.DataFrame(list(importance_scores.items()), columns=['Feature', 'Importance'])

    selected_features = importance_df[importance_df['Importance'] > threshold_importance]['Feature'].tolist()

    feature_indices = [list(model.get_booster().feature_names).index(f) for f in selected_features]
    X_train_selected = X_train.iloc[:, feature_indices]
    X_test_selected = X_test.iloc[:, feature_indices]

    key = f"{dataset_name}_importanceScore"
    XGBoost_outliers_variants_features_selected[key] = (X_train_selected, X_test_selected, y_train, y_test)


for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants.items():
    model = XGBClassifier()
    model.fit(X_train, y_train)

    importance_scores = model.get_booster().get_score(importance_type='gain')
    importance_df = pd.DataFrame(list(importance_scores.items()), columns=['Feature', 'Importance'])
    selected_features = importance_df[importance_df['Importance'] > threshold_importance]['Feature'].tolist()

    feature_indices = [list(model.get_booster().feature_names).index(f) for f in selected_features]
    X_train_filtered = X_train.iloc[:, feature_indices]
    X_test_filtered = X_test.iloc[:, feature_indices]

    estimator = XGBClassifier()
    rfe = RFE(estimator, n_features_to_select=n_features_to_select)
    rfe.fit(X_train_filtered, y_train)

    rfe_selected_features_mask = X_train_filtered.columns[rfe.support_]

    X_train_rfe = X_train_filtered.loc[:, rfe_selected_features_mask]
    X_test_rfe = X_test_filtered.loc[:, rfe_selected_features_mask]

    key = f"{dataset_name}_rfeAfterImportanceFiltered"
    XGBoost_outliers_variants_features_selected[key] = (X_train_rfe, X_test_rfe, y_train, y_test)

for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants.items():

    estimator = XGBClassifier()
    sfs = SequentialFeatureSelector(estimator, n_features_to_select=n_features_to_select, direction='forward', n_jobs=-1)
    sfs.fit(X_train, y_train)

    X_train_sfs = X_train.loc[:, sfs.get_support()]
    X_test_sfs = X_test.loc[:, sfs.get_support()]

    key = f"{dataset_name}_sfsforward"
    XGBoost_outliers_variants_features_selected[key] = (X_train_sfs, X_test_sfs, y_train, y_test)


for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants.items():
    estimator = XGBClassifier(random_state=42)

    rfecv = RFECV(
        estimator=estimator,
        step=1,
        cv=cv,
        scoring='accuracy',
        min_features_to_select=1
    )

    rfecv.fit(X_train, y_train)

    optimal_feature_count = rfecv.n_features_
    feature_ranking = rfecv.ranking_
    total_mean_score = np.mean(rfecv.cv_results_['mean_test_score'])

    X_train_rfecv = X_train.loc[:, rfecv.support_]
    X_test_rfecv = X_test.loc[:, rfecv.support_]

    key = f"{dataset_name}_rfecv"
    XGBoost_outliers_variants_features_selected[key] = (X_train_rfecv, X_test_rfecv, y_train, y_test)

    print(f"Dataset: {dataset_name}")
    print(f"Optimal number of features: {optimal_feature_count}")
    print(f"Cross-validation scores for each iteration: {rfecv.cv_results_['mean_test_score']}")
    print(f"Total mean score: {total_mean_score:.4f}")


print(XGBoost_outliers_variants_features_selected.keys())
print(importance_df)


In [None]:
print(imputed_variants.keys())
print(feature_engineered_variants.keys())
print(XGBoost_outliers_variants.keys())
print(XGBoost_outliers_variants_features_selected.keys())

In [None]:
X_train_test, X_test_test, y_train_test, y_test_test = XGBoost_outliers_variants_features_selected['point7_Zscore_SVMSMOTE_importanceScore']

data_check = pd.concat([X_train_test, y_train_test], axis=1)

data_check.describe().T

In [None]:
X_train_test, X_test_test, y_train_test, y_test_test = XGBoost_outliers_variants_features_selected['point7_IQR_SMOTETomek_rfecv']

X_train_test = pd.DataFrame(X_train_test)

y_train_test = y_train_test.reset_index(drop=True)

data_check = pd.concat([X_train_test, y_train_test], axis=1)

data_check.describe().T

In [None]:
model_results = []

for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants_features_selected.items():

        smote_best_xgb.fit(X_train, y_train)
        y_pred = smote_best_xgb.predict(X_test)

        accuracy = accuracy_score(y_test, y_pred)

        model_results.append({
            "Model": model_name,
            "Variant": dataset_name,
            "Feature Engineering": dataset_name.split('_')[0],
            "Imputation Method": dataset_name.split('_')[1],
            "Smote Method": dataset_name.split('_')[2],
            "Features Selection Method": dataset_name.split('_')[3],
            "Accuracy": accuracy
        })

model_results_df = pd.DataFrame(model_results)

model_results_df = model_results_df.sort_values(by="Accuracy", ascending=False)

pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 10000)
print(model_results_df)



point7_Zscore_SMOTETomek_importanceScore 88%

point7_IQR_SMOTETomek_importanceScore and  88%

point7_IQR_SMOTETomek_rfecv 89% (have the best for now overall)

point7_IQR_SMOTETomek_rfeAfterImportanceFiltered 89%

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv'
X_train, X_test, y_train, y_test = XGBoost_outliers_variants_features_selected[best_variant_name]
print(f"Running model for variant: {best_variant_name}")

all_accuracies = []
num_runs = 10

for i in range(num_runs):
    X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
        X_train, y_train, test_size=0.2, random_state=i
    )

    print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")

    model = XGBClassifier(
        learning_rate=0.05,
        max_depth=11,
        n_estimators=200,
        subsample=0.5,
        eval_metric='logloss',
        reg_lambda=1.5,
        reg_alpha=0.5,
        min_child_weight=10,
        gamma=0.5,
        colsample_bytree=1,
        colsample_bynode=1,
        colsample_bylevel=0.5,
        booster='gbtree',
        random_state=42,
    )

    model.fit(X_train_best, y_train_best)
    y_pred = model.predict(X_test_best)

    accuracy = accuracy_score(y_test_best, y_pred)
    all_accuracies.append(accuracy)

    scores = cross_validate(
        model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
    )

    print(f"Run {i + 1}:")
    print(f"Accuracy (Testing): {accuracy:.2f}")
    print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

conf_matrix = confusion_matrix(y_test_best, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

print("\nSummary of accuracies across runs:")
print(f"Mean accuracy over {num_runs} runs: {np.mean(all_accuracies):.2f} (+/- {np.std(all_accuracies):.2f})")

print(classification_report(y_test_best, y_pred))

In [None]:
all_runs_results = []

num_runs = 10

for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants_features_selected.items():

    all_accuracies = []
    for i in range(num_runs):

        X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
            X_train, y_train, test_size=0.2, random_state=i
        )

        print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")
        print(f"Running model for variant: {dataset_name}")

        model = XGBClassifier(
            learning_rate=0.05,
            max_depth=10,
            n_estimators=200,
            subsample=0.8,
            objective='reg:squarederror',
            reg_lambda=1.5,
            reg_alpha=1.5,
            min_child_weight=10,
            gamma=0.5,
            colsample_bytree=1,
            colsample_bynode=1,
            colsample_bylevel=0.5,
            booster='gbtree',
            random_state=i
        )

        model.fit(X_train_best, y_train_best)

        y_pred = model.predict(X_test_best)
        accuracy = accuracy_score(y_test_best, y_pred)
        all_accuracies.append(accuracy)

        scores = cross_validate(
            model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
        )

        print(f"\nRun {i + 1}:")
        print(f"Accuracy (Testing): {accuracy:.2f}")
        print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

    mean_accuracy = np.mean(all_accuracies)
    std_accuracy = np.std(all_accuracies)
    all_runs_results.append({
        "Dataset": dataset_name,
        "Mean Accuracy": mean_accuracy,
        "Std Accuracy": std_accuracy,
        "Details": X_train.columns.tolist()
    })

print("\nSummary of accuracies across runs:")
for result in all_runs_results:
    print(f"Dataset: {result['Dataset']}, Mean Accuracy: {result['Mean Accuracy']:.2f} (+/- {result['Std Accuracy']:.2f})")

results_df = pd.DataFrame(all_runs_results)
sorted_results_df = results_df.sort_values(by="Mean Accuracy", ascending=False)

print("\nSorted Results by Accuracy:")
print(sorted_results_df)

import ace_tools as tools
tools.display_dataframe_to_user(name="Sorted Model Results by Accuracy", dataframe=sorted_results_df)

##### output model and safe work state

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv'
XGB_finial_X_train, X_test, XGB_finial_y_train, y_test = XGBoost_outliers_variants_features_selected[best_variant_name]  # unpack the value

best_XGB_After_proccess = XGBClassifier(
        learning_rate=0.05,
        max_depth=11,
        n_estimators=200,
        subsample=0.5,
        eval_metric='logloss',
        reg_lambda=1.5,
        reg_alpha=0.5,
        min_child_weight=10,
        gamma=0.5,
        colsample_bytree=1,
        colsample_bynode=1,
        colsample_bylevel=0.5,
        booster='gbtree',
        random_state=42,
    )

best_XGB_After_proccess.fit(XGB_finial_X_train, XGB_finial_y_train)

In [None]:
import joblib
joblib.dump(best_XGB_After_proccess, 'best_xgboost_model.pkl')

In [None]:
best_XGB_After_process = joblib.load('best_xgboost_model.pkl')
y_pred = best_XGB_After_process.predict(X_test)

In [None]:
y_pred = best_XGB_After_process.predict(X_test)
y_pred

In [None]:
joblib.dump(best_XGB_After_process, 'best_xgboost_model.pkl')

In [None]:
print(imputed_variants.keys())
print(feature_engineered_variants.keys())
print(XGBoost_outliers_variants.keys())
print(XGBoost_outliers_variants_features_selected.keys())

#### Hyperparameter Tuning for Gradient Boosting (kaggle 0.8505)  

In [None]:
from skopt import BayesSearchCV

best_variant_name = 'point7_IQR_SMOTETomek_rfecv'
X_train, X_test, y_train, y_test = XGBoost_outliers_variants_features_selected[best_variant_name]

param_space = {
    'n_estimators': (50, 100, 200, 300),
    'learning_rate': (0.01, 0.05, 0.1, 0.2),
    'max_depth': (3, 5, 7),
    'min_samples_split': (2, 5, 10),
    'min_samples_leaf': (1, 2, 5, 10),
    'max_features': ['sqrt', 'log2', None],
    'subsample': (0.7, 0.8, 1.0),
    'loss': ['log_loss', 'exponential'],
    'min_impurity_decrease': (0.001, 0.01, 0.1),
    'warm_start': [True, False],
    'max_leaf_nodes': [None, 10, 20, 30, 50],
    'n_iter_no_change': [None, 5, 10, 15],
    'tol': (0.0001, 0.001, 0.01)
}

model = GradientBoostingClassifier(random_state=42)

bayes_opt = BayesSearchCV(
    estimator=model,
    search_spaces=param_space,
    n_iter=50,
    scoring='accuracy',
    n_jobs=-1,
    cv=3,
    verbose=1,
    random_state=42
)

bayes_opt.fit(X_train, y_train)

print("Best Parameters: ", bayes_opt.best_params_)
print("Best Accuracy from Grid Search: ", bayes_opt.best_score_)

In [None]:
best_variant_name = 'point7_IQR_SMOTETomek_rfecv'
X_train, X_test, y_train, y_test = XGBoost_outliers_variants_features_selected[best_variant_name]

param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
    'max_features': ['sqrt', 'log2', None],
    'subsample': [0.7, 0.8, 1.0]
}


gb_model = GradientBoostingClassifier()

grid_search = GridSearchCV(estimator=gb_model,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=3,
                           n_jobs=-1,
                           verbose=2)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

In [None]:
all_runs_results = []

num_runs = 10

for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants_features_selected.items():

    all_accuracies = []
    for i in range(num_runs):

        X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
            X_train, y_train, test_size=0.2, random_state=i
        )

        print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")
        print(f"Running model for variant: {dataset_name}")

        model = GradientBoostingClassifier(
            learning_rate=0.2,
            max_depth=7,
            n_estimators=50,
            subsample=1.0,
            max_features='log2',
            min_samples_leaf=1,
            min_samples_split=10,
            random_state=42,
            warm_start=False,
            tol=0.001,
            min_impurity_decrease=0.001,
            max_leaf_nodes=None,
            loss='exponential',
            n_iter_no_change=None
        )

        model.fit(X_train_best, y_train_best)

        y_pred = model.predict(X_test_best)
        accuracy = accuracy_score(y_test_best, y_pred)
        all_accuracies.append(accuracy)

        scores = cross_validate(
            model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
        )

        print(f"\nRun {i + 1}:")
        print(f"Accuracy (Testing): {accuracy:.2f}")
        print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

    mean_accuracy = np.mean(all_accuracies)
    std_accuracy = np.std(all_accuracies)
    all_runs_results.append({
        "Dataset": dataset_name,
        "Mean Accuracy": mean_accuracy,
        "Std Accuracy": std_accuracy,
        "Details": X_train.columns.tolist()
    })

print("\nSummary of accuracies across runs:")
for result in all_runs_results:
    print(f"Dataset: {result['Dataset']}, Mean Accuracy: {result['Mean Accuracy']:.2f} (+/- {result['Std Accuracy']:.2f})")

results_df = pd.DataFrame(all_runs_results)
sorted_results_df = results_df.sort_values(by="Mean Accuracy", ascending=False)

print("\nSorted Results by Accuracy:")
print(sorted_results_df)

import ace_tools as tools
tools.display_dataframe_to_user(name="Sorted Model Results by Accuracy", dataframe=sorted_results_df)

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv'
X_train, X_test, y_train, y_test = XGBoost_outliers_variants_features_selected[best_variant_name]
print(f"Running model for variant: {best_variant_name}")

all_accuracies = []
num_runs = 10

for i in range(num_runs):
    X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
        X_train, y_train, test_size=0.2, random_state=i
    )

    print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")

    model = GradientBoostingClassifier(
        learning_rate=0.2,
        max_depth=7,
        n_estimators=50,
        subsample=1.0,
        max_features='log2',
        min_samples_leaf=1,
        min_samples_split=10,
        random_state=42,
        warm_start=False,
        tol=0.001,
        min_impurity_decrease=0.001,
        max_leaf_nodes=None,
        loss='exponential',
        n_iter_no_change=None
    )

    model.fit(X_train_best, y_train_best)
    y_pred = model.predict(X_test_best)

    accuracy = accuracy_score(y_test_best, y_pred)
    all_accuracies.append(accuracy)

    scores = cross_validate(
        model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
    )

    print(f"Run {i + 1}:")
    print(f"Accuracy (Testing): {accuracy:.2f}")
    print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

conf_matrix = confusion_matrix(y_test_best, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

print("\nSummary of accuracies across runs:")
print(f"Mean accuracy over {num_runs} runs: {np.mean(all_accuracies):.2f} (+/- {np.std(all_accuracies):.2f})")

print(classification_report(y_test_best, y_pred))

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv'
GB_finial_X_train, X_test, GB_finial_y_train, y_test = XGBoost_outliers_variants_features_selected[best_variant_name]

best_GB_After_proccess = GradientBoostingClassifier(
        learning_rate=0.2,
        max_depth=7,
        n_estimators=50,
        subsample=1.0,
        max_features='log2',
        min_samples_leaf=1,
        min_samples_split=10,
        random_state=42,
        warm_start=False,
        tol=0.001,
        min_impurity_decrease=0.001,
        max_leaf_nodes=None,
        loss='exponential',
        n_iter_no_change=None
    )

best_GB_After_proccess.fit(GB_finial_X_train, GB_finial_y_train)

In [None]:
joblib.dump(best_GB_After_proccess, 'best_GB_After_proccess.pkl')

In [None]:
best_XGB_After_process = joblib.load('best_xgboost_model.pkl')
y_pred = best_XGB_After_process.predict(X_test)

In [None]:
y_pred = best_XGB_After_process.predict(X_test)
y_pred

In [None]:
y_pred = best_gb_model.predict(X_test)
y_pred

##### Gradient Boosting play ground

In [None]:
X = p7_point_footprints_df_iqr_knn_blance.drop('sex', axis=1)
y = p7_point_footprints_df_iqr_knn_blance['sex']
x = 0
count = 0
num_runs = 1

for x in range (num_runs):

    count += 1
    model = GradientBoostingClassifier(n_estimators=150,
                                       learning_rate=0.5,
                                       max_depth=7,
                                       min_samples_split=5,
                                       min_samples_leaf=2,)


    SMOTE_iso_X_train, SMOTE_iso_X_test, SMOTE_iso_y_train, SMOTE_iso_y_test = train_test_split(X,
                                                        y,
                                                        test_size=0.2,
                                                        random_state=1,
                                                       )

    model.fit(SMOTE_iso_X_train, SMOTE_iso_y_train)
    SMOTE_iso_y_pred = model.predict(SMOTE_iso_X_test)

scores = cross_validate(model, X, y, cv=3, return_train_score=True, return_estimator=True)
precision = precision_score(SMOTE_iso_y_test, SMOTE_iso_y_pred)
recall = recall_score(SMOTE_iso_y_test, SMOTE_iso_y_pred)

print(metrics.confusion_matrix(SMOTE_iso_y_test, SMOTE_iso_y_pred))
print("\nAccuracy (Testing):  %0.2f " % (metrics.accuracy_score(SMOTE_iso_y_test, SMOTE_iso_y_pred)))
print("Accuracy (Testing):  %0.2f (+/- %0.2f)" % (scores['test_score'].mean(), scores['test_score'].std() * 2))
print("count:" , count)
print("Precision: %.2f" % precision)
print("recall: %.2f" % recall)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(SMOTE_iso_y_test,SMOTE_iso_y_pred))
sns.heatmap(confusion_matrix(SMOTE_iso_y_test,SMOTE_iso_y_pred),annot=True)

In [None]:
SMOTE_iso_best_gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=1,
    max_depth=5,
    min_samples_split=5,
    min_samples_leaf=2,
)

SMOTE_iso_best_gb_model.fit(SMOTE_iso_X_train, SMOTE_iso_y_train)

#### Hyperparameter Tuning for Support Vector Machines

SVM overall will do better after StandardScaler there for we will use it to improve SVM score

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from skopt import BayesSearchCV

best_variant_name = 'point7_IQR_SMOTETomek_rfecv'
X_train, X_test, y_train, y_test = XGBoost_outliers_variants_features_selected[best_variant_name]


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

param_grid = {
    'C': (0.1, 1000, 'log-uniform'),
    'gamma': (0.001, 10, 'log-uniform'),
    'kernel': ['rbf'],
    'tol': (1e-4, 1e-2, 'log-uniform'),
    'max_iter': (1000, 10000),
    'class_weight': [None, 'balanced']
}

model = SVC(random_state=42)

bayes_opt = BayesSearchCV(
    estimator=model,
    search_spaces=param_grid,
    n_iter=50,
    scoring='accuracy',
    n_jobs=-1,
    cv=3,
    verbose=1,
    random_state=42
)

bayes_opt.fit(X_train_pca, y_train)

print("Best Parameters:", bayes_opt.best_params_)
print("Best Score:", bayes_opt.best_score_)

In [None]:
XGBoost_outliers_features_selected_scaled_variants = {}

scalers = {
    'StandardScaler': StandardScaler(),
    'RobustScaler': RobustScaler()
}

for scaler_name, scaler in scalers.items():
    for variant_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_variants_features_selected.items():

        X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
        X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

        key = f"{variant_name}_{scaler_name}"
        XGBoost_outliers_features_selected_scaled_variants[key] = (X_train_scaled, X_test_scaled, y_train, y_test)

print(XGBoost_outliers_features_selected_scaled_variants.keys())

In [None]:
X_train_test, X_test_test, y_train_test, y_test_test = XGBoost_outliers_features_selected_scaled_variants['point7_Zscore_SMOTETomek_rfecv_StandardScaler']

X_train_test = pd.DataFrame(X_train_test)

y_train_test = y_train_test.reset_index(drop=True)

data_check = pd.concat([X_train_test, y_train_test], axis=1)

data_check.describe().T

In [None]:
X_train_test, X_test_test, y_train_test, y_test_test = XGBoost_outliers_variants_features_selected['point7_Zscore_SMOTETomek_rfecv']

X_train_test = pd.DataFrame(X_train_test)

y_train_test = y_train_test.reset_index(drop=True)

data_check = pd.concat([X_train_test, y_train_test], axis=1)

data_check.describe().T

In [None]:

all_runs_results = []

num_runs = 10

best_params = {
    'C': 1.1930801848463657,
    'class_weight': 'balanced',
    'gamma': 0.120601154417892,
    'kernel': 'rbf',
    'max_iter': 1000,
    'tol': 0.0001
}

for dataset_name, (X_train, X_test, y_train, y_test) in XGBoost_outliers_features_selected_scaled_variants.items():

    all_accuracies = []
    for i in range(num_runs):

        X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
            X_train, y_train, test_size=0.2, random_state=i
        )

        print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")
        print(f"Running model for variant: {dataset_name}")

        model = SVC(
            C=best_params['C'],
            kernel=best_params['kernel'],
            gamma=best_params['gamma'],
            class_weight=best_params['class_weight'],
            max_iter=best_params['max_iter'],
            tol=best_params['tol'],
            random_state=42
        )

        model.fit(X_train_best, y_train_best)

        y_pred = model.predict(X_test_best)
        accuracy = accuracy_score(y_test_best, y_pred)
        all_accuracies.append(accuracy)

        scores = cross_validate(
            model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
        )

        print(f"\nRun {i + 1}:")
        print(f"Accuracy (Testing): {accuracy:.2f}")
        print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

    mean_accuracy = np.mean(all_accuracies)
    std_accuracy = np.std(all_accuracies)
    all_runs_results.append({
        "Dataset": dataset_name,
        "Mean Accuracy": mean_accuracy,
        "Std Accuracy": std_accuracy,
        "Details": X_train.columns.tolist()
    })

print("\nSummary of accuracies across runs:")
for result in all_runs_results:
    print(f"Dataset: {result['Dataset']}, Mean Accuracy: {result['Mean Accuracy']:.2f} (+/- {result['Std Accuracy']:.2f})")

results_df = pd.DataFrame(all_runs_results)
sorted_results_df = results_df.sort_values(by="Mean Accuracy", ascending=False)

print("\nSorted Results by Accuracy:")
print(sorted_results_df)

tools.display_dataframe_to_user(name="Sorted Results by Accuracy", dataframe=sorted_results_df)

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv_StandardScaler'
X_train, X_test, y_train, y_test = XGBoost_outliers_features_selected_scaled_variants[best_variant_name]
print(f"Running model for variant: {best_variant_name}")

all_accuracies = []
num_runs = 10

for i in range(num_runs):
    X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
        X_train, y_train, test_size=0.2, random_state=i
    )

    print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")

    model = SVC(
        C =1.9650743261576813,
        class_weight = 'balanced',
        gamma = 0.09007054559274681,
        kernel = 'rbf',
        max_iter = 4979,
        tol = 0.01,
        random_state=42
    )

    model.fit(X_train_best, y_train_best)
    y_pred = model.predict(X_test_best)

    accuracy = accuracy_score(y_test_best, y_pred)
    all_accuracies.append(accuracy)

    scores = cross_validate(
        model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
    )

    print(f"Run {i + 1}:")
    print(f"Accuracy (Testing): {accuracy:.2f}")
    print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

conf_matrix = confusion_matrix(y_test_best, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

print("\nSummary of accuracies across runs:")
print(f"Mean accuracy over {num_runs} runs: {np.mean(all_accuracies):.2f} (+/- {np.std(all_accuracies):.2f})")

print(classification_report(y_test_best, y_pred))

##### Support Vector Machines play ground

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv_StandardScaler'
SCV_finial_X_train, X_test, SCV_finial_y_train, y_test = XGBoost_outliers_features_selected_scaled_variants[best_variant_name]

best_SVC_After_proccess = SVC(
        C =1.9650743261576813,
        class_weight = 'balanced',
        gamma = 0.09007054559274681,
        kernel = 'rbf',
        max_iter = 4979,
        tol = 0.01,
        random_state=42,
        probability=True
    )

best_SVC_After_proccess.fit(SCV_finial_X_train, SCV_finial_y_train)

In [None]:
joblib.dump(best_SVC_After_proccess, 'best_SVC_After_proccess.pkl')

In [None]:
best_SVC_After_proccess = joblib.load('best_SVC_After_proccess.pkl')
y_pred = best_SVC_After_proccess.predict(X_test)

In [None]:
print(y_pred)

#### Hyperparameter Tuning for Random Forest

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv_StandardScaler'
X_train, X_test, y_train, y_test = XGBoost_outliers_features_selected_scaled_variants[best_variant_name]

param_grid = {
    'n_estimators': [50, 100, 200, 500],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'max_features': ['sqrt', 'log2', None, 0.5],
    'bootstrap': [True],
    'max_leaf_nodes': [10, 20, 50, 100, None],
    'min_impurity_decrease': [0.0, 0.001, 0.01],
    'criterion': ['gini', 'entropy'],
    'class_weight': [None, 'balanced', 'balanced_subsample'],
    'oob_score': [True, False]
}

model = RandomForestClassifier(random_state=42)

bayes_opt = BayesSearchCV(
    estimator=model,
    search_spaces=param_grid,
    n_iter=50,
    scoring='accuracy',
    n_jobs=-1,
    cv=3,
    verbose=1,
    random_state=42
)

bayes_opt.fit(X_train_pca, y_train)

print("Best Parameters:", bayes_opt.best_params_)
print("Best Score:", bayes_opt.best_score_)

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv_StandardScaler'
X_train, X_test, y_train, y_test = XGBoost_outliers_features_selected_scaled_variants[best_variant_name]
print(f"Running model for variant: {best_variant_name}")

all_accuracies = []
num_runs = 10

for i in range(num_runs):
    X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
        X_train, y_train, test_size=0.2, random_state=i
    )

    print(f"Features before fitting (run {i + 1}): {X_train_best.columns}")

    model = RandomForestClassifier(
        class_weight =None,
        criterion = 'entropy',
        max_depth = None,
        max_features = None,
        max_leaf_nodes = 100,
        min_impurity_decrease = 0.001,
        min_samples_leaf = 2,
        min_samples_split = 2,
        n_estimators = 200,
        oob_score = False,
        random_state=42
    )

    model.fit(X_train_best, y_train_best)
    y_pred = model.predict(X_test_best)

    accuracy = accuracy_score(y_test_best, y_pred)
    all_accuracies.append(accuracy)

    scores = cross_validate(
        model, X_train_best, y_train_best, cv=5, return_train_score=True, return_estimator=True
    )

    print(f"Run {i + 1}:")
    print(f"Accuracy (Testing): {accuracy:.2f}")
    print(f"Accuracy (CV Mean): {np.mean(scores['test_score']):.2f} (+/- {np.std(scores['test_score']) * 2:.2f})")

conf_matrix = confusion_matrix(y_test_best, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

print("\nSummary of accuracies across runs:")
print(f"Mean accuracy over {num_runs} runs: {np.mean(all_accuracies):.2f} (+/- {np.std(all_accuracies):.2f})")

print(classification_report(y_test_best, y_pred))

In [None]:
best_variant_name = 'point7_Zscore_SMOTETomek_rfecv_StandardScaler'
RF_finial_X_train, X_test, RF_finial_y_train, y_test = XGBoost_outliers_features_selected_scaled_variants[best_variant_name]  # unpack the value

best_RF_After_proccess = RandomForestClassifier(
        class_weight =None,
        criterion = 'entropy',
        max_depth = None,
        max_features = None,
        max_leaf_nodes = 100,
        min_impurity_decrease = 0.001,
        min_samples_leaf = 2,
        min_samples_split = 2,
        n_estimators = 200,
        oob_score = False,
        random_state=42
    )

best_RF_After_proccess.fit(RF_finial_X_train, RF_finial_y_train)

##step 8: Ensemble Learning

In [None]:
XGB_model = XGBClassifier(best_XGB_After_proccess
GBC_model = GradientBoostingClassifier(best_GB_After_proccess)
SVM_model = SVC(best_SVC_After_proccess, probability=True)
RF_model = RandomForestClassifier(best_RF_After_proccess)

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

voting_clf = VotingClassifier(
    estimators=[
        ('xgb', best_XGB_After_proccess),
        ('gbc', best_GB_After_proccess),
        ('svm', best_SVC_After_proccess),
        ('rf', best_RF_After_proccess)
    ],
    voting='soft'
)

voting_clf.fit(X_train, y_train)

y_pred = voting_clf.predict(X_test)

print("Ensemble Model Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

In [None]:
print(imputed_variants.keys())
print(feature_engineered_variants.keys())
print(XGBoost_outliers_variants.keys())
print(XGBoost_outliers_features_selected_scaled_variants.keys())
print(XGBoost_outliers_variants_features_selected.keys())

In [None]:
XGBoost_outliers_variants = {'point7_IQR_SMOTE': (X_train, X_test, y_train, y_test)}

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

stacking_clf = StackingClassifier(
    estimators=[
        ('xgb', best_XGB_After_process),
        ('gbc', best_GB_After_proccess),
        ('svm', best_SVC_After_proccess),
        ('rf', best_RF_After_proccess)
    ],
    final_estimator=LogisticRegression()
)
stacking_clf.fit(X_train, y_train)
y_pred_stack = stacking_clf.predict(X_test)
print("Stacking Ensemble Accuracy:", accuracy_score(y_test, y_pred_stack))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_stack))
print("Classification Report:\n", classification_report(y_test, y_pred_stack))

In [None]:
X_train.describe().T

In [None]:
from sklearn.ensemble import VotingClassifier

ensemble_model = VotingClassifier(
    estimators=[
        ('xgb', best_xgb),
        ('gb', best_gb_model),

    ],
    voting='soft'
)

ensemble_model.fit(X_train, y_train)

y_pred_ensemble = ensemble_model.predict(X_test)

accuracy_ensemble = metrics.accuracy_score(y_test, y_pred_ensemble)
print("Ensemble Model Accuracy:", accuracy_ensemble)

print("Confusion Matrix:", metrics.confusion_matrix(y_test, y_pred_ensemble))
print("Classification Report:", metrics.classification_report(y_test, y_pred_ensemble))

In [None]:
best_ensemble_Voting_model =VotingClassifier(
          estimators=[
            ('xgb', best_XGB_After_proccess),
            ('gbc', best_GB_After_proccess),
            ('svm', best_SVC_After_proccess),
            ('rf', best_RF_After_proccess)
          ],
          voting='soft'
      )



# Assume X_train and y_train are the full training data (after preprocessing and scaling)
best_ensemble_Voting_model.fit(X_train, y_train)

In [None]:
from sklearn.ensemble import StackingClassifier

base_models = [
      ('xgb', best_XGB_After_proccess),
      ('gbc', best_GB_After_proccess),
      ('svm', best_SVC_After_proccess),
      ('rf', best_RF_After_proccess)
]

meta_model = LogisticRegression()

stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)
stacking_model.fit(X_train, y_train)
y_pred = stacking_model.predict(X_test)

accuracy_ensemble = metrics.accuracy_score(y_test, y_pred)
print("Ensemble Model Accuracy:", accuracy_ensemble)

print("Confusion Matrix:", metrics.confusion_matrix(y_test, y_pred))
print("Classification Report:", metrics.classification_report(y_test, y_pred))


In [None]:
best_ensemble_Stacking_model =StackingClassifier(
          estimators=[
            ('xgb', best_XGB_After_proccess),
            ('gbc', best_GB_After_proccess),
            ('svm', best_SVC_After_proccess),
            ('rf', best_RF_After_proccess)
          ],
      )

best_ensemble_model.fit(X_train, y_train)

## hold_out set change

In [None]:
hold_out = pd.read_csv('SexLandmarks-test.csv')

In [None]:
hold_out_data_df = hold_out.copy()

In [None]:
hold_out_data_df = IQR(hold_out_data_df)

In [None]:
hold_out_scaled_data = hold_out.copy()

for column in hold_out_scaled_data.columns:
    if column.startswith('x'):
        hold_out_scaled_data[column] = hold_out_scaled_data[column] * width
    elif column.startswith('y'):
        hold_out_scaled_data[column] = hold_out_scaled_data[column] * height

print(hold_out_scaled_data.head())

In [None]:
lengths_upper_threshold = hold_out_scaled_data_with_lengths_widths['lengths'].quantile(0.95)
lengths_lower_threshold = hold_out_scaled_data_with_lengths_widths['lengths'].quantile(0.05)
widths_upper_threshold = hold_out_scaled_data_with_lengths_widths['widths'].quantile(0.95)
widths_lower_threshold = hold_out_scaled_data_with_lengths_widths['widths'].quantile(0.05)

big_feet = hold_out_scaled_data_with_lengths_widths[
    (hold_out_scaled_data_with_lengths_widths['lengths'] > lengths_upper_threshold) |
    (hold_out_scaled_data_with_lengths_widths['widths'] > widths_upper_threshold)
]

small_feet = hold_out_scaled_data_with_lengths_widths[
    (hold_out_scaled_data_with_lengths_widths['lengths'] < lengths_lower_threshold) |
    (hold_out_scaled_data_with_lengths_widths['widths'] < widths_lower_threshold)
]

print("Big Feet Data Points:")
print(big_feet)

print("\nSmall Feet Data Points:")
print(small_feet)

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

plt.scatter(hold_out_scaled_data_with_lengths_widths['lengths'], hold_out_scaled_data_with_lengths_widths['widths'], alpha=0.5, label='Normal Data')

plt.scatter(big_feet['lengths'], big_feet['widths'], color='red', label='Big Feet Outliers', edgecolor='black')
plt.scatter(small_feet['lengths'], small_feet['widths'], color='yellow', label='Small Feet Outliers', edgecolor='black')

plt.xlabel('Lengths')
plt.ylabel('Widths')
plt.legend()
plt.show()

In [None]:
big_foot_1 = big_feet[
    (big_feet['lengths'] > 2200) & (big_feet['lengths'] < 2300) &
    (big_feet['widths'] > 1000) & (big_feet['widths'] < 1100)
]

In [None]:
if not small_foot_1.empty:
    plot_footprint(small_foot_1.iloc[0], 'Visual Representation of Small Foot 1 (Length ~ 1780, Width ~ 200)')

In [None]:
holdout_datasets = {
    'IQR': IQR(hold_out_data_df),
    'RobustScaling': cap_outliers_and_scale(hold_out_data_df),
    'Winsorization': Winsorization(hold_out_data_df),
    'Zscore': z_score(hold_out_data_df),
}

imputed_variants_holdout = {}

for variant_name, hold_out_data in holdout_datasets.items():
    hold_out_imputed = pd.DataFrame(IterativeImputer.fit_transform(hold_out_data), columns=hold_out_data.columns)
    key = f"{variant_name}_Iterative"
    imputed_variants_holdout[key] = hold_out_imputed

for variant_name, hold_out_data in holdout_datasets.items():
    hold_out_imputed = pd.DataFrame(KNNImputer.fit_transform(hold_out_data), columns=hold_out_data.columns)
    key = f"{variant_name}_knn"
    imputed_variants_holdout[key] = hold_out_imputed

print(imputed_variants_holdout.keys())

In [None]:
holdout_feature_engineered_variants = {}

lengths_widths_temp_dict = {}

for variant_name, hold_out_data in imputed_variants_holdout.items():

    hold_out_lengths = lengths_widths_calculation(hold_out_data)

    key = f"{variant_name}_lengths_widths"

    lengths_widths_temp_dict[key] = hold_out_lengths

holdout_feature_engineered_variants.update(lengths_widths_temp_dict)

point7_temp_dict = {}

for variant_name, hold_out_data in imputed_variants_holdout.items():

    hold_out_point7 = point7_calculation(hold_out_data)
    key = f"{variant_name}_point7"
    point7_temp_dict[key] = hold_out_point7

holdout_feature_engineered_variants.update(point7_temp_dict)

print(holdout_feature_engineered_variants.keys())

In [None]:
print(holdout_feature_engineered_variants.keys())
print(imputed_variants_holdout.keys())

In [None]:
hold_out_XGBoost_outliers_variants = {}

hold_out_data = holdout_feature_engineered_variants['Winsorization_Iterative_point7']

hold_out_point7_datasets = {
    'point7_IQR': (IQR(Winsorization_knn_point7)),
    'point7_Zscore': (z_score(Winsorization_knn_point7)),
    'point7_isolationforest': (isolation_forest(Winsorization_knn_point7))
}

print(hold_out_point7_datasets.keys())

In [None]:
if 'point7_Zscore' in hold_out_point7_datasets:

    display(hold_out_point7_datasets['point7_Zscore'].describe())
else:
    print("The key 'point7_Zscore' does not exist in the dictionary.")

In [None]:
if 'point7_IQR' in hold_out_point7_datasets:

    display(hold_out_point7_datasets['point7_IQR'].describe())
else:
    print("The key 'point7_IQR' does not exist in the dictionary.")

In [None]:
if 'point7_isolationforest' in hold_out_point7_datasets:

    display(hold_out_point7_datasets['point7_isolationforest'].describe())
else:
    print("The key 'point7_isolationforest' does not exist in the dictionary.")

In [None]:
selected_features = [ #it its from point7_Zscore_SMOTETomek_rfecv if i have use other data set if i use change this
    'x0', 'y0', 'x1', 'y2', 'x3', 'y3', 'x4', 'x5', 'y5', 'x6', 'y6',
    'x7', 'y7', 'x8', 'y8', 'x10', 'y10', 'x11', 'y11', 'x12', 'y12',
    'x13', 'y13', 'y14', 'x17', 'T1', 'T2', 'T3', 'T4', 'T5', 'BAB',
    'BAH', 'HB_index'
]


In [None]:
hold_out_data_filtered_unscaled = hold_out_point7_datasets['point7_Zscore'][selected_features]

In [None]:
hold_out_data_filtered_unscaled.describe().T

In [None]:
Used_in_model = 'point7_Zscore_SMOTETomek_rfecv' # change if i have use other version of my data ******
x_train_scale, x_test_scale, y_train_scale, y_test_scale = XGBoost_outliers_variants_features_selected[Used_in_model]

In [None]:
scaler = StandardScaler()
scaler.fit(x_test_scale)

hold_out_data_filtered_unscaled = hold_out_point7_datasets['point7_Zscore'][selected_features]
hold_out_data_filtered_unscaled = hold_out_data_filtered_unscaled.reindex(columns=x_test_scale.columns)

try:
    hold_out_data_filtered_scaled = pd.DataFrame(
        scaler.transform(hold_out_data_filtered_unscaled),
        columns=hold_out_data_filtered_unscaled.columns
    )
    print("Scaling successful.")
except ValueError as e:
    print("Error during scaling:", e)

In [None]:
print(hold_out_data_filtered_unscaled.head())
print(hold_out_data_filtered_scaled.head())

In [None]:
print("Mean used by scaler: ", scaler.mean_)
print("Scale used by scaler: ", scaler.scale_)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train_best)

hold_out_data_filtered_scaled = pd.DataFrame(
    scaler.transform(hold_out_data_filtered_unscaled),
    columns=hold_out_data_filtered_unscaled.columns
)

print(hold_out_data_filtered_scaled.describe().T)

In [None]:
print(hold_out_data_filtered_unscaled.head())
print(hold_out_data_filtered_scaled.head())

# Try to submitting it to kaggle

In [None]:
Used_in_model = 'point7_Zscore_SMOTETomek_rfecv'
x_train_submit, x_test_submit, y_train_submit, y_test_submit = XGBoost_outliers_variants_features_selected[Used_in_model]

In [None]:
print(x_train_submit.shape)

In [None]:
scaler = StandardScaler()
scaler.fit(x_train_submit)

In [None]:
x_train_submit_scaled = pd.DataFrame(scaler.transform(x_train_submit), columns=x_train_submit.columns)
x_train_submit_unscaled = x_train_submit.copy()

In [None]:
print(x_train_submit.shape)
print(y_train_submit.shape)

In [None]:
y_train_submit = pd.Series(y_train_submit)
y_train_submit.reset_index(drop=True, inplace=True)
y_train_submit = y_train_submit.values

In [None]:
best_XGB_After_proccess.fit(x_train_submit_unscaled, y_train_submit)
best_GB_After_proccess.fit(x_train_submit_unscaled, y_train_submit)

best_RF_After_proccess.fit(x_train_submit_scaled, y_train_submit)
best_SVC_After_proccess.fit(x_train_submit_scaled, y_train_submit)

best_ensemble_Voting_model submit

In [None]:
training_columns = x_train_submit.columns
hold_out_data_filtered_unscaled = hold_out_data_filtered_unscaled.reindex(columns=training_columns, fill_value=0)

hold_out_data_filtered_scaled = pd.DataFrame(
    scaler.transform(hold_out_data_filtered_unscaled),
    columns=hold_out_data_filtered_unscaled.columns
)

In [None]:
xgb_pred = best_XGB_After_proccess.predict(hold_out_data_filtered_unscaled)
gbc_pred = best_GB_After_proccess.predict(hold_out_data_filtered_unscaled)

rf_pred = best_RF_After_proccess.predict(hold_out_data_filtered_unscaled)
svm_pred = best_SVC_After_proccess.predict(hold_out_data_filtered_scaled)

In [None]:
hold_out_pred = best_ensemble_Voting_model.predict_proba(hold_out_data_filtered_scaled)[:, 1]

In [None]:
RowID = np.array(hold_out_data_filtered_unscaled.index)

In [None]:
results = pd.DataFrame({'RowID': RowID, 'sex': hold_out_pred})

In [None]:
print(results)

In [None]:
results.to_csv('results.csv', index=False)

In [None]:
'''!kaggle competitions submit -c budm-24 -f results.csv -m 'test''''

best_GB_After_proccess submit

In [None]:
best_GB_After_proccess.fit(x_train_submit_scaled, y_train_submit)

gb_pred = best_GB_After_proccess.predict_proba(hold_out_data_filtered_scaled)[:, 1]

results = pd.DataFrame({
    'RowID': np.array(hold_out_data_filtered.index),
    'Sex': gb_pred
})

results.to_csv('best_GB_After_proccess.csv', index=False)
print(results.head)

In [None]:
results.to_csv('best_GB_After_proccess.csv', index=False)

In [None]:
!kaggle competitions submit -c budm-24 -f best_GB_After_proccess.csv -m 'best_GB_After_proccess_test'

best_XGB_After_proccess submit

In [None]:
best_XGB_After_proccess.fit(x_train_submit_scaled, y_train_submit)

gb_pred = best_XGB_After_proccess.predict_proba(hold_out_data_filtered_scaled)[:, 1]

results = pd.DataFrame({
    'RowID': np.array(hold_out_data_filtered.index),
    'Sex': gb_pred
})

results.to_csv('best_XGB_After_proccess.csv', index=False)
print(results.head)

In [None]:
results.to_csv('best_XGB_After_proccess.csv', index=False)

In [None]:
!kaggle competitions submit -c budm-24 -f best_XGB_After_proccess.csv -m 'best_XGB_After_proccess_test

best_RF_After_proccess submit

In [None]:
best_RF_After_proccess.fit(x_train_submit_scaled, y_train_submit)

gb_pred = best_RF_After_proccess.predict_proba(hold_out_data_filtered_scaled)[:, 1]

results = pd.DataFrame({
    'RowID': np.array(hold_out_data_filtered.index),
    'Sex': gb_pred
})

results.to_csv('best_RF_After_proccess.csv', index=False)
print(results.head)

In [None]:
results.to_csv('best_RF_After_proccess.csv', index=False)

In [None]:
!kaggle competitions submit -c budm-24 -f best_RF_After_proccess.csv -m 'best_RF_After_proccess_test'

best_SVC_After_proccess submit

In [None]:
best_SVC_After_proccess.fit(x_train_submit_scaled, y_train_submit)

gb_pred = best_SVC_After_proccess.predict_proba(hold_out_data_filtered_scaled)[:, 1]

results = pd.DataFrame({
    'RowID': np.array(hold_out_data_filtered.index),
    'Sex': gb_pred
})

results.to_csv('best_SVC_After_proccess.csv', index=False)
print(results.head)

In [None]:
results.to_csv('best_SVC_After_proccess.csv', index=False)

In [None]:
!kaggle competitions submit -c budm-24 -f best_SVC_After_proccess.csv -m 'best_SVC_After_proccess_test'