<a href="https://colab.research.google.com/github/mauriciomau0/ML-Based-Performance-Monitoring-Approach-for-Athlete-Performance-Attenuation-Prediction/blob/main/Test_ML_Paper_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Initial Data Loading and Feature Dropping

The code reads data from an Excel file (pd.read_excel) and displays the first few rows (data.head()), which are standard initial steps in preparing data for analysis.

## Importing the appropriate packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Loading our data

In [None]:
data = pd.read_excel("/content/Data_PhD.xlsx")
print(type(data))
data.head()

## Feature Selection

In [None]:
df2 = data.drop(['Ranking_PERC','Ranking_CK','Ranking_CMJ','Ranking_DJ','Ranking_DJCont','Ranking_RSI'], axis=1)
X = df2.values

## Inspecting and choosing the right data

In [None]:
#correlation
correlation=df2.corr()
print(correlation)

In [None]:
df2.corr().unstack().sort_values().drop_duplicates()

In [None]:
#heatmap
import seaborn as sns
sns.heatmap(correlation, annot=True,fmt=".1f",linewidths=.1, linecolor='#ffffff',
            cmap='YlGnBu', xticklabels=1, yticklabels=1)

'Age' has a low average correlation with other features and is not strongly correlated with any single feature.

The decision to remove 'Age' is based on specific goals of the modeling process, rather than purely on the correlation analysis.

In [None]:
import pandas as pd

# Load your dataset into a pandas DataFrame
# Replace 'data.csv' with the path to your dataset

# Calculate correlation matrix
correlation_matrix = df2.corr()

# Calculate average correlation for each variable
average_correlation = correlation_matrix.abs().mean()

# Set a threshold for average correlation
correlation_threshold = 0.25 # Adjust this threshold as needed

# Identify variables with average correlations below the threshold
less_correlated_variables = average_correlation[average_correlation < correlation_threshold].index.tolist()

# Remove less correlated variables from the DataFrame
df_filtered = df2.drop(columns=less_correlated_variables)

# Optional: Print the removed variables
print("Removed Variables:", less_correlated_variables)


## **Let's begin the analysis | Factor analysis and principal component analysis (PCA)**

The code aims to perform **factor analysis and principal component analysis** (PCA) on the standardized dataset. It first checks the suitability of the data for factor analysis using **Bartlett's test of sphericity**. Then, it performs PCA and factor analysis with varimax rotation, displaying the loadings. Finally, it visualizes the **factor loadings** and shows the **explained variance ratio** for each principal component.

Now, that we have our data let's begin with our analysis! Firstly we will standardize our data.

In [None]:
X = df_filtered.values

In [None]:
#standardizing our data
from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Standardize and scale the data
data_new = scaler.fit_transform(X)


In [None]:
print(data_new)

In [None]:
!pip install factor-analyzer

## **Bartlett's Test**

Assess the suitability of the data for factor analysis. The heatmap provides a clear overview of the relationships between variables.

In [None]:
# @title
# Importing required libraries
import pandas as pd
import numpy as np
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt

In [None]:
# Import the module that performs the Bartlett test
import factor_analyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value, p_value = calculate_bartlett_sphericity(data_new)
print('Bartlett Sphericity Test: chi² = %d,  p_value = %d' % (chi_square_value, p_value))

Now, we will continue with the PCA part. Taking into account the <u> <i> <b> Kaiser's </b> </i> </u>  criterion which states that the components that have an eigenvalue greater than 1 are selected, we will choose the first 4 principal components. Thus, by making PCA we achieved to explain 82% of our data's variance <u> only by looking at a four-dimensional space. </u>

In [None]:
feature_names = df_filtered.columns
feature_names

In [None]:
X = df_filtered.values

In [None]:
from sklearn import preprocessing

std_scale = preprocessing.StandardScaler().fit(X)
X_scaled = std_scale.transform(X)

## **Number of Principal Components - PCA**

In [None]:
from sklearn import decomposition

pca = decomposition.PCA(n_components=4)
pca.fit(X_scaled)

In [None]:
print (pca.explained_variance_ratio_)
print (pca.explained_variance_ratio_.sum())

### **VARIMAX Method**

VARIMAX, short for "variance maximizing rotation," is a type of orthogonal rotation method. Orthogonal rotations keep the factors uncorrelated (independent) from each other, which simplifies interpretation.





In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from factor_analyzer import FactorAnalyzer
from sklearn.decomposition import FactorAnalysis

n_comps = 4

methods = [
    ("PCA", PCA(n_components=n_comps)),
    ("Unrotated Loadings", FactorAnalysis(n_components=n_comps)),
    ("Varimax-Rotated Loadings", FactorAnalyzer(n_factors=n_comps, rotation="varimax"))
]

# Assuming X_scaled is your standardized dataset
for method, model in methods:
    model.fit(X_scaled)
    if method == "PCA":
        loadings = model.components_.T
    elif method == "Varimax-Rotated Loadings":
        loadings = model.loadings_
    else:  # For Unrotated FA
        loadings = model.components_.T

    communalities = np.sum(loadings ** 2, axis=1)

    # Create DataFrame
    table = pd.DataFrame(data=loadings, columns=[f'Factor {i+1}' for i in range(n_comps)])
    table['Communalities'] = communalities
    table.index = feature_names  # Assuming feature_names is defined

    print(f"\n\n {method} :\n")
    print(table)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from factor_analyzer import FactorAnalyzer

n_comps = 4

methods = [
    ("PCA", PCA(n_components=n_comps)),  # Exclude PCA method
    ("Unrotated FA", FactorAnalysis()),
    ("Varimax FA", FactorAnalyzer(n_factors=n_comps, rotation="varimax"))
]

fig, axes = plt.subplots(ncols=len(methods), figsize=(15, 8), sharey=True)

for ax, (method, model) in zip(axes, methods):
    model.fit(X_scaled)  # Supondo que X_scaled seja o seu conjunto de dados padronizado

    if method == "PCA":
        components = model.components_.T
    elif method == "Varimax FA":
        components = model.loadings_

    print("\n\n %s :\n" % method)
    print(components)

    vmax = np.abs(components).max()
    ax.imshow(components, cmap="RdBu", vmax=vmax, vmin=-vmax)
    ax.set_yticks(np.arange(len(feature_names)))
    ax.set_yticklabels(feature_names)
    ax.set_title(str(method))
    ax.set_xticks(np.arange(n_comps))
    ax.set_xticklabels(["Comp. {}".format(i+1) for i in range(n_comps)])

fig.suptitle("Factors")
plt.tight_layout()
plt.savefig('factors_PCA_Varimax.png', dpi=300, bbox_inches='tight')
plt.show()

The Varimax-rotated factor analysis reveals four distinct factors underlying the relationships between the variables in the dataset:

**Factor 1**: Strength: This factor is primarily characterized by high positive loadings on variables related to body composition (Weight (kg), %Body Fat) and measures of strength (1RM Hip Thrust (kg), Relative 1RM HT(kg), Relative 1RM Squat (kg)).

**Factor 2**: Lower Body Power and Explosiveness: This factor is dominated by high positive loadings on Baseline Drop Jump (cm) and Baseline CMJ (cm), both of which are measures of lower body power and explosiveness.

**Factor 3**: Speed and Agility: This factor is defined by high positive loadings on Speed 5m (s) and Speed 20m (s).

**Factor 4**: This factor has a high positive loading on Weight (kg), and %Body Fat.

These interpretations provide insights into the underlying structure of the data and how the variables relate to each other. Each factor represents a distinct dimension of physical fitness, and individuals can be characterized by their scores on these factors.

## **Cumulative Explained Variance**

This section visualizes the cumulative explained variance ratio for the principal components. It shows how much of the total variance in the data is captured by increasing the number of components. This information helps assess the trade-off between dimensionality reduction and information retention.

In [None]:
#making the pca
from sklearn.decomposition import PCA
pca = PCA(n_components=13,svd_solver="full")
pca.fit(X_scaled)
#taking the explained variance (eigenvalue) for each component
variance=pca.explained_variance_
#explained variance ratio
variance_ratio=(pca.explained_variance_ratio_)*100
#cumulative explained variance ratio
cum=np.cumsum(variance_ratio)
#bulding a dataframe
names=df_filtered.columns
table=pd.DataFrame({'Variable name':names,'Eigenvalue':variance,'% of explained variance ratio':variance_ratio,'% of cumulative explained variance ratio':cum})
print(table)

Let's make the <b> scree plot </b> that is another way of judging which components it would be good to choose!

In [None]:
X_projected = pca.transform(data_new)
print (X_projected.shape)

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Explained variance obtained from your code
explained_variance = variance_ratio
eigenvalues = variance

fig, ax1 = plt.subplots(figsize=(10, 6))

# Plot percentage of variance explained
ax1.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', color='skyblue', label='% Variance Explained')
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('% of Variance Explained', color='black')
ax1.tick_params(axis='y', labelcolor='black')
# ax1.legend(loc='upper left')

# Plot eigenvalues on the secondary y-axis
ax2 = ax1.twinx()
ax2.plot(range(1, len(eigenvalues) + 1), eigenvalues, marker='o', color='orange', label='Eigenvalues')
ax2.set_ylabel('Eigenvalues', color='black')
ax2.tick_params(axis='y', labelcolor='black')
# ax2.legend(loc='upper right')

# Annotate explained variance points
for i, var in enumerate(explained_variance):
    ax1.annotate(f'{var:.2f}%', (i + 1, var), textcoords="offset points", xytext=(0, 5), ha='center', fontweight='bold')

# Set x-axis ticks and labels for every component
ax1.set_xticks(np.arange(1, len(explained_variance) + 1))
ax1.set_xticklabels(np.arange(1, len(explained_variance) + 1))

#plt.title('Scree Plot')
plt.tight_layout()
sns.despine(left=True)
sns.axes_style("whitegrid")

plt.show()


In [None]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

## Balancing Method

### Number of New Samples Generated

In [None]:
!pip install imbalanced-learn

In [None]:
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
import numpy as np

x_train_full, x_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(x_train_full, y_train_full)

# Print the number of new samples generated by SMOTE
print(f"Original training data shape: {x_train_full.shape[0]} samples, {x_train_full.shape[1]} features")
print(f"Resampled training data shape: {X_resampled.shape[0]} samples, {X_resampled.shape[1]} features")
print(f"Number of new samples generated by SMOTE: {X_resampled.shape[0] - x_train_full.shape[0]}")
print(f"Percentage increase in data points: {(X_resampled.shape[0] - x_train_full.shape[0]) / x_train_full.shape[0] * 100:.2f}%")

adasyn = ADASYN(random_state=42)
X_resampled_adasyn, y_resampled_adasyn = adasyn.fit_resample(x_train_full, y_train_full)

# Calculate the number of minority class samples needed to reach the desired ratio
minority_class_samples = 8

# Select the desired number of minority class samples
minority_class_indices = np.where(y_resampled_adasyn == 1)[0]
selected_minority_class_indices = np.random.choice(minority_class_indices, size=minority_class_samples, replace=True)

# Create the final resampled dataset
X_resampled_adasyn_final = np.concatenate((x_train_full, X_resampled_adasyn[selected_minority_class_indices]))
y_resampled_adasyn_final = np.concatenate((y_train_full, y_resampled_adasyn[selected_minority_class_indices]))

# Print the number of new samples generated by ADASYN
print(f"Original training data shape: {x_train_full.shape[0]} samples, {x_train_full.shape[1]} features")
print(f"Resampled training data shape: {X_resampled_adasyn_final.shape[0]} samples, {X_resampled_adasyn_final.shape[1]} features")
print(f"Number of new samples generated by ADASYN: {X_resampled_adasyn_final.shape[0] - x_train_full.shape[0]}")
print(f"Percentage increase in data points: {(X_resampled_adasyn_final.shape[0] - x_train_full.shape[0]) / x_train_full.shape[0] * 100:.2f}%")

ros = RandomOverSampler(random_state=42)
X_resampled_ros, y_resampled_ros = ros.fit_resample(x_train_full, y_train_full)

# Print the number of new samples generated by ROS
print(f"Original training data shape: {x_train_full.shape[0]} samples, {x_train_full.shape[1]} features")
print(f"Resampled training data shape: {X_resampled_ros.shape[0]} samples, {X_resampled_ros.shape[1]} features")
print(f"Number of new samples generated by ROS: {X_resampled_ros.shape[0] - x_train_full.shape[0]}")
print(f"Percentage increase in data points: {(X_resampled_ros.shape[0] - x_train_full.shape[0]) / x_train_full.shape[0] * 100:.2f}%")

> **Note that we have to run these cells, e.g. the following cells of Perception rank, every time before we run the models, because the data have to be set up accordingly to the rank that will be predicted.**

# **Ranking Perception Prediction: Data Preprocessing and Model Training**

The code starts by creating a binary target variable (Reviews_Perc) based on the Ranking_PERC column. Values between 1 and 21 are labeled as '0' (top 20), and values between 22 and 41 are labeled as '1' (next 20).

The PCA-transformed data (X_projected) is used as the feature set.
The data is split into training and testing sets using a 75%/25% split.



In [None]:
import pandas as pd

# Define the bin edges
bin_edges = [0, 21, 41]  # 0-20 (top 20), 21-40 (next 20)

# Define the bin labels
bin_labels = ['top_20', 'next_20']

# Use the cut function to create categorical labels
reviews = data['Reviews_Perc'] = pd.cut(data['Ranking_PERC'], bins=bin_edges, labels=bin_labels, include_lowest=True)

In [None]:
reviews = []
for i in data['Ranking_PERC']:
    if i >= 1 and i <= 21:
        reviews.append('0')
    elif i >= 22 and i <= 41:
        reviews.append('1')

data['Reviews_Perc'] = reviews

In [None]:
y = data['Reviews_Perc']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_projected, y, test_size = 0.25, random_state=42)

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

## **Rose Balancing Method**

The code starts by encoding the target variable (y) into numerical format using LabelEncoder.

The data is split into training and testing sets, with 25% reserved for testing.

Several classifier models are initialized: Random Forest, Naive Bayes, SVM, AdaBoost, and XGBoost.

Random Over-Sampling (ROSE) is applied to the training data. This technique generates synthetic samples of the minority class to balance the class distribution, addressing the issue of imbalanced data.

**Explanation**:

The code iterates through each classifier model.
Each model is trained on the **oversampled data** (***x_resampled, y_resampled***).

The model is used to predict the target variable for the test set (x_test).
Model performance is evaluated using various metrics (**accuracy, precision, recall, F1-score**) and a **confusion matrix** is printed.

For the Random Forest model specifically:
**Feature importances** are calculated and displayed.
A horizontal bar plot is created to visualize the relative importance of each feature in the model.

In [None]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Set a random seed for reproducibility
RANDOM_SEED = 42

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split the data into 80% training and 20% testing
x_train_full, x_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.2, random_state=RANDOM_SEED)

# Further split the training data into 80% training and 20% validation
x_train, x_val, y_train, y_val = train_test_split(x_train_full, y_train_full, test_size=0.2, random_state=RANDOM_SEED)

# Initialize classifiers with random_state
classifiers = {
    'Random Forest': RandomForestClassifier(random_state=RANDOM_SEED),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(random_state=RANDOM_SEED),
    'AdaBoost': AdaBoostClassifier(random_state=RANDOM_SEED),
    'XGBoost': XGBClassifier(random_state=RANDOM_SEED)
}

# Initialize 5-fold cross-validation within the training set
kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

# To store overall metrics and confusion matrices
metrics = {name: {'accuracy': [], 'precision': [], 'recall': [], 'f1': [], 'conf_matrix': []} for name in classifiers.keys()}

# Cross-validation loop on the training set
for train_index, val_index in kf.split(x_train):
    x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
    y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]

    # Apply oversampling to the training fold
    ros = RandomOverSampler(random_state=RANDOM_SEED)
    x_resampled, y_resampled = ros.fit_resample(x_train_kf, y_train_kf)

    # Train and evaluate classifiers on validation fold
    for name, clf in classifiers.items():
        clf.fit(x_resampled, y_resampled)
        y_pred = clf.predict(x_val_kf)

        # Calculate metrics
        acc = accuracy_score(y_val_kf, y_pred)
        precision = precision_score(y_val_kf, y_pred, pos_label=1)
        recall = recall_score(y_val_kf, y_pred, pos_label=1)
        f1 = f1_score(y_val_kf, y_pred, pos_label=1)
        conf_matrix = confusion_matrix(y_val_kf, y_pred)

        # Store metrics
        metrics[name]['accuracy'].append(acc)
        metrics[name]['precision'].append(precision)
        metrics[name]['recall'].append(recall)
        metrics[name]['f1'].append(f1)
        metrics[name]['conf_matrix'].append(conf_matrix)

# Print the average results for each classifier on the validation set
for name in classifiers.keys():
    print(f"\nClassifier: {name}")
    print(f"Average Validation Accuracy: {np.mean(metrics[name]['accuracy'])}")
    print(f"Average Validation Precision: {np.mean(metrics[name]['precision'])}")
    print(f"Average Validation Recall: {np.mean(metrics[name]['recall'])}")
    print(f"Average Validation F1-score: {np.mean(metrics[name]['f1'])}")


# Final evaluation on the test set
print("\nFinal evaluation on the test set:")
for name, clf in classifiers.items():
    clf.fit(x_train, y_train)  # Train on the full training set
    y_pred_test = clf.predict(x_test)

    # Calculate metrics on the test set
    acc_test = accuracy_score(y_test, y_pred_test)
    precision_test = precision_score(y_test, y_pred_test, pos_label=1)
    recall_test = recall_score(y_test, y_pred_test, pos_label=1)
    f1_test = f1_score(y_test, y_pred_test, pos_label=1)
    conf_matrix_test = confusion_matrix(y_test, y_pred_test)

    # Print test set metrics
    print(f"\nClassifier: {name}")
    print(f"Test Accuracy: {acc_test}")
    print(f"Test Precision: {precision_test}")
    print(f"Test Recall: {recall_test}")
    print(f"Test F1-score: {f1_test}")
    print(f"Test Confusion Matrix:\n{conf_matrix_test}")




### Hyper-parameter Tuning with Cross-Validation


### Leave-One-Out

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

In [None]:
!pip install shap

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with ROSE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ROSE to the training data
        ros = RandomOverSampler(random_state=42)
        X_resampled, y_resampled = ros.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ROSE to the full training data and refit the best model
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ROSE to the full training data
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()



In [None]:
import shap
import pandas as pd
import numpy as np

# Assuming best_model is the best AdaBoost model obtained from GridSearchCV
# and X_resampled_full is the resampled training data used for fitting best_model

# Define a function to get model predictions
def predict_fn(X):
    return best_model.predict_proba(X)

# Initialize SHAP Explainer
explainer = shap.KernelExplainer(predict_fn, X_resampled_full)

# Compute SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Print the shape of SHAP values for debugging
print(f"SHAP values shape: {shap_values.shape}")

# Check if SHAP values are a list (binary classification scenario)
if isinstance(shap_values, list):
    # Extract SHAP values for the positive class (usually the second class)
    shap_values_class = shap_values[1]
else:
    # If not a list, assume it's already for the positive class
    shap_values_class = shap_values

# Ensure SHAP values are 2D with shape (num_samples, num_features)
if shap_values_class.ndim == 3:
    # If 3D, handle the case appropriately
    shap_values_class = shap_values_class[:, :, 1]  # Select the SHAP values for the positive class if 3D
elif shap_values_class.ndim != 2:
    raise ValueError(f"Unexpected SHAP values shape: {shap_values_class.shape}")

# Calculate mean absolute SHAP values for feature importance
mean_shap_values = np.mean(np.abs(shap_values_class), axis=0)

# Get feature names directly from the DataFrame
feature_names = df_filtered.columns.tolist()

# Ensure lengths match
if len(feature_names) != len(mean_shap_values):
    print(f"Length mismatch: feature_names ({len(feature_names)}) != mean_shap_values ({len(mean_shap_values)})")
    # Optionally handle the mismatch here

# Ensure 1D arrays
feature_names = np.array(feature_names).flatten()
mean_shap_values = np.array(mean_shap_values).flatten()

# Create a DataFrame to show feature importances
try:
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': mean_shap_values
    })
    # Print feature importances
    print("\nFeature Importances using SHAP for Best AdaBoost Model:")
    print(importance_df)
except ValueError as e:
    print(f"Error creating DataFrame: {e}")

# Plot SHAP summary plot
shap.summary_plot(shap_values_class, X_test)


### No Cross Validation if you want to test it

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply ROSE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

Now, the **only difference** in the code below is that **Feature importances** are calculated and displayed.

**You will see this pattern in this code.**



In [None]:
import matplotlib.pyplot as plt
import numpy as np

from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

    # Get variable importance for Random Forest and create a plot
    if name == 'Random Forest':
        variable_importance = clf.feature_importances_
        print(f"\nVariable Importance for {name}:")
        for idx, importance in enumerate(variable_importance):
            print(f"Feature {idx}: {importance}")

        # Get feature names
        feature_names = df_filtered.columns.tolist()  # Assuming X_projected is a DataFrame with column names

        # Plot variable importances with feature names
        plt.figure(figsize=(10, 6))
        bars = plt.barh(feature_names, variable_importance, color='skyblue')
        plt.xlabel('Importance')
        #plt.ylabel('Feature')
        #plt.title('Variable Importance for Random Forest Classifier')

        # Add text annotations
        '''for idx, bar in enumerate(bars):
            plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2, round(variable_importance[idx], 2),
                     va='center', ha='left')'''

        plt.tight_layout()
        sns.axes_style("whitegrid")
        sns.despine(left=True, bottom=True)
        plt.show()


## **SMOTE Balancing Method**

In [None]:
from imblearn.over_sampling import ADASYN, SMOTE

### Leave-One-Out

In [None]:
x_train_full, x_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(x_train_full, y_train_cv)

# Print the number of new samples generated by SMOTE
print(f"Original training data shape: {X_train_cv.shape[0]} samples, {X_train_cv.shape[1]} features")
print(f"Resampled training data shape: {X_resampled.shape[0]} samples, {X_resampled.shape[1]} features")
print(f"Number of new samples generated by SMOTE: {X_resampled.shape[0] - X_train_cv.shape[0]}")
print(f"Percentage increase in data points: {(X_resampled.shape[0] - X_train_cv.shape[0]) / X_train_cv.shape[0] * 100:.2f}%")

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with ROSE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply SMOTE to the training data
        smote = SMOTE(random_state=42)
        x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

        model.fit(x_resampled, y_resampled)
        train_pred = model.predict(x_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply Smote to the full training data and refit the best model
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ROSE to the full training data
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()



In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris  # Example dataset, replace with your data

# Assuming the best model parameters from GridSearchCV
best_params = {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 100}

# Initialize the Random Forest with the best parameters
rf_best = RandomForestClassifier(**best_params, random_state=42)

# Fit the model to the entire training data (if you're using the entire dataset)
rf_best.fit(X_train_full, y_train_full)

# Extract feature importances
importances = rf_best.feature_importances_
features = df_filtered.columns  # Assuming X_train_full is a DataFrame

# Create a DataFrame for visualization
feature_importances = pd.DataFrame({
    'Feature': features,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot the feature importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances from Random Forest')
plt.gca().invert_yaxis()  # To display the most important features at the top
plt.show()

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



## **ADASYN Balacing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with ROSE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ADASYN to training data
        adasyn = ADASYN(random_state=42)
        x_resampled, y_resampled = adasyn.fit_resample(x_train, y_train)

        model.fit(x_resampled, y_resampled)
        train_pred = model.predict(x_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ROSE to the full training data and refit the best model
        adasyn = ADASYN(random_state=42)
        X_resampled_full, y_resampled_full = adasyn.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ROSE to the full training data
        adasyn = ADASYN(random_state=42)
        X_resampled_full, y_resampled_full = adasyn.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()



In [None]:
from imblearn.over_sampling import ADASYN
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()


# Apply ADASYN to training data
adasyn = ADASYN(random_state=42)
x_resampled, y_resampled = adasyn.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



# **Ranking CK Prediction**

Here you can find the best combination for CK

In [None]:
bin_edges = [0, 21, 41]  # Defining bins (0-20, 21-40)
bin_labels = ['top_20', 'next_20']

# Apply pd.cut to create the 'Reviews_CK' column
reviews2 = data['Reviews_CK'] = pd.cut(data['Ranking_CK'], bins=bin_edges, labels=bin_labels, include_lowest=True)

In [None]:
reviews = []
for i in data['Ranking_CK']:
    if i >= 1 and i <= 21:
        reviews.append('0')
    elif i >= 22 and i <= 41:
        reviews.append('1')

data['Reviews_CK'] = reviews

In [None]:
y2 = data['Reviews_CK']


In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

## **Rose Balancing Method**

In [None]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import RandomOverSampler

In [None]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y2)

### Leave-One-Out

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with ROSE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ROSE to the training data
        ros = RandomOverSampler(random_state=42)
        X_resampled, y_resampled = ros.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ROSE to the full training data and refit the best model
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ROSE to the full training data
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()



In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y2)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

## **SMOTE Balancing Method**

Here you can find the best combination for CK: **SMOTE and XGBoost**

In [None]:
from imblearn.over_sampling import ADASYN, SMOTE

### Leave-One-Out

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with SMOTE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply SMOTE to the training data
        smote = SMOTE(random_state=42)
        X_resampled, y_resampled = smote.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply SMOTE to the full training data and refit the best model
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply SMOTE to the full training data
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()




In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y2)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



In [None]:
import matplotlib.pyplot as plt
import numpy as np

from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y2)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

    # Get variable importance for Random Forest and create a plot
    if name == 'XGBoost':
        variable_importance = clf.feature_importances_
        print(f"\nVariable Importance for {name}:")
        for idx, importance in enumerate(variable_importance):
            print(f"Feature {idx}: {importance}")

        # Get feature names
        feature_names = df_filtered.columns.tolist()  # Assuming X_projected is a DataFrame with column names

        # Plot variable importances with feature names
        plt.figure(figsize=(10, 6))
        bars = plt.barh(feature_names, variable_importance, color='skyblue')
        plt.xlabel('Importance')
        #plt.ylabel('Feature')
        #plt.title('Variable Importance for Random Forest Classifier')

        # Add text annotations
        '''for idx, bar in enumerate(bars):
            plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2, round(variable_importance[idx], 2),
                     va='center', ha='left')'''

        plt.tight_layout()
        sns.axes_style("whitegrid")
        sns.despine(left=True, bottom=True)
        plt.show()


## **ADASYN Balacing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import train_test_split, LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import ADASYN
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        model.fit(X_train_cv, y_train_cv)
        train_pred = model.predict(X_train_cv)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_train_cv, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies
# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=LeaveOneOut(),
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_resampled_adasyn_final, y_resampled_adasyn_final)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Evaluate on the test set
        y_test_pred = best_model.predict(x_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_resampled_adasyn_final, y_resampled_adasyn_final)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        model.fit(X_resampled_adasyn_final, y_resampled_adasyn_final)

        y_test_pred = model.predict(x_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")


        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_resampled_adasyn_final, y_resampled_adasyn_final)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()


In [None]:
import shap
import pandas as pd
import numpy as np

# Assuming best_model is the best XGBoost model obtained from GridSearchCV
# and X_resampled_full is the resampled training data used for fitting best_model

# Define SHAP explainer for XGBoost
explainer = shap.Explainer(best_model)

# Compute SHAP values for the test set
shap_values = explainer(x_test)

# Print the shape of SHAP values for debugging
print(f"SHAP values shape: {shap_values.shape}")

# Extract feature names directly from X_test DataFrame
feature_names = df_filtered.columns.tolist()

# Calculate mean absolute SHAP values for feature importance
mean_shap_values = np.mean(np.abs(shap_values.values), axis=0) / np.sum(np.abs(shap_values.values))

# Ensure lengths match
if len(feature_names) != len(mean_shap_values):
    print(f"Length mismatch: feature_names ({len(feature_names)}) != mean_shap_values ({len(mean_shap_values)})")
    raise ValueError("Length mismatch")

# Ensure 1D arrays
feature_names = np.array(feature_names).flatten()
mean_shap_values = np.array(mean_shap_values).flatten()

# Create a DataFrame to show feature importances
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': mean_shap_values
})

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print feature importances
print("\nFeature Importances using SHAP for Best XGBoost Model:")
print(importance_df)

# Save the importance DataFrame to a file
importance_df.to_csv('feature_importances_CK.csv', index=False)

# Plot SHAP summary plot
shap.summary_plot(shap_values, x_test)


In [None]:
from imblearn.over_sampling import ADASYN
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y2)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=4)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()


# Apply ADASYN to training data
adasyn = ADASYN(random_state=4)
x_resampled, y_resampled = adasyn.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



# **Ranking CMJ Prediction**

Here you can find the best combination for CMJ

In [None]:
bin_edges = [0, 21, 41]  # 0-20 (top 20), 21-40 (next 20)

# Define the bin labels
bin_labels = ['top_20', 'next_20']

reviews3 = data['Reviews_CMJ'] = pd.cut(data['Ranking_CMJ'], bins=bin_edges, labels=bin_labels, include_lowest=True)

In [None]:
reviews = []
for i in data['Ranking_CMJ']:
    if i >= 1 and i <= 21:
        reviews.append('0')
    elif i >= 22 and i <= 41:
        reviews.append('1')

data['Reviews_CMJ'] = reviews

In [None]:
y3 = data['Reviews_CMJ']
y3

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y3)

## **Rose Balancing Method**

### Leave-One-Out

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with ROSE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ROSE to the training data
        ros = RandomOverSampler(random_state=42)
        X_resampled, y_resampled = ros.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ROSE to the full training data and refit the best model
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ROSE to the full training data
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()



In [None]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import RandomOverSampler

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y3)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y3)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

    # Get variable importance for Random Forest and create a plot
    if name == 'Random Forest':
        variable_importance = clf.feature_importances_
        print(f"\nVariable Importance for {name}:")
        for idx, importance in enumerate(variable_importance):
            print(f"Feature {idx}: {importance}")

        # Get feature names
        feature_names = df_filtered.columns.tolist()  # Assuming X_projected is a DataFrame with column names

        # Plot variable importances with feature names
        plt.figure(figsize=(10, 6))
        bars = plt.barh(feature_names, variable_importance, color='skyblue')
        plt.xlabel('Importance')
        #plt.ylabel('Feature')
        #plt.title('Variable Importance for Random Forest Classifier')

        # Add text annotations
        '''for idx, bar in enumerate(bars):
            plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2, round(variable_importance[idx], 2),
                     va='center', ha='left')'''

        plt.tight_layout()
        sns.axes_style("whitegrid")
        sns.despine(left=True, bottom=True)
        plt.show()


## **SMOTE Balancing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with SMOTE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply SMOTE to the training data
        smote = SMOTE(random_state=42)
        X_resampled, y_resampled = smote.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply SMOTE to the full training data and refit the best model
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply SMOTE to the full training data
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()




In [None]:
from imblearn.over_sampling import ADASYN, SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y3)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



## **ADASYN Balacing Method**

### Leave-One-Out

In [None]:
# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

RANDOM_SEED = 42
classifiers = {
    'Random Forest': RandomForestClassifier(random_state=RANDOM_SEED),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(random_state=RANDOM_SEED),
    'AdaBoost': AdaBoostClassifier(random_state=RANDOM_SEED),
    'XGBoost': XGBClassifier(random_state=RANDOM_SEED)
}

In [None]:
from sklearn.model_selection import train_test_split, LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import ADASYN
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        model.fit(X_train_cv, y_train_cv)
        train_pred = model.predict(X_train_cv)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_train_cv, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies
# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=LeaveOneOut(),
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_resampled_adasyn_final, y_resampled_adasyn_final)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Evaluate on the test set
        y_test_pred = best_model.predict(x_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_resampled_adasyn_final, y_resampled_adasyn_final)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        model.fit(X_resampled_adasyn_final, y_resampled_adasyn_final)

        y_test_pred = model.predict(x_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")


        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_resampled_adasyn_final, y_resampled_adasyn_final)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()


In [None]:
import shap
import pandas as pd
import numpy as np

# Assuming best_model is the best XGBoost model obtained from GridSearchCV
# and X_resampled_full is the resampled training data used for fitting best_model

# Define SHAP explainer for XGBoost
explainer = shap.Explainer(best_model)

# Compute SHAP values for the test set
shap_values = explainer(x_test)

# Print the shape of SHAP values for debugging
print(f"SHAP values shape: {shap_values.shape}")

# Extract feature names directly from X_test DataFrame
feature_names = df_filtered.columns.tolist()

# Calculate mean absolute SHAP values for feature importance
mean_shap_values = np.mean(np.abs(shap_values.values), axis=0) / np.sum(np.abs(shap_values.values))

# Ensure lengths match
if len(feature_names) != len(mean_shap_values):
    print(f"Length mismatch: feature_names ({len(feature_names)}) != mean_shap_values ({len(mean_shap_values)})")
    raise ValueError("Length mismatch")

# Ensure 1D arrays
feature_names = np.array(feature_names).flatten()
mean_shap_values = np.array(mean_shap_values).flatten()

# Create a DataFrame to show feature importances
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': mean_shap_values
})

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print feature importances
print("\nFeature Importances using SHAP for Best XGBoost Model:")
print(importance_df)

# Save the importance DataFrame to a file
importance_df.to_csv('feature_importances_CMJ.csv', index=False)

# Plot SHAP summary plot
shap.summary_plot(shap_values, x_test)


In [None]:
from imblearn.over_sampling import ADASYN
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y3)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=4)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()


# Apply ADASYN to training data
adasyn = ADASYN(random_state=4)
x_resampled, y_resampled = adasyn.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



# **Ranking DJ Prediction**

In [None]:
bin_edges = [0, 21, 41]  # 0-20 (top 20), 21-40 (next 20)

# Define the bin labels
bin_labels = ['top_20', 'next_20']

reviews3 = data['Reviews_DJ'] = pd.cut(data['Ranking_DJ'], bins=bin_edges, labels=bin_labels, include_lowest=True)

In [None]:
reviews = []
for i in data['Ranking_DJ']:
    if i >= 1 and i <= 21:
        reviews.append('0')
    elif i >= 22 and i <= 41:
        reviews.append('1')

data['Reviews_DJ'] = reviews

In [None]:
y4 = data['Reviews_DJ']

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y4)

## **Rose Balancing Method**

### Leave-One-Out

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with ROSE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ROSE to the training data
        ros = RandomOverSampler(random_state=42)
        X_resampled, y_resampled = ros.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ROSE to the full training data and refit the best model
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ROSE to the full training data
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()



In [None]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import RandomOverSampler

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y4)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y4)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

    # Get variable importance for Random Forest and create a plot
    if name == 'AdaBoost':
        variable_importance = clf.feature_importances_
        print(f"\nVariable Importance for {name}:")
        for idx, importance in enumerate(variable_importance):
            print(f"Feature {idx}: {importance}")

        # Get feature names
        feature_names = df_filtered.columns.tolist()  # Assuming X_projected is a DataFrame with column names

        # Plot variable importances with feature names
        plt.figure(figsize=(10, 6))
        bars = plt.barh(feature_names, variable_importance, color='skyblue')
        plt.xlabel('Importance')
        #plt.ylabel('Feature')
        #plt.title('Variable Importance for Random Forest Classifier')

        # Add text annotations
        '''for idx, bar in enumerate(bars):
            plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2, round(variable_importance[idx], 2),
                     va='center', ha='left')'''

        plt.tight_layout()
        sns.axes_style("whitegrid")
        sns.despine(left=True, bottom=True)
        plt.show()


## **SMOTE Balancing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import SMOTE, RandomOverSampler  # Import SMOTE and RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with SMOTE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply SMOTE to the training data
        smote = SMOTE(random_state=42)
        X_resampled, y_resampled = smote.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply SMOTE to the full training data and refit the best model
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply SMOTE to the full training data
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()




In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.inspection import permutation_importance
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Define a function to check and handle negative values
def preprocess_data(X):
    # Check for negative values
    if (X < 0).any().any():
        print("Warning: Negative values found. Applying Min-Max scaling.")
        from sklearn.preprocessing import MinMaxScaler
        scaler = MinMaxScaler()
        X = scaler.fit_transform(X)
    return X

# Load or split your dataset here
# For example:
# X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Example data preprocessing
# X_train_full = preprocess_data(X_train_full)
# X_test = preprocess_data(X_test)

# Define GaussianNB model
model = GaussianNB()

# Preprocess the data
X_train_full = preprocess_data(X_train_full)
X_test = preprocess_data(X_test)

# Train the model
model.fit(X_train_full, y_train_full)

# Compute permutation importance
def compute_permutation_importance(model, X_test, y_test):
    # Compute permutation importance
    results = permutation_importance(model, X_test, y_test, scoring='f1_weighted', n_repeats=10, random_state=42)
    importance_scores = results.importances_mean

    # Create a DataFrame to show feature importances
    feature_names = df_filtered.columns.tolist()
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importance_scores
    })

    return importance_df

# Compute and print feature importances
importance_df = compute_permutation_importance(model, X_test, y_test)

print(f"\nFeature Importances using Permutation Importance for GaussianNB:")
print(importance_df)

# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Permutation Importance for GaussianNB')
plt.show()


In [None]:
from imblearn.over_sampling import ADASYN, SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y4)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



## **ADASYN Balacing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import train_test_split, LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import ADASYN
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

# Check the class distribution in the dataset
print("Original class distribution:", Counter(y_train_full))

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Define function to compute learning curves with ADASYN
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ADASYN to the training data
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        if np.unique(y_train_cv).size > 1:  # Check if there is more than one class
            try:
                X_resampled, y_resampled = adasyn.fit_resample(X_train_cv, y_train_cv)
                if len(y_resampled) > len(y_train_cv):  # Ensure resampling occurred
                    model.fit(X_resampled, y_resampled)
                    train_pred = model.predict(X_resampled)
                    val_pred = model.predict(X_val_cv)

                    # Store metrics
                    train_accuracies.append(accuracy_score(y_resampled, train_pred))
                    val_accuracies.append(accuracy_score(y_val_cv, val_pred))
                else:
                    print(f"Warning: No samples generated for fold {len(train_accuracies)+1}, using original data")
                    model.fit(X_train_cv, y_train_cv)
                    train_pred = model.predict(X_train_cv)
                    val_pred = model.predict(X_val_cv)
                    train_accuracies.append(accuracy_score(y_train_cv, train_pred))
                    val_accuracies.append(accuracy_score(y_val_cv, val_pred))
            except ValueError as e:
                print(f"Warning: {e} for fold {len(train_accuracies)+1}, using original data")
                model.fit(X_train_cv, y_train_cv)
                train_pred = model.predict(X_train_cv)
                val_pred = model.predict(X_val_cv)
                train_accuracies.append(accuracy_score(y_train_cv, train_pred))
                val_accuracies.append(accuracy_score(y_val_cv, val_pred))
        else:
            # If only one class is present, no resampling needed
            model.fit(X_train_cv, y_train_cv)
            train_pred = model.predict(X_train_cv)
            val_pred = model.predict(X_val_cv)
            train_accuracies.append(accuracy_score(y_train_cv, train_pred))
            val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=LeaveOneOut(),
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ADASYN to the full training data and refit the best model
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        try:
            X_resampled_full, y_resampled_full = adasyn.fit_resample(X_train_full, y_train_full)
            if len(y_resampled_full) > len(y_train_full):  # Ensure resampling occurred
                best_model.fit(X_resampled_full, y_resampled_full)
            else:
                print(f"Warning: No samples generated, using original data")
                best_model.fit(X_train_full, y_train_full)
        except ValueError as e:
            print(f"Warning: {e}, using original data")
            best_model.fit(X_train_full, y_train_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ADASYN to the full training data
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        try:
            X_resampled_full, y_resampled_full = adasyn.fit_resample(X_train_full, y_train_full)
            if len(y_resampled_full) > len(y_train_full):  # Ensure resampling occurred
                model.fit(X_resampled_full, y_resampled_full)
            else:
                print(f"Warning: No samples generated, using original data")
                model.fit(X_train_full, y_train_full)
        except ValueError as e:
            print(f"Warning: {e}, using original data")
            model.fit(X_train_full, y_train_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()


In [None]:
from imblearn.over_sampling import ADASYN
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y4)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=12)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()


# Apply ADASYN to training data
adasyn = ADASYN(random_state=12)
x_resampled, y_resampled = adasyn.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



# **Ranking DJ Contact Time Prediction**

Here you can find the best combination for DJ conctact time

In [None]:
bin_edges = [0, 21, 41]  # 0-20 (top 20), 21-40 (next 20)

# Define the bin labels
bin_labels = ['top_20', 'next_20']

reviews3 = data['Reviews_DJCont'] = pd.cut(data['Ranking_DJCont'], bins=bin_edges, labels=bin_labels, include_lowest=True)

In [None]:
reviews = []
for i in data['Ranking_DJCont']:
    if i >= 1 and i <= 21:
        reviews.append('0')
    elif i >= 22 and i <= 41:
        reviews.append('1')

data['Reviews_DJCont'] = reviews

In [None]:
y5 = data['Reviews_DJCont']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_projected, y5, test_size = 0.25, random_state=42)

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y5)

## **Rose Balancing Method**

### Leave-One-Out

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with ROSE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ROSE to the training data
        ros = RandomOverSampler(random_state=42)
        X_resampled, y_resampled = ros.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ROSE to the full training data and refit the best model
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ROSE to the full training data
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()



In [None]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import RandomOverSampler

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y5)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.inspection import permutation_importance  # Added

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y5)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

    # Get variable importance for Random Forest and create a plot
    if name == 'SVM':
        # Calculate permutation importance
        result = permutation_importance(clf, x_train, y_train, scoring='accuracy', n_repeats=10, random_state=42)
        importances = result.importances_mean

        print(f"\nPermutation Importance for {name}:")
        for idx, importance in enumerate(importances):
            print(f"Feature {idx}: {importance}")

        # Get feature names
        feature_names = df_filtered.columns.tolist()  # Assuming X_projected is a DataFrame with column names

        # Plot permutation importances with feature names
        plt.figure(figsize=(10, 6))
        bars = plt.barh(feature_names, importances, color='skyblue')
        plt.xlabel('Importance')
        plt.ylabel('Feature')
        plt.title('Permutation Importance for SVM Classifier')

        # Add text annotations
        for idx, bar in enumerate(bars):
            plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2, round(importances[idx], 2),
                     va='center', ha='left')

        plt.tight_layout()
        sns.axes_style("whitegrid")
        sns.despine(left=True, bottom=True)
        plt.show()


## **SMOTE Balancing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with SMOTE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply SMOTE to the training data
        smote = SMOTE(random_state=42)
        X_resampled, y_resampled = smote.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply SMOTE to the full training data and refit the best model
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply SMOTE to the full training data
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()




In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.inspection import permutation_importance
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Define a function to check and handle negative values
def preprocess_data(X):
    # Check for negative values
    if (X < 0).any().any():
        print("Warning: Negative values found. Applying Min-Max scaling.")
        scaler = MinMaxScaler()
        X = scaler.fit_transform(X)
    return X

# Load or split your dataset here
# For example:
# X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Example data preprocessing
X_train_full = preprocess_data(X_train_full)
X_test = preprocess_data(X_test)

# Define SVM model
model = SVC(probability=True, kernel='rbf', random_state=42)  # SVM with probability=True for better handling

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train_full, y_train_full)

# Train the model
model.fit(X_resampled, y_resampled)

# Compute permutation importance
def compute_permutation_importance(model, X_test, y_test):
    # Compute permutation importance
    results = permutation_importance(model, X_test, y_test, scoring='f1_weighted', n_repeats=10, random_state=42)
    importance_scores = results.importances_mean

    # Create a DataFrame to show feature importances
    feature_names = df_filtered.columns.tolist()
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importance_scores
    })

    return importance_df

# Compute and print feature importances
importance_df = compute_permutation_importance(model, X_test, y_test)

print(f"\nFeature Importances using Permutation Importance for SVM:")
print(importance_df)

# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Permutation Importance for SVM')
plt.show()


In [None]:
from imblearn.over_sampling import ADASYN, SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y5)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

## **ADASYN Balacing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import train_test_split, LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import ADASYN
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

# Check the class distribution in the dataset
print("Original class distribution:", Counter(y_train_full))

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Define function to compute learning curves with ADASYN
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ADASYN to the training data
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        if np.unique(y_train_cv).size > 1:  # Check if there is more than one class
            try:
                X_resampled, y_resampled = adasyn.fit_resample(X_train_cv, y_train_cv)
                if len(y_resampled) > len(y_train_cv):  # Ensure resampling occurred
                    model.fit(X_resampled, y_resampled)
                    train_pred = model.predict(X_resampled)
                    val_pred = model.predict(X_val_cv)

                    # Store metrics
                    train_accuracies.append(accuracy_score(y_resampled, train_pred))
                    val_accuracies.append(accuracy_score(y_val_cv, val_pred))
                else:
                    print(f"Warning: No samples generated for fold {len(train_accuracies)+1}, using original data")
                    model.fit(X_train_cv, y_train_cv)
                    train_pred = model.predict(X_train_cv)
                    val_pred = model.predict(X_val_cv)
                    train_accuracies.append(accuracy_score(y_train_cv, train_pred))
                    val_accuracies.append(accuracy_score(y_val_cv, val_pred))
            except ValueError as e:
                print(f"Warning: {e} for fold {len(train_accuracies)+1}, using original data")
                model.fit(X_train_cv, y_train_cv)
                train_pred = model.predict(X_train_cv)
                val_pred = model.predict(X_val_cv)
                train_accuracies.append(accuracy_score(y_train_cv, train_pred))
                val_accuracies.append(accuracy_score(y_val_cv, val_pred))
        else:
            # If only one class is present, no resampling needed
            model.fit(X_train_cv, y_train_cv)
            train_pred = model.predict(X_train_cv)
            val_pred = model.predict(X_val_cv)
            train_accuracies.append(accuracy_score(y_train_cv, train_pred))
            val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=LeaveOneOut(),
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ADASYN to the full training data and refit the best model
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        try:
            X_resampled_full, y_resampled_full = adasyn.fit_resample(X_train_full, y_train_full)
            if len(y_resampled_full) > len(y_train_full):  # Ensure resampling occurred
                best_model.fit(X_resampled_full, y_resampled_full)
            else:
                print(f"Warning: No samples generated, using original data")
                best_model.fit(X_train_full, y_train_full)
        except ValueError as e:
            print(f"Warning: {e}, using original data")
            best_model.fit(X_train_full, y_train_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ADASYN to the full training data
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        try:
            X_resampled_full, y_resampled_full = adasyn.fit_resample(X_train_full, y_train_full)
            if len(y_resampled_full) > len(y_train_full):  # Ensure resampling occurred
                model.fit(X_resampled_full, y_resampled_full)
            else:
                print(f"Warning: No samples generated, using original data")
                model.fit(X_train_full, y_train_full)
        except ValueError as e:
            print(f"Warning: {e}, using original data")
            model.fit(X_train_full, y_train_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()


In [None]:
from imblearn.over_sampling import ADASYN
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y5)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=4)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()


# Apply ADASYN to training data
adasyn = ADASYN(random_state=4)
x_resampled, y_resampled = adasyn.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



# **Ranking RSI Prediction**

Here you can find the best combination for RSI.

In [None]:
bin_edges = [0, 21, 41]  # 0-20 (top 20), 21-40 (next 20)

# Define the bin labels
bin_labels = ['top_20', 'next_20']

reviews3 = data['Reviews_RSI'] = pd.cut(data['Ranking_RSI'], bins=bin_edges, labels=bin_labels, include_lowest=True)

In [None]:
reviews = []
for i in data['Ranking_RSI']:
    if i >= 1 and i <= 21:
        reviews.append('0')
    elif i >= 22 and i <= 41:
        reviews.append('1')

data['Reviews_RSI'] = reviews

In [None]:
y6 = data['Reviews_RSI']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_projected, y6, test_size = 0.25, random_state=42)

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y6)

## **Rose Balancing Method**

### Leave-One-Out

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with ROSE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ROSE to the training data
        ros = RandomOverSampler(random_state=42)
        X_resampled, y_resampled = ros.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ROSE to the full training data and refit the best model
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ROSE to the full training data
        ros = RandomOverSampler(random_state=42)
        X_resampled_full, y_resampled_full = ros.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()



In [None]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import RandomOverSampler

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y6)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y6)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
ros = RandomOverSampler(random_state=42)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")

    # Get variable importance for Random Forest and create a plot
    if name == 'XGBoost':
        variable_importance = clf.feature_importances_
        print(f"\nVariable Importance for {name}:")
        for idx, importance in enumerate(variable_importance):
            print(f"Feature {idx}: {importance}")

        # Get feature names
        feature_names = df_filtered.columns.tolist()  # Assuming X_projected is a DataFrame with column names

        # Plot variable importances with feature names
        plt.figure(figsize=(10, 6))
        bars = plt.barh(feature_names, variable_importance, color='skyblue')
        plt.xlabel('Importance')
        #plt.ylabel('Feature')
        #plt.title('Variable Importance for Random Forest Classifier')

        # Add text annotations
        '''for idx, bar in enumerate(bars):
            plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2, round(variable_importance[idx], 2),
                     va='center', ha='left')'''

        plt.tight_layout()
        sns.axes_style("whitegrid")
        sns.despine(left=True, bottom=True)
        plt.show()


## **SMOTE Balancing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
import numpy as np

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Define function to compute learning curves with SMOTE
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply SMOTE to the training data
        smote = SMOTE(random_state=42)
        X_resampled, y_resampled = smote.fit_resample(X_train_cv, y_train_cv)

        model.fit(X_resampled, y_resampled)
        train_pred = model.predict(X_resampled)
        val_pred = model.predict(X_val_cv)

        # Store metrics
        train_accuracies.append(accuracy_score(y_resampled, train_pred))
        val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=loo,
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply SMOTE to the full training data and refit the best model
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        best_model.fit(X_resampled_full, y_resampled_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply SMOTE to the full training data
        smote = SMOTE(random_state=42)
        X_resampled_full, y_resampled_full = smote.fit_resample(X_train_full, y_train_full)
        model.fit(X_resampled_full, y_resampled_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()




In [None]:
import shap
import pandas as pd
import numpy as np

# Assuming best_model is the best AdaBoost model obtained from GridSearchCV
# and X_resampled_full is the resampled training data used for fitting best_model

# Define a function to get model predictions
def predict_fn(X):
    return best_model.predict_proba(X)

# Initialize SHAP Explainer
explainer = shap.KernelExplainer(predict_fn, X_resampled_full)

# Compute SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Print the shape of SHAP values for debugging
print(f"SHAP values shape: {shap_values.shape}")

# Check if SHAP values are a list (binary classification scenario)
if isinstance(shap_values, list):
    # Extract SHAP values for the positive class (usually the second class)
    shap_values_class = shap_values[1]
else:
    # If not a list, assume it's already for the positive class
    shap_values_class = shap_values

# Ensure SHAP values are 2D with shape (num_samples, num_features)
if shap_values_class.ndim == 3:
    # If 3D, handle the case appropriately
    shap_values_class = shap_values_class[:, :, 1]  # Select the SHAP values for the positive class if 3D
elif shap_values_class.ndim != 2:
    raise ValueError(f"Unexpected SHAP values shape: {shap_values_class.shape}")

# Calculate mean absolute SHAP values for feature importance
mean_shap_values = np.mean(np.abs(shap_values_class), axis=0)

# Get feature names directly from the DataFrame
feature_names = df_filtered.columns.tolist()

# Ensure lengths match
if len(feature_names) != len(mean_shap_values):
    print(f"Length mismatch: feature_names ({len(feature_names)}) != mean_shap_values ({len(mean_shap_values)})")
    # Optionally handle the mismatch here

# Ensure 1D arrays
feature_names = np.array(feature_names).flatten()
mean_shap_values = np.array(mean_shap_values).flatten()

# Create a DataFrame to show feature importances
try:
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': mean_shap_values
    })
    # Print feature importances
    print("\nFeature Importances using SHAP for Best AdaBoost Model:")
    print(importance_df)
except ValueError as e:
    print(f"Error creating DataFrame: {e}")

# Plot SHAP summary plot
shap.summary_plot(shap_values_class, X_test)


In [None]:
import shap
import pandas as pd
import numpy as np
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score

# Define a function to get model predictions
def predict_fn(X, model):
    return model.predict_proba(X)

# Initialize LOO Cross-Validation
loo = LeaveOneOut()

# Initialize list to collect SHAP values for each fold
all_shap_values = []

# Loop through each train-test split
for train_index, test_index in loo.split(X_resampled_full):
    # Split data
    X_train, X_test_fold = X_resampled_full[train_index], X_resampled_full[test_index]
    y_train, y_test_fold = y_resampled_full[train_index], y_resampled_full[test_index]

    # Train the model on the training set
    best_model.fit(X_train, y_train)

    # Initialize SHAP Explainer for the current fold
    explainer = shap.KernelExplainer(lambda x: predict_fn(x, best_model), X_train)

    # Compute SHAP values for the test set in the current fold
    shap_values_fold = explainer.shap_values(X_test_fold)

    # Check if SHAP values are a list (binary classification scenario)
    if isinstance(shap_values_fold, list):
        # Extract SHAP values for the positive class (usually the second class)
        shap_values_fold_class = shap_values_fold[1]
    else:
        # If not a list, assume it's already for the positive class
        shap_values_fold_class = shap_values_fold

    # Ensure SHAP values are 2D with shape (num_samples, num_features)
    if shap_values_fold_class.ndim == 3:
        # If 3D, handle the case appropriately
        shap_values_fold_class = shap_values_fold_class[:, :, 1]  # Select the SHAP values for the positive class if 3D
    elif shap_values_fold_class.ndim != 2:
        raise ValueError(f"Unexpected SHAP values shape: {shap_values_fold_class.shape}")

    # Append SHAP values for the current fold
    all_shap_values.append(shap_values_fold_class)

# Concatenate all SHAP values across folds
all_shap_values = np.vstack(all_shap_values)

# Calculate mean absolute SHAP values for feature importance
mean_shap_values = np.mean(np.abs(all_shap_values), axis=0)

# Get feature names directly from the DataFrame
feature_names = df_filtered.columns.tolist()

# Ensure lengths match
if len(feature_names) != len(mean_shap_values):
    print(f"Length mismatch: feature_names ({len(feature_names)}) != mean_shap_values ({len(mean_shap_values)})")
    # Optionally handle the mismatch here

# Ensure 1D arrays
feature_names = np.array(feature_names).flatten()
mean_shap_values = np.array(mean_shap_values).flatten()

# Create a DataFrame to show feature importances
try:
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': mean_shap_values
    }).sort_values(by='Importance', ascending=False)

    # Print feature importances
    print("\nFeature Importances using SHAP for Best AdaBoost Model:")
    print(importance_df)
except ValueError as e:
    print(f"Error creating DataFrame: {e}")

# Plot SHAP summary plot for the entire dataset (combine all data used)
explainer_full = shap.KernelExplainer(lambda x: predict_fn(x, best_model), X_resampled_full)
shap_values_full = explainer_full.shap_values(X_resampled_full)

# Check if SHAP values are a list (binary classification scenario)
if isinstance(shap_values_full, list):
    # Extract SHAP values for the positive class (usually the second class)
    shap_values_class_full = shap_values_full[1]
else:
    # If not a list, assume it's already for the positive class
    shap_values_class_full = shap_values_full

# Plot SHAP summary plot
shap.summary_plot(shap_values_class_full, X_resampled_full)


In [None]:
from imblearn.over_sampling import ADASYN, SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y6)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost':xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



## **ADASYN Balacing Method**

### Leave-One-Out

In [None]:
from sklearn.model_selection import train_test_split, LeaveOneOut, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from imblearn.over_sampling import ADASYN
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

# Check the class distribution in the dataset
print("Original class distribution:", Counter(y_train_full))

# Define parameter grids for each classifier
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Naive Bayes': {},
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.001, 0.01, 0.1, 1]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }
}

X_train_full, X_test, y_train_full, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=42)

# Initialize classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': XGBClassifier()
}

# Define function to compute learning curves with ADASYN
def compute_learning_curves(model, X, y):
    train_accuracies = []
    val_accuracies = []

    loo = LeaveOneOut()

    for train_index, val_index in loo.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index]
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        # Apply ADASYN to the training data
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        if np.unique(y_train_cv).size > 1:  # Check if there is more than one class
            try:
                X_resampled, y_resampled = adasyn.fit_resample(X_train_cv, y_train_cv)
                if len(y_resampled) > len(y_train_cv):  # Ensure resampling occurred
                    model.fit(X_resampled, y_resampled)
                    train_pred = model.predict(X_resampled)
                    val_pred = model.predict(X_val_cv)

                    # Store metrics
                    train_accuracies.append(accuracy_score(y_resampled, train_pred))
                    val_accuracies.append(accuracy_score(y_val_cv, val_pred))
                else:
                    print(f"Warning: No samples generated for fold {len(train_accuracies)+1}, using original data")
                    model.fit(X_train_cv, y_train_cv)
                    train_pred = model.predict(X_train_cv)
                    val_pred = model.predict(X_val_cv)
                    train_accuracies.append(accuracy_score(y_train_cv, train_pred))
                    val_accuracies.append(accuracy_score(y_val_cv, val_pred))
            except ValueError as e:
                print(f"Warning: {e} for fold {len(train_accuracies)+1}, using original data")
                model.fit(X_train_cv, y_train_cv)
                train_pred = model.predict(X_train_cv)
                val_pred = model.predict(X_val_cv)
                train_accuracies.append(accuracy_score(y_train_cv, train_pred))
                val_accuracies.append(accuracy_score(y_val_cv, val_pred))
        else:
            # If only one class is present, no resampling needed
            model.fit(X_train_cv, y_train_cv)
            train_pred = model.predict(X_train_cv)
            val_pred = model.predict(X_val_cv)
            train_accuracies.append(accuracy_score(y_train_cv, train_pred))
            val_accuracies.append(accuracy_score(y_val_cv, val_pred))

    return train_accuracies, val_accuracies

# Perform Grid Search with LOOCV
for name, model in classifiers.items():
    if name in param_grids:
        # Configure GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=LeaveOneOut(),
            n_jobs=-1,
            verbose=1
        )
        # Fit GridSearchCV with LOOCV
        grid_search.fit(X_train_full, y_train_full)

        print(f"\nClassifier: {name}")
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best F1 Score: {grid_search.best_score_}")

        # Retrieve the best model
        best_model = grid_search.best_estimator_

        # Apply ADASYN to the full training data and refit the best model
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        try:
            X_resampled_full, y_resampled_full = adasyn.fit_resample(X_train_full, y_train_full)
            if len(y_resampled_full) > len(y_train_full):  # Ensure resampling occurred
                best_model.fit(X_resampled_full, y_resampled_full)
            else:
                print(f"Warning: No samples generated, using original data")
                best_model.fit(X_train_full, y_train_full)
        except ValueError as e:
            print(f"Warning: {e}, using original data")
            best_model.fit(X_train_full, y_train_full)

        # Evaluate on the test set
        y_test_pred = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(best_model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()

    else:
        # For classifiers without hyperparameter tuning
        # Apply ADASYN to the full training data
        adasyn = ADASYN(random_state=42, sampling_strategy='auto')  # Set sampling_strategy to 'auto'
        try:
            X_resampled_full, y_resampled_full = adasyn.fit_resample(X_train_full, y_train_full)
            if len(y_resampled_full) > len(y_train_full):  # Ensure resampling occurred
                model.fit(X_resampled_full, y_resampled_full)
            else:
                print(f"Warning: No samples generated, using original data")
                model.fit(X_train_full, y_train_full)
        except ValueError as e:
            print(f"Warning: {e}, using original data")
            model.fit(X_train_full, y_train_full)

        y_test_pred = model.predict(X_test)

        acc_test = accuracy_score(y_test, y_test_pred)
        precision_test = precision_score(y_test, y_test_pred, average='weighted')
        recall_test = recall_score(y_test, y_test_pred, average='weighted')
        f1_test = f1_score(y_test, y_test_pred, average='weighted')
        conf_matrix_test = confusion_matrix(y_test, y_test_pred)

        print(f"\nClassifier: {name}")
        print(f"Test Accuracy: {acc_test}")
        print(f"Test Precision: {precision_test}")
        print(f"Test Recall: {recall_test}")
        print(f"Test F1-score: {f1_test}")
        print(f"Test Confusion Matrix:\n{conf_matrix_test}")

        # Compute learning curves
        train_accuracies, val_accuracies = compute_learning_curves(model, X_train_full, y_train_full)

        plt.figure(figsize=(12, 6))
        plt.plot(train_accuracies, label='Training Accuracy')
        plt.plot(val_accuracies, label='Validation Accuracy')
        plt.xlabel('Leave-One-Out Fold')
        plt.ylabel('Accuracy')
        plt.title(f'Learning Curves for {name}')
        plt.legend()
        plt.show()


In [None]:
from imblearn.over_sampling import ADASYN
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y6)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_projected, y_encoded, test_size=0.25, random_state=4)

# Initialize classifiers
rf = RandomForestClassifier()
nb = GaussianNB()
svm = SVC()
ada = AdaBoostClassifier()
xgb = XGBClassifier()


# Apply ADASYN to training data
adasyn = ADASYN(random_state=4)
x_resampled, y_resampled = adasyn.fit_resample(x_train, y_train)

# Train and evaluate classifiers
classifiers = {'Random Forest': rf, 'Naive Bayes': nb, 'SVM': svm, 'AdaBoost': ada, 'XGBoost': xgb}
for name, clf in classifiers.items():
    clf.fit(x_resampled, y_resampled)
    y_pred = clf.predict(x_test)

    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {acc}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"Confusion Matrix:\n{conf_matrix}")



# Testing SVM and Nayve Bayes Permutation Importance

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Define feature names
feature_names = [
    'Weight (kg)', '%Body Fat', '1RM Hip Thrust (kg)', '1RM Back Squat (kg)',
    'Relative 1RM HT(kg)', 'Relative 1RM Squat (kg)', 'Speed 5m (s)',
    'Speed 20m (s)', 'v02 Max (ml.kg-1.min-1)', 'Baseline Drop Jump (cm)',
    'Baseline Contact time (s)', 'Baseline RSI', 'Baseline CMJ (cm)'
]

# Define the data
data3 = {
    'PR\n(ROSE and \nAdaBoost)': [
        0.000, 0.000, 0.056, 0.308, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.075, 0.000, 0.000
    ],
    'CK\n(ADASYN and \nRandom Forrest)': [
        0.331, 0.479, 0.158, 0.688, 0.790, 0.809, 0.391, 0.451, 0.925, 0.493, 0.412, 0.193, 0.000
    ],
    'CMJ\n(ADASYN and \nXGBoost)': [
        0.040, 0.061, 0.523, 0.095, 0.204, 0.218, 0.048, 0.191, 0.169, 0.810, 0.532, 0.066, 0.011
    ],
    'RSI\n(SMOTE and \nAdaBoost)': [
        0,
0.002,
0,
0,
0,
0.013,
0.001,
0.025,
0.071,
0.004,
0,
0.018,
0]
}

palette = sns.light_palette("seagreen", 16)

# Create DataFrame
df_filtered2 = pd.DataFrame(data3, index=feature_names)

# Create the heatmap with a color map that handles both positive and negative values
plt.figure(figsize=(14, 8))
ax = sns.heatmap(df_filtered2.T, annot=True, cmap=palette,
                 annot_kws={"size": 13}, center=0, cbar_kws={"shrink": .8, "aspect": 10, "pad": 0.05})

# Set y-axis labels and title
ax.set_yticklabels(df_filtered2.columns, rotation=0,ha='center', fontsize=14)
ax.tick_params(axis='y', labelsize=11, pad=45)
ax.tick_params(axis='x', labelsize=11)
ax.set_xticklabels(df_filtered2.index,  ha='center', fontsize=13)  # Manually set xticklabels

# Add title
#plt.title('Feature Importances for Different Models and Variables', size=20)

# Adjust layout to fit labels
plt.tight_layout()

# Save and show the plot
plt.savefig('feature_importances.png', dpi=300, bbox_inches='tight')
plt.show()


### SVM and Nayve Bayes Permutation Importance

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming data3 is your feature importance dictionary
data3 = {
    'PR\n(ROSE and \nAdaBoost)': [0.000, 0.000, 0.056, 0.308, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.075, 0.000, 0.000],
    'CK\n(ADASYN and \nRandom Forrest)': [0.331, 0.479, 0.158, 0.688, 0.790, 0.809, 0.391, 0.451, 0.925, 0.493, 0.412, 0.193, 0.000],
    'CMJ\n(ADASYN and \nXGBoost)': [0.040, 0.061, 0.523, 0.095, 0.204, 0.218, 0.048, 0.191, 0.169, 0.810, 0.532, 0.066, 0.011],
    'DJ\n(SMOTE and GB Naive Bayes)': [0.000, 0.003, -0.145, -0.147, -0.098, 0.027, 0.000, 0.000, -0.080, 0.065, -0.134, -0.094, -0.151],
    'DJCT\n(SMOTE and SVM)': [0.025, 0.028, 0.065, 0.075, 0.082, -0.048, -0.177, 0.082, 0.082, 0.018, 0.025, 0.076, 0.011],
    'RSI\n(SMOTE and \nAdaBoost)': [0, 0.002, 0, 0, 0, 0.013, 0.001, 0.025, 0.071, 0.004, 0, 0.018, 0]
}

# Define feature names
feature_names = [
    'Weight (kg)', '%Body Fat', '1RM Hip Thrust (kg)', '1RM Back Squat (kg)',
    'Relative 1RM HT(kg)', 'Relative 1RM Squat (kg)', 'Speed 5m (s)',
    'Speed 20m (s)', 'v02 Max (ml.kg-1.min-1)', 'Baseline Drop Jump (cm)',
    'Baseline Contact time (s)', 'Baseline RSI', 'Baseline CMJ (cm)'
]

# Create DataFrame for SVM and Naive Bayes
svm_nb_data = {
    'DJ\n(SMOTE and GB Naive Bayes)': data3['DJ\n(SMOTE and GB Naive Bayes)'],
    'DJCT\n(SMOTE and SVM)': data3['DJCT\n(SMOTE and SVM)']
}

df_svm_nb = pd.DataFrame(svm_nb_data, index=feature_names)

# Convert DataFrame to long format for easier plotting
df_long = df_svm_nb.melt(var_name='Model', value_name='Decrease in Accuracy', ignore_index=False)
df_long.reset_index(inplace=True)
df_long.rename(columns={'index': 'Feature'}, inplace=True)

# Set the seaborn style to a minimalistic theme
sns.set(style="white")

# Define colors
color_naive_bayes = '#8de5a1'  # Light green
color_svm = '#a1c9f4'          # Light blue

# Create a function to plot individual models
def plot_model(df, model_name, color, filename):
    plt.figure(figsize=(10, 8))
    ax = sns.barplot(x='Decrease in Accuracy', y='Feature', data=df[df['Model'] == model_name], color=color)
    ax.set_title(f'{model_name}', fontsize=16, fontweight='bold')
    ax.set_xlabel('Decrease in Accuracy Score')
    ax.set_ylabel('', fontsize=14)

    # Add value annotations on bars
    for p in ax.patches:
        width = p.get_width()
        ax.annotate(f'{width:.3f}', (width, p.get_y() + p.get_height() / 2),
                    xytext=(5, 0), textcoords='offset points',
                    ha='left', va='center', fontsize=12, color='black')

    # Remove box and x-axis
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.tick_params(axis='x', which='both', length=0)
    ax.tick_params(axis='y', which='both', length=0)
    ax.tick_params(axis='y', labelsize=12)  # Increase y-axis label size

    plt.tight_layout()
    plt.savefig(filename, dpi=300, bbox_inches='tight')
    plt.show()

# Plot for Naive Bayes
plot_model(df_long, 'DJ\n(SMOTE and GB Naive Bayes)', color_naive_bayes, 'naive_bayes_feature_importance.png')

# Plot for SVM
plot_model(df_long, 'DJCT\n(SMOTE and SVM)', color_svm, 'svm_feature_importance.png')
