<h1 style="color: green; font-family: Arial, Helvetica, sans-serif; text-align: center;"> Problem - 1: Meta-Feature Based Dynamic Selection of Machine Learning Algorithms for Datasets</h1>
<h1 style="color: green; font-family: Arial, Helvetica, sans-serif; text-align: center;"> Problem - 2: A Comparison of Meta-Feature-Driven Algorithm Selection and Conventional Model Selection Techniques</h1>
<h1 style="color: green; font-family: Arial, Helvetica, sans-serif; text-align: center;"> Problem - 3: Dynamically ML Framework Selection</h1>
<h3 style = "text-align: center;">Hasan H. Rahman</h3>

<p>In the field of machine learning, selecting the most suitable algorithm for a given dataset can be challenging, as it requires comprehensive knowledge of various algorithms and their performance characteristics. Traditional approaches to model selection often involve evaluating multiple algorithms on a dataset and comparing their performance metrics, a process that can be time-consuming and computationally expensive.</p>

<p>This experiment aims to address this issue by developing a dynamic machine learning algorithm selection system that leverages meta-features extracted from a dataset. By utilizing these meta-features, the system can automatically identify and select the most appropriate algorithm for the dataset. The selected model is then compared with models selected through traditional algorithm evaluation-based methods to analyze the efficiency and effectiveness of the meta-feature-driven approach. The goal is to determine whether meta-feature-based selection can achieve comparable or superior results in a more efficient manner.</p>

<h1> Steps to address this problem </h1>
<ol> 
    <li>Dynamically prepare Dataset</li>  
    <li>Detect Problem Type</li>  
    <li>Meta-Feature Extraction</li> 
    <li>Meta-Feature-Based Model Selection</li>
    <li>Meta-Feature Model Evaluation</li>
    <li>Select Best Model from Traditional Algorithms Evaluation</li>
    <li>Compare both selected Models</li>
    <li>Final Algorithm Suggestion</li>
    <li>Save Results in our MetaFeatures.csv file</li>
</ol>

### Import necessary packages

In [1]:
import pandas as pd
import numpy as np
import os
import warnings
import time
from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import (
    RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor,
    RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier
)
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,
    mean_absolute_error, mean_squared_error, r2_score
)
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from scipy import stats

### Dataset

In [2]:
# Suppress convergence warnings globally
warnings.filterwarnings("ignore", category=ConvergenceWarning)

#Dataset
#Customers.csv (Income), Boston.csv(medv), cars.csv(sales), iris.csv (variety), Dye.csv (Epsilon), dyeDesign.csv (Class)
dataset_file_path = 'dye.csv'
try:
    df = pd.read_csv(dataset_file_path)
    print(f"Dataset loaded successfully. Shape: {df.shape}")

    target_column = 'Epsilon'
    independent_features = [col for col in df.columns if col != target_column]
    dataset_name = dataset_file_path.split('/')[-1]
    print(f"Dataset Name: {dataset_name}")
    print(f"Target Feature: {target_column}")
    
    print("Independent Features:")
    for feature in independent_features:
        print(f"- {feature}")
    
except FileNotFoundError:
    print(f"Error: The file {dataset_file_path} does not exist.")

Dataset loaded successfully. Shape: (8802, 248)
Dataset Name: dye.csv
Target Feature: Epsilon
Independent Features:
- C Atom Count
- Total Atom Count
- H Atom Count
- Longest Carbon Chain
- Aromatic Atom Count
- Max Distance
- Bonds Count
- Rdkit Descriptor Chi0
- Rdkit Descriptor Chi0N
- Rdkit Descriptor Chi0V
- Rdkit Descriptor Chi1
- Rdkit Descriptor Chi1N
- Rdkit Descriptor Chi1V
- Rdkit Descriptor Chi2N
- Rdkit Descriptor Chi2V
- Rdkit Descriptor Chi3N
- Rdkit Descriptor Chi3V
- Rdkit Descriptor Chi4N
- Rdkit Descriptor Chi4V
- Rdkit Descriptor Heavyatomcount
- Rdkit Descriptor Maxabsestateindex
- Rdkit Descriptor Maxabspartialcharge
- Rdkit Descriptor Maxestateindex
- Rdkit Descriptor Maxpartialcharge
- Rdkit Descriptor Minabsestateindex
- Rdkit Descriptor Minabspartialcharge
- Rdkit Descriptor Minestateindex
- Rdkit Descriptor Minpartialcharge
- Rdkit Descriptor Mollogp
- Rdkit Descriptor Molmr
- Rdkit Descriptor Molwt
- Rdkit Descriptor Nhohcount
- Rdkit Descriptor Nocount
- Rd

### Identify Problem Type: Classification, Regression, or Clustering  

In [3]:
def detect_problem_type(df, target_column):
    target = df[target_column]
    if target.dtype in ['object', 'category'] or len(target.unique()) <= 10:
        return 'classification'
    elif target.dtype in ['int64', 'float64']:
        return 'regression'
    else:
        return 'clustering'

problem_type = detect_problem_type(df, target_column)
print(f"Problem Type: {problem_type}")

Problem Type: regression


### Preprocess data

In [4]:
# Data preprocessing function
def preprocess_data(df, target_column):
    X = df.drop(target_column, axis=1)
    y = df[target_column]
    le = LabelEncoder()
    y = le.fit_transform(y)

    numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = X.select_dtypes(include=['object']).columns

    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    return preprocessor, X_train, X_test, y_train, y_test


In [5]:
# Preprocess the data
#preprocessor, X_train, X_test, y_train, y_test = preprocess_data(df, target_column)
#X_train = preprocessor.fit_transform(X_train)
#X_test = preprocessor.transform(X_test)

### Dynamically Extract Meta Features of a Dataset

In [6]:
def extract_meta_features(df):
    meta_features = {}
    
    meta_features['num_instances'] = len(df)
    meta_features['num_features'] = len(df.columns) - 1  # Assuming last column is target
    meta_features['instance_feature_ratio'] = meta_features['num_instances'] / meta_features['num_features']
    meta_features['missing_value_percentage'] = df.isnull().sum().sum() / (df.shape[0] * df.shape[1]) * 100

    meta_features['num_categorical'] = len(df.select_dtypes(include=['object', 'category']).columns)
    meta_features['num_numerical'] = len(df.select_dtypes(include=['int64', 'float64']).columns) - 1
    
    numerical_features = df.select_dtypes(include=['int64', 'float64']).columns[:-1]
    if len(numerical_features) > 1:
        meta_features['mean_correlation'] = df[numerical_features].corr().abs().mean().mean()
    else:
        meta_features['mean_correlation'] = 0

    target = df.iloc[:, -1]
    is_classification = target.dtype in ['object', 'category'] or len(target.unique()) <= 10

    if is_classification:
        meta_features['num_classes'] = len(target.unique())
        meta_features['class_imbalance'] = target.value_counts().max() / len(target)
    else:
        meta_features['target_mean'] = target.mean()
        meta_features['target_std'] = target.std()
        meta_features['target_skew'] = stats.skew(target)  # Calculating skewness only for regression
        meta_features['target_kurtosis'] = stats.kurtosis(target)

    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]
    if is_classification:
        le = LabelEncoder()
        encoded_target = le.fit_transform(y)
        meta_features['target_entropy'] = stats.entropy(np.bincount(encoded_target))
        try:
            mi_scores = mutual_info_classif(X, encoded_target)
            meta_features['mean_mutual_information'] = np.mean(mi_scores)
        except:
            meta_features['mean_mutual_information'] = 0
    else:
        meta_features['target_entropy'] = stats.entropy(pd.cut(y, bins=10).value_counts())
        try:
            mi_scores = mutual_info_regression(X, y)
            meta_features['mean_mutual_information'] = np.mean(mi_scores)
        except:
            meta_features['mean_mutual_information'] = 0

    for col in numerical_features:
        meta_features[f'{col}_skew'] = stats.skew(df[col])
        meta_features[f'{col}_kurtosis'] = stats.kurtosis(df[col])
        meta_features[f'{col}_outliers'] = np.sum(np.abs(stats.zscore(df[col])) > 3) / len(df[col])

    for feature, value in meta_features.items():
        print(f"{feature}: {value}")

    return meta_features

meta_features = extract_meta_features(df)

num_instances: 8802
num_features: 247
instance_feature_ratio: 35.635627530364374
missing_value_percentage: 0.0
num_categorical: 0
num_numerical: 247
mean_correlation: 0.11041338280018773
num_classes: 2
class_imbalance: 0.9995455578277664
target_entropy: 0.003951925758117492
mean_mutual_information: 0.00042020759198939085
Epsilon_skew: 3.6037817719942353
Epsilon_kurtosis: 18.924715478151636
Epsilon_outliers: 0.020790729379686436
C Atom Count_skew: 2.5423248935361147
C Atom Count_kurtosis: 10.606236528274842
C Atom Count_outliers: 0.015791865485117018
Total Atom Count_skew: 2.9449663279838907
Total Atom Count_kurtosis: 15.043263672339524
Total Atom Count_outliers: 0.016927970915700977
H Atom Count_skew: 3.849354896233442
H Atom Count_kurtosis: 26.232981983627283
H Atom Count_outliers: 0.017155192001817768
Longest Carbon Chain_skew: 2.581229871862874
Longest Carbon Chain_kurtosis: 10.895985870059294
Longest Carbon Chain_outliers: 0.02578959327425585
Aromatic Atom Count_skew: 2.15645580514

### Meta Feature Based Model Selection

In [7]:
def select_model_based_on_meta_features(meta_features, problem_type):
    reasoning = []

    if problem_type == 'classification':
        if meta_features.get('class_imbalance', 0) > 0.7:
            reasoning.append(f"High class imbalance detected ({meta_features['class_imbalance'] * 100:.2f}%).")
            return 'RandomForestClassifier', 'Chosen due to high class imbalance (RandomForest is robust to class imbalance). ' + ' '.join(reasoning)

        elif meta_features.get('mean_mutual_information', 0) > 0.2:
            reasoning.append(f"High mutual information between features and target ({meta_features['mean_mutual_information']:.2f}).")
            return 'GradientBoostingClassifier', 'Chosen due to high mutual information (GradientBoosting is effective for informative features). ' + ' '.join(reasoning)

        elif meta_features.get('num_features', 0) > 50:
            reasoning.append(f"Large number of features detected ({meta_features['num_features']}).")
            return 'SVC', 'Chosen due to large number of features (SVM works well with many features). ' + ' '.join(reasoning)

        elif meta_features.get('num_instances', 0) < 100:
            reasoning.append(f"Small number of instances detected ({meta_features['num_instances']}).")
            return 'KNeighborsClassifier', 'Chosen due to small dataset size (KNN is effective with small datasets). ' + ' '.join(reasoning)

        else:
            reasoning.append("No significant class imbalance, high mutual information, or large number of features detected.")
            return 'LogisticRegression', 'Chosen as a baseline classifier for moderate dataset characteristics. ' + ' '.join(reasoning)

    else:  # For regression
        if meta_features.get('target_skew', 0) > 1:
            reasoning.append(f"High skewness in target distribution (skew={meta_features['target_skew']:.2f}).")
            return 'GradientBoostingRegressor', 'Chosen due to high target skew (GradientBoosting handles non-normal distributions well). ' + ' '.join(reasoning)

        elif meta_features.get('mean_correlation', 0) > 0.5:
            reasoning.append(f"High correlation between features (mean correlation={meta_features['mean_correlation']:.2f}).")
            return 'Ridge', 'Chosen due to high feature correlation (Ridge regression is effective for correlated features). ' + ' '.join(reasoning)

        elif meta_features.get('num_features', 0) > 50:
            reasoning.append(f"Large number of features detected ({meta_features['num_features']}).")
            return 'SVM', 'Chosen due to large number of features (SVM is effective with high-dimensional data). ' + ' '.join(reasoning)

        elif meta_features.get('num_instances', 0) < 100:
            reasoning.append(f"Small number of instances detected ({meta_features['num_instances']}).")
            return 'KNeighborsRegressor', 'Chosen due to small dataset size (KNN works well with small datasets). ' + ' '.join(reasoning)

        else:
            reasoning.append("No significant skewness, high correlation, or large number of features detected.")
            return 'LinearRegression', 'Chosen as a baseline regressor for moderate dataset characteristics. ' + ' '.join(reasoning)

model, reasoning = select_model_based_on_meta_features(meta_features, problem_type)
print(f"Selected Model: {model}")
print(f"Reason: {reasoning}")

Selected Model: SVM
Reason: Chosen due to large number of features (SVM is effective with high-dimensional data). Large number of features detected (247).


### Classification and Regression Metrices for Model Evaluation

In [8]:
def evaluate_classification_metrics(y_true, y_pred):
    metrics = {}
    metrics['accuracy'] = accuracy_score(y_true, y_pred)
    metrics['precision'] = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    metrics['recall'] = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    metrics['f1_score'] = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return metrics

def evaluate_regression_metrics(y_true, y_pred):
    metrics = {}
    metrics['mean_absolute_error'] = mean_absolute_error(y_true, y_pred)
    metrics['mean_squared_error'] = mean_squared_error(y_true, y_pred)
    metrics['root_mean_squared_error'] = np.sqrt(mean_squared_error(y_true, y_pred))
    metrics['r2_score'] = r2_score(y_true, y_pred)
    return metrics

### Evaluate Selected Meta Features based Algorithm for a Dataset and Create a Model

In [9]:
def evaluate_best_model_based_on_meta_features(X_train, X_test, y_train, y_test, best_model_name, problem_type):
    model_map = {
        'AdaBoost': 'AdaBoostClassifier' if problem_type == 'classification' else 'AdaBoostRegressor',
        'RandomForest': 'RandomForestClassifier' if problem_type == 'classification' else 'RandomForestRegressor',
        'GradientBoosting': 'GradientBoostingClassifier' if problem_type == 'classification' else 'GradientBoostingRegressor',
        'LogisticRegression': 'LogisticRegression',
        'KNeighbors': 'KNeighborsClassifier' if problem_type == 'classification' else 'KNeighborsRegressor',
        'SVM': 'SVC' if problem_type == 'classification' else 'SVR',
        'MLP': 'MLPClassifier' if problem_type == 'classification' else 'MLPRegressor',
        'Bagging': 'BaggingClassifier' if problem_type == 'classification' else 'BaggingRegressor'
    }

    # Convert the general model name to the specific sklearn model
    best_model_name = model_map.get(best_model_name, best_model_name)

    if problem_type == 'classification':
        models = {
            'RandomForestClassifier': RandomForestClassifier(),
            'GradientBoostingClassifier': GradientBoostingClassifier(),
            'LogisticRegression': LogisticRegression(),
            'KNeighborsClassifier': KNeighborsClassifier(),
            'SVC': SVC(),
            'MLPClassifier': MLPClassifier(max_iter=2000),
            'AdaBoostClassifier': AdaBoostClassifier(),
            'BaggingClassifier': BaggingClassifier()
        }
    else:  
        models = {
            'RandomForestRegressor': RandomForestRegressor(),
            'GradientBoostingRegressor': GradientBoostingRegressor(),
            'LinearRegression': LinearRegression(),
            'Ridge': Ridge(),
            'KNeighborsRegressor': KNeighborsRegressor(),
            'MLPRegressor': MLPRegressor(max_iter=2000),
            'AdaBoostRegressor': AdaBoostRegressor(),  
            'SVR': SVR(), 
            'BaggingRegressor': BaggingRegressor()
        }

    best_model = models.get(best_model_name)
    print(f"Evaluating the selected model: {best_model_name}")

    if best_model is None:
        raise KeyError(f"The model '{best_model_name}' is not recognized. Please check the model name and try again.")

    best_model.fit(X_train, y_train)
    y_pred = best_model.predict(X_test)

    if problem_type == 'classification':
        metrics = evaluate_classification_metrics(y_test, y_pred)
    else:
        metrics = evaluate_regression_metrics(y_test, y_pred)
    print(f"Model Evaluation Metrics for {best_model_name}:")
    for metric_name, metric_value in metrics.items():
        print(f"{metric_name}: {metric_value:.16f}")    

    return metrics

preprocessor, X_train, X_test, y_train, y_test = preprocess_data(df, target_column)

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

best_model, reasoning = select_model_based_on_meta_features(meta_features, problem_type)
print(f"Selected Model: {best_model}")
print(f"Reason: {reasoning}")

metrics = evaluate_best_model_based_on_meta_features(X_train, X_test, y_train, y_test, best_model, problem_type)

Selected Model: SVM
Reason: Chosen due to large number of features (SVM is effective with high-dimensional data). Large number of features detected (247).
Evaluating the selected model: SVR
Model Evaluation Metrics for SVR:
mean_absolute_error: 718.6344343282174805
mean_squared_error: 712422.9815153812523931
root_mean_squared_error: 844.0515277608241149
r2_score: 0.1325695959104841


### Using Traditional Process Dynamically Evaluate Model to Select Best Model  

In [10]:
def evaluate_all_algorithms(X_train, X_test, y_train, y_test, problem_type):
    if problem_type == 'classification':
        models = {
            'RandomForest': RandomForestClassifier(),
            'GradientBoostingClassifier': GradientBoostingClassifier(),
            'AdaBoost': AdaBoostClassifier(algorithm='SAMME'),
            'Bagging': BaggingClassifier(),
            'SVM': SVC(probability=False),  # Ensure we're not getting probabilities here
            'KNeighbors': KNeighborsClassifier(),
            'LogisticRegression': LogisticRegression(),
            'DecisionTree': DecisionTreeClassifier(),
            'MLPClassifier': MLPClassifier(max_iter=2000),
            'NaiveBayes': GaussianNB(),
            'MultinomialNB': MultinomialNB(),
        }
        metric_function = evaluate_classification_metrics
        le = LabelEncoder()
        y_train = le.fit_transform(y_train)
        y_test = le.transform(y_test)
        
        # Check for negative values, and handle MultinomialNB
        has_negative_values = np.any(X_train < 0) or np.any(X_test < 0)
        if has_negative_values:
            print("Negative values detected. Skipping MultinomialNB.")
            models.pop('MultinomialNB')

    else:
        models = {
            'RandomForest': RandomForestRegressor(),
            'GradientBoosting': GradientBoostingRegressor(),
            'AdaBoost': AdaBoostRegressor(),
            'Bagging': BaggingRegressor(),
            'SVM': SVR(),
            'KNeighbors': KNeighborsRegressor(),
            'LinearRegression': LinearRegression(),
            'Ridge': Ridge(),
            'Lasso': Lasso(max_iter=30000, alpha=0.1),
            'ElasticNet': ElasticNet(max_iter=30000, alpha=0.1),
            'DecisionTree': DecisionTreeRegressor(),
            'MLPRegressor': MLPRegressor(max_iter=2000),
        }
        metric_function = evaluate_regression_metrics

    all_algorithm_metrics = []
    best_model = None
    best_algorithm = ""
    best_score = np.inf if problem_type == 'regression' else -np.inf

    for algo_name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        if problem_type == 'classification':
            if hasattr(model, "predict_proba"):
                y_pred = np.argmax(model.predict_proba(X_test), axis=1)  

            score = accuracy_score(y_test, y_pred)
            metrics = {
                'accuracy': accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred, average='weighted', zero_division=0),
                'recall': recall_score(y_test, y_pred, average='weighted', zero_division=0),
                'f1_score': f1_score(y_test, y_pred, average='weighted', zero_division=0)
            }
        else:
            score = mean_squared_error(y_test, y_pred)
            metrics = {
                'mean_absolute_error': mean_absolute_error(y_test, y_pred),
                'mean_squared_error': mean_squared_error(y_test, y_pred),
                'root_mean_squared_error': np.sqrt(mean_squared_error(y_test, y_pred)),
                'r2_score': r2_score(y_test, y_pred)
            }
        
        metrics['Model'] = algo_name
        all_algorithm_metrics.append(metrics)

        if (problem_type == 'regression' and score < best_score) or \
           (problem_type == 'classification' and score > best_score):
            best_score = score
            best_model = model
            best_algorithm = algo_name

    df_results = pd.DataFrame(all_algorithm_metrics)
    print(f"Best Algorithm: {best_algorithm}\n")
    print(df_results)
    
    return df_results, best_algorithm, best_model

evaluate_all_algorithms(X_train, X_test, y_train, y_test, problem_type)

Best Algorithm: RandomForest

    mean_absolute_error  mean_squared_error  root_mean_squared_error  \
0          2.219254e+02        1.282751e+05             3.581551e+02   
1          4.115110e+02        2.866122e+05             5.353618e+02   
2          5.717762e+02        4.661605e+05             6.827595e+02   
3          2.326309e+02        1.399667e+05             3.741213e+02   
4          7.186344e+02        7.124230e+05             8.440515e+02   
5          2.678342e+02        1.743084e+05             4.175025e+02   
6          1.958402e+06        6.750793e+15             8.216321e+07   
7          4.720061e+02        3.754899e+05             6.127723e+02   
8          4.724899e+02        3.756524e+05             6.129049e+02   
9          4.949967e+02        4.018188e+05             6.338918e+02   
10         2.566133e+02        2.160053e+05             4.647637e+02   
11         2.617022e+02        1.698649e+05             4.121467e+02   

        r2_score             Mode

(    mean_absolute_error  mean_squared_error  root_mean_squared_error  \
 0          2.219254e+02        1.282751e+05             3.581551e+02   
 1          4.115110e+02        2.866122e+05             5.353618e+02   
 2          5.717762e+02        4.661605e+05             6.827595e+02   
 3          2.326309e+02        1.399667e+05             3.741213e+02   
 4          7.186344e+02        7.124230e+05             8.440515e+02   
 5          2.678342e+02        1.743084e+05             4.175025e+02   
 6          1.958402e+06        6.750793e+15             8.216321e+07   
 7          4.720061e+02        3.754899e+05             6.127723e+02   
 8          4.724899e+02        3.756524e+05             6.129049e+02   
 9          4.949967e+02        4.018188e+05             6.338918e+02   
 10         2.566133e+02        2.160053e+05             4.647637e+02   
 11         2.617022e+02        1.698649e+05             4.121467e+02   
 
         r2_score             Model  
 0   8.4381

### Compare Selected Meta Features based Model and Best Performing Model of Traditional Process 

In [11]:
# Function to analyze and compare results
def analyze_results(meta_model_metrics, best_algorithm_metrics, meta_model_name, best_algorithm_name, meta_selection_time, eval_selection_time, problem_type):
    # Adjust metrics based on problem type
    if problem_type == 'classification':
        metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'Selection Time']
        
        meta_metrics_values = [
            meta_model_metrics.get('accuracy', 'N/A'),
            meta_model_metrics.get('precision', 'N/A'),
            meta_model_metrics.get('recall', 'N/A'),
            meta_model_metrics.get('f1_score', 'N/A'),
            f"{meta_selection_time:.4f} seconds"
        ]
        
        best_algo_metrics_values = [
            best_algorithm_metrics.get('accuracy', 'N/A'),
            best_algorithm_metrics.get('precision', 'N/A'),
            best_algorithm_metrics.get('recall', 'N/A'),
            best_algorithm_metrics.get('f1_score', 'N/A'),
            f"{eval_selection_time:.4f} seconds"
        ]
    else:  # regression
        metrics = ['Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 
                   'Root Mean Squared Error (RMSE)', 'R² Score', 'Selection Time']
        
        meta_metrics_values = [
            meta_model_metrics.get('mean_absolute_error', 'N/A'),
            meta_model_metrics.get('mean_squared_error', 'N/A'),
            meta_model_metrics.get('root_mean_squared_error', 'N/A'),
            meta_model_metrics.get('r2_score', 'N/A'),
            f"{meta_selection_time:.4f} seconds"
        ]
        
        best_algo_metrics_values = [
            best_algorithm_metrics.get('mean_absolute_error', 'N/A'),
            best_algorithm_metrics.get('mean_squared_error', 'N/A'),
            best_algorithm_metrics.get('root_mean_squared_error', 'N/A'),
            best_algorithm_metrics.get('r2_score', 'N/A'),
            f"{eval_selection_time:.4f} seconds"
        ]
    
    # Combine results into a DataFrame
    results_df = pd.DataFrame({
        'Metric': metrics,
        f'{meta_model_name} (Meta-features)': meta_metrics_values,
        f'{best_algorithm_name} (Best Algorithm)': best_algo_metrics_values
    })
    
    return results_df

### Final Suggestion for Users 

In [12]:
def suggest_algorithm_combined(meta_metrics, best_algo_metrics, meta_time, algo_time, problem_type):
    if problem_type == 'classification':
        if meta_metrics['accuracy'] >= best_algo_metrics['accuracy']:
            return f"Meta-Feature Based Model is suggested due to better or equal accuracy ({meta_metrics['accuracy']:.4f}) and shorter selection time ({meta_time:.4f} seconds)."
        else:
            return f"Best Performing Algorithm is suggested due to better accuracy ({best_algo_metrics['accuracy']:.4f})."
    else:  # regression
        if meta_metrics['mean_squared_error'] <= best_algo_metrics['mean_squared_error']:
            return f"Meta-Feature Based Model is suggested due to lower MSE ({meta_metrics['mean_squared_error']:.4f}) and shorter selection time ({meta_time:.4f} seconds)."
        else:
            return f"Best Performing Algorithm is suggested due to lower MSE ({best_algo_metrics['mean_squared_error']:.4f})."

### Save all meta data, algorithm evaluation result, reasoning to choose algorithm for this dataset and others 

In [13]:
def save_meta_features_and_results(dataset_name, problem_type, meta_features, algorithm_results, best_algorithm, meta_selection_time, eval_selection_time, best_model_metrics, best_algorithm_metrics, all_algorithm_metrics): 
    file_path = 'metaFeature.csv'
    
    if isinstance(algorithm_results, str):  
        meta_features['model_selection_reasoning'] = algorithm_results
    else:
        meta_features.update(algorithm_results)

    meta_features['best_algorithm'] = best_algorithm
    meta_features['dataset_name'] = dataset_name
    meta_features['problem_type'] = problem_type
    
    meta_features['meta_model_selection_time'] = meta_selection_time
    meta_features['best_algorithm_selection_time'] = eval_selection_time

    meta_features.update({f"meta_model_{k}": v for k, v in best_model_metrics.items()})
    
    meta_features.update({f"best_algorithm_{k}": v for k, v in best_algorithm_metrics.items()})
    
    for algo_name, metrics in all_algorithm_metrics.items():
        for k, v in metrics.items():
            meta_features[f"{algo_name}_{k}"] = v
    
    if os.path.exists(file_path):
        if os.stat(file_path).st_size > 0:  
            try:
                meta_features_df = pd.read_csv(file_path)
            except pd.errors.EmptyDataError:
                meta_features_df = pd.DataFrame()  
        else:
            meta_features_df = pd.DataFrame()  
    else:
        meta_features_df = pd.DataFrame()  
    
    meta_features_df_temp = pd.DataFrame([meta_features])
    duplicate_record = False
    
    if not meta_features_df.empty:
        meta_features_df_aligned, meta_features_df_temp_aligned = meta_features_df.align(meta_features_df_temp, axis=1, fill_value=np.nan)
        if (meta_features_df_aligned == meta_features_df_temp_aligned.iloc[0]).all(axis=1).any():
            duplicate_record = True
            print("Warning: All values of the current record are the same as an existing record.")
    
    if not duplicate_record:
        meta_features_df = pd.concat([meta_features_df, meta_features_df_temp], ignore_index=True)
    
    meta_features_df.to_csv(file_path, index=False)
    
    print(f"Meta-features, algorithm results, problem type, best algorithm, all metrics, and execution times saved to {file_path}!")

### Complete workflow of our application

In [14]:
preprocessor, X_train, X_test, y_train, y_test = preprocess_data(df, target_column)

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)


meta_features = extract_meta_features(df)
meta_model_metrics = evaluate_best_model_based_on_meta_features(X_train, X_test, y_train, y_test, best_model, problem_type)
df_results, best_algorithm_name, best_algorithm = evaluate_all_algorithms(X_train, X_test, y_train, y_test, problem_type)

all_algorithm_metrics = df_results.set_index('Model').to_dict(orient='index')

### Meta-Features Based Model Selection ###
# Record time for meta-feature-based model selection
start_time = time.time()
best_model, reasoning = select_model_based_on_meta_features(meta_features, problem_type)
meta_selection_time = time.time() - start_time

### Evaluation-Based Model Selection (Across All Algorithms) ###
# Record time for all algorithms evaluation
start_time = time.time()
eval_selection_time = time.time() - start_time

best_algorithm_metrics = df_results[df_results['Model'] == best_algorithm_name].iloc[0].to_dict()


# Now you can call the analyze_results function
analysis_results = analyze_results(
    meta_model_metrics,         
    best_algorithm_metrics,     
    best_model,                 
    best_algorithm_name,        
    meta_selection_time,        
    eval_selection_time,        
    problem_type                
)

print("\n--- Analysis Results ---")
print(analysis_results)

print("\n--- Comparison Between Meta-Feature Based and Best Performing Algorithm ---")
print(f"Meta-Feature Based Model: {best_model}")
print(f"Best Performing Algorithm: {best_algorithm_name}")

print("\nPerformance Comparison:")
print(f"Meta-Feature Model Metrics: {meta_model_metrics}")
print(f"Best Performing Algorithm Metrics: {best_algorithm_metrics}")
print(f"Selection Time (Meta-Feature Based): {meta_selection_time:.4f} seconds")
print(f"Selection Time (Best Performing): {eval_selection_time:.4f} seconds")

final_suggestion = suggest_algorithm_combined(meta_model_metrics, best_algorithm_metrics, meta_selection_time, eval_selection_time, problem_type)
print(f"\nFinal Suggestion: {final_suggestion}")


# Call the function to save the results
save_meta_features_and_results(
    dataset_name=dataset_name,
    problem_type=problem_type,
    meta_features=meta_features,
    algorithm_results=reasoning,
    best_algorithm=best_algorithm_name,
    meta_selection_time=meta_selection_time,
    eval_selection_time=eval_selection_time,
    best_model_metrics=meta_model_metrics,
    best_algorithm_metrics=best_algorithm_metrics,
    all_algorithm_metrics=all_algorithm_metrics
)


num_instances: 8802
num_features: 247
instance_feature_ratio: 35.635627530364374
missing_value_percentage: 0.0
num_categorical: 0
num_numerical: 247
mean_correlation: 0.11041338280018773
num_classes: 2
class_imbalance: 0.9995455578277664
target_entropy: 0.003951925758117492
mean_mutual_information: 0.0004050893893398374
Epsilon_skew: 3.6037817719942353
Epsilon_kurtosis: 18.924715478151636
Epsilon_outliers: 0.020790729379686436
C Atom Count_skew: 2.5423248935361147
C Atom Count_kurtosis: 10.606236528274842
C Atom Count_outliers: 0.015791865485117018
Total Atom Count_skew: 2.9449663279838907
Total Atom Count_kurtosis: 15.043263672339524
Total Atom Count_outliers: 0.016927970915700977
H Atom Count_skew: 3.849354896233442
H Atom Count_kurtosis: 26.232981983627283
H Atom Count_outliers: 0.017155192001817768
Longest Carbon Chain_skew: 2.581229871862874
Longest Carbon Chain_kurtosis: 10.895985870059294
Longest Carbon Chain_outliers: 0.02578959327425585
Aromatic Atom Count_skew: 2.156455805142

### Framework selection

In [15]:
import pandas as pd
import numpy as np

def identify_framework(dataset_name, dataset_type, dataset_size, task_type):
    # Define thresholds and conditions based on the table provided
    small_size_threshold = 1000
    moderate_size_threshold = 50000
    
    # Conditions for framework selection
    if dataset_size <= small_size_threshold:
        # Small datasets typically handled by Scikit-Learn
        return "Scikit-Learn"
    elif dataset_type in ["Numerical", "Categorical", "Mixed"] and dataset_size <= moderate_size_threshold:
        # Moderate size and traditional ML tasks (numerical or categorical data)
        return "Scikit-Learn"
    elif dataset_type == "Image" or dataset_type == "Text" or dataset_type == "Image + Text":
        # Deep Learning Frameworks for complex data types (Image, Text)
        if task_type == "Classification" or task_type == "Sentiment Analysis":
            return "TensorFlow" if dataset_name != "MNIST" else "PyTorch"
        elif task_type == "Image Captioning" or task_type == "Multimodal":
            return "PyTorch"
    elif dataset_size > moderate_size_threshold:
        # Larger datasets benefit from deep learning frameworks for scalability
        return "TensorFlow"
    
    # Default to Scikit-Learn if conditions do not match any above
    return "Scikit-Learn"

# Example usage
datasets = [
    {"name": "Iris", "type": "Numerical", "size": 150, "task": "Classification"},
    {"name": "Boston Housing", "type": "Numerical", "size": 506, "task": "Regression"},
    {"name": "CIFAR-10", "type": "Image", "size": 60000, "task": "Classification"},
    {"name": "MNIST", "type": "Image", "size": 70000, "task": "Classification"},
    {"name": "UCI Adult Income", "type": "Categorical", "size": 48842, "task": "Classification"},
    {"name": "IMDb Reviews", "type": "Text", "size": 50000, "task": "Sentiment Analysis"},
    {"name": "COCO Dataset", "type": "Image + Text", "size": 200000, "task": "Image Captioning"},
    {"name": "Airbnb Listings", "type": "Mixed", "size": 500000, "task": "Price Prediction"},
    {"name": "Dye Design", "type": "Numerical", "size": 8802, "task": "Regression/Prediction"},
    {"name": "Cars Dataset", "type": "Mixed", "size": 963, "task": "Regression/Classification"},
    {"name": "Fashion MNIST", "type": "Image", "size": 70000, "task": "Classification"},
    {"name": "YouTube Comments", "type": "Text", "size": 100000, "task": "Sentiment Analysis"},
    {"name": "Weather Data", "type": "Numerical", "size": 10000, "task": "Regression"},
    {"name": "NYC Taxi Trips", "type": "Mixed", "size": 1000000, "task": "Price Prediction"},
    {"name": "Caltech-256", "type": "Image", "size": 30607, "task": "Classification"}
]

for dataset in datasets:
    framework = identify_framework(dataset["name"], dataset["type"], dataset["size"], dataset["task"])
    print(f"Dataset: {dataset['name']}, Suggested Framework: {framework}")

Dataset: Iris, Suggested Framework: Scikit-Learn
Dataset: Boston Housing, Suggested Framework: Scikit-Learn
Dataset: CIFAR-10, Suggested Framework: TensorFlow
Dataset: MNIST, Suggested Framework: PyTorch
Dataset: UCI Adult Income, Suggested Framework: Scikit-Learn
Dataset: IMDb Reviews, Suggested Framework: TensorFlow
Dataset: COCO Dataset, Suggested Framework: PyTorch
Dataset: Airbnb Listings, Suggested Framework: TensorFlow
Dataset: Dye Design, Suggested Framework: Scikit-Learn
Dataset: Cars Dataset, Suggested Framework: Scikit-Learn
Dataset: Fashion MNIST, Suggested Framework: TensorFlow
Dataset: YouTube Comments, Suggested Framework: TensorFlow
Dataset: Weather Data, Suggested Framework: Scikit-Learn
Dataset: NYC Taxi Trips, Suggested Framework: TensorFlow
Dataset: Caltech-256, Suggested Framework: TensorFlow
