## Distance Metric Recommendation for $k$-Means Clustering: A Meta-Learning Approach

**Mark Edward M. Gonzales<sup>1</sup>, Lorene C. Uy<sup>1</sup>, Jacob Adrianne L. Sy<sup>1</sup>, Macario O. Cordel, II<sup>2</sup>**

<sup>1</sup> Department of Software Technology, College of Computer Studies, De La Salle University <br>
<sup>2</sup> Department of Computer Technology, College of Computer Studies, De La Salle University 

{mark_gonzales, lorene_c_uy, jacob_adrianne_l_sy, macario.cordel}@dlsu.edu.ph

<hr>

# PART I: Preliminaries

The following libraries and modules — most of which are automatically bundled with an Anaconda installation — were used in this notebook:

Library/Module | Description | License
:-- | :-- | :--
<a href = "https://docs.python.org/3/library/os.html">`os`</a> | Provides miscellaneous operating system interfaces | Python Software Foundation License
<a href = "https://pandas.pydata.org/">`pandas`</a> | Provides functions for data analysis and manipulation	 | BSD 3-Clause "New" or "Revised" License
<a href = "https://numpy.org/">`numpy`</a> | Provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays | BSD 3-Clause "New" or "Revised" License
<a href = "https://scikit-learn.org/stable/">`scikit-learn`</a> | Python module for machine learning and predictive data analysis | BSD 3-Clause "New" or "Revised" License
<a href = "https://pymfe.readthedocs.io/en/latest/index.html">`pymfe`</a> | Provides a functions for extracting different metafeatures based on various literatures | MIT License

*The descriptions were lifted from their respective websites.*
<br><br>

<div class="alert alert-block alert-info">
<b>Note:</b>  The <a href = "https://pymfe.readthedocs.io/en/latest/index.html"><code>pymfe</code></a> library is not included in Anaconda by default. The fastest way to install this library is to use pip and run the following command on the terminal: <br>

**`pip install -U pymfe`**
</div>

In [None]:
from os import listdir

import numpy as np
import pandas as pd
import re

from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from pymfe.mfe import MFE

<hr>

# PART II: Meta-feature Extraction

Metafeatures belonging to the **`general`**, **`statistical`**, **`information-theoretic`**, **`complexity`**, and **`structural`** categories were extracted.

**General** <br>
> These meta-features describe the dimensionality and size of the dataset

**Statistical** <br>
> These meta-features capture characteristics related to feature interdependence, normality, degree of discreteness, and noisiness

**Information-Theoretic** <br>
> These meta-features quantify feature informativeness and interdependence

**Complexity** <br>
> These meta-features pertain to attributes that are related to the principal component analysis (PCA) dimensions

**Structural** <br>
> These meta-features capture patterns, statistics, and correlation information from the frequencies of *k*-itemsets

In [None]:
NO_HEADER = r'noheader'

folder = 'final_datasets'
datasets = listdir(folder)

### General, Statistical, Information-Theoretic & Complexity Meta-Features

In [None]:
columns = None
row_data = []

for dataset in datasets:
    # Some of the datasets are not encoded in the default UTF-8.
    if re.search(NO_HEADER, dataset):
        data_raw = pd.read_csv(f"{folder}/{dataset}", encoding='latin-1', header = None)
    else:
        data_raw = pd.read_csv(f"{folder}/{dataset}", encoding='latin-1')
    
    data = data_raw.to_numpy()    
    X, y = np.split(data, [-1], axis=1)

    # Data imputation
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X = imp.fit_transform(X)
    # Min-max normalization to [0, 1]
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)

    # Extract general, statistical and information-theoretic measures.
    mfe = MFE(  groups=["general", "statistical", "info-theory", "complexity"], 
                features=[  "attr_to_inst", "inst_to_attr", "nr_attr", "nr_bin",
                            "nr_inst", "attr_conc", "attr_ent", "t2", "t3", "t4", "can_cor", 
                            "cor", "cov", "eigenvalues", "iq_range", "kurtosis", "mad",  
                            "mean", "median", "nr_cor_attr", "nr_outliers", "sd",
                            "skewness", "sparsity", "t_mean", "var"]
    )
    mfe.fit(X, y)
    ft = mfe.extract()
 
    #ft[0] represents headers, initialize once 
    if(columns == None):
        columns = ft[0] 
        
    # create file paths
    filename, ext = dataset.rsplit('.', 1)
    ft[1].insert(0, filename)    
    row_data.append(ft[1])
    
    print(dataset)
    
#add dataset to columns
columns.insert(0, "dataset")
metafeatures_df = pd.DataFrame(data=row_data, columns=columns)
filename =  f"./metafeatures_gsic.csv"

# Save data to CSV.
metafeatures_df.to_csv(filename, index=False)

### Structural Meta-Features

In [None]:
columns = None
row_data = []

for dataset in datasets:        
    # Some of the datasets are not encoded in the default UTF-8.
    if re.search(NO_HEADER, dataset):
        data_raw = pd.read_csv(f"{folder}/{dataset}", encoding='latin-1', header = None)
    else:
        data_raw = pd.read_csv(f"{folder}/{dataset}", encoding='latin-1')
        
    data = data_raw.to_numpy()    
    X, y = np.split(data, [-1], axis=1)

    # Data imputation
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X = imp.fit_transform(X)
    # Min-max normalization to [0, 1]
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)
    
    # Extract general, statistical and information-theoretic measures
    mfe = MFE(  groups="all", 
                features=["one_itemset", "two_itemset"], summary=["quantiles"]
    )
    mfe.fit(X, y)
    ft = mfe.extract()

    #ft[0] represents headers, initialize once 
    if(columns == None):
        columns = ft[0] 
        
    # create file paths
    filename, ext = dataset.rsplit('.', 1)
    ft[1].insert(0, filename)    
    row_data.append(ft[1])
    
    print(dataset)

#add dataset to columns
columns.insert(0, "dataset")
metafeatures_df = pd.DataFrame(data=row_data, columns=columns)
filename =  f"./metafeatures_itemset.csv"

# Save data to CSV.
metafeatures_df.to_csv(filename, index=False)

<hr>

# Part III: Consolidation of Meta-Features


The consolidated csv output contains the meta-features for each dataset and their corresponding evaluation following the result of the dataset labeling stage.

In [None]:
# data containing information on general, statistical, information-theoretic, and complexity meta-features
gsic = pd.read_csv('metafeatures_gsic.csv')
gsic

In [None]:
# data containing information on structural meta-features
structural = pd.read_csv('metafeatures_itemset.csv')
structural

In [None]:
# consolidated meta-features
mf_no_label = pd.concat([gsic, structural], axis=1, join='inner')
mf_no_label

In [None]:
# evaluation metrics derived from data labeling stage
metrics =  pd.read_csv('dataset_labels/csv_collated/results_top_metrics.csv')
metrics = metrics[['dataset', 'best_dist_metric_ari', 'best_dist_metric_dbs']]
metrics

In [None]:
mf = pd.concat([mf_no_label, metrics], axis=1, join='inner')
mf

In [None]:
mf_data = mf.to_numpy()    
X, y = np.split(mf_data, [-2], axis=1)

# Data imputation
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imp.fit_transform(X)

In [None]:
header = list(mf.columns.values)

mf_final = pd.DataFrame(np.concatenate((X, y), axis=1))
mf_final.columns = header
mf_final

In [None]:
mf_final.to_csv('dataset_labels/metafeatures.csv', index=False)

In [None]:
mf_final = pd.read_csv('dataset_labels/metafeatures.csv')
mf_final

A copy of `metafeatures.csv` &mdash; where the abbreviated headers are expanded &mdash; is created. Its filename is `metafeatures_readable_header.csv`.