# Meta-feature Extraction

**Mark Edward M. Gonzales, Lorene C. Uy, and Jacob Adrianne L. Sy (CSC713M)**<br>
mark_gonzales@dlsu.edu.ph, lorene_c_uy@dlsu.edu.ph, jacob_adrianne_l_sy@dlsu.edu.ph

In partial fulfillment of the requirements for the Machine Learning graduate class (CSC713M) under **Dr. Macario O. Cordel, II** of the Department of Computer Technology, College of Computer Studies, De La Salle University, this notebook details the process and presents the code for the **meta-feature extraction** stage of the investigatory project titled "Automatic Recommendation of Distance Metric for $k$-Means Clustering: A Meta-Learning Approach."

## PART I: Preliminaries

The following libraries and modules — most of which are automatically bundled with an Anaconda installation — were used in this notebook:

Library/Module | Description | License
:-- | :-- | :--
<a href = "https://docs.python.org/3/library/os.html">`os`</a> | Provides miscellaneous operating system interfaces | Python Software Foundation License
<a href = "https://pandas.pydata.org/">`pandas`</a> | Provides functions for data analysis and manipulation	 | BSD 3-Clause "New" or "Revised" License
<a href = "https://numpy.org/">`numpy`</a> | Provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays | BSD 3-Clause "New" or "Revised" License
<a href = "https://scikit-learn.org/stable/">`scikit-learn`</a> | Python module for machine learning and predictive data analysis | BSD 3-Clause "New" or "Revised" License
<a href = "https://pymfe.readthedocs.io/en/latest/index.html">`pymfe`</a> | Provides a functions for extracting different metafeatures based on various literatures | MIT License

*The descriptions were lifted from their respective websites.*
<br><br>

<div class="alert alert-block alert-info">
<b>Note:</b>  The pymfe library is not included in Anaconda by default. The simplest way to install the library is to use pip by running the following command on the command prompt: <br>

**`pip install -U pymfe`**
</div>

In [2]:
from os import listdir

import numpy as np
import pandas as pd
import re

from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from pymfe.mfe import MFE

## PART II: Meta-feature Extraction

Metafeatures belonging to the following categories: **`general`**, **`statistical`**, **`information-theoretic`**, **`complexity`**, and **`structural`** were extracted.

**general** <br>
> These meta-features describe the dimensionality and size of the dataset

**statistical** <br>
> These meta-features capture characteristics related to feature interdependence, normality, degree of discreteness, and noisiness

**information-theoretic** <br>
> these meta-features quantify feature informativeness and interdependence

**complexity** <br>
> these meta-features pertain to attributes that are related to the principal component analysis (PCA) dimensions

**structural** <br>
> these meta-features capture patterns, statistics, and correlation information from the frequencies of *k*-itemsets

In [3]:
NO_HEADER = r'noheader'

folder = 'final_datasets'
datasets = listdir(folder)

### General, Statistical, Information-Theoretic & Complexity Meta-Features

In [None]:
columns = None
row_data = []

for dataset in datasets:
    # Some of the datasets are not encoded in the default UTF-8.
    if re.search(NO_HEADER, dataset):
        data_raw = pd.read_csv(f"{folder}/{dataset}", encoding='latin-1', header = None)
    else:
        data_raw = pd.read_csv(f"{folder}/{dataset}", encoding='latin-1')
    
    data = data_raw.to_numpy()    
    X, y = np.split(data, [-1], axis=1)

    # Data imputation
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X = imp.fit_transform(X)
    # Min-max normalization to [0, 1]
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)

    # Extract general, statistical and information-theoretic measures.
    mfe = MFE(  groups=["general", "statistical", "info-theory", "complexity"], 
                features=[  "attr_to_inst", "inst_to_attr", "nr_attr", "nr_bin",
                            "nr_inst", "attr_conc", "attr_ent", "t2", "t3", "t4", "can_cor", 
                            "cor", "cov", "eigenvalues", "iq_range", "kurtosis", "mad",  
                            "mean", "median", "nr_cor_attr", "nr_outliers", "sd",
                            "skewness", "sparsity", "t_mean", "var"]
    )
    mfe.fit(X, y)
    ft = mfe.extract()
 
    #ft[0] represents headers, initialize once 
    if(columns == None):
        columns = ft[0] 
        
    # create file paths
    filename, ext = dataset.rsplit('.', 1)
    ft[1].insert(0, filename)    
    row_data.append(ft[1])
    
    print(dataset)
    
#add dataset to columns
columns.insert(0, "dataset")
metafeatures_df = pd.DataFrame(data=row_data, columns=columns)
filename =  f"./metafeatures_gsic.csv"

# Save data to CSV.
metafeatures_df.to_csv(filename, index=False)

### Structural Meta-Features

In [None]:
columns = None
row_data = []

for dataset in datasets:        
    # Some of the datasets are not encoded in the default UTF-8.
    if re.search(NO_HEADER, dataset):
        data_raw = pd.read_csv(f"{folder}/{dataset}", encoding='latin-1', header = None)
    else:
        data_raw = pd.read_csv(f"{folder}/{dataset}", encoding='latin-1')
        
    data = data_raw.to_numpy()    
    X, y = np.split(data, [-1], axis=1)

    # Data imputation
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    X = imp.fit_transform(X)
    # Min-max normalization to [0, 1]
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)
    
    # Extract general, statistical and information-theoretic measures
    mfe = MFE(  groups="all", 
                features=["one_itemset", "two_itemset"], summary=["quantiles"]
    )
    mfe.fit(X, y)
    ft = mfe.extract()

    #ft[0] represents headers, initialize once 
    if(columns == None):
        columns = ft[0] 
        
    # create file paths
    filename, ext = dataset.rsplit('.', 1)
    ft[1].insert(0, filename)    
    row_data.append(ft[1])
    
    print(dataset)

#add dataset to columns
columns.insert(0, "dataset")
metafeatures_df = pd.DataFrame(data=row_data, columns=columns)
filename =  f"./metafeatures_itemset.csv"

# Save data to CSV.
metafeatures_df.to_csv(filename, index=False)

## Part III: Consolidation of Meta-Features

The consolidated csv output contains the meta-features for each dataset and their corresponding evaluation following the result of the dataset labeling stage.

In [None]:
# data containing information on general, statistical, information-theoretic, and complexity meta-features
gsic = pd.read_csv('metafeatures_gsic.csv')
gsic

In [None]:
# data containing information on structural meta-features
structural = pd.read_csv('metafeatures_itemset.csv')
structural

In [None]:
# consolidated meta-features
mf_no_label = pd.concat([gsic, structural], axis=1, join='inner')
mf_no_label

In [9]:
# evaluation metrics derived from data labeling stage
metrics =  pd.read_csv('dataset_labels/csv_collated/results_top_metrics.csv')
metrics = metrics[['dataset', 'best_dist_metric_ari', 'best_dist_metric_dbs']]
metrics

Unnamed: 0,dataset,best_dist_metric_ari,best_dist_metric_dbs
0,a1_raw,manhattan,euclidean
1,a1_va3,euclidean,euclidean
2,a2_raw,euclidean,chebyshev
3,a2_va3,euclidean,manhattan
4,a3_raw,euclidean,euclidean
...,...,...,...
335,winequality-white,euclidean,euclidean
336,wine,euclidean,euclidean
337,wisconsin,mahalanobis,manhattan
338,Zemberek-Stemmed,euclidean,manhattan


In [None]:
mf = pd.concat([mf_no_label, metrics], axis=1, join='inner')
mf

In [None]:
mf_data = mf.to_numpy()    
X, y = np.split(mf_data, [-2], axis=1)

# Data imputation
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imp.fit_transform(X)

In [None]:
header = list(mf.columns.values)

mf_final = pd.DataFrame(np.concatenate((X, y), axis=1))
mf_final.columns = header
mf_final

In [None]:
mf_final.to_csv('dataset_labels/metafeatures.csv', index=False)

In [10]:
mm = pd.read_csv('dataset_labels/metafeatures.csv')
mm['best_dist_metric_dbs'] = metrics['best_dist_metric_dbs']
mm

Unnamed: 0,attr_conc.mean,attr_conc.sd,attr_ent.mean,attr_ent.sd,attr_to_inst,can_cor.mean,can_cor.sd,cor.mean,cor.sd,cov.mean,...,one_itemset.quantiles.2,one_itemset.quantiles.3,one_itemset.quantiles.4,two_itemset.quantiles.0,two_itemset.quantiles.1,two_itemset.quantiles.2,two_itemset.quantiles.3,two_itemset.quantiles.4,best_dist_metric_ari,best_dist_metric_dbs
0,0.083137,0.074754,3.584954,2.680000e-06,0.010303,0.433018,0.253195,0.242233,0.228001,0.006188,...,0.083572,0.083572,0.084144,0.010303,0.147682,0.155695,0.161420,0.167716,manhattan,euclidean
1,0.037992,0.036260,3.584956,1.400000e-06,0.018359,0.217562,0.399361,0.125801,0.172071,0.000761,...,0.083190,0.083764,0.083764,0.027539,0.148021,0.153758,0.158921,0.167527,euclidean,euclidean
2,0.086459,0.064692,3.321917,4.570000e-16,0.014241,0.477469,0.210690,0.220614,0.217854,0.005883,...,0.099684,0.100475,0.100475,0.011076,0.172468,0.183544,0.193038,0.200949,euclidean,chebyshev
3,0.046616,0.036637,3.321928,0.000000e+00,0.025397,-0.014967,0.440451,0.125810,0.176909,0.000887,...,0.100000,0.100000,0.100000,0.034921,0.173016,0.182540,0.190476,0.200000,euclidean,manhattan
4,0.101427,0.026079,3.584958,2.430000e-06,0.009815,0.316617,0.442044,0.272287,0.228004,0.008253,...,0.083424,0.083424,0.083969,0.002181,0.147219,0.157579,0.163577,0.167394,euclidean,euclidean
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
335,0.023432,0.021466,3.444974,1.128371e-02,0.006879,0.369102,0.550237,0.199823,0.188264,0.003710,...,0.090056,0.097561,0.151970,0.081301,0.153221,0.164478,0.175735,0.249531,euclidean,euclidean
336,0.014634,0.017273,3.973324,2.244442e-02,0.002246,0.133160,0.283442,0.178266,0.189298,0.001608,...,0.062372,0.067630,0.109024,0.041854,0.107595,0.116987,0.125970,0.192119,euclidean,euclidean
337,0.080498,0.076914,2.320541,7.082206e-03,0.159794,0.492791,0.322741,0.314237,0.259968,0.009200,...,0.201031,0.201031,0.257732,0.010309,0.298969,0.319588,0.345361,0.412371,mahalanobis,manhattan
338,0.000976,0.006221,0.066647,1.487167e-01,1.581550,0.217634,0.305870,0.007252,0.029346,0.000010,...,0.067797,0.998055,0.999722,0.000000,0.006946,0.393998,0.993054,1.000000,euclidean,manhattan


In [11]:
mm.to_csv('dataset_labels/metafeatures1.csv', index=False)