# Data Collection and Preprocessing

**Mark Edward M. Gonzales, Lorene C. Uy, and Jacob Adrianne L. Sy (CSC713M)**<br>
mark_gonzales@dlsu.edu.ph, lorene_c_uy@dlsu.edu.ph, jacob_adrianne_l_sy@dlsu.edu.ph

In partial fulfillment of the requirements for the Machine Learning graduate class (CSC713M) under **Dr. Macario O. Cordel, II** of the Department of Computer Technology, College of Computer Studies, De La Salle University, this notebook details the process and presents the code for **data collection and preprocessing** stage of the investigatory project titled "Automatic Recommendation of Distance Metric for $k$-Means Clustering: A Meta-Learning Approach."

<hr>

## PART I: Preliminaries

The following libraries and modules — all of which are automatically bundled with an Anaconda installation — were used in this notebook:

Library/Module | Description | License
:-- | :-- | :--
<a href = "https://docs.python.org/3/library/os.html">`os`</a> | Provides miscellaneous operating system interfaces | Python Software Foundation License
<a href = "https://docs.python.org/3/library/shutil.html">`shutil`</a> | Provides high-level operations on files and collections of files | Python Software Foundation License
<a href = "https://pandas.pydata.org/">`pandas`</a> | Provides functions for data analysis and manipulation	 | BSD 3-Clause "New" or "Revised" License
<a href = "https://numpy.org/">`numpy`</a> | Provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays | BSD 3-Clause "New" or "Revised" License

*The descriptions were lifted from their respective websites.*

In [None]:
import os
import shutil

import pandas as pd
import numpy as np

<hr>

## PART II: Data Collection

A total of **340 datasets** were collected from two sources:
- **195 datasets** were taken from the collection published by Pimentel [1] in OpenML [2]. The suitability of this collection of datasets for meta-learning studies related to clustering is supported by its use in several studies [3, 4, 5, 6].
- **60 datasets** were taken from the University of California, Irvine (UCI) Machine Learning Repository [7]. From among the datasets tagged as suitable for clustering tasks, a subset of those with ground-truth labels was chosen. The presence of ground-truth labels is necessary since one of the cluster validity indices used in this study (i.e., the adjusted Rand index) is an *external* measure of cluster quality.
- **85 datasets** were taken from Kaggle. From among the datasets tagged as suitable for multi-class or binary classification, a subset of those with ground-truth labels was chosen.

<hr>

## PART III: Data Wrangling

The data wrangling process starts with a manual inspection of the datasets. In particular,
1. The string `_noheader` is appended to the filenames of those without header rows. 
2. For uniformity, some of the files are restructured such that <br>
   a. The ground-truth labels are found at the last column <br>
   b. The categorical features are at consecutive columns after the numerical features.
3. All the datasets are converted to CSV (comma-separated values) format.

The datasets at the end of this stage of manual data wrangling can be accessed through these links: 
- [OpenML Datasets](https://drive.google.com/drive/folders/1CxUsyiKmCpgNlS9pePh_tXjsrhgzHb_U?usp=sharing) <br>
- [UCI Machine Learning Repository Datasets](https://drive.google.com/drive/folders/1kU2SHzluFAwNG3sMdBLT-0D9RFtcwq88?usp=sharing)
- [Kaggle Datasets](https://drive.google.com/drive/folders/17VyMgsNNayMv0yc1X86MkaxrLrPUK8QN?usp=sharing)

However, since some of the datasets, namely the *(i)* Gas Sensor Array Drift Dataset and *(ii)* the User Identification From Walking Activity Dataset, are distributed across multiple files, scripts are written to consolidate them into single CSV files. These scripts assume that the files are stored inside the directory `datasets/for_cleaning`.

### Gas Sensor Array Drift Dataset

The code below converts the [DAT files](https://drive.google.com/drive/folders/1SQHuLomJePQ6N8kGX84SYlOaS3-2aYtn?usp=sharing) into CSV files and consolidates them into a single CSV file. It assumes that the files are stored inside the directory `datasets/for_cleaning/gas_sensor` and that the resulting CSV file will be stored in the directory `datasets/uci_datasets` with the filename `gas_sensor_noheader.csv`.

In [None]:
for j in range(1, 11):
    f = None
    
    with open(f'datasets/for_cleaning/gas_sensor/batch{j}.dat', 'r') as file:
        f = file.read()
        f = f.replace('\n', str(j) + '\n')
        
        # The entries are stored in the .dat file in the form x:y or x;y, where x acts as an index.
        # Therefore, in the conversion to CSV file, x: and x; are removed.
        for i in range(129, 0, -1):
            f = f.replace(str(i) + ';', '')
            f = f.replace(str(i) + ':', '')
            f = f.replace(' ', ',')
            
    with open(f'datasets/for_cleaning/gas_sensor/batch{j}_noheader.csv', 'w') as file:
        file.write(f)
        
src_files = []

for i in range(1, 11):
    src_files.append(f'datasets/for_cleaning/gas_sensor/batch{i}_noheader.csv')
    
with open(f'datasets/uci_datasets/gas_sensor_noheader.csv', 'w') as dest_file:
    for src_file in src_files:
        with open(src_file) as file:
            for line in file:
                dest_file.write(line)

### User Identification From Walking Activity Dataset

The code below consolidates the [CSV files](https://drive.google.com/drive/folders/1amW3Y8XrWwMfqYrsygcAAsnGyM69G3ct?usp=sharing) into a single CSV file. It assumes that the files are stored inside the directory `datasets/for_cleaning/user_walk` and that the resulting CSV file will be stored in the directory `datasets/uci_datasets` with the filename `user_walk_noheader.csv`.

In [None]:
for j in range(1, 23):
    f = None
    
    with open(f'datasets/for_cleaning/user_walk/{j}.csv', 'r') as file:
        f = file.read()
        f = f.replace('\n', ',' + str(j) + '\n')
            
    with open(f'datasets/for_cleaning/user_walk/user_walk{j}_noheader.csv', 'w') as file:
        file.write(f)
        
src_files = []

for i in range(1, 11):
    src_files.append(f'datasets/for_cleaning/user_walk/user_walk{i}_noheader.csv')
    
with open(f'datasets/uci_datasets/user_walk_noheader.csv', 'w') as dest_file:
    for src_file in src_files:
        with open(src_file) as file:
            for line in file:
                dest_file.write(line)

<hr>

## PART IV: One-Hot Encoding

One-hot encoding is performed to convert categorical features into a form that can be fed into machine learning classifiers,.

The function below creates a CSV file after performing one-hot encoding on the categorical features of the given dataset.

**Parameter**:
- `filename`: Filename (together with the file extension) of the dataset
- `num_categorical`: Number of categorical features in the dataset
- `output`: Filename of the output CSV file

**Preconditions**:
- The ground-truth labels are found at the last column
- The categorical features are at consecutive columns after the numerical features. This implies that the last `num_categorical` columns (before the last column, which is reserved for the ground-truth labels) correspond to the categorical features.

In [None]:
def convert_orig_to_one_hot(filename, num_categorical, output):
    has_header = "noheader" not in filename
    data_df = pd.read_csv(filename, header=None if not has_header else 'infer')
    data_header = data_df.columns.values

    # Assumes that the label is at the last column and that the categorical features 
    # are at consecutive columns after the numerical features.
    start_col = len(data_header) - num_categorical - 1
    end_col = len(data_header) - 1

    non_categorical = data_df.iloc[:, 0:start_col]
    categorical = data_df.iloc[:, start_col:end_col]
    label = data_df.iloc[:, end_col:]

    categorical_headers = categorical.columns.values

    # One-hot encoding for categorical values.
    one_hot = pd.get_dummies(categorical, columns=categorical_headers)
        
    # Insert the one hot encoded categorical values into the categorical data's position.
    new_df = pd.concat([non_categorical, one_hot, label], axis=1)

    # Save data to CSV.
    new_df.to_csv(output, index=False, header=has_header)

This is a manually constructed dictionary that stores the number of categorical features in the datasets with categorical features.

In [None]:
# datasets to be one hot encoded
datasets_to_be_processed = {
    "analcatdata_apnea1":	3,
    "analcatdata_apnea2":	3,
    "analcatdata_apnea3":	3,
    "analcatdata_boxing1":	3,
    "analcatdata_boxing2":	3,
    "analcatdata_chlamydia":	3,
    "analcatdata_creditscore":	2,
    "analcatdata_germangss":	5,
    "analcatdata_impeach":	10,
    "analcatdata_lawsuit":	1,
    "analcatdata_michiganacc":	2,
    "analcatdata_neavote":	3,
    "analcatdata_seropositive":	1,
    "analcatdata_vineyard":	2,
    "analcatdata_wildcat_noheader":	2,
    "backache":	26,
    "badges2":	3,
    "calendarDOW":	20,
    "cars1":	2,
    "chscase_vine2":	2,
    "cleve":	7,
    "cloud":	1,
    "cpu":	1,
    "dataset_10_lymph":	15,
    "dataset_48_tae":	4,
    "dataset_106_molecular":	57,
    "flags":	26,
    "fruitfly":	2,
    "grub":	8,
    "hayes":	4,
    "lowbwt":	7,
    "mu284":	6,
    "openml_phpZNNasq":	15,
    "php7gmqTJ":	5,
    "phpJ1rDu3":	5,
    "phpnYQXoc":	7,
    "phpO72JYX":	6,
    "phpOAYun7":	5,
    "phppZkQRw":	8,
    "phpRql5hp":	4,
    "phpSj3fWL":	7,
    "phpSNaed2":	9,
    "phpswpP3r":	1,
    "phpTXWrKb":	4,
    "phpXxoe1Q":	22,
    "plasma_retinol":	3,
    "PopularKids":	9,
    "pwLinear":	10,
    "servo":	4,
    "sleuth_case2002":	6,
    "solar-raw":	12,
    "teachingAssistant":	4,
    "transplant":	1,
    "veteran":	4,
    "vinnie":	2,
    "visualizing_livestock":	2,
    "zoo":	15,
    "sobar-72":	19,
    "online_shoppers_intention":	10,
    "dress":	10,
    "obesity":	13,
    "hcvdat0":	1,
    "heart_failure_clinical_records_dataset":	5,
    "Data_Cortex_Nuclear":	3,
    "nonverbal_tourists":	21
}

The code below iterates through all the pertinent datasets to perform one-hot encoding. It also assumes that the datasets are stored in two folders `openml_datasets`, `uci_datasets`, and `kaggle_datasets` (depending on their source).

In [None]:
directory = os.getcwd()
dataset_folders = ['openml_datasets', 'uci_datasets', 'kaggle_datasets']

for folder in dataset_folders:
    # get datasets 
    datasets = os.listdir(f'./{folder}')
    # mkdir for one hot datasets
    one_hot_folder = f"one_hot_{folder}"

    if(os.path.isdir(one_hot_folder)):
        shutil.rmtree(f"./{one_hot_folder}")
    os.mkdir(f'./{one_hot_folder}')

    for dataset in datasets:
        filename, ext = dataset.rsplit('.', 1)
        num_categorical = datasets_to_be_processed.get(filename)
        if(num_categorical is not None):
            convert_orig_to_one_hot(f"{directory}/{folder}/{dataset}", num_categorical, f"{directory}/{one_hot_folder}/{filename}_one_hot.{ext}")

**Lastly, all 340 preprocessed datasets are transferred to the directory [`final_datasets`](https://github.com/memgonzales/meta-learning-clustering/tree/master/final_datasets).**

<hr>

## References

[1] B. Pimentel, "Datasets," *OpenML*, 2017. [Online]. Available: https://www.openml.org/s/88/data.
[Accessed: Apr. 2, 2022]

[2] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, "OpenML: Networked science in machine
learning," *ACM SIGKDD Explorations Newsletter*, vol. 15, no. 2, pp. 49-60, June 2014.

[3] X. Zhu, Y. Li, J. Wang, T. Zheng, and J. Fu, "Automatic recommendation of a distance measure for
clustering algorithms," *ACM Transactions on Knowledge Discovery from Data*, vol. 15, no. 1, pp. 7-22,
December 2020.

[4] B. A. Pimentel and A. C. P. L. F. de Carvalho, "Statistical versus distance-based meta-features for
clustering algorithm recommendation using meta-learning," in Proc. 2018 International Joint Conference
on Neural Networks (IJCNN), 2018, pp. 1-8.

[5] B. A. Pimentel and A. C. P. L. F. de Carvalho, "A Meta-learning approach for recommending the
number of clusters for clustering algorithms," *Knowledge-Based Systems*, vol. 195, May 2020.

[6] A. Jilling and M. Alvarez, "Optimizing recommendations for clustering algorithms using
meta-learning," in Proc. 2020 International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1-10.

[7] D. Dua and C. Graff, "UCI Machine Learning Repository," *University of California, School of Information and Computer Science*, 2019. [Online]. Available: http://archive.ics.uci.edu/ml. [Accessed: Apr. 22, 2022]

[8] “Kaggle.” [Online]. Available: https://www.kaggle.com [Accessed: Jul. 17, 2022.]