In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [2]:
import openml
import os
import pandas as pd
import math

## For loading data
from pathlib import Path
from typing import Union

In [3]:
# Functions to read data
def load_dataset(path: Union[Path, str]) -> pd.DataFrame:
    return pd.read_csv(path, index_col=0)


def load_rankings(path: Union[Path, str]) -> pd.DataFrame:
    out = pd.read_csv(path, index_col=0, header=[0, 1, 2, 3])
    out.columns.name = ("dataset", "model", "tuning", "scoring")
    return out

In [4]:
dir_data = '../../data/raw/'

# File names
filename_dataset = 'dataset.csv'

# Create paths for given files
filepath_dataset = os.path.join(dir_data, filename_dataset)

# Load data
dataset = load_dataset(filepath_dataset)

In [5]:
unique_datasets = dataset.dataset.unique()

# Idea explained

The basic idea is to generate a bunch of features. 
The best features will later be selected by a feature selection algorithm like RFECV, MRMR or something else.
The starting point is the ```dataset``` feature, which indicates the id of the dataset in [openml.org](https://www.openml.org/). 
Therefore, the [openml API](https://openml.github.io/openml-python/main/api.html#) is used. 

The first concept for creating the features is: 

![image](../../data/dataset_FE.svg)

### ToDos
[ ] Research and test openMLStudy

[ ] Research and test openMLTask

[ ] Research and test openMLRun

[ ] Research and test openml.datasets.list_qualities

## dataset_agg

### Get lists of possible attributes and an intersection

In [6]:
# Get intersection of keys which are in all datasets
list_of_keys = [set(openml.datasets.get_dataset(dataset_id=int(dataset_id)).qualities.keys()) for dataset_id in unique_datasets]
intersection = set.intersection(*list_of_keys)
intersection

Could not download file from http://openml1.win.tue.nl/dataset41224/dataset_41224.pq: Bucket does not exist or is private.


{'AutoCorrelation',
 'Dimensionality',
 'MajorityClassPercentage',
 'MajorityClassSize',
 'MinorityClassPercentage',
 'MinorityClassSize',
 'NumberOfBinaryFeatures',
 'NumberOfClasses',
 'NumberOfFeatures',
 'NumberOfInstances',
 'NumberOfInstancesWithMissingValues',
 'NumberOfMissingValues',
 'NumberOfNumericFeatures',
 'NumberOfSymbolicFeatures',
 'PercentageOfBinaryFeatures',
 'PercentageOfInstancesWithMissingValues',
 'PercentageOfMissingValues',
 'PercentageOfNumericFeatures',
 'PercentageOfSymbolicFeatures'}

In [7]:
# List of all attributes given by opeml
attribute_set = set()
for dataset_id in unique_datasets:
    attribute_list = list(openml.datasets.get_dataset(dataset_id=int(dataset_id)).qualities.keys())
    attribute_set.update(attribute_list)
attribute_set

Could not download file from http://openml1.win.tue.nl/dataset41224/dataset_41224.pq: Bucket does not exist or is private.


{'AutoCorrelation',
 'CfsSubsetEval_DecisionStumpAUC',
 'CfsSubsetEval_DecisionStumpErrRate',
 'CfsSubsetEval_DecisionStumpKappa',
 'CfsSubsetEval_NaiveBayesAUC',
 'CfsSubsetEval_NaiveBayesErrRate',
 'CfsSubsetEval_NaiveBayesKappa',
 'CfsSubsetEval_kNN1NAUC',
 'CfsSubsetEval_kNN1NErrRate',
 'CfsSubsetEval_kNN1NKappa',
 'ClassEntropy',
 'DecisionStumpAUC',
 'DecisionStumpErrRate',
 'DecisionStumpKappa',
 'Dimensionality',
 'EquivalentNumberOfAtts',
 'J48.00001.AUC',
 'J48.00001.ErrRate',
 'J48.00001.Kappa',
 'J48.0001.AUC',
 'J48.0001.ErrRate',
 'J48.0001.Kappa',
 'J48.001.AUC',
 'J48.001.ErrRate',
 'J48.001.Kappa',
 'MajorityClassPercentage',
 'MajorityClassSize',
 'MaxAttributeEntropy',
 'MaxKurtosisOfNumericAtts',
 'MaxMeansOfNumericAtts',
 'MaxMutualInformation',
 'MaxNominalAttDistinctValues',
 'MaxSkewnessOfNumericAtts',
 'MaxStdDevOfNumericAtts',
 'MeanAttributeEntropy',
 'MeanKurtosisOfNumericAtts',
 'MeanMeansOfNumericAtts',
 'MeanMutualInformation',
 'MeanNoiseToSignalRatio',


At first I will keep __all__ features and not just the ones, which are in every dataset present. 
Threrefore, I will create additional features in the dataset_agg table. 

Create a mapping of the attributes I want to create to the ones given by openml.

| My feature idea | Related feature from openml | Description |
| :- | :- | :- |
| row_count | NumberOfInstances | The number of instances = The number of rows in the dataset |
| column_count | NumberOfFeatures | The total number of features + targets |
| null_value_count | NumberOfMissingValues | Number of occuring null values |
| rows_with_null_values_count | NumberOfInstancesWithMissingValues | Number of rows with null values |
| columns_with_null_values_count |  | Number of features containing null values |
| ratio_of_null_values_to_all |  | $ = \dfrac{\text{null_value_count}}{\text{row_count} \times \text{total_feature_count}}$ |
| categorical_features_count |  | Self explaining. Give a __suggestion__ by calculation. But it has to be checked manually, since there can also be numerical features, which are just other category names. (e.g. the *geo_level_1_id* in the earthquake dataset)  |
| non_categorical_features_count |  | Self explaining. But has also to be checked manually. |
| ratio_of_categorical_features_to_all |  | $ = \dfrac{\text{categorical_features_count}}{\text{total_feature_count}} $ |
| sum_of_all_categories |  | Sum of the number of categories over all categorical values. Has to be checked manually. |
| categorical_target_variables_count |  | The number of classification tasks |
| non_categorical_target_variables_count |  | The number of regression tasks |
| categorical_target_values_sum | NumberOfClasses | The sum of classes to predict over all target variables |
| total_feature_count |  | The number of features to predict the target(s) |
| min_number_of_categories_per_cat_feature |  | Min number of categories in a categorical feature |
| max_number_of_categories_per_cat_feature |  | Max number of categories in a categorical feature |
| avg_number_of_categories_per_cat_feature |  | Avg number of categories per categorical feature |

### Create dataset and save it

In [8]:
# Init empty lists for feature values
list_dataset_id = []
list_row_count = []
list_column_count = []
list_null_value_count = []
list_rows_with_null_values_count = []
list_columns_with_null_values_count = []
list_ratio_of_null_values_to_all = []
list_categorical_features_count = []
list_non_categorical_features_count = []
list_ratio_of_categorical_features_to_all = []
list_sum_of_all_categories = []
list_categorical_target_variables_count = []
list_non_categorical_target_variables_count = []
list_categorical_target_values_sum = []
list_total_feature_count = []
list_min_number_of_categories_per_cat_feature = []
list_max_number_of_categories_per_cat_feature = []
list_avg_number_of_categories_per_cat_feature = []

In [9]:
# Remove the features already used above
attributs_to_remove_from_feature_set = set(["NumberOfInstances", "NumberOfMissingValues", "NumberOfInstancesWithMissingValues", "NumberOfClasses", "NumberOfFeatures"])

# Create lists for all attributes in the set
add_feature_list = attribute_set - attributs_to_remove_from_feature_set

# Create dict with lists for the features to add
feature_list_dict = {}
for feature_name in add_feature_list:
    feature_list_dict[feature_name] = []

In [10]:
def row_count(dataset):
    """
    Returns the count of rows in the provided dataset.

            Parameters:
                    dataset (openml.datasets.OpenMLDataset): A dataset object from openml.org

            Returns:
                    row_count (int): The number of rows in the provided dataset object 
    """
    return dataset.qualities.get('NumberOfInstances')

In [11]:
def column_count(dataset):
    return dataset.qualities.get('NumberOfFeatures')

In [12]:
def null_value_count(dataset):
    return dataset.qualities.get('NumberOfMissingValues')

In [13]:
def rows_with_null_values_count(dataset):
    return dataset.qualities.get('NumberOfInstancesWithMissingValues')

In [14]:
def columns_with_null_values_count(X):
    return sum(X.isna().any())

In [15]:
def ratio_of_null_values_to_all(dataset, X):
    return (null_value_count(dataset)) / (total_feature_count(X) * row_count(dataset))

In [16]:
def categorical_features_count(dataset):
    categorical_features_count = 0
    
    for k in dataset.features:
        # Operations on features
        if dataset.features[k].name not in dataset.default_target_attribute.split(','):
            if dataset.features[k].data_type in ['nominal', 'string']:
                categorical_features_count += 1
    
    #return sum(categorical_indicator)
    return categorical_features_count

In [17]:
def non_categorical_features_count(X, dataset):
    return total_feature_count(X) - categorical_features_count(dataset)

In [18]:
def ratio_of_categorical_features_to_all(X, dataset):
    return categorical_features_count(dataset) / total_feature_count(X)

In [19]:
def sum_of_all_categories(dataset, attribute_names):
    # ToDo: Maybe use the categorical indicator map
    sum_of_categories = 0
    
    for k in dataset.features:
        # Operations on features
        if dataset.features[k].name not in dataset.default_target_attribute.split(','):
            # Update min and max number of categories per features
            if dataset.features[k].data_type == 'nominal':
                sum_of_categories += len(dataset.features[k].nominal_values)
            if dataset.features[k].data_type == 'string':
                if dataset.features[k].name in attribute_names:
                    tmp = X[dataset.features[k].name].unique()
                    sum_of_categories += len(tmp)
    
    return sum_of_categories

In [20]:
def categorical_target_variables_count(dataset):
    count_of_cat_targets = 0
    
    for k in dataset.features:
        # Operations on features
        if dataset.features[k].name in dataset.default_target_attribute.split(','):
            if dataset.features[k].data_type in ['nominal', 'string']:
                count_of_cat_targets += 1
    
    return count_of_cat_targets

In [21]:
def non_categorical_target_variables_count(dataset):
    count_of_non_cat_targets = 0
    
    for k in dataset.features:
        # Operations on features
        if dataset.features[k].name in dataset.default_target_attribute.split(','):
            if dataset.features[k].data_type not in ['nominal', 'string']:
                count_of_non_cat_targets += 1
    
    return count_of_non_cat_targets

In [22]:
def categorical_target_values_sum(dataset):
    return dataset.qualities.get('NumberOfClasses')

In [23]:
def total_feature_count(X):
    return X.shape[1]

In [24]:
def min_number_of_categories_per_cat_feature(dataset, X, attribute_names):
    min_number_of_categories = math.inf
    
    for k in dataset.features:
        # Operations on features
        if dataset.features[k].name not in dataset.default_target_attribute.split(','):
            # Update min and max number of categories per features
            if dataset.features[k].data_type == 'nominal':
                if len(dataset.features[k].nominal_values) < min_number_of_categories:
                    min_number_of_categories = len(dataset.features[k].nominal_values)
            if dataset.features[k].data_type == 'string':
                if dataset.features[k].name in attribute_names:
                    tmp = X[dataset.features[k].name].unique()
                    if len(tmp) < min_number_of_categories:
                        min_number_of_categories = len(tmp)
    
    return min_number_of_categories

In [25]:
def max_number_of_categories_per_cat_feature(dataset, X, attribute_names):
    max_number_of_categories = -math.inf
    
    for k in dataset.features:
        # Operations on features
        if dataset.features[k].name not in dataset.default_target_attribute.split(','):
            # Update min and max number of categories per features
            if dataset.features[k].data_type == 'nominal':
                if len(dataset.features[k].nominal_values) > max_number_of_categories:
                    max_number_of_categories = len(dataset.features[k].nominal_values)
            if dataset.features[k].data_type == 'string':
                if dataset.features[k].name in attribute_names:
                    tmp = X[dataset.features[k].name].unique()
                    if len(tmp) > max_number_of_categories:
                        max_number_of_categories = len(tmp)
    
    return max_number_of_categories

In [26]:
def avg_number_of_categories_per_cat_feature(dataset, categorical_indicator, attribute_names):
    return sum_of_all_categories(dataset, attribute_names) / categorical_features_count(dataset)

In [27]:
def get_predefined_feature(dataset, feature_name):
    return dataset.qualities.get(feature_name)

In [28]:
# Traverse all unique datasets, call the functions and collect the information
for dataset_id in unique_datasets:
    print(dataset_id)
    
    # Get openml dataset object with the current id
    dataset = openml.datasets.get_dataset(dataset_id=int(dataset_id))
    
    # Get dataset
    X, y, categorical_indicator, attribute_names = dataset.get_data(
        target=dataset.default_target_attribute, dataset_format="dataframe"
    )
    
    # Apply functions
    list_dataset_id.append(dataset_id)
    list_row_count.append(row_count(dataset))
    list_column_count.append(column_count(dataset))
    list_null_value_count.append(null_value_count(dataset))
    list_rows_with_null_values_count.append(rows_with_null_values_count(dataset))
    list_columns_with_null_values_count.append(columns_with_null_values_count(X))
    list_ratio_of_null_values_to_all.append(ratio_of_null_values_to_all(dataset, X))
    list_categorical_features_count.append(categorical_features_count(dataset))
    list_non_categorical_features_count.append(non_categorical_features_count(X, dataset))
    list_ratio_of_categorical_features_to_all.append(ratio_of_categorical_features_to_all(X, dataset))
    list_sum_of_all_categories.append(sum_of_all_categories(dataset, attribute_names))
    list_categorical_target_variables_count.append(categorical_target_variables_count(dataset))
    list_non_categorical_target_variables_count.append(non_categorical_target_variables_count(dataset))
    list_categorical_target_values_sum.append(categorical_target_values_sum(dataset))
    list_total_feature_count.append(total_feature_count(X))
    list_min_number_of_categories_per_cat_feature.append(min_number_of_categories_per_cat_feature(dataset, X, attribute_names))
    list_max_number_of_categories_per_cat_feature.append(max_number_of_categories_per_cat_feature(dataset, X, attribute_names))
    list_avg_number_of_categories_per_cat_feature.append(avg_number_of_categories_per_cat_feature(dataset, categorical_indicator, attribute_names))
    
    # Iterate over the attributes in qualities
    for feature_name in add_feature_list:
        updated_list = feature_list_dict[feature_name]
        updated_list.append(get_predefined_feature(dataset, feature_name))
        feature_list_dict[feature_name] = updated_list

3
29
31
38
50
51
56
333
334
451
470
881
956
959
981
1037
1111
1112
1114
1169
1235
1461
1463
1486
1506
1511
1590
6332
23381
40536
40945
40981
40999
41005
41007
41162


Could not download file from http://openml1.win.tue.nl/dataset41224/dataset_41224.pq: Bucket does not exist or is private.


41224
42178
42343
42344
42738
42750
43098
43607
43890
43892
43896
43897
43900
43922


In [30]:
# Create a pandas dataframe and save it
feature_list_dict['dataset_id'] = list_dataset_id
feature_list_dict['row_count'] = list_row_count
feature_list_dict['column_count'] = list_column_count
feature_list_dict['null_value_count'] = list_null_value_count
feature_list_dict['rows_with_null_values_count'] = list_rows_with_null_values_count
feature_list_dict['columns_with_null_values_count'] = list_columns_with_null_values_count
feature_list_dict['ratio_of_null_values_to_all'] = list_ratio_of_null_values_to_all
feature_list_dict['categorical_features_count'] = list_categorical_features_count
feature_list_dict['non_categorical_features_count'] = list_non_categorical_features_count
feature_list_dict['ratio_of_categorical_features_to_all'] = list_ratio_of_categorical_features_to_all
feature_list_dict['sum_of_all_categories'] = list_sum_of_all_categories
feature_list_dict['categorical_target_variables_count'] = list_categorical_target_variables_count
feature_list_dict['non_categorical_target_variables_count'] = list_non_categorical_target_variables_count
feature_list_dict['categorical_target_values_sum'] = list_categorical_target_values_sum
feature_list_dict['total_feature_count'] = list_total_feature_count
feature_list_dict['min_number_of_categories_per_cat_feature'] = list_min_number_of_categories_per_cat_feature
feature_list_dict['max_number_of_categories_per_cat_feature'] = list_max_number_of_categories_per_cat_feature
feature_list_dict['avg_number_of_categories_per_cat_feature'] = list_avg_number_of_categories_per_cat_feature

dataset_agg = pd.DataFrame(feature_list_dict)

In [31]:
dataset_agg.head(50)

Unnamed: 0,Quartile1SkewnessOfNumericAtts,NumberOfBinaryFeatures,J48.00001.ErrRate,MinAttributeEntropy,NumberOfSymbolicFeatures,MeanNominalAttDistinctValues,MeanStdDevOfNumericAtts,J48.0001.Kappa,NumberOfNumericFeatures,MajorityClassPercentage,...,non_categorical_features_count,ratio_of_categorical_features_to_all,sum_of_all_categories,categorical_target_variables_count,non_categorical_target_variables_count,categorical_target_values_sum,total_feature_count,min_number_of_categories_per_cat_feature,max_number_of_categories_per_cat_feature,avg_number_of_categories_per_cat_feature
0,,35.0,0.007822,0.004094,37.0,2.027027,,0.984326,0.0,52.221527,...,0,1.0,74,1,0,2.0,36,2,3,2.055556
1,1.403083,5.0,0.16087,0.50104,10.0,4.2,901.509141,0.673994,6.0,55.507246,...,6,0.6,41,1,0,2.0,15,2,14,4.555556
2,-0.27257,3.0,0.279,0.228364,14.0,4.0,407.047619,0.244312,7.0,70.0,...,7,0.65,56,1,0,2.0,20,2,11,4.307692
3,1.258947,21.0,0.015642,-0.0,23.0,2.086957,19.053878,0.857927,7.0,93.875928,...,7,0.758621,46,1,0,2.0,29,1,5,2.090909
4,,1.0,0.185804,1.470628,10.0,2.9,,0.577195,0.0,65.344468,...,0,1.0,27,1,0,2.0,9,3,3,3.0
5,-0.18559,3.0,0.207483,0.391312,8.0,2.625,19.599081,0.549102,6.0,63.945578,...,6,0.538462,19,1,0,2.0,13,2,4,2.714286
6,,17.0,0.045977,0.929364,17.0,2.0,,0.903872,0.0,61.37931,...,0,1.0,32,1,0,2.0,16,2,2,2.0
7,,3.0,0.104317,0.999664,7.0,2.714286,,0.791367,0.0,50.0,...,0,1.0,17,1,0,2.0,6,2,4,2.833333
8,,3.0,0.356073,0.999982,7.0,2.714286,,0.029328,0.0,65.723794,...,0,1.0,17,1,0,2.0,6,2,4,2.833333
9,-0.080037,2.0,0.0,1.0,4.0,4.25,15.395027,1.0,2.0,55.6,...,2,0.6,15,1,0,2.0,5,2,10,5.0


In [None]:
dataset_agg.to_csv('../../data/preprocessed/dataset_agg.csv')