<a href="https://colab.research.google.com/github/krikorantranik/Private/blob/main/DataSelectionFromOpenML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, it is necessary to install the Open ML library

In [None]:
!pip install --quiet openml

Load the necessary packages

In [18]:
import openml
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml

Download a list of datasets from OpenML

In [12]:
availDatasets = openml.datasets.list_datasets(output_format='dataframe')
availDatasets

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
2,2,anneal,1,1,active,ARFF,684.0,7.0,8.0,5.0,39.0,898.0,898.0,22175.0,6.0,33.0
3,3,kr-vs-kp,1,1,active,ARFF,1669.0,3.0,1527.0,2.0,37.0,3196.0,0.0,0.0,0.0,37.0
4,4,labor,1,1,active,ARFF,37.0,3.0,20.0,2.0,17.0,57.0,56.0,326.0,8.0,9.0
5,5,arrhythmia,1,1,active,ARFF,245.0,13.0,2.0,13.0,280.0,452.0,384.0,408.0,206.0,74.0
6,6,letter,1,1,active,ARFF,813.0,26.0,734.0,26.0,17.0,20000.0,0.0,0.0,16.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45674,45674,a,2,30127,active,arff,,,,,5.0,150.0,0.0,0.0,4.0,1.0
45675,45675,a,3,30127,active,arff,50.0,,50.0,3.0,4.0,150.0,0.0,0.0,3.0,1.0
45681,45681,iris,54,30127,active,arff,,,,,5.0,150.0,0.0,0.0,4.0,1.0
45683,45683,iris,55,30127,active,arff,,,,,5.0,150.0,0.0,0.0,4.0,1.0


I need a dataset with mostly numeric features for this task, so I will perform a calculation to get the fraction of numeric features. Also I want to know the fraction of instances with missing values. I want datasets with over 90% of clean (no missing values) and over 80% of numeric features.

I also don't want datasets that are too small or too big, so I limit the size to between 500 and 5000 instances and 10 to 50 features.

In [13]:
availDatasets["FracNumeric"] = 1.00*availDatasets['NumberOfNumericFeatures']/availDatasets['NumberOfFeatures']
availDatasets["FracClean"] = 1-1.00*availDatasets['NumberOfInstancesWithMissingValues']/availDatasets['NumberOfInstances']
availDatasets = availDatasets[(availDatasets['FracNumeric']>=0.8) & (availDatasets['FracClean']>=0.9)]
availDatasets = availDatasets[(availDatasets['NumberOfInstances']>=500) & (availDatasets['NumberOfInstances']<=5000)]
availDatasets = availDatasets[(availDatasets['NumberOfFeatures']>=10) & (availDatasets['NumberOfFeatures']<=50)]
availDatasets.to_csv('Datasetlist.csv')
availDatasets

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures,FracNumeric,FracClean
15,15,breast-w,1,1,active,ARFF,458.0,2.0,241.0,2.0,10.0,699.0,16.0,16.0,9.0,1.0,0.900000,0.97711
22,22,mfeat-zernike,1,1,active,ARFF,200.0,10.0,200.0,10.0,48.0,2000.0,0.0,0.0,47.0,1.0,0.979167,1.00000
36,36,segment,1,1,active,ARFF,330.0,7.0,330.0,7.0,20.0,2310.0,0.0,0.0,19.0,1.0,0.950000,1.00000
54,54,vehicle,1,1,active,ARFF,218.0,4.0,199.0,4.0,19.0,846.0,0.0,0.0,18.0,1.0,0.947368,1.00000
60,60,waveform-5000,1,1,active,ARFF,1692.0,3.0,1653.0,3.0,41.0,5000.0,0.0,0.0,40.0,1.0,0.975610,1.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45536,45536,Contaminant-detection-in-packaged-cocoa-hazeln...,1,33136,active,arff,,,,0.0,31.0,2400.0,0.0,0.0,31.0,0.0,1.000000,1.00000
45537,45537,Contaminant-detection-in-packaged-cocoa-hazeln...,1,33136,active,arff,,,,0.0,31.0,2400.0,0.0,0.0,31.0,0.0,1.000000,1.00000
45538,45538,Contaminant-detection-in-packaged-cocoa-hazeln...,1,33136,active,arff,,,,0.0,31.0,2400.0,0.0,0.0,31.0,0.0,1.000000,1.00000
45539,45539,Contaminant-detection-in-packaged-cocoa-hazeln...,1,33136,active,arff,,,,0.0,31.0,2400.0,0.0,0.0,31.0,0.0,1.000000,1.00000


The best course of action is to download the csv file, and take a look at the available results. To further explore a dataset, use the OpenML website or the following code

In [23]:
#use the did column
queriesDS = fetch_openml(data_id=41701, as_frame=True, parser='auto')
print(queriesDS.DESCR)
print(queriesDS.feature_names)
print(queriesDS.target_names)


source: An Algorithm Selection Benchmark for the Container Pre-Marshalling Problem (CPMP)
authors: K. Tierney and Y. Malitsky (features) / K. Tierney and D. Pacino and S. Voss (algorithms)
translator in coseal format: K. Tierney

This is an extension of the 2013 premarshalling dataset that includes more features and a set of test instances. 

There are three sets of features:

feature_values.arff contains the full set of features from iteration 2 of our latent feature analysis (LFA) process (see paper)
feature_values_itr1.arff contains only the features after iteration 1 of LFA
feature_values_orig.arff containers the features used in PREMARHSALLING-ASTAR-2013

We also provide test data with an identical naming scheme (see _test). 

The features for the pre-marshalling problem are all extremely easy and fast to
compute, thus the feature_costs.arff file has been omitted, as it would be time
0 for every feature (regardless of using original, iteration 1 or iteration 2
features).

The feat

If ok, then retrieve the dataset as a pandas table

In [21]:
dataset = queriesDS.data
dataset

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,y2
0,0.98,514.5,294.0,110.25,7.0,2.0,0.0,0.0,11
1,0.98,514.5,294.0,110.25,7.0,3.0,0.0,0.0,11
2,0.98,514.5,294.0,110.25,7.0,4.0,0.0,0.0,11
3,0.98,514.5,294.0,110.25,7.0,5.0,0.0,0.0,11
4,0.90,563.5,318.5,122.50,7.0,2.0,0.0,0.0,18
...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,5.0,0.4,5.0,11
764,0.62,808.5,367.5,220.50,3.5,2.0,0.4,5.0,7
765,0.62,808.5,367.5,220.50,3.5,3.0,0.4,5.0,7
766,0.62,808.5,367.5,220.50,3.5,4.0,0.4,5.0,7
