<a href="https://colab.research.google.com/github/krikorantranik/Private/blob/main/DataSelectionFromOpenML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, it is necessary to install the Open ML library

In [None]:
!pip install --quiet openml

Load the necessary packages

In [3]:
import openml
import numpy as np
import pandas as pd

Download a list of datasets from OpenML

In [9]:
availDatasets = openml.datasets.list_datasets(output_format='dataframe')
availDatasets

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
2,2,anneal,1,1,active,ARFF,684.0,7.0,8.0,5.0,39.0,898.0,898.0,22175.0,6.0,33.0
3,3,kr-vs-kp,1,1,active,ARFF,1669.0,3.0,1527.0,2.0,37.0,3196.0,0.0,0.0,0.0,37.0
4,4,labor,1,1,active,ARFF,37.0,3.0,20.0,2.0,17.0,57.0,56.0,326.0,8.0,9.0
5,5,arrhythmia,1,1,active,ARFF,245.0,13.0,2.0,13.0,280.0,452.0,384.0,408.0,206.0,74.0
6,6,letter,1,1,active,ARFF,813.0,26.0,734.0,26.0,17.0,20000.0,0.0,0.0,16.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45674,45674,a,2,30127,active,arff,,,,,5.0,150.0,0.0,0.0,4.0,1.0
45675,45675,a,3,30127,active,arff,50.0,,50.0,3.0,4.0,150.0,0.0,0.0,3.0,1.0
45681,45681,iris,54,30127,active,arff,,,,,5.0,150.0,0.0,0.0,4.0,1.0
45683,45683,iris,55,30127,active,arff,,,,,5.0,150.0,0.0,0.0,4.0,1.0


I need a dataset with mostly numeric features for this task, so I will perform a calculation to get the fraction of numeric features. Also I want to know the fraction of instances with missing values. I want datasets with over 90% of clean (no missing values) and over 80% of numeric features.

I also don't want datasets that are too small or too big, so I limit the size to between 500 and 5000 instances and 10 to 50 features.

In [11]:
availDatasets["FracNumeric"] = 1.00*availDatasets['NumberOfNumericFeatures']/availDatasets['NumberOfFeatures']
availDatasets["FracClean"] = 1-1.00*availDatasets['NumberOfInstancesWithMissingValues']/availDatasets['NumberOfInstances']
availDatasets = availDatasets[(availDatasets['NumberOfInstances']>=500) & (availDatasets['NumberOfInstances']<=5000)]
availDatasets = availDatasets[(availDatasets['NumberOfFeatures']>=10) & (availDatasets['NumberOfFeatures']<=50)]
availDatasets.to_csv('Datasetlist.csv')
availDatasets

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures,FracNumeric,FracClean
2,2,anneal,1,1,active,ARFF,684.0,7.0,8.0,5.0,39.0,898.0,898.0,22175.0,6.0,33.0,0.153846,0.00000
3,3,kr-vs-kp,1,1,active,ARFF,1669.0,3.0,1527.0,2.0,37.0,3196.0,0.0,0.0,0.0,37.0,0.000000,1.00000
15,15,breast-w,1,1,active,ARFF,458.0,2.0,241.0,2.0,10.0,699.0,16.0,16.0,9.0,1.0,0.900000,0.97711
22,22,mfeat-zernike,1,1,active,ARFF,200.0,10.0,200.0,10.0,48.0,2000.0,0.0,0.0,47.0,1.0,0.979167,1.00000
23,23,cmc,1,1,active,ARFF,629.0,4.0,333.0,3.0,10.0,1473.0,0.0,0.0,2.0,8.0,0.200000,1.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45537,45537,Contaminant-detection-in-packaged-cocoa-hazeln...,1,33136,active,arff,,,,0.0,31.0,2400.0,0.0,0.0,31.0,0.0,1.000000,1.00000
45538,45538,Contaminant-detection-in-packaged-cocoa-hazeln...,1,33136,active,arff,,,,0.0,31.0,2400.0,0.0,0.0,31.0,0.0,1.000000,1.00000
45539,45539,Contaminant-detection-in-packaged-cocoa-hazeln...,1,33136,active,arff,,,,0.0,31.0,2400.0,0.0,0.0,31.0,0.0,1.000000,1.00000
45540,45540,Contaminant-detection-in-packaged-cocoa-hazeln...,1,33136,active,arff,,,,0.0,31.0,2400.0,0.0,0.0,31.0,0.0,1.000000,1.00000
