# openml

[openml.org](https://openml.org) es una plataforma abierta para compartir conjuntos de datos, algoritmos y experimentos de aprendizaje automático con datos tabulados. Los principales conceptos sobre los cuales se basa son:
* **Dataset:** $\;$ conjunto de datos tabulados
* **Task:** $\;$ conjunto de datos, tarea de aprendizaje a realizar y método de evaluación
* **Flow:** $\;$ pipeline de aprendizaje automático con detalles sobre software a emplear e hiperparámetros a ajustar
* **Run:** $\;$ experimento de evaluación de un flow en una tarea

La elección de conjuntos de datos se puede hacer en la sección [datasets](https://openml.org/search?type=data). Los conjuntos elegidos se pueden descargar directamente o en uso de la función [fetch_openml](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml) de sklearn. Ahora bien, en general es preferible escoger conjuntos de datos previamente elegidos por otros usuarios (con algún criterio específico) y publicados en la sección [benchmarks](https://openml.org/search?type=benchmark). En particular, podemos destacar tres "benchmark suites" recientes para comparar y evaluar técnicas de clasificación:
* **OpenML-CC18 Curated Classification benchmark:** $\;$ $72$ conjuntos de [Bahri et al, 2022](https://arxiv.org/abs/2106.15147)
* **Tabular benchmark categorical classification:** $\;$ $7$ conjuntos de [Grinsztajn et al, 2022](https://arxiv.org/abs/2207.08815)
* **AutoML Benchmark All Classification:** $\;$ $71$ conjuntos de [Gijsbers et al, 2019](https://arxiv.org/abs/1907.00909)



In [5]:
import openml
# OpenML-CC18 99; Tabular 334; AutoML 271
benchmark_suite = openml.study.get_suite(suite_id=334)
benchmark_suite

OpenML Benchmark Suite
ID..............: 334
Name............: Tabular benchmark categorical classification
Status..........: in_preparation
Main Entity Type: task
Study URL.......: https://www.openml.org/s/334
# of Data.......: 7
# of Tasks......: 7
Creator.........: https://www.openml.org/u/26324
Upload Time.....: 2023-01-16 03:22:41

In [6]:
openml.datasets.list_datasets(data_id=benchmark_suite.data, output_format='dataframe')

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
44156,44156,electricity,13,26324,active,arff,19237.0,19237.0,2.0,9.0,38474.0,0.0,0.0,7.0,2.0
44157,44157,eye_movements,8,26324,active,arff,3804.0,3804.0,2.0,24.0,7608.0,0.0,0.0,20.0,4.0
44159,44159,covertype,13,26324,active,arff,211840.0,211840.0,2.0,55.0,423680.0,0.0,0.0,10.0,45.0
45035,45035,albert,2,26324,active,arff,29126.0,29126.0,2.0,32.0,58252.0,0.0,0.0,21.0,11.0
45036,45036,default-of-credit-card-clients,4,26324,active,arff,6636.0,6636.0,2.0,22.0,13272.0,0.0,0.0,20.0,2.0
45038,45038,road-safety,7,26324,active,arff,55881.0,55881.0,2.0,33.0,111762.0,0.0,0.0,29.0,4.0
45039,45039,compas-two-years,5,26324,active,arff,2483.0,2483.0,2.0,12.0,4966.0,0.0,0.0,3.0,9.0


In [7]:
openml.tasks.list_tasks(task_id=benchmark_suite.tasks, output_format="dataframe")

Unnamed: 0,tid,ttid,did,name,task_type,status,estimation_procedure,evaluation_measures,source_data,target_feature,MajorityClassSize,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
361110,361110,TaskType.SUPERVISED_CLASSIFICATION,44156,electricity,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,44156,class,19237,19237,2,9,38474,0,0,7,2
361111,361111,TaskType.SUPERVISED_CLASSIFICATION,44157,eye_movements,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,44157,label,3804,3804,2,24,7608,0,0,20,4
361113,361113,TaskType.SUPERVISED_CLASSIFICATION,44159,covertype,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,44159,class,211840,211840,2,55,423680,0,0,10,45
361282,361282,TaskType.SUPERVISED_CLASSIFICATION,45035,albert,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,45035,class,29126,29126,2,32,58252,0,0,21,11
361283,361283,TaskType.SUPERVISED_CLASSIFICATION,45036,default-of-credit-card-clients,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,45036,y,6636,6636,2,22,13272,0,0,20,2
361285,361285,TaskType.SUPERVISED_CLASSIFICATION,45038,road-safety,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,45038,SexofDriver,55881,55881,2,33,111762,0,0,29,4
361286,361286,TaskType.SUPERVISED_CLASSIFICATION,45039,compas-two-years,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,45039,twoyearrecid,2483,2483,2,12,4966,0,0,3,9
