# openml

[openml.org](https://openml.org) es una plataforma abierta para compartir conjuntos de datos, algoritmos y experimentos de aprendizaje automático con datos tabulados. Los principales conceptos sobre los cuales se basa son:
* **Dataset:** $\;$ conjunto de datos tabulados
* **Task:** $\;$ conjunto de datos, tarea de aprendizaje a realizar y método de evaluación
* **Flow:** $\;$ pipeline de aprendizaje automático con detalles sobre software a emplear e hiperparámetros a ajustar
* **Run:** $\;$ experimento de evaluación de un flow en una tarea

La elección de conjuntos de datos se puede hacer en la sección [datasets](https://openml.org/search?type=data). Los conjuntos elegidos se pueden descargar directamente o en uso de la función [fetch_openml](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml) de sklearn. Ahora bien, en general es preferible escoger conjuntos de datos previamente elegidos por otros usuarios (con algún criterio específico) y publicados en la sección [benchmarks](https://openml.org/search?type=benchmark). En particular, podemos destacar tres "benchmark suites" recientes para comparar y evaluar técnicas de clasificación:
* **OpenML-CC18 Curated Classification benchmark:** $\;$ $72$ conjuntos de [Bahri et al, 2022](https://arxiv.org/abs/2106.15147)
* **Tabular benchmark categorical classification:** $\;$ $7$ conjuntos de [Grinsztajn et al, 2022](https://arxiv.org/abs/2207.08815)
* **AutoML Benchmark All Classification:** $\;$ $71$ conjuntos de [Gijsbers et al, 2019](https://arxiv.org/abs/1907.00909)



In [None]:
# Nota: solo si estás ejecutando este cuaderno en Google Colab, descomenta la siguiente línea y ejecútala:
# !pip install openml

In [3]:
import openml
# OpenML-CC18 99; Tabular 334; AutoML 271
benchmark_suite = openml.study.get_suite(suite_id=334)
benchmark_suite

OpenML Benchmark Suite
ID..............: 334
Name............: Tabular benchmark categorical classification
Status..........: in_preparation
Main Entity Type: task
Study URL.......: https://www.openml.org/s/334
# of Data.......: 7
# of Tasks......: 7
Creator.........: https://www.openml.org/u/26324
Upload Time.....: 2023-01-16 03:22:41

In [4]:
openml.datasets.list_datasets(data_id=benchmark_suite.data, output_format='dataframe')

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
44156,44156,electricity,13,26324,active,arff,19237.0,19237.0,2.0,9.0,38474.0,0.0,0.0,7.0,2.0
44157,44157,eye_movements,8,26324,active,arff,3804.0,3804.0,2.0,24.0,7608.0,0.0,0.0,20.0,4.0
44159,44159,covertype,13,26324,active,arff,211840.0,211840.0,2.0,55.0,423680.0,0.0,0.0,10.0,45.0
45035,45035,albert,2,26324,active,arff,29126.0,29126.0,2.0,32.0,58252.0,0.0,0.0,21.0,11.0
45036,45036,default-of-credit-card-clients,4,26324,active,arff,6636.0,6636.0,2.0,22.0,13272.0,0.0,0.0,20.0,2.0
45038,45038,road-safety,7,26324,active,arff,55881.0,55881.0,2.0,33.0,111762.0,0.0,0.0,29.0,4.0
45039,45039,compas-two-years,5,26324,active,arff,2483.0,2483.0,2.0,12.0,4966.0,0.0,0.0,3.0,9.0


In [5]:
openml.tasks.list_tasks(task_id=benchmark_suite.tasks, output_format="dataframe")

Unnamed: 0,tid,ttid,did,name,task_type,status,estimation_procedure,evaluation_measures,source_data,target_feature,MajorityClassSize,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
361110,361110,TaskType.SUPERVISED_CLASSIFICATION,44156,electricity,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,44156,class,19237,19237,2,9,38474,0,0,7,2
361111,361111,TaskType.SUPERVISED_CLASSIFICATION,44157,eye_movements,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,44157,label,3804,3804,2,24,7608,0,0,20,4
361113,361113,TaskType.SUPERVISED_CLASSIFICATION,44159,covertype,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,44159,class,211840,211840,2,55,423680,0,0,10,45
361282,361282,TaskType.SUPERVISED_CLASSIFICATION,45035,albert,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,45035,class,29126,29126,2,32,58252,0,0,21,11
361283,361283,TaskType.SUPERVISED_CLASSIFICATION,45036,default-of-credit-card-clients,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,45036,y,6636,6636,2,22,13272,0,0,20,2
361285,361285,TaskType.SUPERVISED_CLASSIFICATION,45038,road-safety,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,45038,SexofDriver,55881,55881,2,33,111762,0,0,29,4
361286,361286,TaskType.SUPERVISED_CLASSIFICATION,45039,compas-two-years,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,45039,twoyearrecid,2483,2483,2,12,4966,0,0,3,9


## Descarga y visualización de datasets de openml

Para acceder y obtener alguno de los datasets listados en la tabla anterior (p.e. [`electricity`](https://openml.org/search?type=data&id=44156&sort=runs&status=active)):

In [6]:
from sklearn.datasets import fetch_openml
# data_id contiene el ID en openml del dataset "electricity", con objetos 9-dimensionales y 2 classes a reconocer.
data_id = 44156
ret = fetch_openml(data_id=data_id, parser="auto")
print(ret.DESCR)

Dataset used in the tabular data benchmark https://github.com/LeoGrin/tabular-benchmark,  
                          transformed in the same way. This dataset belongs to the "classification on categorical and
                          numerical features" benchmark. Original description: 
 
**Author**: M. Harries, J. Gama, A. Bifet  
**Source**: [Joao Gama](http://www.inescporto.pt/~jgama/ales/ales_5.html) - 2009  
**Please cite**: None  

**Electricity** is a widely used dataset described by M. Harries and analyzed by J. Gama (see papers below). This data was collected from the Australian New South Wales Electricity Market. In this market, prices are not fixed and are affected by demand and supply of the market. They are set every five minutes. Electricity transfers to/from the neighboring state of Victoria were done to alleviate fluctuations.

The dataset (originally named ELEC2) contains 45,312 instances dated from 7 May 1996 to 5 December 1998. Each example of the dataset refers to 

Visualizamos el dataset con la librería `pandas`:

In [7]:
import pandas as pd
df = pd.DataFrame(data=ret.data, columns=ret.feature_names)
df['species'] = pd.Series(ret.target, dtype='category')
print(df)

           date day    period  nswprice  nswdemand  vicprice  vicdemand  \
0      0.898987   2  0.957447  0.068632   0.568283  0.004456   0.456499   
1      0.867616   5  0.234043  0.033716   0.337102  0.001672   0.329622   
2      0.009159   6  0.255319  0.059175   0.185808  0.003467   0.422915   
3      0.898987   2  0.531915  0.087577   0.539572  0.004936   0.637752   
4      0.868280   6  0.085106  0.027021   0.165129  0.001271   0.265924   
...         ...  ..       ...       ...        ...       ...        ...   
38469  0.915800   6  0.404255  0.077549   0.456263  0.005332   0.378560   
38470  0.915800   6  0.425532  0.074397   0.444213  0.005110   0.377525   
38471  0.915800   6  0.468085  0.072835   0.423386  0.005019   0.354480   
38472  0.915800   6  0.829787  0.065420   0.353913  0.004508   0.319524   
38473  0.915800   6  0.978723  0.066651   0.329366  0.004630   0.345417   

       transfer species  
0      0.644737    DOWN  
1      0.846930    DOWN  
2      0.414912    DO