Skip to content

Type of datasets supported by LAC

bugfoo edited this page Nov 20, 2019 · 1 revision

LAC supports multiple input datasets. It has been developed in this sense to facilitate the integration with existing tools that imposes a specific format. At current version, LAC is able to work with CSV, ARFF and KEEL format without any pre-processing step to convert among formats. No configuration is required to use an input or another, but LAC is able to automatically detect the kind of used dataset (by means of the file extension). Next, each format is briefly described and different sources for each kind of input format are provided.

  • CSV format. A Comma-Separated Values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. At the moment, LAC supposes that the very first row is the header of the dataset and it is not considered as data. Types of attributes are inferred using all the values for each attribute. Class is obtained using the header, or in cases where the attribute class is not represented with this name, the last column is used as class. File extension for this format should be .csv. Next, an example is provided.

     outlook, humidity, windy, temperature, play
     sunny, high, low, 25, yes
     overcast, low, low, 14, no
     rainy, mild, high, 0, no
    

    In this example as there are not any attribute with the name of class, the last column, i.e., play will be used as class. Whereas this kind of format is not so much used in the research community, it is one of the most used while approaching real-world problems because it is pretty easy to generate. Datasets for this format could be downloaded from https://www.mldata.io/datasets/. Additionally, LAC repository also provides a small example for this kind of file at https://github.com/kdis-lab/lac/tree/v0.2.0/doc/examples/dataset.csv.

  • ARFF format. An ARFF file is a text file that describes a list of instances sharing a set of attributes. ARFF files have two distinct sections. The first section is the header information, which is followed by the data section. The header of ARFF file contains the name of the relation, that is, the name of dataset, a list of attributes, and their respective types. Each line of this section must start with the `@' character. Header section is finished after the line @data, where all the data will be found. Each line represents a different instance, and each attribute's value is separated by comma. This format is well-known thanks to Weka tool. File extension for this format should be .arff. Next, an example is provided.

     @RELATION weather.tennis
     @ATTRIBUTE outlook {sunny, overcast, rainy}
     @ATTRIBUTE humidity {high, low, middle}
     @ATTRIBUTE windy {low, high}
     @ATTRIBUTE temperature NUMERIC
     @ATTRIBUTE play {yes, no}
     @DATA
     sunny, high, low, 25, yes
     overcast, low, low, 14, no
     rainy, mild, high, 0, no
    

    One of the best source for this kind of dataset, is UCI repository https://archive.ics.uci.edu/ml/datasets.php. This repository shares more than 400 datasets, many of them are well-known in the research community and many AC algorithms have been executed on that obtaining very good results. Additionally, LAC repository also provides a small example for this kind of file at https://github.com/kdis-lab/lac/tree/v0.2.0/doc/examples/dataset.arff.

  • KEEL format. It is a text file that describes a list of instances, it is based on ARFF but it also has added a couple of modifications. First, it has changed from numeric types to $integer$ or $real$. It also adds the intervals for those kind of attributes. Finally, it also requires to specify which attributes are part of the inputs and which are part of the output. As well as ARFF, files have two distinct sections. The first section describes the header information and the second part shows the data. Likewise, each line of this section must start with the `@' character. Header section is finished after the line @data, where all the data is found. Each line represents a different instance, and each attribute's value is separated by comma. File extension for this format should be .dat. Next, an example is provided.

     @RELATION weather.tennis
     @ATTRIBUTE outlook {sunny, overcast, rainy}
     @ATTRIBUTE humidity {high, low, middle}
     @ATTRIBUTE windy {low, high}
     @ATTRIBUTE temperature integer [0,25]
     @ATTRIBUTE play {yes, no}
     @INPUTS outlook, humidity, windy, temperature
     @OUTPUTS play
     @DATA
     sunny, high, low, 25, yes
     overcast, low, low, 14, no
     rainy, mild, high, 0, no
    

    The best repository for this kind of dataset, is KEEL repository http://keel.es/datasets.php. It has almost 100 datasets for the classification task. Additionally, LAC repository also provides a small example for this kind of file at https://github.com/kdis-lab/lac/tree/v0.2.0/doc/examples/dataset.dat.

Clone this wiki locally