Skip to content

OPF file format for datasets

Gustavo Rosa edited this page Mar 15, 2017 · 5 revisions

As LibDEEP uses the same format as LibOPF datasets, the LibOPF package contains a directory LibOPF/tools, in which you can find some useful tools.

  • txt2opf: a program to convert OPF files written in ASCII format to binary format.

  • opf2txt: a program to convert OPF files written in binary format to ASCII format.

  • opf_check: a program to check whether a file is in the OPF required format.

  • opf2svm: a program to convert binary OPF files to LibSVM format.

  • svm2opf: a program to convert LibSVM files to binary OPF format.


The original dataset and its parts training, evaluation and test sets must be in the following BINARY file format:

<# of samples> <# of labels> <# of features>
<0> <label> <feature 1 from element 0> <feature 2 from element 0> ...
<1> <label> <feature 1 from element 1> <feature 2 from element 1> ...
.
.
<i> <label> <feature 1 from element i> <feature 2 from element i> ...
<i+1> <label> <feature 1 from element i+1> <feature 2 from element i+1> ...
.
.
<n-1> <label> <feature 1 from element n-1> <feature 2 from element n-1> ... 

The first number of each line, <0>, <1>, ... <n-1>, is a sample identifier (for n samples in the dataset), which is used in the case of precomputed distances. However, the identifier must be specified anyway. For unlabeled datasets, please use label 0 for all samples (unsupervised OPF).

Example: Suppose that you have a dataset with 5 samples, distributed into 3 classes, with 2 elements from label 1, 2 elements from label 2 and 1 element from label 3. Each sample is represented by a feature vector of size 2. So, the OPF file format should look like as below:

5 3 2
0 1 0.21 0.45
1 1 0.22 0.43
2 2 0.67 1.12
3 2 0.60 1.11
4 3 0.79 0.04

Comment #1: Note that, the file must be binary with no blank spaces. This ASCII representation is just for illustration.

Comment #2: The first line of the file, 5 3 2, contains, respectively, the dataset size, the number of labels (classes) and the number of features in the feature vectors. The remaining lines contain the sample identifier (integer from 0 to n-1, in which n is the dataset size), its label and the feature values for each sample.