Skip to content
Gustavo Rosa edited this page Aug 4, 2016 · 4 revisions

In this package, we have the following programs:

SUPERVISED OPF
  • opf_split: This is a program to randomly split the dataset into training, evaluation and test sets.

  • opf_train: This is a program to execute the training phase considering the OPF proposed by [PapaIJIST09,PapaPR12].

  • opf_learn: This is a program to execute the learning phase from classification errors in the evaluation set considering the OPF proposed by [PapaIJIST09,PapaPR12]. It substitutes opf_train.

  • opf_classify: This is a program to execute the test phase by classifying the test set considering the OPF proposed by [PapaIJIST09,PapaPR12].

  • opf_train: This is a program to execute the training phase considering the OPF proposed by [PapaISVC08].

  • opf_knnclassify: This is a program to execute the test phase by classifying the test set considering the OPF proposed by [PapaISVC08].

  • opf_accuracy: This is a program to compute the accuracy over training and/or test set.

  • opf_accuracy4label: This is a program to compute the accuracy over training and/or test set for each label.

UNSUPERVISED OPF
  • opf_cluster: This is a program to compute clusters by OPF. It assigns a consecutive number starting from 1 to N for N clusters, when the training set is unlabeled. Otherwise, it propagates the true labels of the roots to the labels of the nodes in their respective trees in order to evaluate the quality of the clustering. The resulting classifier is written in classifier.opf.
COMMON (Auxiliary functions)
  • opf_distance: This is a program to compute distance functions and store them into a precomputed distance file.

  • opf_normalize: This is a program to normalise datasets.

  • opf_info: This is a program that retrieves basic information about OPF files, such as the dataset size, number of labels and features.

  • opf_fold: This program partitions the datasets in k folds.

  • opf_merge: This program merges the folds, and it can be used together with opf_fold program.


opf_split Usage

Usage: opf_split <P1> <P2> <P3> <P4> <P5>

P1: dataset in the OPF file format
P2: percentage of the training set size [0,1]
P3: percentage of the evaluation set size [0,1] (leave 0 in the case of no learning)
P4: percentage of the test set size [0,1]
P5: normalize features? 1 - Yes 0 - No

The sum P2 + P3 + P4 must be 1.

The features are normalized with the following equation:

N_i = (F_i - M_i)/S_i,

where F_i, M_i and S_i are, respectively, the feature i, the average of F_i and the standard deviation of F_i in the dataset.

The program splits the dataset into two new files, training.opf and testing.opf, when P3 = 0, and it splits the dataset into three files, training.opf, evaluating.opf and testing.opf, otherwise.

opf_train Usage

Usage: opf_train <P1> <P2>

P1: training set in the OPF file format
P2: precomputed distance file (leave it in blank if you are not using this resource)

The program designs a classifier from training.opf and outputs it in a file named classifier.opf, which is used by opf_classify for testing.

The opf_train also outputs the following files:

  • .out: it contains the predicted labels (training phase)
  • .time: it contains the execution time in seconds (training phase)
  • .acc: it contains the accuracy (training phase)
opf_learn Usage

Usage: opf_learn <P1> <P2> <P3>

P1: training set in the OPF file format
P2: evaluation set in the OPF file format
P3: precomputed distance file (leave it in blank if you are not using this resource)

The program substitutes opf_learn when there is evaluation set. It learns from the classification errors in the evaluation set without increasing the training set size, and outputs a final classifier in a file named classifier.opf, which is used for testing by the program opf_classify.

The opf_learning outputs the following file:

  • .time: it contains the execution time in seconds (learning phase)
opf_classify Usage

Usage: opf_classify <P1> <P2>

P1: test/training set in the OPF file format
P2: precomputed distance file (leave it in blank if you are not using this resource)

The opf_classify outputs the following files:

  • .out: it contains the predicted labels (test phase)
  • .time: it contains the execution time in seconds (test phase)
opfknn_train Usage

Usage: opfknn_train <P1> <P2> <P3>

P1: training set in the OPF file format
P2: kmax (maximum value for the k-neighborhood)
P3: precomputed distance file (leave it in blank if you are not using this resource)

The program designs a classifier from training.opf and outputs it in a file named classifier.opf, which is used by opfknn_classify for testing.

The opf_knntrain also outputs the following files:

  • .out: it contains the predicted labels (training phase)
  • .time: it contains the execution time in seconds (training phase)
  • .acc: it contains the accuracy (training phase)
opfknn_classify Usage

Usage: opf_knnclassify <P1> <P2>

P1: test/training set in the OPF file format
P2: precomputed distance file (leave it in blank if you are not using this resource)

The opf_knnclassify outputs the following files:

  • .out: it contains the predicted labels (test phase)
  • .time: it contains the execution time in seconds (test phase)
opf_accuracy Usage

Usage: opf_accuracy <P1>

P1: data set in the OPF file format

The opf_accuracy will look for a classified file with the same name of the data set file in P1 and extension ".out" in order to compute the accuracy of that classification. It outputs a text file with the same name and extension ".acc".

opf_accuracy4label Usage

Usage: opf_accuracy4label <P1>

P1: data set in the OPF file format

The opf_accuracy4label will look for a classified file with the same name of the data set file in P1 and extension ".out" in order to compute the accuracy of that classification. It outputs a text file with the same name and extension ".acc".

opf_cluster Usage

Usage: opf_cluster <P1> <P2> <P3> <P4> <P5>

P1: unlabeled data set in the OPF file format
P2: kmax (maximum degree for the knn graph)
P3: 0 (height), 1 (area) and 2 (volume)
P4: value of parameter P3 (integer) in (0-1)
P5: precomputed distance file (leave it in blank if you are not using this resource)

P3: allows to remove maxima from the pdf based on height, area or volume criteria.

Note: the opf_cluster outputs the k value that minimized the cut in the graph as well as the number of obtained clusters and a classifier written in a file classifier.opf. The labeled samples (predicted) are also outputed in a ".out file".

opf_knn_classify Usage

Usage: opf_knn_classify <P1> <P2>

P1: test/training set in the OPF file format
P2: precomputed distance file (leave it in blank if you are not using this resource)

The opf_knn_classify outputs the following files:

  • .out: it contains the predicted labels (test phase)
  • .time: it contains the execution time in seconds (test phase)
opf_distance Usage

One of the most important characteristic of the OPF classifier is the possibility of working with any distance function. Its default is the Euclidean metric. The user can execute the program opf_distance with the following options of distance functions.

Usage: opf_distance <P1> <P2> <P3>

P1: Dataset in the OPF file format
P2: Distance ID

1 - Euclidean
2 - Chi-Square
3 - Manhattan (L1)
4 - Canberra
5 - Squared Chord
6 - Squared Chi-Squared
7 - BrayCurtis

P3: Distance normalization? 1- yes 0 - no

The program computes the selected distance function between every pair of samples in the dataset and outputs a precomputed distance file (distances.dat). The sample identifier in the dataset is used here. The distance values may be or not be normalized with P3. The user can also create his/her own distance file. The file BINARY format is:

<# of samples>
<Distance from sample 0 to sample 0> <Distance from sample 0 to sample 1> ...
<Distance from sample 1 to sample 0> <Distance from sample 1 to sample 1> ...
.
.
<Distance from sample n-1 to sample 0> <Distance from sample n-1 to sample 1> ...

Comment #1: Note that, the file is an N x N matrix of distance values. It must be binary with no blank spaces. This ASCII representation is just for illustration.

opf_normalize Usage

If the user has its own datasets and does not need to use opf_split, he/her may need to normalise the dataset. Therefore, the user can use the opf_normalize program, which employes the same normalisation process used by opf_split.

Usage: opf_normalize <P1> <P2>

P1: input dataset in the OPF file format
P2: normalized output dataset in the OPF file format

opf_info Usage

It retrieves basic information about OPF files, such as dataset size, and number of labels and features.

Usage: opf_info <P1>

P1: OPF file

opf_fold Usage

If the user needs to employ a k-fold cross validation, he/she can use the opf_fold program, which partitions the dataset in k folds. The user can merge folds with opf_merge program.

Usage: opf_fold <P1> <P2> <P3>

P1: input dataset in the OPF file format
P2: k
P3: normalize features? 1 - Yes 0 - No

opf_merge Usage

If merges n folds for a k-fold cross validation.

Usage: opf_merge <P1> <P2> ... <Pn>

P1: input dataset 1 in the OPF file format
P2: input dataset 2 in the OPF file format
Pn: input dataset n in the OPF file format