Skip to content

sonar_tutorial

Manlio Morini edited this page Jun 1, 2024 · 50 revisions

Sonar

Dataset

The original dataset contains 208 examples organized on 60 features:

Pandas dataframe obtained from sonar.csv (you can take a look at sonar.md for additional information about the dataset)

The last column holds the label / class of the examples (values "R" and "M", which stands for "Rock" or "Mine" observation). We will try to predict it based on the columns from 0 to 59.

We can look at the class balance:

sonar.csv class balance

concluding that there isn't a class imbalance problem.

Let's start coding

#include "kernel/ultra.h"

int main()
{
  using namespace ultra;

  // READING INPUT DATA
  src::dataframe::params params;
  params.output_index = src::dataframe::params::index::back;

  src::problem prob("sonar.csv", params);

  // ...
}

Data preprocessing isn't required (take a look at dataset format for details):

  • all attributes / features are numerical;
  • there isn't a header row (usually it's automatically identified).

All we have to do is point the column containing labels:

params.output_index = src::dataframe::params::index::back;

Actually there's just a little bit left before the first attempt so let's add the missing code without hesitation:

int main()
{
  using namespace ultra;

  // READING INPUT DATA
  src::dataframe::params params;
  params.output_index = src::dataframe::params::index::back;

  src::problem prob("sonar.csv", params);

  // SETTING UP SYMBOLS
  prob.setup_symbols();

  // SEARCHING
  src::search s(prob);
  s.validation_strategy<src::holdout_validation>(prob, 70);

  const auto result(s.run());

  std::cout << "\nCANDIDATE SOLUTION\n"
            << out::c_language << result.best_individual
            << "\n\nACCURACY\n" << *result.best_measurements.accuracy * 100.0
            << '%'
            << "\n\nFITNESS\n" << *result.best_measurements.fitness << '\n';
}

All the classes and functions are placed into the ultra namespace.

Now compiling and executing the code (for your ease the above code is in the examples/sonar.cc file):

$ cd build/examples
$ ./sonar

you should see something like (actual values will differ):

[INFO] Importing dataset...
[INFO] ...dataset imported
[INFO] Examples: 208, features: 60, classes: 2
[INFO] Setting up terminals...
[INFO] Category 0 variables: `X1` `X2` `X3` `X4` `X5` `X6` `X7` `X8` `X9` `X10` `X11` `X12` `X13` `X14` `X15` `X16` `X17` `X18` `X19` `X20` `X21` `X22` `X23` `X24` `X25` `X26` `X27` `X28` `X29` `X30` `X31` `X32` `X33` `X34` `X35` `X36` `X37` `X38` `X39` `X40` `X41` `X42` `X43` `X44` `X45` `X46` `X47` `X48` `X49` `X50` `X51` `X52` `X53` `X54` `X55` `X56` `X57` `X58` `X59` `X60`
[INFO] ...terminals ready
[INFO] Automatically setting up symbol set...
[INFO] Category 0 symbols: `FABS` `FADD` `FDIV` `FLN` `FMUL` `FMOD` `FSUB`
[INFO] ...symbol set ready
[INFO] Holdout validation settings: 70% training (145), 15% validation (31), 15% test (32)
[INFO] Number of layers set to 1
[INFO] Population size set to 740
   2.001       0:      -51.308                    
   6.004       0:      -50.317                      
  14.010       0:     -46.4347                       
   01:08       3:      -45.186                                
01:14  3: -45.186  [1x 145gph]

At last we have a program to classify training examples! Not too difficult, but what's going on under the hood?

Line by line description

src::problem prob("sonar.csv", params);

src::problem class is a specialization of the ultra::problem class for symbolic regression and classification tasks. The constructor reads/stores a dataset (file "sonar.csv") and other useful information.


Classification (and symbolic regression) tasks are tackled through genetic programming. In genetic programming every candidate solution / program needs an alphabet to be expressed (primitive set).

The command:

prob.setup_symbols();

sets the default primitive set. Part of this set (the terminal set) is derived from the input data

[INFO] Setting up terminals...
[INFO] Category 0 variables: `X1` `X2` `X3` `X4` `X5` `X6` `X7` `X8` `X9` `X10` `X11` `X12` `X13` `X14` `X15` `X16` `X17` `X18` `X19` `X20` `X21` `X22` `X23` `X24` `X25` `X26` `X27` `X28` `X29` `X30` `X31` `X32` `X33` `X34` `X35` `X36` `X37` `X38` `X39` `X40` `X41` `X42` `X43` `X44` `X45` `X46` `X47` `X48` `X49` `X50` `X51` `X52` `X53` `X54` `X55` `X56` `X57` `X58` `X59` `X60`
[INFO] ...terminals ready

and part (the function set) is arbitrarily chosen:

[INFO] Automatically setting up symbol set...
[INFO] Category 0 symbols: `FABS` `FADD` `FDIV` `FLN` `FMUL` `FMOD` `FSUB`
[INFO] ...symbol set ready

src::search s(prob);

The search of solutions is entirely driven by the master src::search class. It uses prob to access data and specific parameters/constraints.

src::search is a template class, its template parameters allow to choose various search algorithms. For now we stick with the default.


Training and testing a model on the same dataset can lead to overfitting, where the model memorizes the training data rather than learning underlying patterns, resulting in poor generalization to new, unseen data.

For this reason available data are split into a:

  • training set used for training a model;
  • validation set to be able to measure the model generalization (i.e. to provide an unbiased evaluation of the model fit).

(see Training, validation, and test data sets for a good, simple introduction to the topic)

The instruction:

s.validation_strategy<src::holdout_validation>(prob);

perform the split (and, by default, also shuffling and stratified sampling); hence the output message:

[INFO] Holdout validation settings: 70% training (144), 30% validation (64), 0% test (0)

const auto result(s.run());

The run method executes the search and returns a summary including the best program found. Since we do not specify any environment parameter (e.g. population size, number of runs...) they're automatically tuned (e.g. see the information lines [INFO] Number of layers set to 1 and [INFO] Population size set to 740).


std::cout << "\nCANDIDATE SOLUTION\n"
          << out::c_language << result.best_individual
          << "\n\nACCURACY\n" << *result.best_measurements.accuracy * 100.0
          << '%'
          << "\n\nFITNESS\n" << *result.best_measurements.fitness << '\n';

The interesting part is the accuracy that is calculated on the validation set. The candidate solution / individual isn't so useful: it acts as a discriminant function and so it's just a part of the classification model.

PROCEED TO PART 2→