Skip to content

terminology

Manlio Morini edited this page May 30, 2024 · 10 revisions

The following terms will come up repeatedly in other documents about Ultra.

Class imbalance

The class imbalance problem typically occurs when there are many more instances of some classes than others. In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones.

Example

An instance (with its features) and a label.

In Ultra the class representing an example is dataframe::example (see dataframe.h and dataframe.cc).

Instance

The thing about which you want to make a prediction. For example, the instance might be an image that you want to classify as either hotdog or not hotdog (Silicon Valley: Season 4 Episode 4).

Feature

A property of an instance used in a prediction task. For example, a car might have a feature Mileage.

NOTE: in Machine Learning a feature has several meanings depending on the context. It can be a data type (e.g. Mileage) or a data type plus its value. Many people use the words attribute and feature interchangeably (it also happens in this wiki).

Label

An answer for a prediction task, either the answer produced by the machine learning system, or the right answer supplied in training data.

For example, the label for an image might be hotdog.

Consider that:

  • symbolic regression tasks have a number as label and it can be accessed via the label_as function;

  • although classification tasks might codify classes with different, problem-specific data types, Ultra always uses its own scheme encoding classes with an integer (class_t). This allows a simpler, uniform, manipulation.

    The actual label can be accessed via the label(example) function and the original label via the dataframe::class_name method.

Metric

A number that you care about. May or may not be directly optimized.

Model

A statistical representation of a prediction task. You train a model on examples then use the model to make predictions.

Objective

A metric that your algorithm is trying to optimize.

Pipeline

The infrastructure surrounding a machine learning algorithm. Includes gathering the data from the front end, putting it into training data files, training one or more models, and exporting the models to production.

Primitive set

The primitive set is the alphabet of a genetic programming program. It consists of a terminal set (the variables, constants and functions with no arguments) and of a function set (functions and constructs driven by the nature of the problem domain).

Stratified sampling

Stratified sampling is a sampling technique used in statistics and machine learning to ensure that the distribution of samples across different classes or categories remains representative of the population.

The population is divided into distinct groups based on certain characteristics (such as age, gender, income level...) and then samples are randomly selected from each group in proportion to their representation in the population. This helps to ensure that each subgroup is adequately represented in the sample so reducing the potential for bias in the analysis.

The method is particularly useful when dealing with imbalanced datasets, where certain classes or categories are significantly more prevalent than others. The goal of stratified sampling is to maintain the proportions of different classes in the sample that closely reflect their proportions in the entire population.