Skip to content

Latest commit

 

History

History
224 lines (175 loc) · 13 KB

SchemaComprehension.md

File metadata and controls

224 lines (175 loc) · 13 KB

Schema comprehension in ML.NET

This document describes in detail the under-the-hood mechanism that ML.NET uses to automate the creation of IDataView schema, with the goal to make it as convenient to the end user as possible, while not incurring extra computational costs.

For a better understanding of IDataView principles and type system please refer to:

Introduction

Every dataset in ML.NET is represented as an IDataView, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other annotations is known as the schema of the IDataView, and it's represented as an DataViewSchema object.

In this document, we will be using the terms data view and IDataView interchangeably, same for schema and DataViewSchema.

Before any new data enters ML.NET, the user needs to somehow define how the schema of the data will look like. To do this, the following questions need to be answered:

  • What are the column names?
  • What are their types?
  • What other annotations are associated with the columns?

These items above are very similar to the definition of fields in a C# class: names and types of columns correspond to names and types of fields, and annotations can correspond to field attributes. Because of this similarity, ML.NET offers a common convenient mechanism for creating a schema: it is done via defining a C# class.

For example, the below class definition can be used to define a data view with 5 float columns:

public class IrisData
{
    public float Label;
    public float SepalLength;
    public float SepalWidth;
    public float PetalLength;
    public float PetalWidth;
}

Using schema comprehension to make a data view and to read a data view

The first obvious benefit of schema comprehension is that we can now create IDataViews out of in-memory enumerables of user-defined 'data types', without having to define the schema. It works in the other direction too: you can take an IDataView, and read it as an IEnumerable of user-defined 'data type' (which will fail if the user-provided schema does not match the real schema).

Let's see how we can create a new IDataView out of an in-memory array, run some operations on it, and then read it back into the array.

public class IrisData
{
    public float Label;
    public float SepalLength;
    public float SepalWidth;
    public float PetalLength;
    public float PetalWidth;
}

public class IrisVectorData
{
    public float Label;
    public float[] Features;
}

static void Main(string[] args)
{
    // Here's a data array that we want to work on.
    var dataArray = new[]
    {
        new IrisData { Label = 1, PetalLength = 1, SepalLength = 1, PetalWidth = 1, SepalWidth = 1 },
        new IrisData { Label = 0, PetalLength = 2, SepalLength = 2, PetalWidth = 2, SepalWidth = 2 }
    };

    // Create the ML.NET environment.
    var context = new MLContext();

    // Create the data view.
    // This method will use the definition of IrisData to understand what columns there are in the
    // data view.
    var dataView = context.Data.LoadFromEnumerable(dataArray);

    // Now let's do something to the data view. For example, concatenate all four non-label columns
    // into 'Features' column.
    var pipeline = context.Transforms.Concatenate("Features",
        "SepalLength", "SepalWidth", "PetalLength", "PetalWidth");

    // Next, let's fit and transform the data so the concatenation goes through the data view.
    var transformedData = pipeline.Fit(dataView).Transform(dataView);

    // Read the data into an IEnumerable.
    // This method will use the definition of IrisData to understand which columns and of which types
    // are expected to be present in the input data.
    var data = context.Data.CreateEnumerable<IrisVectorData>(transformedData, reuseRowObject: false).ToList();
}

After this code runs, data will contain two IrisVectorData objects, each having Features filled with the actual values of the features (the 4 concatenated columns).

Streaming data views

What if the original data doesn't support seeking, like if it's some form of IEnumerable<IrisData> instead of IList<IrisData>? Well, we can simply use another helper function:

var streamingDv = env.CreateStreamingDataView<IrisData>(dataEnumerable);

The only subtle difference is, the resulting streamingDv will not support shuffling (a property that's useful to some ML application).

AsCursorable and reuseRowObject parameter

When you read a data view as AsEnumerable<OutType>, ML.NET will create and populate an object per row. If you do not need multiple row objects to exist in memory (for example, you are writing them to disk one by one, as you scan through the IEnumerable), you may want to set reuseRowObject to true. This will make ML.NET create only one row object for the entire data view when you enumerate it, and just re-populate the values every time.

Obviously, in the example above this would lead to incorrect behavior, as the data variable will hold two copies of the same IrisVectorData object. Please consider carefully whether you want to reuse the row object, because it is more efficient, but can lead to hard to find issues.

Sometimes, we don't even want to populate the row object per row. For example, we only want to see every 100th row of the data, so there's no need to populate the remaining 99% row objects. In this case, you can use AsCursorable<OutType> method:

var cursorable = dv.AsCursorable<IrisVectorData>(env);
// You can create as many simultaneous cursors as you like, they are independent.
using (var cursor = cursorable.GetCursor())
{
    // We are now in charge of creating the row object.
    var myRow = new IrisVectorData();
    while (cursor.MoveNext())
    {
        if (cursor.Position % 100 == 99)
        {
            // Populate the values of the row object.
            cursor.FillValues(myRow);
            // Do something to the row.
        }
    }
}

Please note that cursors are not thread-safe: they have mutable state inside, and they are meant to be used by one thread. If you want to read the data in parallel, use multiple cursors.

PredictionEngine and PredictorModel

ML.NET's PredictionEngine is attempting to turn a sequence of data transforms (maybe capped by a predictor, but not necessarily) into a 'black box' that takes strongly typed inputs and returns strongly typed outputs. The name is a little misleading: the PredictionEngine object doesn't require a predictor to be present in the pipeline, it can be just a sequence of transforms like in the below example:

var engine = env.CreatePredictionEngine<IrisData, IrisVectorData>(dv);
var output = engine.Predict(new IrisData { Label = 1, PetalLength = 1, SepalLength = 1, PetalWidth = 1, SepalWidth = 1 });

It is important to note that the PredictionEngine actually validates that the 'pipeline' conforms to the input and output schema requirements when it is created.

The same can be said about the PredictorModel<InputType, OutputType>. This is a somewhat more restricted version of PredictionEngine that is created by LearningPipeline.Train.

Please note that PredictionEngine and PredictorModel are not thread-safe: they hold an internal cursor object, and therefore cannot be used in a re-entrant fashion. If you ever see the error message that says: An attempt was made to keep iterating after the pipe has been reset, it most likely means that ML.NET has detected a race condition on the PredictionEngine.

Type system mapping

IDataView type system differs slightly from the C# type system, so a 1-1 mapping between column types and C# types is not always feasible. Below are the most notable examples of the differences:

  • IDataView vector columns often have a fixed (and known) size. The C# array type best corresponds to a 'variable size' vector: the one that can have different number of slots on every row. You can use [VectorDataViewType(N)] attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones.
  • IDataView's key types don't have a natural underlying C# type either. To declare a key-type column, you need to make your field an uint, and decorate it with [KeyType] to denote that the field is a key, and not a regular unsigned integer.

Full list of type mappings

The below table illustrates what C# types are mapped to what IDataView types:

IDataView type C# type C# type with extra conversion
I1 DvInt1 sbyte, sbyte?
I2 DvInt2 short, short?
I4 DvInt4 int, int?
I8 DvInt8 long, long?
U1 byte byte?
U2 ushort ushort?
U4 uint uint?
U8 ulong ulong?
UG UInt128
R4 float float?
R8 double double?
TX DvText, string
BL DvBool bool, bool?
TS DvTimeSpan
DT DvDateTime
DZ DvDateTimeZone
Variable-size vector VBuffer<T> T[], and the vector is always dense
Fixed-size vector VBuffer<T> with [VectorDataViewType(N)] T[] with VectorDataViewType(N), and the vector is always dense
Key type uint with [KeyType]

Additional attributes to affect type mapping

There are two more attributes that can affect the way ML.NET conducts schema comprehension:

  • [ColumnName] lets you choose a different name for the IDataView column. By default it is the same as field name.
    • This is a way to create or read back an IDataView column with a name containing 'invalid' characters (like whitespace).
  • [NoColumn] is an attribute that denotes that the below field should not be mapped to a column.

Using SchemaDefinition for run-time type mapping hints

As you can see from the table and notes above, certain IDataView types can only be denoted with an additional field attribute. If the type parameters are not known at compile time (like the size of the fixed-size vector), this is tricky.

You can use a SchemaDefinition object to re-map a type to an IDataView schema programmatically. It gives you the same powers as the attributes, but at runtime. Please see the below example.

// Vector size is only known at runtime.
int numberOfFeatures = 4;

// Create the default schema definition.
var schemaDef = SchemaDefinition.Create(typeof(IrisVectorData));

// Specify the right vector size.
schemaDef["Features"].ColumnType = new VectorDataViewType(NumberType.R4, numberOfFeatures);

// Create a data view.
var dataView = env.CreateDataView<IrisVectorData>(arr, schemaDef);

// Create a prediction engine. You can add custom input and output schema definitions there.
var predictionEngine = env.CreatePredictionEngine<IrisData, IrisVectorData>(dv, outputSchemaDefinition: schemaDef);

In addition to the above, you can use SchemaDefinition to add per-column annotations:

// Add column annotation.
schemaDef["Label"].AddAnnotation(MetadataUtils.Kinds.HasMissingValues, false);

Limitations

Certain things are not possible to do at all using the schema comprehensions, but are possible via the native IDataView programmatic interface. It was our design decision to not allow these scenarios, thus simplifying the other, more common scenarios.

Here is the list of things that are only possible via the low-level interface:

  • Creating or reading a data view, where even column types are not known at compile time (so you cannot create a C# class to define the schema)
    • This can happen if you write a general-purpose machine learning tool that can ingest different kinds of datasets.
  • Reading a subset of columns that differs from one row to another: the cursor always populates the entire row object.
  • Reading column annotations from the data view.
  • Accessing the 'hidden' data view columns by index.
    • Hidden columns are those that have the same name as other columns and a smaller index. They are not accessible by name.
  • Creating 'cursor sets': this is a feature that lets you iterate over data in multiple parallel threads by splitting the data between multiple 'sibling' cursors.