# Using ML.NET for Machine Learning

In this notebook we'll be using ML.NET framework to estimate water consumption and refill (measured in grams) starting from historical data of accelerometer's aggregated measures. Water consumption is represented in the dataset by a negative weight delta, while water refill is represented as a positive weight delta.

In [51]:
//Importing all the packages needed to load, explore and model the data in .NET
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json"
#r "nuget:Microsoft.Data.Analysis,0.20.0-preview.22514.1"
#r "nuget: DataView.InteractiveExtension, 1.0.69"
#r "nuget: SandDance.InteractiveExtension, 1.0.69"
#r "nuget: Microsoft.ML.AutoML, 0.20.0-preview.22514.1"
#r "nuget: Plotly.NET.Interactive"
#r "nuget: Plotly.NET.CSharp"

### Importing the data in the environment
Let's start by importing the data from a csv file, collecting the aggregation computed from the historical measures of acceleration and weight.

In [52]:
//Loading and storing the dataset of aggregated measures of weight and acceleration of the glass 
#!value --name datasource --from-file "./waterConsumptionDataset.csv"

In [53]:
//Accessing the dataset previously stored
#!share --from value datasource
datasource

ActionId,Time,WindowDuration,Weight,AvgAccX,AvgAccY,AvgAccZ,RangeAccX,RangeAccY,RangeAccZ
1,10/14/2022 2:47:36 AM,00:00:03.1220000,-0.20000000298023224,-0.0441176477162277,0,1,0.10000000149011612,0,0
2,10/14/2022 2:48:06 AM,00:00:06.6790000,-0.7999999970197678,-0.008333333043588532,-0.07361111189756128,0.9833333326710595,0.9000000357627869,1.0000000298023224,0.5999999642372131
3,10/14/2022 2:48:30 AM,00:00:05.0860000,-0.6000000238418579,-0.0036363634196194734,-0.04000000086697665,0.9909090898253701,0.5000000149011612,0.4000000134110451,0.40000003576278687
4,10/14/2022 2:48:52 AM,00:00:07.8860000,0.6000000238418579,0.05294117725947324,-0.11882352864041049,0.90117647227119,1.699999988079071,1.4000000357627869,0.7000000476837158
5,10/14/2022 2:49:17 AM,00:00:11.6280000,-0.20000001788139343,0.029600001037120818,-0.10080000025033951,0.9343999986648559,1.100000023841858,1.600000023841858,1.199999988079071
6,10/14/2022 2:50:20 AM,00:00:04.4280000,-0.699999988079071,-0.0437500006519258,0

In [54]:
using Microsoft.Data.Analysis;
var dataframe = DataFrame.LoadCsvFromString(datasource);
dataframe = dataframe.OrderBy("Time");
dataframe

index,ActionId,Time,WindowDuration,Weight,AvgAccX,AvgAccY,AvgAccZ,RangeAccX,RangeAccY,RangeAccZ
⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️


### Exploring the data with data visualization
Data visualization is an efficient tool in data science to understand your data, find out possible relationships between variables and explore distribution of specific columns. It' s an important step before jumping into data modeling.

In [55]:
//Filtering only negative weight variance (water consumption)
PrimitiveDataFrameColumn<bool> waterConsumptionFilter = dataframe["Weight"].ElementwiseLessThanOrEqual(0);
var waterConsumptionData = dataframe.Filter(waterConsumptionFilter);

In [56]:
//Filtering only positive weight variance (water refill)
PrimitiveDataFrameColumn<bool> waterRefillFilter = dataframe["Weight"].ElementwiseGreaterThan(0);
var waterRefillData = dataframe.Filter(waterRefillFilter);

In [57]:
using Plotly.NET.CSharp;

Chart.Combine(new [] {
Chart.Point<DateTime, float, string>(

    x: waterConsumptionData.Columns["Time"].Cast<DateTime>(),
    y: waterConsumptionData.Columns["Weight"].Cast<float>(),
    "Water Consumption"
),
Chart.Point<DateTime, float, string>(

    x: waterRefillData.Columns["Time"].Cast<DateTime>(),
    y: waterRefillData.Columns["Weight"].Cast<float>(),
    "Water Refill"
)})
.WithXAxisStyle<double, double, string>(Title: Plotly.NET.Title.init("Time"))
.WithYAxisStyle<double, double, string>(Title: Plotly.NET.Title.init("Weight Deltas"))


In [58]:
Chart.Combine(new [] {
    Chart.Point<DateTime, float, string>(
    
        x: waterConsumptionData.Columns["WindowDuration"].Cast<DateTime>(),
        y: waterConsumptionData.Columns["Weight"].Cast<float>(),
        "Water Consumption"
    ),
    Chart.Point<DateTime, float, string>(
    
        x: waterRefillData.Columns["WindowDuration"].Cast<DateTime>(),
        y: waterRefillData.Columns["Weight"].Cast<float>(),
        "Water Refill"
    )})
.WithXAxisStyle<double, double, string>(Title: Plotly.NET.Title.init("Window Duration"))
.WithYAxisStyle<double, double, string>(Title: Plotly.NET.Title.init("Weight Deltas"))

In [59]:
    Chart.BoxPlot<string, float, string>(
    waterConsumptionFilter.Cast<bool>().Select(x => x ? "Water Consumption" : "Water Refill").ToArray(),
    dataframe.Columns["AvgAccX"].Cast<float>().ToArray())
.WithXAxisStyle<string, float, string>(Title: Plotly.NET.Title.init("Water Consumption"))
.WithYAxisStyle<string, float, string>(Title: Plotly.NET.Title.init("Avg AccX"))

In [60]:
Chart.BoxPlot<string, float, string>(
    waterConsumptionFilter.Cast<bool>().Select(x => x ? "Water Consumption" : "Water Refill").ToArray(),
    dataframe.Columns["AvgAccY"].Cast<float>().ToArray())
.WithXAxisStyle<string, float, string>(Title: Plotly.NET.Title.init("Water Consumption"))
.WithYAxisStyle<string, float, string>(Title: Plotly.NET.Title.init("Avg AccY"))

In [61]:
Chart.BoxPlot<string, float, string>(
    waterConsumptionFilter.Cast<bool>().Select(x => x ? "Water Consumption" : "Water Refill").ToArray(),
    dataframe.Columns["AvgAccZ"].Cast<float>().ToArray())
.WithXAxisStyle<string, float, string>(Title: Plotly.NET.Title.init("Water Consumption"))
.WithYAxisStyle<string, float, string>(Title: Plotly.NET.Title.init("Avg AccZ"))

### Transform data and prepare them to modeling
From visualizing the data we learnt which features might be more relevant to estimate our label, so we can now transform the dataset to prepare it to training. Also, we need to hold back a small portion of the dataset for evaluation purposes.

In [64]:
using Microsoft.ML;
using Microsoft.ML.AutoML;
using Microsoft.ML.Data;

//initialize ML context
var ctx = new MLContext();

In [None]:
DataFrame RemoveColumns(DataFrame df, params string[] columns)
{
    var new_df = df.Clone();
    foreach (var column in columns)
    {
        new_df.Columns.Remove(column);
    }
    return new_df;
}

In [65]:
//transforming windowduration in numeric feature
var windowDurationFloat = dataframe.Columns["WindowDuration"].Cast<DateTime>().Select(x => (float)(x.Second + (float)x.Millisecond/1000));
var newColumn = new PrimitiveDataFrameColumn<float>("WindowDurationFloat", windowDurationFloat);
var dataToModel = dataframe.Clone();
dataToModel.Columns.Add(newColumn); 

//removing DateTime columns, since they are not helpful to the model
dataToModel = RemoveColumns(dataToModel, "Time", "WindowDuration");

In [66]:
dataToModel

index,ActionId,Weight,AvgAccX,AvgAccY,AvgAccZ,RangeAccX,RangeAccY,RangeAccZ,WindowDurationFloat
⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️


In [67]:
//splitting dataset into test and train
var split = ctx.Data
                .TrainTestSplit(dataToModel, testFraction: 0.3, seed:3);

var trainData = split.TrainSet;
var testData = split.TestSet;

### Create a machine learning experiment
We are now ready to define our training pipeline for a regressor model, create and run our machine learning experiment, relying on AutoML to discover which model would fit better. At the end of the experiment, we'll be also evaluating the best model's performances using the test dataset.

In [68]:
var features = dataToModel.Columns.Select(x => x.Name).Where(name => name != "ActionId" && name != "Weight").ToArray();
features

index,value
0,AvgAccX
1,AvgAccY
2,AvgAccZ
3,RangeAccX
4,RangeAccY
5,RangeAccZ
6,WindowDurationFloat


In [69]:
//define training pipeline
var features = dataToModel.Columns.Select(x => x.Name).Where(name => name != "ActionId" && name != "Weight").ToArray();
var pipeline = 
    ctx.Auto().Featurizer(trainData,numericColumns:features, excludeColumns: new string[]{"Weight","ActionId"})
        .Append(ctx.Auto().Regression(labelColumnName:"Weight",  useLgbm:false));

In [70]:
//configure experiment
var experiment = ctx.Auto().CreateExperiment();

experiment
	.SetPipeline(pipeline)
	.SetTrainingTimeInSeconds(60)
	.SetRegressionMetric(RegressionMetric.RSquared, labelColumn: "Weight")
	.SetDataset(trainData, testData);

In [71]:
//run the experiment
var result = await experiment.RunAsync();

In [72]:
ITransformer bestModel = result.Model;
string modelType = pipeline.ToString(result.TrialSettings.Parameter).Split("=>").Last();
var predictions = bestModel.Transform(testData);

//best model
modelType.Display()

FastForestRegression

In [73]:
// Comparing actual and predicted values of weight deltas 
using System;
var actual = predictions.GetColumn<float>("Weight");
var predicted = predictions.GetColumn<float>("Score");

var compare = 
	actual
		.Zip(predicted,(actual,pred) => new {Actual=actual, Predicted=pred, Difference=actual-pred})
		.OrderBy(x => Math.Abs(x.Difference))
		.Take(20);

compare

index,Actual,Predicted,Difference
0,-87.899994,-85.52446,-2.375534
1,-45.7,-43.128693,-2.5713081
2,-44.600006,-41.80694,-2.793068
3,-50.300003,-47.36378,-2.936222
4,-37.100002,-33.224003,-3.8759995
5,-38.199997,-34.105515,-4.0944824
6,-36.5,-32.241657,-4.2583427
7,-68.3,-60.63809,-7.661915
8,-38.6,-47.833504,9.233505
9,-51.1,-41.0212,-10.0788


In [74]:
//Displaying the best model Rsquared metric
result.Metric

In [75]:
// Save model as a zip, to make it consumable from other apps
IDataView dvData = (IDataView)(RemoveColumns(dataToModel,"ActionId"));
ctx.Model.Save(result.Model, dvData.Schema, "model.zip");