# Assignment: Automatically find the best model for the taxi dataset

In this assignment you're going to build an app that can predict taxi fares in New York.

You already did that in the first lesson in this course, but the twist this time is that you are going to let ML.NET automatically pick the best machine learning algorithm for you!

## Get started

To start please run the following code block to install the required NuGet packages:

In [1]:
#r nuget:Microsoft.ML
#r nuget:Microsoft.ML.AutoML

Note the **AutoML** package, this is the ML.NET experimentation engine that can automatically discover the best machine learning algorithm for any given dataset. We're going to use AutoML in this assignment to discover the best possible learner to use.  

We're ready to add code. Let's start with a bunch of using statements:

In [2]:
using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.AutoML;
using XPlot.Plotly;

Now we're ready to add classes. We’ll need one to hold a taxi trip, and one to hold your model predictions.

Run the following code:

In [3]:
public class TaxiTrip
{
    [LoadColumn(0)] public string VendorId;
    [LoadColumn(5)] public string RateCode;
    [LoadColumn(3)] public float PassengerCount;
    [LoadColumn(4)] public float TripDistance;
    [LoadColumn(9)] public string PaymentType;
    [LoadColumn(10)] public float FareAmount;
}

public class TaxiTripFarePrediction
{
    [ColumnName("Score")]
    public float FareAmount;
}

The **TaxiTrip** class holds one single taxi trip. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from.

You're also declaring a **TaxiTripFarePrediction** class which will hold a single fare prediction.

Now you need to load the training data in memory:

In [4]:
// file path to data file
static readonly string dataPath = Path.Combine(Environment.CurrentDirectory, "yellow_tripdata_2018-12_small.csv");

// create the machine learning context
var context = new MLContext();

// load the data 
Console.Write("Loading training data....");
var data = context.Data.LoadFromTextFile<TaxiTrip>(path: dataPath, hasHeader:true, separatorChar: ',');
Console.WriteLine("done");

Loading training data....done


This code calls **LoadFromTextFile** to load the CSV data into memory. Note how the attributes in the **TaxiTrip** field tells the method how to load each CSV column in memory. 

Let's see what the data looks like. We're going to deserialize the training data into an enumeration of TaxiTrip instances and do a quick visual check of the data:

In [5]:
// get an array of taxi trips
var trips = context.Data.CreateEnumerable<TaxiTrip>(data, reuseRowObject: false).ToArray();

// display the result
display(trips.Take(10));

index,VendorId,RateCode,PassengerCount,TripDistance,PaymentType,FareAmount
0,1,1,2,2.5,1,12.0
1,1,1,3,2.3,1,13.0
2,2,1,1,0.0,2,2.5
3,1,1,1,3.9,1,12.5
4,1,1,1,12.8,1,45.0
5,1,1,1,18.8,1,50.5
6,1,1,1,1.0,1,7.5
7,1,1,1,0.3,4,4.0
8,1,1,1,3.3,2,17.5
9,1,1,1,5.7,1,26.5


We are not going to clean the data any further. Instead, we're going to use the AutoML engine to automatically process the dataset, build the machine learning pipeline, and select the optimal learning algorithm to predict the taxi fares.

## Automatically training the model

Here's how to set up AutoML:

In [6]:
// automatically discover the optimal model
var cutoff = 90; // <--- seconds
Console.Write("Automatically discovering best model...");
var results = context.Auto()
              .CreateRegressionExperiment((uint)cutoff)
              .Execute(data, "FareAmount");
Console.WriteLine("done");

Automatically discovering best model...done


The **Auto** method starts AutoML and the **CreateRegressionExperiment** sets up an experiment to find the best regression model. Finally, the **Execute** method starts the experiment, processes the training data, tries out several pipelines and learning algorithms to discover the best quality fare amount predictions, and finally returns a collection of best models it has discovered.

Let's take a look at the top models discovered by the experiment:

In [7]:
var models = from r in results.RunDetails 
             let rmse = r.ValidationMetrics?.RootMeanSquaredError ?? 99
             orderby rmse ascending
             select new 
             { 
                 Trainer = r.TrainerName, 
                 RMSE = rmse,
                 MAE = r.ValidationMetrics?.MeanAbsoluteError 
             };
display(models);

index,Trainer,RMSE,MAE
0,LightGbmRegression,3.2411236213028,1.5046073263005693
1,FastTreeRegression,3.328601627622368,1.4387104523017327
2,LightGbmRegression,3.330734551334208,1.462082858581309
3,LightGbmRegression,3.344453618649537,1.7371664287507995
4,FastTreeRegression,3.4329769703848827,1.437959974640971
5,FastTreeTweedieRegression,3.570389069136981,1.4163964356813994
6,FastTreeRegression,3.5829041817448664,1.5312732305651584
7,SdcaRegression,3.58446707667788,1.7303408644830833
8,SdcaRegression,3.652401174159612,1.8192336069493136
9,SdcaRegression,3.6549345544862426,1.8131337257070244


## Making a prediction

Let's wrap up by grabbing the top model and use it to make a prediction:

In [8]:
// save the best model from the experiment results
var model = results.BestRun.Model;

// create a prediction engine for one single prediction
var engine = context.Model.CreatePredictionEngine<TaxiTrip, TaxiTripFarePrediction>(model);

// prep a single taxi trip
var taxiTripSample = new TaxiTrip()
{
    VendorId = "2",
    RateCode = "1",
    PassengerCount = 1,
    TripDistance = 3.75f,
    PaymentType = "1",
    FareAmount = 0 // the model will predict the actual fare for this trip
};

// make the prediction
var prediction = engine.Predict(taxiTripSample);

// sho the prediction
Console.WriteLine($"Single prediction:");
Console.WriteLine($"  Predicted fare: {prediction.FareAmount:0.####}");

Single prediction:
  Predicted fare: 15.6025


You use the **CreatePredictionEngine** method to set up a prediction engine. The two type arguments are the input data class and the class to hold the prediction. And once the prediction engine is set up, you can simply call **Predict** to make a single prediction.

As you can see, the model predicts that my taxi trip should cost around $15. 

## Further improvements

Can you think of ways to improve the accuracy of the model predictions?