# Assignment: Predict taxi fares in New York

In this assignment you're going to build an app that can predict taxi fares in New York.

The first thing you'll need is a data file with transcripts of New York taxi rides. The [NYC Taxi & Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) provides yearly TLC Trip Record Data files which have exactly what you need.

We're going to use the Yellow Taxi trip dataset from December 2018. This is a CSV file that looks like this:
￼

![Data File](./assets/data.png)


There are a lot of columns with interesting information in this data file, but you will only train on the following:

* Column 0: The data provider vendor ID
* Column 3: Number of passengers
* Column 4: Trip distance
* Column 5: The rate code (standard, JFK, Newark, …)
* Column 9: Payment type (credit card, cash, …)
* Column 10: Fare amount

You are going to build a machine learning model in C# that will use columns 0, 3, 4, 5, and 9 as input, and use them to predict the taxi fare for every trip. Then you’ll compare the predicted fares with the actual taxi fares in column 10, and evaluate the accuracy of your model.

## Get started

We will start by installing the NuGet package for ML.NET and a support package for fast decision trees. 

Run the following code block:

In [1]:
#r nuget:Microsoft.ML
#r nuget:Microsoft.ML.FastTree

Now we're ready to add some code. Let's start with a bunch of using statements:

In [2]:
using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.Data;
using XPlot.Plotly;

Note the **XPlot.Plotly**. This is the awesome XPlot plotting library that Jupyter loads by default. We'll use it in this assignment to plot some cool graphs from the taxi dataset.

Now we're ready to add some classes. We’ll need one to hold a taxi trip, and one to hold your model predictions.

Run the following code:

In [3]:
public class TaxiTrip
{
    [LoadColumn(0)] public string VendorId;
    [LoadColumn(5)] public string RateCode;
    [LoadColumn(3)] public float PassengerCount;
    [LoadColumn(4)] public float TripDistance;
    [LoadColumn(9)] public string PaymentType;
    [LoadColumn(10)] public float FareAmount;
}

public class TaxiTripFarePrediction
{
    [ColumnName("Score")]
    public float FareAmount;
}

The **TaxiTrip** class holds one single taxi trip. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from.

You're also declaring a **TaxiTripFarePrediction** class which will hold a single fare prediction.

Now you need to load the training data in memory:

In [4]:
// file path to data file
static readonly string dataPath = Path.Combine(Environment.CurrentDirectory, "yellow_tripdata_2018-12_small.csv");

// create the machine learning context
var mlContext = new MLContext();

// set up the text loader 
var textLoader = mlContext.Data.CreateTextLoader(
    new TextLoader.Options() 
    {
        Separators = new[] { ',' },
        HasHeader = true,
        Columns = new[] 
        {
            new TextLoader.Column("VendorId", DataKind.String, 0),
            new TextLoader.Column("RateCode", DataKind.String, 5),
            new TextLoader.Column("PassengerCount", DataKind.Single, 3),
            new TextLoader.Column("TripDistance", DataKind.Single, 4),
            new TextLoader.Column("PaymentType", DataKind.String, 9),
            new TextLoader.Column("FareAmount", DataKind.Single, 10)
        }
    }
);

// load the data 
Console.Write("Loading training data....");
var dataView = textLoader.Load(dataPath);
Console.WriteLine("done");

// split into a training and test partition
var partitions = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);

Loading training data....done


This code sets up a **TextLoader** to load the CSV data into memory. Note how the information in the **TextLoader.Options.Columns** field tells the **Load** method how to load each CSV column in memory. And finally the **TrainTestSplit** method splits the data into a training set consisting of 80% of the data and a testing set consisting of 20% of the data.

## Cleaning the data

Let's see if that worked. We're going to deserialize the training data into an enumeration of TaxiTrip instances and do a quick visual check of the data:

In [8]:
// get an array of taxi trips
var trips = mlContext.Data.CreateEnumerable<TaxiTrip>(partitions.TrainSet, reuseRowObject: false).ToArray();

// display the result
display(trips.Take(10));

index,VendorId,RateCode,PassengerCount,TripDistance,PaymentType,FareAmount
0,1,1,2,2.5,1,12.0
1,1,1,3,2.3,1,13.0
2,2,1,1,0.0,2,2.5
3,1,1,1,3.9,1,12.5
4,1,1,1,12.8,1,45.0
5,1,1,1,0.3,4,4.0
6,1,1,1,3.3,2,17.5
7,1,1,1,5.7,1,26.5
8,1,2,1,17.3,1,52.0
9,1,1,1,0.3,1,5.0


Even though every column contains a number, you might have noticed that we are loading **RateCode** and **PaymentType** as strings. The reason we're doing this is because RateCode is an enumeration with the following values:

* 1 = standard
* 2 = JFK
* 3 = Newark
* 4 = Nassau
* 5 = negotiated
* 6 = group

And PaymentType is defined as follows:

* 1 = Credit card
* 2 = Cash
* 3 = No charge
* 4 = Dispute
* 5 = Unknown
* 6 = Voided trip

These actual numbers don’t mean anything in this context. And we certainly don’t want the machine learning model to start believing that a trip to Newark is three times as important as a standard fare.

So converting these values to strings is a perfect trick to show the model that **RateCode** and **PaymentType** are just labels, and the underlying numbers have no meaning.

Let's do a quick plot. We're going to plot a graph of the rate of a taxi trip as a function of distance:

In [7]:
// plot median house value by latitude and longitude
var chart = Chart.Plot(
    new Graph.Scattergl()
    {
        x = trips.Select(v => v.TripDistance),
        y = trips.Select(v => v.FareAmount),
        mode = "markers"
    }
);
chart.WithXTitle("Trip Distance");
chart.WithYTitle("Fare Amount");
chart.WithTitle("Fare amount by trip distance");
chart.Width = 600;
chart.Height = 600;
display(chart);

The fare scales linearly with distance as we would expect. But check out those outliers! There's one $350 trip that covers no distance, and another trip with a negative fare amount.

This shows how important it is to clean a dataset before we train our machine learning models. There are always crazy outliers in any dataset, and we don't want our machine learning models to get the wrong ideas.

Let's get rid of the fare outliers:

In [16]:
var cleanedData = mlContext.Data.FilterRowsByColumn(
    input: partitions.TrainSet,
    columnName: "FareAmount", 
    lowerBound: 0, 
    upperBound: 200);

Now the **cleanedData** variable contains only training data where the fare amount is between 0 and 200. All the outliers have been removed from the dataset.

## Training the model

We'll continue setting up a machine learning training pipeline:

In [20]:
// set up a learning pipeline
var pipeline = 
    
    // copy the fare amount to a new label column
    mlContext.Transforms.CopyColumns(
        inputColumnName:"FareAmount", 
        outputColumnName:"Label")

    // one-hot encode all text features
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("VendorId"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("RateCode"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("PaymentType"))

    // combine all input features into a single column 
    .Append(mlContext.Transforms.Concatenate(
        "Features", 
        "VendorId", 
        "RateCode", 
        "PassengerCount", 
        "TripDistance", 
        "PaymentType"))

    // cache the data to speed up training
    .AppendCacheCheckpoint(mlContext)

    // use the fast tree learner 
    .Append(mlContext.Regression.Trainers.FastTree());

// train the model
Console.Write("Training the model....");
var model = pipeline.Fit(cleanedData);
Console.WriteLine("done");

Training the model....done


Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.

This pipeline has the following components:

* **CopyColumns** which copies the FareAmount column to a new column called Label. This Label column holds the actual taxi fare that the model has to predict.
* **OneHotEncoding** to perform one hot encoding on the three columns that contains enumerative data: VendorId, RateCode, and PaymentType. This is a required step because machine learning models cannot handle enumerative data directly. The one hot encoding transformation converts the number to a special vector that is suitable for machine learning models to train on. 
* **Concatenate** which combines all input data columns into a single column called Features. This is a required step because ML.NET can only train on a single input column.
* **AppendCacheCheckpoint** which caches all data in memory to speed up the training process.
* A final **FastTree** regression learner which will train the model to make accurate predictions.

Let's use the model to make predictions for every row in the training set:

In [34]:
// get a set of predictions 
Console.Write("Creating training predictions....");
var predictions = model.Transform(cleanedData);
Console.WriteLine("done");

Creating training predictions....done


The **predictions** variable now holds a list of predicted fare amounts for every taxi trip in the training set. Unfortunately we cannot display these predictions directly because the Jupyter server doesn't have built-in support to render the output correctly.

However we can easily fix that by adding a helper method:

In [35]:
using Microsoft.AspNetCore.Html;
Formatter<DataDebuggerPreview>.Register((preview, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(preview.ColumnView.Select(c => (IHtmlContent) th(c.Column.Name)));
    var rows = new List<List<IHtmlContent>>();
    var count = 0;
    foreach (var row in preview.RowView)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(count));
        foreach (var obj in row.Values)
        {
            cells.Add(td(obj.Value));
        }
        rows.Add(cells);
        count++;
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");

With the helper method fully set up, we can now render the output of the machine learning pipeline:

In [36]:
// show a 5-record preview of the output of the machine learning pipeline
var preview = predictions.Preview(maxRows: 5);
display(preview);

index,VendorId,VendorId.1,VendorId.2,RateCode,RateCode.1,RateCode.2,PassengerCount,TripDistance,PaymentType,PaymentType.1,PaymentType.2,FareAmount,SamplingKeyColumn,Label,Features,Score
0,1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 3 }",1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",2,2.5,1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",12.0,0.5956414,12.0,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 13 }",11.690246
1,1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 3 }",1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",3,2.3,1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",13.0,0.58837676,13.0,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 13 }",11.314795
2,2,2,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 3 }",1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",1,0.0,2,2,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",2.5,0.7536782,2.5,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 13 }",3.465709
3,1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 3 }",1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",1,3.9,1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",12.5,0.96748567,12.5,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 13 }",16.118864
4,1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 3 }",1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",1,12.8,1,1,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 4 }",45.0,0.9295975,45.0,"{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: False, Length: 13 }",39.54979


Note how **VendorId**, **RateCode**, and **PaymentType** are present three times in the output. This is because we used the **OneHotEncoding** transformation on these columns. 

If you scroll to the right, you'll see the **Features** concatenated column which contains all individual columns we're training on. The actual fare amount is in the **FareAmount** column, and the model predictions are in the final column at the end called **Score**.

Note how each score is close to the actual fare amount. The model is generating predictions wich are fairly close to the real fare amounts. 

## Evaluating the model

But are the predictions close enough? Let's find out by calculating the RMSE, MSE, and MAE metrics:

In [37]:
// get a set of predictions 
Console.Write("Evaluating the model....");
var predictions = model.Transform(partitions.TestSet);

// get regression metrics to score the model
var metrics = mlContext.Regression.Evaluate(predictions, "Label", "Score");
Console.WriteLine("done");

// show the metrics
Console.WriteLine();
Console.WriteLine($"Model metrics:");
Console.WriteLine($"  RMSE:{metrics.RootMeanSquaredError:#.##}");
Console.WriteLine($"  MSE: {metrics.MeanSquaredError:#.##}");
Console.WriteLine($"  MAE: {metrics.MeanAbsoluteError:#.##}");
Console.WriteLine();

Evaluating the model....done

Model metrics:
  RMSE:2.55
  MSE: 6.51
  MAE: 1.4



This code calls **Transform** to set up predictions for every single taxi trip in the test partition. The **Evaluate**(…) method then compares these predictions to the actual taxi fares and automatically calculates these metrics:

* **RootMeanSquaredError**: this is the root mean squared error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
* **MeanAbsoluteError**: this is the mean absolute prediction error or MAE value, expressed in dollars.
* **MeanSquaredError**: this is the mean squared error, or MSE value. Note that RMSE and MSE are related: RMSE is the square root of MSE.

The RMSE is 2.55. Unfortunately this doesn't tell us anything. Is 2.55 good or bad? We don't know because the RMSE metric is not expressed in dollars and cents.

A better metric is MAE which stands for the mean averate error in each individual prediction. We get a value of $1.40 which means every taxi fare prediction will be off by one dollar and forty cents on average. 

What do you think about these results? Are you happy with this level of accuracy?

## Making a prediction

To wrap up, let’s use the model to make a prediction.

Imagine that I'm going to take a standard taxi trip, I cover a distance of 3.75 miles, I am the only passenger, and I pay by credit card. What would my fare be? 

Here’s how to make that prediction:

In [38]:
// create a prediction engine for one single prediction
var predictionFunction = mlContext.Model.CreatePredictionEngine<TaxiTrip, TaxiTripFarePrediction>(model);

// prep a single taxi trip
var taxiTripSample = new TaxiTrip()
{
    VendorId = "2",
    RateCode = "1",
    PassengerCount = 1,
    TripDistance = 3.75f,
    PaymentType = "1",
    FareAmount = 0 // the model will predict the actual fare for this trip
};

// make the prediction
var prediction = predictionFunction.Predict(taxiTripSample);

// sho the prediction
Console.WriteLine($"Single prediction:");
Console.WriteLine($"  Predicted fare: {prediction.FareAmount:0.####}");

Single prediction:
  Predicted fare: 15.6474


You use the **CreatePredictionEngine** method to set up a prediction engine. The two type arguments are the input data class and the class to hold the prediction. And once the prediction engine is set up, you can simply call **Predict** to make a single prediction.

As you can see, the model predicts that my taxi trip should cost $15.65. 

## Further improvements

Can you think of ways to improve the accuracy of the model predictions?