# Predictions
## Intro

In this last phase of the take home test I'm going to use H2O for convenience. In the pro's side it can handle easily tests with different variable sets, parameter changes, and graphics to show the evolution of the training or the importance of variables.

Once we have the data processed and enriched from the previous notebook, we upload it on the H2O session and parse the variables. We have to take care that the code-numbered fields have to be marked as categorical fields. The parsing in H2O not only identifies the fields type, but also makes an in-memory data compression, so even if the processed files make for ~2.5GB each, the final size for the three of them altogether in memory is around 2.4GB.

It can be argued that the learning algorithms available in H2O are a very limited set, but I do not concur. For example, SVM is missing. It is a powerful algorithm but sometimes it cannot scale well (R³ with the number of free support vectors, and nS, the product of number of samples and support vectors), is dependant on the kernel used, and it can be difficult to parallelize. The algorithms we have at our disposal are Generalized Linear Models family, Distributed Random Forest, Gradient Boosting Machine and Deep Learning. Without considering Deep Learning, the good thing about these algorithms is that there are efficient implementations with a high number of samples, which makes them very suitable for production solutions.

It is true that there are no methods for clustering, which could be good for profiling, nor for dimensionality reduction (and this action will be approximated somehow).

So, as for a basic test I'll consider GLM as a baseline, discard Deep Learning (it can be difficult to tune and we are not handling unstructured data such as audio or images) and see the results with the other two. I expect better results with the other regressors as they might be non-linear characteristics that the GLM model cannot easily fit.

## Considerations

Besides parsing the data coherently with regard to the semantics (numerically coded categorical fields), there are some other considerations:
* Fields to ignore
 - the index field exported from pandas
 - Location_PU and Location_DO, which are subproducts of the merging operations and are the same as PULocationID and DOLocationID
 - total_amount: it is a linear combination of other fields, including the one we want to predict, so it is not a valid predictor. In fact we wouldn't have this value in the real world until we do provide the tip.
 - With that selection we end up with 30 features
* Memory constraints: despite compressing the data, the dataset for three months is quite big (as noted above, around 30 million entries). So most of the training models have been executed on a smaller subset (half of the entries, homogenously selected).
* Train and validation set: a random partition of the datain 80/20 has been used

## GLM firsts models

First let's build a basic model, with Gaussian function, Lasso distance.

We obtain an MSE of 2.91 for the training set and an MSE of 2.97 for the validation test. Unless stated otherwise the tables show the values for the validation set.

| Metric | Value |
| :--    |   --: |
|MSE	| 2.970440 |
|RMSE	| 1.723496 |
|r2	|     0.561798 |
|mean_residual_deviance	 | 2.970440 |
|mae	| 0.899362 |

But we also grasp a coarse reference of the importance of the variables and its incidence in the model. Nevertheless it is not advisable to use this list as an attribution model. This means that these predictors have been selected by the regressor training, but 1) it is closer to capturing correlations and 2) during the process random selections may happen in favour of one predictor over some other.


![variable importance Lasso](h2o/GLMLasso50varsel/varimport.png)

It seems that for a simple model the payment type, the rate, the locations and the fare and charges amounts are representative, whereas the time or date are not.

Interesting to note that the categorical variables have been transformed into one-hot encoding, so each value is considered a separate feature.

To note the main insights from this simple model, payment.1 is the usage of credit card, RatecodeID.2 is JFK rate, LocationID 138 is La Guardia airport, 132 is JFK airport... It seems coherent with the insights found in the first data analyses, even if those were limited to just the 30 most likely locations in March. Also, the predictor suggests that dropping in Brooklyn is better than Manhattan.

If we used a limited amount of features as per the calculated importance (the absolute value of their coefficients, aggregated as sum for the categoricals), and limit the numbers to those over one-thousand of the *relative* importance, we end up with 10 features (one third of the total). Training with those, we obtain MSE of 2.98 for the training and 2.97 for the validation set. A bit worse, but almost nothing.

| Metric | Value |
| :--    |   --: |
|MSE |	2.970004|
|RMSE |	1.723370 |
|r2 |	0.561862 |
|mean |_residual_deviance	2.970004 |
|mae |	0.900110 |

## GBM models

Now we are going to test the results with GBM. The basic configuration I am going to use is 5-level trees, running at most up to 50 trees, a learning rate of 0.1, and sampling rate of 1 for samples and columns.

Using all data, the results are better than GLM (considering 50 trees), 1.95 for the validation set, 1.78 for the training set:

![Score history](h2o/XGB_50_80_20_tot/evol.png)

| Metric | Value |
| :--    |   --: |
|number_of_trees |	50 |
|number_of_internal_trees |	50 |
|model_size_in_bytes |	44651 |
|min_depth |	5 |
|max_depth |	5 |
|mean_depth |	5.0 |
|min_leaves |	25 |
|max_leaves |	32 |
|mean_leaves |	29.4400 |
|MSE	 |1.957453 |
|RMSE	 |1.399090 |
|r2	 |0.711235 |
|mean_residual_deviance	 |1.957453 |
|mae	 |0.508201 |

The mean absolute error is almost as half, about $0.50.

Also the GBM model offers a valuation of the variable importance.

![Variable importance GBM](h2o/XGB_50_80_20_tot/importance.png)

The model clearly suggests just 7 variables out of 30. All related to charges, distance and **area** locations for the pick-up and drop-off. Again, not a single variable for date or time is regarded as valuable.

I will perform another test with this feature selection, and I will also add "extra", as if we follow the same criteria above (over one-thousand of relative importance used for GLM) it should be considered. The results barely change:

| Metric | Value |
| :--    |   --: |
| MSE |	1.957158 |
| RMSE |	1.398985 |
| r2 |	0.711278 |
| mean_residual_deviance |	1.957158 |
| mae |	0.508364 |

We have reduced our model from 30 to 8 variables with reasonable results. This also helps with training times.


## DRF Models

Now let's test a DRF training. At this point I am comparing the trainings with the subset of 8 variables. It takes more time and is slightly worse than GBM. Also, the convergence is more unstable, and the overfitting seems to be a bit more serious (2.03 for the validation set, 1.90 for the training set).

![Score metrics DRF](h2o/DRF8/evol.png)

| Metric | Value |
| :--    |   --: |
| number_of_trees |	50 |
| number_of_internal_trees |	50 |
| min_depth |	20 |
| max_depth |	20 |
| mean_depth |	20.0 |
| min_leaves |	3729 |
| max_leaves |	28008 |
| mean_leaves	 |12250.4400 |
| MSE |	2.029719 |
| RMSE |	1.424682 |
| r2 |	0.700574 |
| mean_residual_deviance |	2.029719 |
| mae |	0.585851 |
| rmsle |	0.303854 |

We can try to improve the result with a different sampling rate (say 0.8 instead of the generic 0.63 provided by the tool), although at the risk of overfitting:

![Score metrics DRF](h2o/DRF8/evol_08.png)

It is not good, although not that bad, we have to bear in mind that the first results with just one tree matches the GLM result. In any case it tends to overfit more, and the model on the training set only compares to the GBM result without improving any more. So it does not seem likely we can improve the results on this algorithm easily.



# Summary


## Results
We have seen three different machine learning algorithms and their results:

1. Accuracy: GDB obtained the best MSE scores with a small set of features.
2. Speed: GLM is the fastest (41s for the full set of features), although we half the error in GBM trained for 329s for the reduced set (it surpasses the accuracy of GLM after 61s with 10 trees). DRF is the slowest of the bunch with 16min for the reduced set of 8 features.
2. Generalization: although no algorithm overfitted very much (about %5 score worsening in the validation set for GBM and DRF, 0.3% for GLM) DRF had the worst stability behaviour.
3. Potential improvements:
 - Test more hyperparameter tuning: in this basic training I haven't searched for better accuracy values, nor for much regularization (in fact this should be considered once a real model is done, with more data, as the results may vary)
   + we could use a selection criteria (Mutual Information, for example) to do a preselection of promising features: we should not base our feature selection on the variable importance of a trained regressor given one go on the whole set (how would we solve for co-linearity, for example?)
   + we could use feature extraction/reduction techniques (such as PCA), although the models may become more difficult to explain
   + we can use other approaches to feature selection (as testing for incremental inclusion, etc)
   + everything subjected to the requirements of the project (maybe $0.50 RMSE is good enough)
 - Spatio-temporal strategies: we have seen that the model predicts tip_amount given the rest of the data, but we are interested in having trips to charge in the first place, so as we saw on the data exploration phase we need to know where and when we have to be in order to pick a passenger.
 
## Turning into an API

Let's consider that we choose GBM for our solution.

We have to distinguish two different phases when taking this kind of systems to production. 

* One is the creation of the models, which has to be updated, maybe including more data, maybe assessing the validity of the model as time goes by (for example, it could be done only on a given time window so the samples are the most up-to-date), so in some way the creation of the model should be automatised, and include more data than there is now. Fortunately, for GBM there are parallel implementations that allow the processing of huge quantities of samples, also available on cloud solutions, so the training seems plausible even for lots of data. For instance, XGBoost.
* The other is taking the model and turn it into a service. In this case many different options may appear. Once an backend interface has been selected (Rest, graphQL, propietary...) and how it is going to be serviced (usage, clients, authorization, licensing, etc), the deployment of the model should not be difficult to do.
 - The model and the resources needed to run it are orders of magnitude smaller compared to the training phase: it is just a set of decision trees, so the execution is fast. In fact many implementations may provide, for instance, a small Python code snippet to implement the decision logic. H2O provides a Java implementation and a MOJO deployable package (one is provided for the selected regressor). The only caveat is that any transformation used to create features must be cloned once we want to exploit the model for a given new sample.
 - The interface definition does not need to be complex, as long as the data sample to use for prediction can be handled (which any modern data API can do nowadays).
 - As for this simple case the model does not need either history of the data and all calls can be considered independent, in case of high load the balancing is easy as no session needs to be kept.
 - And if some application needs to be working over a service like this, most frontend frameworks support natively the access to this kind of interfaces (for instance, Rest interfaces from Angular). Maybe it can be a good idea to still provide the full data available and not just the features we have selected. We don't know if we are going to change the model in the future, and changing an interface once an enviroment has been established can be more expensive than sending a few more bytes per request.