These notebooks are part of Kaggle’s [Practical Model Evaluation](https://www.kaggle.com/practical-model-evaluation) event, which ran from December 3-5 2019. You can find [the livestreams for this event here](https://youtu.be/7RdKnACscjA?list=PLqFaTIg4myu-HA1VGJi_7IGFkKRoZeOFt).

***

During a [Kaggle competition](https://www.kaggle.com/competitions), we evaluate your models in a specific way: the predictions you make are compared to a ground truth using a specific metric. (Which metric depends on the competition and the question we’re asking.) Whichever model achieves the highest score on the final validation dataset wins. Pretty simple right?

If you’re working on building machine learning models in a work setting, however, things may be a bit more complicated. Achieving a good score on your metric of choice is important, of course, but it’s only part of the problem. When picking the best model to use for a particular problem, some of the things you need to consider include: 

* Your time
* Computation time and cost for training and inference
* Model performance (and not just your loss function!)

Let’s talk about each of these points in turn and then how to balance them when picking what type of model to work on.

## Your Time

It’s easy to forget that **your time is a limited and valuable resource**. It can be hard to predict how long a data science project will take, but here are some things that can take more time than you might anticipate.

* **Scoping projects.** What’s feasible? What do you currently have enough data to be able to do? What would be the most valuable type of model? What timelines are you working with? Figuring out the answers to these questions, and making sure that your stakeholders agree, can be extremely time consuming. (If you're lucky enough to be working with a project or product manager they should be able to help you here.)
* **Setting up your environment and installing dependencies.** True story: one year in grad school I ended up spending an entire summer wrestling with different audio codecs and their dependencies! It can be easy to underestimate how much time it will take to get a new modelling framework and all its dependencies set up. Especially for very new frameworks, you may end up finding brand new problems or bugs that you’ll need to solve before you can even start training your model.
* **Data collection and preparation.** If you’re lucky, the data you need will already exist. If you’re *really* lucky it will already be in a form that you can use for modelling. Probably at least one of these things will not be true and correcting it will inevitably end up taking far, far longer than you think it should. According to the [2018 Kaggle machine learning and data science survey](https://www.kaggle.com/kaggle/kaggle-survey-2018), data scientists spend over half of their time gathering, cleaning and visualizing data.
* **Communicating with stakeholders.** If you build a model that no one ever uses, what’s the point? 🤷 Working with stakeholders (i.e. people who have an interest in your work) is really, really important to make sure that 1) you’re building a model that addresses a real need, 2) your stakeholders understand the strengths and limitations of your model and also machine learning in general and 3) your model actually gets used.

With these things in mind, here are some tips you can use to reduce the amount of time it takes you to get your model up and training :

* It’s often a good idea to start with an established, older model family and implementation. There will likely be more example code for you to work from and, hopefully, fewer bugs. This is especially true if you already have it installed correctly to train a model with whatever compute you’re using. 
* If you can, try to find a container (like a Docker) that has all the dependencies with the correct versions already set up for the packages you want to use. This will help you save on set up time.
* For data cleaning, before you start working spend some time writing out what you need your data to look like before it goes in the model. Then list out all the steps you need to go through to get your data to that form. This will keep you from too sidetracked during data cleaning and give you a nice checklist to work through and track your progress on.
* If your data is tabular and in a SQL database, do as much of your data cleaning as possible in SQL. A well written SQL query is generally *much* faster than running the equivalent data manipulations in Python or R.

## Compute time

If you’ve trained larger models before, this a limitation you’re very familiar with. The more trainable parameters a model has and the more data you’re using to train it, the longer it will take. And, since computing time costs money (either in electricity if you’re working on your own machines or being billed for the time if you’re using somebody else’s). The initial training time isn’t the only factor to consider, however. Here are some other things you might not have considered:

* **How long will it take to update your model?** Depending on your specific project, you may need to regularly update your model or retrain a new one from scratch. Some models, like neural networks, can generally be updated. For other models, especially random forests, it’s often easiest to retrain from scratch again. If you’ll need to retrain your model often you might consider
* **How long does it take your model to make predictions?** This is often called “inference time” and it’s really easy to forget to check in the excitement of training models! If you’re model is very accurate but so slow that everyone who tries to use it quits before they get their results back it’s probably not actually a very good model. 

Depending on your specific problem, **you’ll have to choose how you want to balance the time it takes to train your model, update your trained model and use your model to make predictions**.

## Model performance

The most common way to measure a given model is it’s loss metric. (I personally generally go with cross entropy for classification and, as long as I care about outliers, mean squared error for regression.) However, while these metrics are very useful for training they can pretty easily hide important differences between individual models. 

### Error analysis

This is where *error analysis* comes into play. In machine learning, people generally use “error analysis” to mean looking at how many and what sorts of errors a model made. This can be helpful during model training and tuning to help you identify places where your model can be improved, usually through additional feature engineering or data preprocessing. 

Including a discussion of error analysis with your final model can also help build trust. 

> All machine learning models make errors. It’s important to be able to clearly communicate the types of errors your model is likely to make and consider that when selecting which model to implement.

For this workshop, we’ll be looking at multi-class classification problems and using confusion matrices to quickly summarize errors. 

### Interpretability

Another important thing to consider when evaluating model performance is *interpretability*. Specifically, *why* did a given model output a specific decision for a specific class? When you’re working with stakeholders who have a lot of knowledge about the data you’re working with, being able to offer an answer to this question can help build trust in your model. 

For this particular workshop, we’ll be using counterfactuals as a way to interpret model decisions. Counterfactuals let you ask “what feature would I need to change and in what way in order to get a different output?” or, in the easier-to-compute case, “how would my model output change if I changed a single feature in a specific way?”. 

> Counterfactuals have two main advantages: you can use them for any class of model and it’s easy for someone without much of a machine learning background to understand. Which is important; not everyone on your team is going to have--or need--a deep background in machine learning.

We’ll talk more about them later on, but for now if you’re curious you can check out [this chapter in Christoph Molnar’s book “Interpretable Machine Learning
A Guide for Making Black Box Models Explainable” for more details](https://christophm.github.io/interpretable-ml-book/counterfactual.html#generating-counterfactual-explanations).

# Exercise!

For the purposes of this workshop, we’ll be working on a project to predict what job title a data-science-related role will have given the responsibilities of the role using data from the 2018 & 2019 Kaggle Data Science Survey.

Let’s pretend that you’re working for a consulting company that helps other companies hire data scientists. We’ll be building this tool as an automated first step for people interested in consulting with the company to narrow down what type of role they’re looking for, and that information will be used to help match them with the most relevant consultant on the team.

With this in mind, take some time to think about and answer the following questions. If you’d like to share, you can post your answers in the comments below.

* Take a look [at the 2019 survey data](https://www.kaggle.com/c/kaggle-survey-2019). Which particular fields might make good features? How much cleaning will this data need for you to be ready to train a classification model on it? (I'll be providing one version of the cleaned data, but if you have time you cad do your own data cleaning and feature engineering.)
* Data science is a fast moving field and job titles change quickly. On the other hand, hiring events are relatively rare. Given how often you’ll need to retrain your model and also that it isn't going to need to be run very often, how would you balance training time, retraining time and inference time?
* What sorts of errors should you (or your imaginary stakeholders) be particularly worried about given the subject matter of the model?