In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="css/custom.css">

<img src='../images/gdd-logo.png' width='300px' align='right' style="padding: 15px">


# Choosing a Machine Learning Algorithm

There is no simple answer to the question what the best machine learning algorithm is for your task since every task will be different. The choice of which algorithm to use comes down to a lot of factors. We are going to cover some of these in this notebook:

- [The target feature](#target)
- [The business need](#buss)
- [The computational time](#time)
- [Number and type of features](#number)

This is not an exhaustive list, but a good place to start when deciding on the right algorithm. Let's dive into this in more detail.

<img src='../images/05_Choosing_Models/target.png' width=300px align=right>

<a id='target'></a>

# The Target Feature

First off we need to find out if we have a target feature. What kind of problems would you have where there is no existing target to predict?

- Customer segmentation - To build marketing or other business strategies
- Genetics - E.g. clustering DNA patterns to analyze evolutionary biology
- Recommender systems - Grouping together users with similar patterns to recommend similar content
- Anomaly detection - Fraud detection or detecting defective mechanical parts

These kind of problems would require an **unsupervised** solution.

<mark>Question: How would you evaulate the success of your models if you do not have a target? </mark> 

# The Target Feature

In this training we have focused on learning models to predict a target vector. This is known as **supervised** learning. 

When the target vector consists of categorical values it is called **classification** and when it consists of continuous values it is called **regression**.

## <mark>Exercise: Decide on the Target Feature</mark>

What would the target feature be for the following cases:

- Predict next year's saving balances
- Determine the cure rate for a disease
- Create an automated filter for spam emails
- Probability of a transaction being fraudulent

Questions to ask:
- Where does this field come from?
- Do you need to create it?
- If you are predicting a probability, what is a probability of? (E.g. for the use case of finding likelihood that a customer will not repay a loan, target feature is made from number of missed payments in the last 6 months)

<img src='../images/05_Choosing_Models/target.png' width=300px align=right>

## Target Conclusion

The target variable tells us a lot about where to start when looking for an algorithm.

Questions to ask:
- Do I have a target variable?
- Can I create a target variable? 
- What does my target look like? (continuous vs categorical)

<a id='buss'></a>

<img src='../images/05_Choosing_Models/business-need.jpeg' width=300px align=right>


# The Business Need

## Accuracy vs Interpretability

The **accuracy** of a model means that the predictions are close to the true responses for each observation. 

**Interpretability** is the degree to which a model can be understood in human terms.

A highly interpretable algorithm means that we can understand what key factors go into calculating predictions, while more flexible models give higher accuracy at the cost of low interpretability.

<img src='../images/05_Choosing_Models/ml-tradeoff.png'>

<img src='../images/05_Choosing_Models/business-need.jpeg' width=300px align=right>


## The Business Need Conclusion

So now the choice of algorithm depends on the objective of the business problem. 

- If interpretability is high priority, then restrictive models are better as they are much more interpretable. 
- If a higher accuracy is the goal, without the need for interpretability, a more flexible model would suit. The flexibility of a method increases, its interpretability decreases.

Can you think of two example business cases for each of these priorities? One for interpretability and one for accuracy?

### Examples

Accuracy: 
- Handwriting recognition: we don't need to know how the model knows who's handwriting it is, but we want to be as accurate as possible

Interpretability:
- Customer churn: can we identify why customers leave, and therefore stop them leaving in the future?

<img src='../images/05_Choosing_Models/computer.png' width=300px align=right>

<a id='time'></a>

# Computational Time

Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven a lot by these two factors.

Quick to implement/easy to run:
- Linear Regression
- Logistic regression
- Simple Decision Tree

More time to train the data:
- SVM 
- Neural networks
- Random forests

<img src='../images/05_Choosing_Models/numbers.jpeg' width=300px align=right>

<a id='number'></a>

# Number of Features

The dataset may have a large number of features that may not all be relevant and significant. For certain types of data, such as genetic or textual, the number of features can be very large compared to the number of data points.

Algorithms that are heavily impacted by a large number of features (plus irrelevant or highly correlated features):
- K-Nearest Neighbour
- Logistic Regression

A large number of features can bog down some learning algorithms, making training time long. An SVM is better suited in case of data with a large feature space and fewer observations. 

### Dimensionality Reduction

If you have too many features, principal component analysis (PCA) and feature selection techniques can be used to reduce dimensionality and select important features without losing information.

# Pros and Cons of Algorithms

<img src='../images/05_Choosing_Models/pros-cons.png'>

# Cheatsheet

A cheatsheet can be helpful, but really it's down to you, your understanding of the models and of the subject matter to make the final choice.

<img src='../images/05_Choosing_Models/cheatsheet.png'>

## <mark>Exercise - Making a plan for a Use Case</mark>

In the assignment folder you'll find a [Choosing Models assignment](assignments/05_Choosing_Models/). In teams you will come up with a plan around a given use case. You must decide on the type of probem and algorithm you will use the solve. In particular you will answer the following questions:

- Identify your target variable
- What kind of problem do you have? A regression or classification problem?
- Is it more important to be accurate or to have interpretability?
- What type of model do you think would be good to use?
- What features (besides the ones in the dataset above) might be important to include?
- How would you determine the performance of your model? 