# Week 1 Notes

## Improving Model Performance

Why is having a disciplined machine learning strategy important? Suppose we train a computer algorithm for a machine learning task. Its performance is not acceptable - we need to improve. There are many changes we can try making to the system, including:

* Collect more data

* Collect more diverse training data (balance)

* Change convergence criteria 
    - train for more iterations (if using iterative approach)
    - train until change is within epsilon
 
* Try different initial parameters

* Try different algorithms

* Try bigger network (more parameters/complexity)

* Try smaller network (more parameters/complexity)

* Try dropout

* Add regularization

* Change Network Architecture
    - number of activation units in different layers
    - number of hidden layers
    - different activation function

* Data Augmentation
    - artificial data
    - similar data
    
* Bootstrapping

* Bagging
 
* Boosting 
 
* Early Stopping

Changing these controls essentially produces a new solution each time. We are then trying to select the best solution by altering the settings for how it is trained on which kind of data.

Almost all of these are good ideas - having different effects depending on the reason underlying unacceptable performance. However, without a strategy we could go in circles trying one thing and then another - possibly wasting time and resources and not getting optimal results. 

A principled way of diagnosing problems and applying remedies leads to better performance which can be reasoned about much sooner.

The difference might be seen as akin to a well trained doctor having good diagnostic tools vs someone with a lot of experience seeing various kinds of illnesses but not having a logical and sound framework to improve with.

An important part of applying the right changes is doing them with a sense of fairness - that is that if we change one aspect of the algorithm or data then it should have just one measurable effect on the performance - ideally something that we can reason about.

## Orthogonalization

In many problems it is convenient to find a set of aspects or system of controls that lets us change one and not change the other controls, only the quantity (for example performance metric) that we are interested in. 

![](orthogonal_vectors.gif)

One example is a co-ordinate system. If we have a set of orthogonal basis vectors then changing the size of one them does not change the others - the basis vectors are perpendicular to each other. If we worked in a non-orthogonal set of basis vectors then change the size of one vector impacts the co-ordinates of the others as well as changing distance.

![](non_orthogonal_vectors.png)

Another is the controls of a vehicle. Speed and direction are in separate controls (steering wheel and pedals). Changing the speed, influences the distance traveled, but it does not change the direction. Direction also influences the distance, but doesn't affect speed. It is conceivable to create controls that change both direction and speed at the same time, making it much harder to understand and achieve the desired action.

![car navigation example](steering_wheel_accelerator_brake.jpg)

Another example is an old CRT TV. These televisions had controls or knobs that change height, width, vertical center and horizontal center. There are 4 controls so that each one impacts just one axis. These were carefully set. If instead we had controls that moved both the horizontal and vertical center in one knob - it would be a lot more work.

![Old TV](crt_controls.jpg)

Similarly, with respect to machine learning performance, we want to have controls that influence performance but not the other controls that we may want to change. Some controls such as "early stopping" are not orthogonal to the others and so we don't consider them. We also suggest that one control at a time is changed to see its effect on performance.

We will spend the rest of the course understanding how we can identify the reasons for the poor performance and what changes can be expected from change certain controls.

## Single Number Evaluation Metric

In some cases it might be natural that our performance measure is a vector of numbers. In that case we need to choose one or create one from the vector. It becomes difficult if not impossible to order between a collection of vectors if some go up and others down as we change a control.

If we need to optimize a vector of metrics the best suggestion is to compromise and approximate this with a function that takes the vector of metrics as input and outputs a scalar. It can become too difficult to select which control setting to choose otherwise.

![single number metric](single_metric_f1.png)

One example is wanting a binary classifier to have high precision and high recall. For this, the recommended strategy is to combine recall and precision into their harmonic sum - which is also called the micro-f1 score. Then this f1 score should be improved by changing the controls.

For the example above we can order by the f1 score and choose the logistic classifier as the best of them. 

If we just tried comparing precision and recall, we could only discount the SVM and decision tree as it has strictly worse recall and precision compared to the logistic classifer and random forest respectively. However the other classifiers are dominant in one or other of recall and precision so we don't have a preference relation over them. 

By having a single metric over a vector we effectively create a utility function that values the different vectors by utility.

## Satisficing And Optimizing Metrics

Sometimes we might not have a good utility function over the metrics we monitor for the problem in mind.

### One Optimizing Metric And Many Satisficing Metrics

If we have $N$ metrics on a solution family and no natural utility function over them, we might still have conditions on the metric vector, to help us choose between them.

Often we can set one metric as that to be optimized and the others as having some constraint to be satisfied. A good solution is then to put constraints to be satisfied on $N-1$ metrics and retain one to be optimized.

By doing this, we can focus on a single number metric, while still chasing desirable features.

### Example: Accuracy and Runtime

For some time-critical applications (such as web or mobile apps) we can only push to development an application that has a fast enough runtime to be commercially relevant. Although we might get state of the art performance it might have an unacceptable runtime. (This is akin to having the offer of amazing meals four hours after sitting down in a restaurant.)

In this case we could threshold our maximum runtime, optimizing accuracy (performance) that meets this satisficing constraint. Any time a solution fails this constraint on the runtime metric we reject it instantly and focus on those that will be at or below the runtime.

If we have a family of solutions, then we just filter out those that don't make the threshold cut and then choose the one with the optimal accuracy (optimizing performance) metric.

### Example: Wake Word

Here we give another example, wake words for devices such as Amazon's Alexa, Google's Home, Apple's Siri and Baidu's Razor. A number of metrics are needed for this problem:

* fast runtime
* high accuracy
* low false positives

In industry, fast runtime is not a concern for most models, the hardware and network speeds are fast enough to accommodate most current state of the art algorithm at commercially relevant speeds. Instead, we need to keep the false positives to a reasonable level, say 1 every 24 hours - the satisficing constraint on the metric. Then we need to optimize accuracy while maintaining this constraint.

## Train/Dev/Test Distributions

Assuming we have enough data - which is often the case in the age of big data then our splitting should be random. We also assume that the data is fairly balanced for classification tasks (the proportions of classes is not excessively different), otherwise we may need over or undersampling.


### Selecting Training, Dev And Test Sets 

Splitting the data into training, dev and test sets is needed because the hyperparameters can't be cleanly optimized before learning the parameters, and a fair assessment of a solution's performance needs to done on unseen data.

* training set is used to learn the parameters for some fixed hyperparameters.

* dev (also called cross-validation, validation or hold-out) set is used to tune the hyperparameters so that the best parameters can be learned.

* test set is used to get an unbiased estimate of the performance of the solution on future data.

### Learning vs Tuning

On the training set we *optimize* the parameters wrt to some performance measure (cost function) - usually via an iterative procedure such as gradient descent. 

On the dev set, we *tune* the hyperparameters usually using one of:

* random search

* grid search

* bayesian hyper optimization

Dev sets are needed because the hyperparameters can be too difficult (mathematically intractable and/or too computationally expensive) to learn in the iterative optimization process. Using dev sets is akin to performing an empirical Bayes procedure where we learn the best priors using the data itself. 

The effect tends to be to smooth out the stickiness the parameters of a solution to the features that were realized. The estimates (such as class probabilities or  statistical error  and statistical bias parameters will be flatter and less opinionated, less confident in their predictions). Dev sets are used to regularize the data, their effect is to 

Test sets are needed because if we try to estimate the expected performance using our training or dev sets, then we will overestimate since we have optimized and tuned over this seen data. No fitting takes place using the test set - it is only to anticipate performance.

### Importance Of Same Distribution In Splits

In order to tune hyperparameters and get an unbiased estimate of the likely performance on future data, we need to split our dataset into training, dev and test sets. This needs to be done randomly so that the underlying distribution of the splits is the same. In this way our performance measure will be learning and improving for the same type of data.

Usually the best strategy is to sample at random - although stratified sampling may be justified in certain instances.

### Example: International Application Feature



### Example: Loan Applications

## Proportions Of Train/Dev/Test Sets

## When To Change Dev/Test Sets

## Why Human Level Performance

## Avoidable Bias

## Understanding Human Level Performance

## Surpassing Human Level Performance

## Improving Model Performance

## Carrying Out Error Analysis

## Cleaning Up Incorrectly Labeled Data

## Build First System Quickly - Then Iterate

## Training And Testing On Different Distributions

## Bias And Variance With Mismatched Data

## Addressing Data Mismatch

## Transfer Learning

## Multitask Learning

## What is End-To-End Deep Learning

## Whether To Use End-To-End Deep Learning
