# Supervised learning
Train the learning algorithm with inputs and corresponding correct output labels

#### Regression
Predict a number out of infinitely many numbers

#### Classification
Predict a result out of a small number of possible categories 

# Unsupervised learning
Train the learning algorithm with inputs without output labels to predict a pattern in the data

#### Clustering
Take input data without labels and group them into clusters

#### Anomaly detection
Detect unusual data points

#### Dimensionality reduction
Compress large data set with small loss in data

# Debugging a learning algorithm
Diagnostics: tests that can tell what is working/not working for a learning algorithm

## Evaluate a learning algorithm
1. Split all data into training set, validation set, and testing set
2. Determine weights and bias using the training set with regularization
3. Use the weights and bias to calculate the cost of the training set and validation set and compare. For classification problems, the model can be evaluate by calculating the fraction of data the model has been missclassified
3. Use the validation set cost to evaluate the performance of each model for fine tuning 

To find a better prediction model, we can train different models (eg. different polynomials or neural network structure), and pick the model with lowest validation set cost after fine tuning

As the total number training examples becomes larger, the percentage for the validation and testing set can be smaller since they are only used to evaluate the model

Only use the test set until a final model is confirmed to ensure the a relative accurate prediction on the model's precision

Note: the validation and test set should come from the same distribution

If the data from the training and validation set are not from the same distribution, a training-dev set, which contains data that has the same distribution as the training set, will be added for bias and variance evaluation


## Bias and variance
For data with lots of features, it is difficult to visualize if the model has high bias/variance. Thus, we can look at the the cost for training set and validation set to determine if the model overfits or underfits.
* Overfitting (high variance): low cost for the training set, but significantly higher cost for the validation set
* Underfitting (high bias): high cost for both the training and the validation sets
* High bias and variance: high cost for the training set and even higher cost for the validation set
* Just right: low cost for the training and slightly higher cost for the validation set

As the degree of polynomial for fitting increases, the cost for the training set decreases and the cost for the validation set first decreases, then increases

## Regularization
When training a model with regularization, if $\lambda$ is too large, the model will underfit (high cost for both the training and the validation sets), and if $\lambda$ is too small, the model will overfit (low cost for the training set but significantly higher cost for the validation set)

As $\lambda$ increases, the cost for the training set increases (towards underfitting), and the cost for the validation set first decreases, then inncreases

## Setting baseline performance
To judge if the training performs well, we can set a baseline performance by seeing the performance of human or other similar algorithm on the data set. If the difference between the baseline performance and training error is high, the algorithm may have high bias (underfit); if the difference between the training error and the validation error is high, the algorithm may have high variance (overfit)

## Learning curve
In general, as the number of training example increases, the cost for the training set increases since it is difficult to fit all data perfectly with a model, and the cost for the validation set decreases since the model will make more accurate predictions to new data

* High bias: the trainig and validation errors plateau as the number of training example increases since the model just cannot fit the data well, and increases the number of training example will not help

<img src="https://assets-global.website-files.com/63f902d79a33f7ff016cde0b/63f902d89a33f72d236ce685_plot_bias_variance_trainingsize-1024x550.png" width=500>

* High variance: the training error will increase and the validation error will decrease as the number training example increases, so collecting more training data may help 

<img src="https://miro.medium.com/v2/resize:fit:1400/0*wIaoQ-vXhW-Oxbf9.png" width = 500>

## Addressing high bias and variance
### High bias
* Use additional features
* Use polynomial features
* Decrease $\lambda$

### High variance
* Get more training data
* Use a smaller set of features
* Increase $\lambda$

## Neural network

### High bias
* Use a larger neural network (can be computationally expensive)
* Train the network with more iterations of gradient descend
* Use a better neural network architecture

A large neural network almost always fits the training set well and can always do as well or even better than a smaller network if the regularization is chosen correctly

### High variance
* Get more training data and retrain the model

## Error analysis
Manually examine the data that the model predicted wrong and categorize them based on the commons. Then, create solutions to based on categories (eg. getting more data for a specific category). In general, we start from fixing the reasons/categorizes that cause the most errors

## Adding data
Collecting data can be difficult and time consuming, so instead of adding data of everything, adding data related to the categories indicated by error analysis can lower the cost

## Data mismatch
When the data from the training set and validation set are not from the same ditribution, we evaluate the model based on the training-dev set

* If the difference between the baseline performance and training set error is large, the model has high bias
* If the difference between the training set error and training-dev set error is large, the model has high variance
* If the difference between the training-dev set error and validation set error is large, the model is experiencing a data mismatch problem, which mean the model can perform well on data similar to the training set but not as well on data from a different distribution

## Addressing data mismatch
To address data mismatch, we can first perform error analysis to understand the difference between the training set and the validation set, then make the training set more similar to the validtion set by collecting more training data similar to the validation set through artificial data synthesis. When performing data synthesis, be caution about overfitting on a specific subset of all possible data

* Data augmentation (mostly used for images and audio): modifying an existing training example to create new training examples (eg. rotating, mirroring images or adding noise to audio)
* Data synthesis (mostly used for computer vision): creating brand new data

## Transfer learning
Transfer learning is used when the data set is not large enough to train a large neural network. Instead, we take another pretrained model that can complete similar tasks and fine tune the paramemter of the output layer or the entire network to obtain a better model. This method works because the network is able to pick up similar features with similar inputs during the pretraining process, which can be carried over to the fine tuning process.

Steps:
1. Supervised pretraing: train a network's parameters or obtain the parameters of an existing model trained on a large data set with the same input type as the actual data
2. Fine tuning: further train the network with the acutal data

## Multi-task learning
Multi-task learning can predict multiple labels for each training data at once, which is performing multiple tasks at the same time. It it used when the labels shares lower level features 

## End to end deep learning
End to end deep learning refers to train a single neural network for complex tasks using as input directly the raw input data without any manual feature extraction. Alternatively, we can accomplish the same task by building multiple neural network with each completing a smaller task in sequential order

* End to end learning requires a lot more data than a sequential sub-task model and also limits the hand-design components


## Skewed dataset
The ratio of the positive and negative cases are signficantly skewed 

## Accuracy
In classification problems, accuracy measures the proportion of correctly classified data points:  

$$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$  

While accuracy is an important metric, it can be misleading—especially for imbalanced datasets. For example, in a dataset with 1 positive sample and 99 negative samples, a model that always predicts “negative” will achieve 99% accuracy but completely fail to detect the positive class.   This is why precision and recall are needed to provide deeper insight into model performance.

## Precision and Recall  
In a binary classification problem, predictions fall into one of four categories:  

1. True Positive (TP): Actual class = 1, predicted class = 1  
2. False Positive (FP): Actual class = 0, predicted class = 1  
3. True Negative (TN): Actual class = 0, predicted class = 0  
4. False Negative (FN): Actual class = 1, predicted class = 0

These 4 categories are usually plotted with a confusion matrix
<img src="https://miro.medium.com/v2/resize:fit:1400/1*ZPTFqlhFPUvLg8P6q62ttg.png" width=500>

We can use these 4 categories to calculate the precision and recall

$$\text{Precision} = \frac{TP}{TP + FP}$$

$$\text{Recall} = \frac{TP}{TP + FN}$$  

* Precision: Measures the accuracy of the model’s positive predictions, which is the proportion of predicted positives that are actually positive. High precision is important when the cost of misclassifying a negative case as positive is high. In other words, we aim for high precision when we only want to assign a positive label if the model is very confident, even if that means missing some actual positives.

* Recall: Measures the model’s ability to detect all actual positive cases, which is the proportion of actual positives that are correctly identified by the model. High recall is important when the cost of missing a positive case is high. In other words, we aim for high recall when we want to avoid missing any positive cases, even if it means incorrectly labeling some negatives as positive.

### Precision–Recall Trade-off  
Neither precision nor recall should be used alone because both scores can be cheated. If we only want high precision, we can achieve this by setting the threshold to give a positive label to be very high, so the model only gives a positive label when it's very confident, which leads to a high precision score. However, this is problematic as lots of apparent enough positive examples will be classified as negatives since the model will only give positive labels when it's super confident.

If we only want high recall, we can achieve this by setting the threshold to give a positive label to be very low, so the model will give a positive label even when it's not confident enough, so it never misses a positive class to achieve a high recall score. However, lots of negative classes are misclassified as positive. 

When we want to achieve high precision and recall at the same time, the tradeoff occurs because increasing the threshold for classification will result in fewer false positives (increasing precision) but also more false negatives (decreasing recall), and vice versa. The balance between the two is often determined based on the problem requirements or by optimizing for a combined metric such as the F1 score.

<img src="https://jamesmccaffrey.wordpress.com/wp-content/uploads/2014/11/precisionrecallgraph.jpg" width=500>


## F1 Score  
F1 score is a performance metric to evaluate how well a classification model performs by combining both precision and recall. F1 score is between 0 and 1; the higher the F1 score, the higher the precision and recall of the model

$$F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

## Multiclass Classification
In multiclass classification, we cannot directly compute a single precision, recall, or F1 score for the entire model. Instead, we calculate these metrics per class by treating one class as the positive class and all other classes as negative. This effectively transforms the multiclass problem into a binary classification problem for that specific class, allowing us to compute precision, recall, and F1 as usual.

To obtain an overall evaluation for the model, we can calculate the scores for each class and then find the micro-average, macro-average or weighted average of the individual F1 scores as a global evaluation metric.

<img src="https://www.researchgate.net/publication/361092131/figure/fig1/AS:1166908405612546@1655224052772/S-1-multiclass-confusion-matrix-TP-true-positive-FP-false-positive-FN-false-negative.png" width=200>

# ML development cycle 
1. Scope the project: define the problem that the project will solve
2. Collect data
3. Development cycle:
    1. Choose architecture (model, data, etc)
    2. Train model
    3. Run diagnostics (bias, variance, error analysis, etc)
    4. Iterate from step A if the model does not work well enough
    5. Collect more data if needed
4. Deploy the model (implement the model on an inference server), monitoring and maintaining the system
5. Improve the model with data collected during application


# Orthogonalization
Orthogonalization is a system design property that ensures that modification of an instruction or an algorithm component does not create or propagate side effects to other system components. Orthogonalization makes it easier to independently verify the algorithms, thus reducing the time required for testing and development

# Single number evaluation
A single number metric that assesses the performance of a model, whic allows faster evaluation and comparison among algorithms

When there are multiple metrics that are significant, we can use satisfying and optimizing metric to help us evaluate the models. In general, we have one optimizing metric and multiple satisfying metrics

Satisfying metric: a metric that set the lower bound for an algorithm to be considered (e.g. minimum runtime)

Optimizing metric: a metric that we want to let the algorithm optimize as much as possible (e.g. accuracy of predictions)