# 1. types of ML systems

<span style="font-size:14pt; color:black; font-family:Garamond">

- Supervised vs unsupervised learning
- Batch or online:
    - Whenever the data and the model can fit into the main memory, then we can apply batch learning.
- Instance-based vs. model-based

</span>

# 2. Main Challanges of Machine Learning

<span style="font-size:14pt; color:black; font-family:Garamond">

Since the main task is to select a learning algorithm and train it on some data, the two things that can go wrong are **"bad algorithms"** and **"bad data"**. Let's start with examples of bad data.
</span>

## 2.1. Bad Data

### 2.1.1 Insufficient Quantity of Training Data

<span style="font-size:14pt; color:black; font-family:Garamond">

Machine learning requires a lot of data for most algorithms to work properly. Even for very simple algorithms you typically need thousands of examples, and for complex problems such as images or speech recognition you may need millions of examples (unless you can reuse parts of an existing model).
    
**The Unreasonable Effectiveness of Data**: It was proven that very different machine learning algorithms, including fairly simple ones, performed almost identically well on a complex problem of natual language disambiguation once they were given enough data. Hence, data matters more than algorithms for complex problems. However, since small and medium-sized datasets are still very common, it is not always easy or cheap to get extra training data - so we can't abandon algorithms just yet.
</span>

### 2.1.2 Nonrepresentative Training Data

<span style="font-size:14pt; color:black; font-family:Garamond">

In order to generalize well, it is critical that your training data be representative of new cases you want to generalize to. This is true whether you use instance-based or model-based learning.
    
By using a nonrepresentative training set, we trained a model that is unlikely to make accurate predictions on new dataset. Hence, it is crucial to use a training set that is representative of the cases you want to generalize to. This is often harder than it sounds: <br>
- ***Sampling noise***: This is the case if the sample is too small; i.e., nonrepresentative data as a result of chance.
- ***Sampling bias***: A very large dataset can be nonrepresentative if the sampling method is flawed.
</span>

### 2.1.3 Poor Quality Data

<span style="font-size:14pt; color:black; font-family:Garamond">

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well. It is often well worth the effort to spend time cleaning up your training data. The truth is, most data scientist spend a significant part of their time doing just that. The following is just an example of when you'd want to clean up training data:
    
- If some instances are clearly **outliers**, it may help to simply discard them or try to fix the errors manually.
- If some instances are **missing a few features** (e.g., 5% of your customers did not specify their age), you must decide whether you want to:
    - Ignore this attribute altogether
    - Ignore only these instances
    - Fill in the missing values (e.g., with the median age). 
    - Or train one model with the feature and one model without it.
</span>

### 2.1.4 Irrelevant Features

<span style="font-size:14pt; color:black; font-family:Garamond">

As the saying goes: garbage in, garbage out. Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A critical part of the success of a machine learning project is to coming up with a good set of features to train on. This process is called ***feature engineering***. It involves the following steps:

1- ***Feature selection***: Selecting the most useful features to train on amoing existing features.<br>
2- ***Feature Extraction***: Combining existing features to produce a more useful one: *dimentionality reeduction can help here*.
3- ***Creating New features***: Creating new features by gathering new data. 
    
</span>

### 2.1.5 How Adding more Features could Enhance the Model's Performance

<span style="font-size:14pt; color:black; font-family:Garamond">
Imagine you have a set of points on a 2D plane (x, y coordinates), representing two classes of data (say, red and blue). If these points are not linearly separable on this plane, it is difficult to find a straight line that can separate the red points from the blue points.

Now, consider that you have access to a third dimension (z coordinate) for each point. By plotting the data points in a 3D space (x, y, z coordinates), you may be able to find a plane that separates the red points from the blue points more effectively than in the 2D space. This is because the additional dimension provides more flexibility in finding a decision boundary that can separate the data points belonging to different classes.

Adding more dimensions (attributes) can enhance the performance of a machine learning model by providing it with more information to learn from, allowing it to better capture the underlying patterns and relationships in the data. In turn, this can lead to improved performance in terms of accuracy, precision, recall, or other relevant metrics.

However, adding more dimensions can also have potential downsides, such as overfitting and increased computational cost. Therefore, it is crucial to carefully select the most relevant and informative features for your model and apply feature engineering techniques to make the most of the additional information while minimizing the risks associated with increased complexity.

### Example:

Imagine you're trying to predict the price of a house based on its features. You start with a simple model using only two features: square footage and the number of bedrooms. In this case, you can visualize the relationship between these features and the house price in a 3D space (2D for features, 1D for the target variable - price).

Now, consider that you want to improve the model's performance by adding more features, such as the number of bathrooms, the age of the house, proximity to public transportation, the quality of local schools, and so on. As you add more features, the number of dimensions in your feature space increases.

With each additional dimension, the volume of the feature space grows exponentially. For instance, if you have a dataset with 1,000 data points and two features, these points might be relatively evenly distributed across the 2D feature space. However, when you increase the number of features to, say, 10, the same 1,000 data points are now spread across a much larger, 10-dimensional space.

As a consequence, the data becomes sparser in this higher-dimensional space, meaning that the average distance between data points increases, and the model has fewer examples to learn from in each region of the space. This can lead to overfitting, as the model is more likely to memorize the training data instead of learning the underlying patterns that would help it generalize to new, unseen data.
</span>

## 2.2. Bad Algorithms

### 2.2.1 Overfitting the Training Data

<span style="font-size:14pt; color:black; font-family:Garamond">

***Overfitting***: when the model performs well on the training data, but does not generalize well. E.g., a high degree polynomial may overfit a simple linear looking data. Complex models such as deep neural networks can detect subtle patterns in the data, but if the training data is noisy, or if it is too small (which introduces ***sampling noise***), then the model will likely to detect patterns in the noise itself. Obviously these patterns will not generalize to new instances. 
    
For instance, suppose we fed a model with countries' names as attributes, a complex model may detect patterns like the fact that all countries in the training data with a $w$ in their name have life satisfaction greater than 7: New Zealand (7.3), Norway (7.4), Sweden (7.2), and switzerland (7.5). However, how confident are you that the $w$-satisfaction rule generalizes to Rwanda or Zimbabwe?
    
Therefore, some pattern occur in the training data by pure chance, but the model has no way to tell whether a pattern is real or simply the result of noise in the data.
    
*Definition:** Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. 
</span>

#### Possible Solutions:

<span style="font-size:14pt; color:black; font-family:Garamond">

- **Simplify the model:** by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model). Constraining the model to make it simpler and reduce the risk of overfitting is called *regularization*. For example, if we have a linear model with two parameters: $\theta_0$: intercept, and $\theta_1$: slope. This gives the learning algorithm two degrees of freedom to adapt the model to the training data: it can tweak both the height $(\theta_0)$, and the slope $(\theta_1)$ of the line. If we force $\theta_1 = 0$, the algorithm would have only one degree o freedom and would have a much harder time fitting the data properly: all it could do is move the line up and down to get as close as possible to the training instances, so it would end up around the mean. A very simple model indeed! If we allow the algorithm to modify $\theta_1$ but we force it to keep it small, then the learning algorithm will effectively have somewhere in between one and two degrees of freedom. Hence, it will produce a model that is simpler than the one with two degrees of freedom, but more complex than the one with 1 degree of freedom. <br>**You want to find the right balance between fitting the training data perfectly and keeping the model simple enough to ensure that it will generalize well.** <br>The amount of regularization to apply during learning/training can be controlled by a *hyperparameter*. A Hyperparameter is a parameter of the learning algorithm (not of the model). A hyperparameter is maybe added to the loss function, hence, is not part of the model. As such, it is not affected by the learning algorithm itself, aka, does not change while training the model, but must be set prior to training and remains constant during training.   
- **Reduce the number of attributes** in the training data. E.g., sometimes onehot encoding could turn out to degrade the performance of a model. And we refer to this as the curse of dimensionality (more on this later).
- **Gather more data**.
- **Reduce the noise** in the training data (e.g., fix data errors and remove outliers).
</span>

#### Examples

<span style="font-size:14pt; color:black; font-family:Garamond">
Gathering more data can help reduce overfitting because it provides a more diverse and representative sample of the true underlying patterns in the data. A larger dataset can make it harder for the model to memorize specific instances or learn irrelevant patterns present in a smaller dataset.

Let's consider a real-world example with an e-commerce website where you want to build a model to predict whether a customer will make a purchase based on their browsing behavior.

Suppose you have a small dataset of 100 customers. In this small dataset, you might find that 90% of customers who visit a specific product page (e.g., Product A) make a purchase. Your model could learn that visiting Product A's page is a strong indicator of making a purchase, resulting in a decision boundary that overemphasizes this feature.

Now imagine you collect more data, expanding your dataset to 10,000 customers. With this larger dataset, you find that only 10% of customers who visit Product A's page actually make a purchase. The new data helps the model understand that visiting Product A's page is not as strong an indicator of making a purchase as it initially seemed. Consequently, the model learns a more nuanced decision boundary that is less likely to overfit.

In this example, the larger dataset provided a more accurate representation of the true relationship between the features (browsing behavior) and the target variable (making a purchase). The model trained on more data is less likely to overfit because it learned a more generalized pattern instead of relying on specific instances or features that were overemphasized in the smaller dataset.

However, it is important to note that collecting more data alone might not always guarantee better generalization. The quality of the data, the presence of noise, and the relevance of the features also play crucial roles in determining the model's performance.
</span>

### 2.2.2 Underfitting the Training Data

<span style="font-size:14pt; color:black; font-family:Garamond">

This is the opposite of overfitting. It occurs when your model is too simple to learn the underlying structure of the data.
</span>

#### Possible Solutions

<span style="font-size:14pt; color:black; font-family:Garamond">

- Select a more powerful model, with **more parameters**.
- Feed **better features** to the learning algorithm (feature engineering).
- **Reduce the contraints** on the model (e.g., reduce the regularization hyperparameter).
</span>

# 3. Testing and Validating

<span style="font-size:14pt; color:black; font-family:Garamond">

A good idea is to split your data into two sets: the *training set* and the *test set*. As these names imply, you train your model using the training set, and you test it using the testing set. The error rate on the new cases is called the *generalization error* (or *out-of-sample* error). And by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.
    
If the training error is low (i.e., your model makes few mistakes on the training set but the generalization error is high, it means that the model is overfitting the training set).
    
**Note:** It is common to use 80% of the data for training and hold our 20% for testing. However, this depends on the size of the dataset. if it contains 10 million instances, then holding out 1% means your test set will contain 100,000 instance, probably mmore than enough to get a good estimate of the generalization error. 
</span>

# 4. Hyperparameter Tuning and Model Selection

<span style="font-size:14pt; color:black; font-family:Garamond">

Evaluating a model is simple enough. Just use a test set. But what if we are hesitant between 2 models (say a linear and a polynomial model): how to decide between them? One option is to train both and compare how well they generalize using the test set.

Now, maybe after testing, we found out that the linear model generalizes better, but you want to apply some regularization to avoid overfitting. The question is, how do you choose the value of the regularization hyperparmater. One option is to train 100 different models using 100 different values of this hyperparameter. Suppose you find the best hyperparameter value that produces a model with the lowest generalization error - say, 5% error. Finally, after putting the model into production, we get an error of 15% on unseen data. What just happened?

The problem is that you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produce the best model for that particular set. This means that the model is unlikely to perform as well on new data.

A common solution to this problem is called *holdout validation*: You simply hold out part of the training set to evaluate several candidate models and select the best one. The new held-out set is called *validation set* (or sometimes the *development set*, or *dev set*). More specifically, you train multiple models with various hyperparameters on the reduced training set (train set - dev set), and then select the model that performs best on the validation set. After this holdout validation process, you train the model again on the full training set (including the dev set) and this will gives you the final model. Lastly, you evaluate this final model on the test set to get an estimate of the generlization error.
    
This solution usually works quite well. However, if the validation set is too small, then model evaluations will be imprecise: you may end up selecting a suboptimal model by mistake. Conversely, if the validation set is too large, then the remaining training set will be much smaller than the full training set. Why this is bad? Well, since the final model will be trained on the full training set, it is not ideal to compare candidate models trained on a much smaller training set. It would be like selecting the fastest sprinter to participate in a marathon. One way to solve this problem is to perform repeated *cross-validation*, using many small validation sets. Each model is evaluated once per validation set, after it is trained on the rest of the data. By averaging out all the evaluations of a model, we get a much more accurate measure of its performance. However, there is a drawback: the training time is multiplied by the number of validation sets.
</span>

# 5. Data Mismatch

<span style="font-size:14pt; color:black; font-family:Garamond">

In some cases, it is easy to get a large amount of data for training, but it is not perfectly representative of the data that will be used in production. For example, suppose you want to create a mobile app to take pictures of flowers and automatically determine their species. You can easily download millions of pictures of flowers on the web, but they won’t be perfectly representative of the pictures that will actually be taken using the app on a mobile device. Perhaps you only have 10,000 representative pictures (i.e., actually taken with the app). In this case, the most important rule to remember is that the validation set and the test must be as representative as possible of the data you expect to use in production, so they should be composed exclusively of representative pictures: you can shuffle them and put half in the validation set, and half in the test set (making sure that no duplicates or near-duplicates end up in both sets). After training your model on the web pictures, if you observe that the performance of your model on the validation set is disappointing, you will not know whether this is because your model has overfit the training set, or whether this is just due to the mismatch between the web pictures and the mobile app pictures. One solution is to hold out part of the training pictures (from the web) in yet another set that Andrew Ng calls the train-dev set. After the model is trained (on the training set, not on the train-dev set), you can evaluate it on the train-dev set: if it performs well, then the model is not overfitting the training set, so if performs poorly on the validation set, the problem must come from the data mismatch. You can try to tackle this problem by preprocessing the web images to make them look more like the pictures that will be taken by the mobile app, and then retraining the model. Conversely, if the model performs poorly on the train-dev set, then the model must have overfit the training set, so you should try to simplify or regularize the model, get more training data and clean up the training data, as discussed earlier.

</span>

# 6. Affect of learning rate on the training algorithm 

<span style="font-size:14pt; color:black; font-family:Garamond">

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate. If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data (you don’t want a spam filter to flag only the latest kinds of spam it was shown). Conversely, if you set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points (outliers).

In the presence of changing data distribution, the loss function's shape will also change. This is because the optimal model parameters for the old data distribution may not be optimal for the new data distribution.

Let's break down the learning rate's role in this scenario:

1. **High learning rate**: When the learning rate is high, the model quickly adapts to the new data distribution. It takes larger steps in the parameter space to minimize the loss function. This is beneficial when the data distribution is genuinely changing, as the model can swiftly adapt to the new patterns. However, a high learning rate can also make the model sensitive to noise or fluctuations in the data. It might lead to overshooting the optimal parameters, causing the model to oscillate around the minimum of the loss function and potentially diverging.
   
2. **Low learning rate**: With a low learning rate, the model takes smaller steps in the parameter space, which means it learns more slowly. This can be advantageous when the data has noise or fluctuations, as the model is less likely to be influenced by non-representative data points. However, if the data distribution changes significantly, a low learning rate may not allow the model to adapt quickly enough, leading to poor performance on the new data.

### How outliers affect the loss function:

As for the loss curve, the loss function's value over time, In the presence of changing data distribution, the loss curve will also exhibit changes. When the data distribution shifts, the model may experience an increase in the loss value as it tries to adapt to the new distribution. The learning rate's role will then influence how quickly the model can minimize the new loss function and how stable the loss curve is throughout the training process.

The model complexity plays a role in how it reacts to anomalies. If the model is very complex, it might be more prone to overfitting to the anomalies, even when it's close to the global minimum. A simpler model may be more robust to the presence of anomalies.

When outliers are present in the data, the MSE loss function becomes more sensitive to these points. This is because the squared error term grows quadratically with the magnitude of the error, causing outliers to have a disproportionately large impact on the overall loss value.

In the presence of outliers, the bowl-shaped loss surface can become distorted. The global minimum of the loss function might shift towards the outliers, as the model tries to minimize the overall error by fitting these points more closely. As a result, the model parameters may be biased towards the outliers, leading to a less accurate representation of the underlying data distribution.

In summary, the learning rate plays a crucial role in determining how a model adapts to changing data distributions. A high learning rate allows for faster adaptation but can lead to instability, while a low learning rate provides more stability but may not adapt quickly enough to significant changes. The loss curve will also be affected by the changing data distribution and the learning rate, reflecting the model's ability to optimize the loss function under varying conditions.

</span>

## 6.1. The effect of high learning rate with the presence of anomalies

<span style="font-size:14pt; color:black; font-family:Garamond">
In the presence of anomalies or outliers in the data, a high learning rate can lead to several challenges for a machine learning model. Here's what can happen:<br>

1. **Oversensitivity to noise**: A high learning rate means that the model will take larger steps in the parameter space during optimization. Consequently, the model will be more sensitive to noise or anomalies in the data. When an anomaly is encountered, the model may treat it as a significant change in the data distribution and adjust its parameters accordingly. This can lead to the model learning incorrect patterns or fitting the noise instead of the underlying true patterns in the data.

2. **Loss function oscillation**: When the learning rate is high, the model may overshoot the optimal parameters while trying to minimize the loss function. This can cause the model to oscillate around the minimum of the loss function, making it harder to converge. In the presence of anomalies, this oscillation can be exacerbated, as the model keeps adjusting its parameters back and forth in response to the noise.

3. **Difficulty in convergence**: Due to the larger steps taken in the parameter space, a high learning rate can make it challenging for the model to converge to the optimal parameters, especially when there are anomalies in the data. The model may keep bouncing between different regions of the parameter space, influenced by the noise or outliers, leading to poor convergence or even divergence.

4. **Reduced generalization**: When the model is influenced by anomalies or noise, it may overfit to the training data, capturing patterns specific to the anomalies rather than the overall data distribution. This can reduce the model's ability to generalize well to new, unseen data, leading to lower performance on validation or test datasets.

In summary, a high learning rate can cause several issues when there are anomalies in the data, such as oversensitivity to noise, oscillations in the loss function, difficulty in convergence, and reduced generalization. It is essential to choose an appropriate learning rate that balances the model's ability to adapt to new data while maintaining stability and minimizing the influence of noise or anomalies. In many cases, using techniques like learning rate scheduling or adaptive learning rates can help mitigate the effects of a high learning rate and improve model performance.
</span>

# 7. No Free Lunch Theorem

<span style="font-size:14pt; color:black; font-family:Garamond">

A model is a simplified version of the observations. The simplifications are meant to discard the superfluous details that are unlikely to generalize to new instances. However, to decide what data to discard and what data to keep, you must make assumptions. For example, a linear model makes the assumption that the data is fundamentally linear and that the distance between the instances and the straight line is just noise, which can safely be ignored.

In a famous 1996 paper,11 David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch (NFL) theorem. For some datasets the best model is a linear model, while for other datasets it is a neural network. There is no model that is a priori guaranteed to work better (hence the name of the theorem). The only way to know for sure which model is best is to evaluate them all. Since this is not possible, in practice you make some reasonable assumptions about the data and you evaluate only a few reasonable models. For example, for simple tasks you may evaluate linear models with various levels of regularization, and for a complex problem you may evaluate various neural networks.
</span>