# Chapter 1: The Machine Learning Landscape

## What is Machine Learning?

Machine Learning (ML) is the science/art of programming computers so that they can learn from data. 

An example of this would be the spam filter in your email. Given a large sample of emails (training data), the filter can learn from this (build a model) and determine whether an incomming email is spam or not (make a prediciton).

This differs from simply downloading data from Wikipedia. The computer doesn't really learn anything here. It doesn't give us insight into the data, nor does it make predictions.

## Why Use Machine Learning?

Machine Learning can be particularly useful when:
* Existing solutions are rules-based (these can be long and tedious).
* The data is complex and large.
* Environments are fluctuating and changes need to be made rapidly.

## Types of Machine Learning Systems

There are so many different types of ML systems that it can be useful to classify them into broad categories, based on the following criteria:
* How they are supervised during training.
* Whether or not they can learn incrementally on the fly.
* Whether they work by simply comparing new data points to known data points, or by detecting patterns in the training data and building a predictive model.

#### Training Supervision

Supervision during training can be broken down into five categories:
* Supervised Learning
* Unsupervised Learning
* Semi-Supervised Learning
* Self-Supervised Learning
* Reinforcement Learning

**Supervised Learning**

Supervised learning involves a dataset that is fully labeled. This learning will consist of regression or classification tasks. The purpose is to detect pattern in a set of labeled training data, and develop a model that can be used to predict unseen datapoints. 

Algorithms include:
* Regression - linear regression, decision trees, SVR, NN's
* Classification - logistic regression, decision tress, SVC, NN's

**Unsupervised Learning**

Unsupervised learning involves a dataset that does not need to contain labels. Rather than training a model, this focuses on gaining insight from a dataset. The main goal is to find patterns. 

Algorithms and purposes include:
* Clustering - KMeans, DBSCAN
* Visualization - KMeans, PCA, t-SNE
* Dimensionality Reduction - PCA, t-SNE
* Anomaly Detection - Isolation Forest, SVM
* Association Rules Learning - Apriori, FP-Growth

**Semi-Supervised Learning**

Sometimes gathering labels for training data can be expensive - either computationally or with time. Because of this, sometimes data is partially labeled. Semi-Supervised learning involves training a model on partially labeled data. These are often times a combination of supervised and unsupervised learning.

Example: Let's say we have a bunch of photos. We can cluster these photos such that each cluster contains one person. Then, we can apply a label to the cluster. Now we can train any supervised model that we'd like. When a new photo comes in, it can predict which cluster the photo belongs to and which person it may be.

**Self-Supervised Learning**

Self-Supervised learning techniques involve training a model from fully unlabeled data by creating its own labels through predefined tasks (also called pretext tasks). These tasks help the model extract meaningful features from the data without requiring manual annotations.

Example: Let's say we again have a bunch of photos, this time of different animals. 

Step 1: Pretext Task (Mask Repair)
* You design a pretext task: mask out parts of the images and train a neural network to predict the missing content.
* During this process, the model learns about textures, edges, shapes, and relationships between different parts of the image.

Step 2: Transfer to Downstream Task (Animal Classification)
* After training on the masking repair task, the model has developed an internal representation of the data (e.g., knowing where eyes and ears typically go on an animal).
* Now, you modify the model for the downstream task: predicting the species of the animal.
* The model’s prior knowledge (from the masking task) helps it distinguish between animals because it has learned key features such as fur patterns, body shapes, and general anatomy.

Transferring knowledge from one task to another is called ***transfer learning***.


**Reinforcement Learning**

Reinforcement Learning is a different beast than the categories above. This involves a learning system (agent). The agent selects and performs actions, and gets rewards/punishments in return. It must learn the best strategy (policy) such that it maximizes rewards or minimizes punishments. 

#### Batch vs Online Learning

Another criterion used to classify ML systems is whether or not the system can learn incrementally from a stream of incoming data.

***Batch Learning***

In this type of learning the model is trained offline - meaning that the system is incapable of learning incrementally. The system must be trained on the entirety of the data, rather than in mini-batches. 

This can be computationally expensive and time consuming. Re-training and re-tuning can be quite a lot. Thankfully, we can pipeline much of the process so that it moves faster. 

***Online Learning***

In this type of learning the model is trained online with mini-batches of data. We can feed in a small amount of data, update the model, and discard the data. 

This can be useful, as the model can adapt quickly to new observations and new patterns. Additionally, we can use this to train models on very large datasets that are difficult to load in to memory. This is called ***out-of-core*** learning. 

The rate at which the model adapts to new datapoints is called the ***learning rate***. A high learning rate means that the model adapts quickly to new data. A low learning rate means that there is more inertia, and the model is slower to adapt. Setting a high learning rate can be dangerous, as it is less robust to outliers.

Regardless of the system - whether it is a batch or online learning system - it is important to monitor your model's performance after it is launched into production. Models can see ***model rot*** or ***data drift***, which are phenomenoms when model performance tends to decay over time. It is important to recognize this early and clearly, and make the necessary adjustments to the systems as soon as possible.=

#### Instance-Based vs Model-Based Learning

***Instance Based***

Instance based models rely on a measure of similarity to quantify new datapoints. In other words, the system will compare the new unseen datapoint to the previously examined datapoints from the training set. 

Examples:
* K-Nearest Neighbors
* Support Vectors

***Model Based***

Model based learning involves detecting patterns in the training data and using those to build a model to make predictions. 

When we construct the model, we will adjust the models ***parameters*** with the intention of maximizing a utility function or minimizing a cost function. This is referred to as ***training*** the model.

Example: Let's say we are building a very simple linear regression model where we predict the weight of a person based off of their height and their age.

In this model, our equation is:

$$
\hat{y} = \beta_0 + \beta_1 \cdot height + \beta_2 \cdot age
$$

Therefore, we have 3 parameters in which we can tune. Let's set the cost function as the metric Mean Squared Error (MSE). During training, the model systematically adjusts these parameters (using methods like the closed-form solution or gradient descent) to minimize the MSE.

## Main Challenges of Machine Learning

#### Insufficient Quantity of Training Data

Unfortunately, ML systems can take a good amount of data to construct accurate predictive models and to detect patterns in the data. Simple tasks can take thousands of datapoints; whereas more complex tasks, like image recognition, can take millions. 

#### Non-Representative Training Data

In order to have a model that generalizes well to unseen data, the training data must be representative of the true population. If the ML system learns on a set of data that is not representative, it will likely perform poorly. 

We need to make sure we avoid ***sampling noise*** by gathering an adequate amount of data. We also need to be conscious of how our data is sampled and collect. We do not want to introduce any ***sampling bias*** in our dataset.

#### Poor-Quality Data

Error, outliers, and noise in the data make it hard to detect patterns - further making it harder to build accurate ML systems. We will almost always have to spend time cleaning up the data.

#### Irrelevant Features

The saying is "garbage in, garbage out". We need enough relevant features to train on, while making sure not to have too many irrelevant ones. The process of coming up with the perfect set of features is called ***feature engineering***. The steps of feature engineering are:
* *feature selection*: selecting the most useful set of features
* *feature extraction*: creating/combining new features from pre-existing features
* creating new features from external data sources
* 
#### Overfitting the Training Data

Whereas the previous sections focus on "bad data", this and the next section will focus on "bad model".

Overfitting the data happens when the model is fit too specific to the training data, such that it does not generalize well to new unseen data. This can sometimes happen if the training data is too small or noisy. The ML system may detect small meaningless patterns in the noise itself, and it won't generalize very well when it's time to make predictions.

Some solutions to avoid overfitting are to:
* simplify the model (select one with fewer parameters)
* reduce the number of features
* constrain the model
* gather more data
* reduce noise in the training data

Constraining the model is called ***regularization***. This is a process controlled by a ***hyperparameter***. A hyperparameter is a parameter of the learning algorithm. In this case, the hyperparameter is called the ***learning rate***, and it determines how specific the ML system will fit to the training data. When the learning rate is high, the ML system will fit less specifically to training data and is less likely to overfit. However, if it set too high the ML system may be too "flat", and may not generalize well. Our job is to find the balance in the learning rate - to avoid overfitting while still generalizing to new, unseen data well.

#### Underfitting the Training Data

Opposite of overfitting, underfitting occurs when the model is too simple to detect the underlying structure. Solutions to underfitting are:
* choosing a more powerful model (with more parameters)
* feeding the model better features
* reducing the constraints

## Testing + Validating

Now that the model has been fitted, we want a way to tell how the model well generalize to new instances. The simplest way to handle this is to split the data into two sets - a training and a testing set. The model is trained on the training set, and we can get an accuracy measure on how it performs on the test set.

#### Hyperparameter Tuning + Model Selection

Now let's say we want to tune the hyperparameters of our model. We can evaluate the hyperparameter options on the test set, but we run the risk of overfitting these to the test set. In other words, the set of hyperparameters that are returned may be the best set of hyperparameters for that specific subsection of the testing data, but not for the entire dataset.

To avoid this, we can introduce ***cross validation***. Cross validation occurs by fitting the model model multiple times, each time to a different subsection of the data. Each fitted model is then tried on a different subsection of testing data. The metrics are averaged across splits. 

***NOTE***: It is very important that the training set and the testing set (and the live data) are representative of the true population. 