# What is Machine Learning?

Science of programming computers to learn from data



# Why use Machine Learning?


ML is great for
* Problems for which existing solutions demand a lot of hand-tuning of long lists of rules;
* Complex solutions for which there is no good solution using traditional approach
* Fluctuating environments: ML can adapt to new data
* Getting insights about complex problems and large amounts of data.



# Types of ML Systems

* Trained or not with human supervision (supervised, unsupervised, semisupervised, Reinforcement Learning)

* Whether or not they can learn incrementally on the fly (online vs batch learning)

* Whether they work by simply comparing new data points to known ones, or instead detect patterns in the training data and build a predictive model (instance-based vs model-based)

## Supervision

### Supervised Learning
The training data you feed to the algorithm includes the desired solutions (labels)

![image.png](attachment:image.png)

Typical supervised learning task: _classification_

- predict target numeric value given predictors: _regression_

![image.png](attachment:image.png)

Note: some regression algorithms are used for classification (logistic regression for example)

#### Some of classic examples of supervised leraning:

* k-Nearest Neighbors
* Linear Regression
* Logistic Regression
* Support Vector Machines (SVMs)
* Decision Tress and Random Forests
* Neural Networks (some of them)

### Unsupervised Learning

The training data is unlabeled. The system tries to learn without a teacher.

![image.png](attachment:image.png)

#### Some important unsupervised learning algorithms:
<b>Clustering</b>

![image-4.png](attachment:image-4.png)
* K-Means
* DBSCAN
* Hierarchical Cluster Analysis (HCA)

<b>Anomaly detection and novelty detection</b>

![image-3.png](attachment:image-3.png)
* One-class SVM
* Isolation Forest

<b>Visualization and dimensionality reduction</b>

Dimensionality reduction has the goal of simplify the data without losing too much information &rarr; _feature extraction_

It is often a good idea to try  to reduce the dimension of the training data using a dimensionality reduction algorithm before feeding it to another Machine Learning algorithm.

![image-2.png](attachment:image-2.png)


* Principal Component Analysis (PCA)
* Kernel PCA
* Locally-Linear Embedding (LLE)
* t-distributed Stochastic Neighbor Embedding (t-SNE)

<b>Association rule learning</b>

Dig into large amounts of data and discover interesting relations between atributes.
* Apriori
* Eclat




### Semisupervisd Learning

Partially labelled training data &rarr; usually a large amount of unlabeled data and a little bit of labeled.

Example: Google Photos

Most of the semisupervised algorithms are a combination of unsupervised and supervised algorithms

### Reinforcement Learning

The learning system (agent) can observe the environment, select and perform actions, and get rewards/penalties in return. It must learn by itself the best strategy (policy) to get the most reward over time.

![image.png](attachment:image.png)

## Batch and Online Learning

Whether or not a system can learn incrementally from a stream of incoming data.


### Batch Learning

The system is incapable of leraning incrementally: it must be trained using all the available data. First the system is trained, and then is launched into production and runs without learning anymore.

If you want a batch learning system to know about new data, you need a new version from the system from scratch on the full dataset, then stop the old system and replace it with the new one.

Training on the full dataset requires a lot of computing resources. If you have a lot of data and
you automate your system to train from scratch every day, it will end up costing you a
lot of money.

###  Online Learning

![image.png](attachment:image.png)

The system is trained incrementally by feeding it data instances sequentially, either individually or by small groups (_mini-batches_). Each learning step is fast and cheap.

Online learning is great for systems that receive data as a contiunous flow and need to adapt to change rapidly while using less computing resources.

It is also useful when the datasets used for training cannot fit in one machine's main memory (_out-of-core_ learning). The algorithm loads part of the data, runs a training step on it, and repeats it until it has run on all of the data.

obs: _out-of-core_ learning is usually done offline, so _online learning_ can be a confusing name. Think of it as
_incremental learning_.

- One important factor: learning rate - the higher the learning rate, the more rapidly the system will adapt to new data, but it will also tend to forget the old data. 

## Instance-Based vs Model-Based Learning

How the ML system generalize examples it has never seen before.


### Instance-Based Learning

The system learns the examples by heart, then generalizes to new ones by comparing them to the learned examples using a _similarity measure_.

![image.png](attachment:image.png)

### Model-Based Learning

Build a model of from the examples, then use it to make _predictions_

![image.png](attachment:image.png)


After selecting a model with parameters $\theta$, we need to evaluate its performance. 

To measure performance, we can either define a _utility function_ (_fitness function_) that measures how <b>good</b> the model is, or a _cost function_ to to measure how <b>bad</b> it is. For instance, linear models tend to use a cost function that measures the distance between the the linear model's prediction and the training examples, with the objective of minimizing the distance.

We _train_ the model so it can find the parameters that fit best to our data.


![image.png](attachment:image.png)



# Main Challenges of Machine Learning

Since the main task is to select a learning algorithm and train it on some data, the two main thing that can go wrong are "bad algorithm" and "bad data".

## Bad Data 

### Insufficient Quantity of Training Data

It takes a lot of data for most Machine Learning algorithms to work properly.

### Nonrepresentative Training Data

In order to generalize well, it is crucial that the used training data reflects the new cases we want to generalize to.

- If the sample is too small, we have _sampling noise_ (nonrepresentative data as result of chance)

- If the sampling method if flawed, we can have _sampling bias_

###  Poor-Quality Data

training data full of errors, outliers and noise makes it harder for the system to detect the underlying patterns, so the system is less likely to perform well.

### Irrelevant Features

The system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones.

<b>A critical part of the sucess of a Machine Learning project is coming up with a good set of features to train on &rarr; _feature engineering_</b>


- Feature selection: select the most useful features to train on among existing ones. (dimensionality reduction)
- Feature extraction: combining existing features to produce a more useful one.
- Creating new features by gathering more data.

##  Bad Algorithms

### Overfitting the Training Data

_overfitting_: the model performs well on training data, but does not generalize well.

![image.png](attachment:image.png)

Overfitting happens when the model is too complex relative to the amount of noisiness in the training data. Some of the solutions are:

- Simplify the model by selecting one with fewer parameters, by reducing the number of attributes in the training data or by constraining the models (_regulariaztion_: for example, fixing a value to one specific value, or allowing the parameter to stay in a low range of values, etc).
- Gather more training data.
- Reduce noise in the training data (fix data errors and remove outliers).

![image-2.png](attachment:image-2.png)

The amount of regularization to apply during learning can be controlled by a _hyperparameter_. A hyperparameter is a parameter of a learning algorithm (not of the model): it must be set prior to training and remains constant during it.

### Underfitting the Training Data

The opposite of ovefitting: the model is too simple to learn the underlying structure of the data. The main options to fix it are:

- Selecting a more powerful model, with more parameters.
- Feeding better to the learning algorithm (feature engineering)
- Reducing the constraint on the model (hyperparameter tuning)




## Testing and Validating

Split the data into two sets: the _training set_ and the _test set_(usually 80/20). Train the model using the training data, and test it on test set. Then, calculate the _generalization error_ (_out-of-sample error_) on the test set, so we can evaluate how the model will perform on instances it has never seen before.

If training error is low, but the generalization error is high, the model is overfitting the training data.

### Hyperparameter Tuning and Model Selection

 How to decide which model to apply? One option is to train both and compare how well they generalize using the test set.
 
Since you're goingo to iterate through different models and different hyperparameters, to avoid overfitting to the test set, we do _holdout validation_:

Hold out part of the training set to evaluate several candidate models and select the best one. This heldout set is the _validation set_. Traing multiple models with various hyperparameters on the reduced training set and select the one that performs the best on the validation set. After this, train the best model on the full training set, giving you the final model. Lastly, evaluate the final model one the test set.

However, if the validation set is too small, or the validation is too large, making the remaining training set much smaller than the full training set, the evaluations will be imprecise. One way to solve this is to perform repeated <b>_cross_validation_</b>, using many small validation sets. Each model is evaluated once per validation set, after it is trained in the rest of the data. By averaging out all the evaluations of a model, we get a more accurate measure of its performance. However, the training time is multiplied by the number of validation sets.

### Data Mismatch

The validation and test set must be as representative as possible of the data you expect to use in production.