# Chapter 1: The Machine Learning Landscape

## What is Machine Learning?
Machine learning is the science and art of programming a computer to learn from data.

Example: The spam filter is a ML algorithm that learns to flag spam given examples of spam and ham emails.

The **training set** is the collection (dataset) of examples the computer learns on.

Each training example is called a __training instance__ or __sample__.

## Why use Machine Learning?
 - Some problems require list of rules and constant updates; Machine Learning uses mathematics to measure similarities and finds rules mathematically instead of explicit programming.
 - Some problems are too complex (or long) to do by hand
 - Machine Learning systems are can be automated to automatically update to catch new patterns
 - Machine Learning algorithms are inspectable to teach humans about trends in data.
 
**Machine Learning can help humans learn**.
 
**Data Mining** is uncovering patterns in big datasets that were not apparent without the help of a Machine Learning model.

## Types of Machine Learning Systems
Machine Learning *systems* are categorized based on:
 - Amount of human supervision
  - Supervised
  - Unsupervised
  - Semi-supervised
  - reinforcement
 - Whether it is automated
  - online
  - batch
 - Whether it is being compared to other data or building a model
  - instance based
  - model based
  
These criteria are combinable.

### Supervised/Unsupervised Learning
ML systems are classified based on the amount of human supervision given.
The four major categories (Supervised, Unsupervised, etc) are given above.

#### Supervised Learning
In Supervised Learning problems, the training data includes the **labels**.

Def'n. 

**Labels** are solutions to instances of training data. Labels can be numerical or categorical and are some *target* value. An example of a label in a classification problem is the "ham" or "spam" class of an email. An example in a regression problem is the median housing price for a district in California (continuous).

Typical supervised learning tasks are Regression and Classification.

Note: The words: regressor, feature, attribute, and predictor are almost synonymous in Data Science and Machine Learning.

Note: The book mentions Logistic Regression. Recall that LR is a classification technique that outputs probability of belonging to each of given classes.

Some very important Supervised Learning algorithms:
 - K-Nearest Neighbors (KNN)
 - Linear Regression
 - Logistic Regression (LRC)
 - Support Vector Machines (SVM)
 - Decision Trees and Random Forests
 - Neural Networks
 
#### Unsupervised Learning
Training data is unlabelled.

Important Unsupervised Learning algorithms:
- Clustering
 - K-means
 - DBSCAN
 - Heirachical Cluster Analysis (HCA)
- Anomaly detection and novelty detection
 - One-class SVM
 - Isolation Forest
- Visualization and dimensionality reduction
 - Principal Component Analysis (PCA)
 - Kernel PCA
 - Locally-Linear Embedding (LLE)
 - t-distributed Stochastic Neighbor Embedding (t-SNE)
- Association rule learning
 - Apriori
 - Eclat
 
Example: You can run *clustering* algorithms to detect groups of visitors on websites. You don't know which group they belong to prior, the algorithm finds that out. "it might notice that 40% of visitors are comic-book-loving males who read every evening, and 20% are sci-fi lovers who only read on weekends. Running a *heirarchical clustering* algorithm may help divide into even smaller subgroups to target your traffic".

Visualization algorithms are good for creating 2D or 3D visualizations of complex, high-dimensional data. This aims to maintain the structure such as spaces between clusters while simplifying understanding of data organization and unrealized patterns.

Related to visualization is *dimensionality reduction*: trying to simplify the data without losing information. This is done by merging one or several correlated features into one. Also called **Feature extraction**.

Example of **feature extraction**: Merging a car's mileage with its age, $\dfrac{mileage}{age}$ , to get miles_per_year.

Remark: Dimensionality reduction can make your ML system work **faster and better**.

Anomaly detection can be used to detect unusual transactions in preventing credit card fraud, or catching manufacturing defects, or automatically removing outliers.

*Association learning rule*: to discover relationships among attributes.

#### Semisupervised Learning
Algorithms that can utilize partially-labelled data - some labelled data and a lot of unlabelled data.

Google Photos is an example. Google Photo's algorithm can identify the same person appearing in multiple pictures, so if you label that person in 1 photo, the algorithm will label that person for all photos. 

Most semisupervised algorithms are combinations of supervised and unsupervised learning algorithms.


#### Reinforcement Learning
Much different from the other types of learning.

The reinforcement learning system deals with an *agent* that "observes" the environment, performs an action(s), then gets rewards or penalties (negative rewards) and adopts a *policy*, or *best strategy* to get the most reward over time.

### Batch and Online Learning
Whether or not the system is automated.

#### Batch Learning (Offline Learning)
The system is trained on all available data. This may be impossible if the data is huge, or it may take too long, take up too many resources, etc.

#### Online Learning (aka Incremental Learning)
Great for systems that use a continuous flow of data. This system is dependent on the learning rate which if too high can forget old data too quickly, or if too low will not react quickly enough.

Online Learning systems are better than offline, but must be carefully watched.

### Instance Based vs Model Based
"Most ML tasks are about making predictions", this way is categorizing ML systems on how they generalize.

#### Instance-based learning
The most trivial way of learning. The system learns examples "by heart" and measures similarity of other instances.

#### Model-based learning
Build a model of examples, then use the model to make predictions. This requires model selection.

Suppose you decide to use a linear model, how do you define which parameters $\theta_0$ ... $\theta_m$ best fit your model? Use either a *utility (fitness) function* to measure goodness, or a *cost function* to measure badness.

Note: For linear regression it is typical to use a cost function to measure the distance between the linear model's predictions (line of best fit)($\vec{\hat{y}})$s and the true values ($\vec{y}$).

Now train your model (uncover the parameters that best fit the data).

## Main Challenges of Machine Learning
2 things: bad algorithm or bad data

The following 4 headers are bad data challenges.
### Insufficient Quantity of Training Data
Researchers at Microsoft showed in a famous paper that algorithms performed similarly when having large amounts of training data.

### Nonrepresentative Training Data
It is crucial that training data be representative of new cases you want to generalize to.

Note: in the example about GDP and happiness of country, the extreme poor and extreme rich countries were left out. It is crucial to use training data that is representative of the cases you're trying to generalize, but this is difficult because of:

**Sampling Noise** - training sample is too small and nonrepresentative data (perhaps as a result of chance) makes bad predictions.
Example: Using Mexico, Brazil, and other [poor but happy] countries and Belguim and Luxemborg [rich but unhappy] countries to try and predict Cyprus's happiness level.

**Sampling Bias** - Even very large samples can be unrepresentative, say if the sample was not taken randomly.
Famous example of sampling bias: The US Literary Digest sampling their readers to predict that the 1936 Presidential candidate Landon would get 57% of the votes, but Roosevelt proceeded to win by landslide. They also did not account for people who did not respond to the survey (*nonresponse bias*).

### Poor Quality Data
Most data scientists spend lots of time cleaning data: removing outliers, filling missing values or ignoring them, or using different models.

### Irrelavent Features
**Feature Engineering**: Garbage in, Garbage out. We need to use feature engineering or have bad predictions.
 - Feature selection: selecting the most useful features
 - feature extraction: combining existing features
 - Gather new data and create new features

The following headers are bad algorithm challenges.
### Overfitting the Training Data
The model performs well on the training data, but poorly on new data. It does not generalize well. The model tries hard to capture the pattern in the training data and predicts poorly on unseen data. We do not want to detect the noise in the data, just generalize. 

Do not feed uninformative data into a model, like a country's name. A complex model may notice some pattern in the name.

Possible solutions to overfitting:
- Select a simpler model (like linear over polynomial)
- Gather more data
- Reduce noise (remove outliers, fill missing)

Regularization is controlled by hyperparameters.
**Regularization** - *constraining a model to make it simpler and reduce the risk of overfitting.*
A hyperparameter is a parameter of the learning algorithm, not of the model itself.

### Underfitting the Training Data
Model is too simple.

fixes:
- select a more powerful model
- feed better features
- reduce constraints (reduct regularization hyperparams)

## Testing and Validation
The only way to know how well a model generalizes is to try it on new cases.

We could just launch the system into production, or we could split our data into training and testing sets and evaluate before launching!

If the training error is low but the generalization error is high, the model is overfitting.

If the training error and generalization error is high, the model is underfitting.

### Hyperparameter Tuning and Model Selection
*Holdout validation* - Speaking of training sets, it is good practice to further split the training set into a **validation** set and train several models on the reduced training set, then eval on the validation set. Choose the model with the best performance on the validation dataset, then train it on the *entire* training set and eval on the test set.

**Cross-validation** should be used as holdout validation to mitigate evaluation imprecision.

Note: The validation and test sets should be as representative as possible to data you expect to see in production.

### Data Mismatch
Adding another validation set called the train-dev set?

No Free Lunch Theorem. There is no reason to think that one model will perform better than another. Need to test on all models.