# 1 The Machine Learning Landscape

Machine Learning (ML) is making a computer accomplish something without explicitly programming it. Instead of giving input and logic to the program directly to get the data, input and data are given to the program to get the logic. 

> **Remark:** Before starting every machine learning project, the most important questions to ask 
> - Can we do this without ML?
> - What is the business objective and how to measure performance?
> - How does performance boost look like?


## 1.1 Types of ML

We can roughly use 3 criteria to divide ML methods int categories
- Supervised and Unsupervised Learning
- Batch and Online Learning
- Instance or Model Based Learning


### Supervised and Unsupervised Learning

- **Supervised Learning**: Classification, regression.
- **Semi-supervised Learning**: 
- **Unsupervised Learning**: Clustering, density estimation, dimensionality reduction.
- **Reinforcement Learning**:




During dimensionality reduction, we can use highly correlated features and combined them into one which is called feature extraction. For example, `age` and `milage` of a car is strongly correlated. Therefore, we can call this new feature `wear_and_tear`.

### Batch and Online Learning

**Batch learning (offline learning)** processes the whole dataset and **online learning** processes dataset as they come and adjusts the model. When online learning is used to train with huge datasets, this can be accomplished with out-of-core learning that is done offline usually. Incremental learning would be a better name for it. 


### Instance or Model Based Learning

Another important distinction about a learning algorithm whether it's instance or model based. Instance based approach compares the new instance with dataset through some similarity metric and decides the results. Model base approach learn pattern from data and constructs a model. The decision is made through model.


## 1.2 Main Challenges of ML

Humans are remarkable at pattern recognition as opposed to machines which require lots of data. Main problem with learning systems obviously related to data. Following are the main issues:

- Insufficient quantity of data
- Nonrepresentative data (sampling bias)
- Poor quality data (the bulk of data engineering and processing step to clean data)
- Irrelevant features
- Overfitting and underfitting model

Especially deep learning models require gigantic amount of data. Since they automate representation learning this is excepted and little off topic to be honest just a reminder. 

> **Remark:** *The Unreasonable effectiveness of data* [[1]]() is quite a high level paper about how more data helps in terms of performance metrics rather than clever new algorithms. It's revisited in 2017 regarding deep learning methods by the paper *Revisiting unreasonable effectiveness of data in deep learning era* [[2]]() released by Google. The paper uses an internal (not public) enormous dataset called JFT. Their main findings are as follows: 
> - **a)** Large data improves [representation learning](https://en.wikipedia.org/wiki/Feature_learning) 
> - **b)** performance gain is logarithmic 
> - **c)** Model capacity needs to be adjusted according to data in terms of size 
> - **d)** even with long tail data, performance is not severely affected. 

## 1.3 Testing and Validating

This is the most important step to measure performance of ML approach. We test the model with never before seen data and validate how correct the results are. This is called holdout testing. However, in addition to model parameters, we also need to adjust hyper-parameters that are used to choose model itself. For this step, cross validation is used. In short, we divide training set into k parts and then we train the model with k-1 set and run the test on the left out set. Repeating this process k times, we can toggle hyper-parameters.

<img src="fig/chapter1/validation.svg" width="750">

## 1.4 References

[[1]](#alon2009unreasonable) Halevy, Alon, Peter Norvig, and Fernando Pereira. *"The unreasonable effectiveness of data."* IEEE Intelligent Systems 24.2 .2009: 8-12.

[[2]](#alon2017revisiting) Sun, Chen, et al. *"Revisiting unreasonable effectiveness of data in deep learning era."* Proceedings of the IEEE international conference on computer vision. 2017.
