# The Machine Learning Landscape

## what is machine leanring?

Machine Learning is the science (and art) of programming computer so that they can _learn from data_

here is a more general definition:

`[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.`

and a more engineering-oriented one:

`A computer program is said to learn from experience E with respect to some task T and some perfromance measure P, if its performance on T, as mesured by P, improves with experience E`

## Why use Machine Learning?

machine learning is best used to model tasks that would be nearly unmaintainable to program the logic for by hand (think of all the different formats spam can come in). A handcoded spam filter would contain a long list of complex rules. An ML approach automatically learns which phrases and words are good predictors.

what if the rules change later down the line, you would have to potentially re-write large chunks of rules. However ML projects can be configured to automatically detect changes in the patterns of a task and starts incorporating the new rules into its model

Machine learning is also good for problems that are too complex for traditional approaches, or which have no known algorithms. such as speech recognition

### Machine Learning can help humans learn
ML algorithms can be used to help humans learn as well. they may discover easily missed patterns in data. using ML algorithms to discover patterns in a large dataset is called **Data Mining**.

## example of machine learning
* image classification - recognizing objects in an image
* Sematic segmentation - each pixel in the image is classified to determe the exact location of (possibly multiple) objects in data
* Natural Language Processing (NLP)
    * automatically classifying articles and text
    * text summarization
    * chatbot creation
* prediction and regression
* speech recognition - processing audio samples, which are potentially long and complex sequences of data
* anomally detection
    * detecting credit card fraud
* clustering - grouping different entites (data, customers, etc) into groups
* data visualization - represent complex, high-dimensional data in a clear and insightful diagram
    * dimensionality reduction - reducing the number of attributes in a dataset (see curse of dimensionality)
* reccomendation systems
    * association mining
* reinforcement learning
    * building an intelligent robot for a game

## Types of Machine Learning Systems

ML systems are generally categorized based on: whether they are trained with human supervision, whether they can learn incrementally on the fly, whether they work by simply comparing new datapoints to known data points, or instead by detecting patterns in the training data and building a predictive model

*   Supervised learning - the training set you feed into the model includes the desired solutions (the data is labeled). typical tasks include:
    * classification - predict which of a set of labels, some data appears to be
    * regression  - predict a value given a set of data
        * some regression algorithms can be used for classification, such as **logistic regression**
    
*   Unsupervised Learning - the training data is unlabeled. the system tries to infer patterns present in the data. typical unsupervised tasks include:
    * Clustering
    * Anomaly and novelty detection
    * Visualization (outputing a 2D or 3D representation of complex, unlabeled, Data)
    * Dimensionality Reduction
    * Association Rule Learning
    
*   Semisupervised Learning - labeling data is usually time-consuming and costly. some algorithms can deal with partially labeled data. most semi-supervised learning algorithms are combinations of unsupervised and supervised algorithms. for example:
    * deep belief networks (DBN) are based on unsupervised components canned restricted boltzmann machines stacked on top of each other. RMBs are intially trained in an unsupervised manor, and then fine-tuned with supervised learning techniques
    
*   Reinforcement Learning - very different from supervised/unsupervised. 
    * the learning system, also called an **agent**, can observe its **environment** and select and perform actions. it gets **rewards** or **penalties** based on the results of those actions. Through this it learns the best strategy, called a **policy**, to get the most rewards over time

ML systems can also be categroized based on **how** they learn
*   Batch Learning - trains using the entire dataset at one time
    * it must be trained using all available data. a model is typically trained, and then it is launched into production and doesn't learn any more. this is called *offline learning* since it is typically done offline. 
    * if you want a batch learning system to know about new data, then you will need to train a new version of the system from scratch to replace the old one.
    * a system using batch learning is incapable of learning incrementally
*   *online learning* - you train the system incrementally by feeding it data instances sequentially: 
    * either individually, or in smalled groups called *mini-batches*. 
    * each learning step is fast and cheap, and the system can learn about new data on the fly. 
    * online learning algorithms can also be used to train systems on huge datasets that cannot be fit into one machine's main memory. 
    * **learning rate** determines how fast a model should react to changing data; a high learning rate makes a model react to data faster, but it will tend to quickly forget the old data; a low learning rate makes a model learn more slowly, but will be less sensitive to noise or outliers.
    * when using online learning, one must be careful about bad data, which can gradually decrease the effectivness of the model. one must constnatly watch the input stream.
    
one more way to classify machine learning models is on how they generalize, there are two main approaches:
* **instance-based learning** - learning the dataset by heart
    * keeps every observation in the training set
    * compares new observations to the training set via a *measure of similarity*
* **Model Based Learning** - makes a model from the dataset to make predictions
    * the training data is used to build a **model** of the problem
    * to test how good the model is, you can define a **fitness function** (how good it is) or a **loss function** how bad it is
    * ex. for linear regression, people typically define a cost function that measures the distance between the linear models predictions and teh training examples, with the objective to minimize this distance

## Main Challenges of Machine Learning
the two main things that can go wrong are "bad algorithms" and "bad" data

### Insufficient Quantity of Data
it takes a lot of data for most ML models to work propertly. you may need thousands of examples (image / speech recognition) or millions of examples (nlp). in sufficient quantities, data can be more important than the choice of algorithm to process it.

NOTE: read "the unreasonable effectiveness of data"

### Nonrepresentative training data
your data should be representative of the case you want to generalize to. if a sample is too small, you may have **sampling noise** which is nonrepresentive data as a result of chance. on the opposite end, a very large data set can be non-representitive if the sampling method is flawed, which is known as *sampling bias*. 

NOTE: *non response bias* is a special type of bias that happens when an observation isn't collected (people asked to answer a poll don't care to respond)

### Poor Data Quality
if your training data is full of errors, outliers, or noise, then it will be harder for a system to detect the underlying patterns. it is opten well worth the effort to clean your data.

### irrelevant features
garbage in, garbage out. a system will only be capable of learning if the training data contains enough relevant features, and not too many irrelivant ones. 

E.g: imagine that you are predicting a countries average happiness level vs it's gpd, and decided to include the names of the countries in it. a model may decide that the name of a country, or the first letter, could have a factor into the happiness rating (if the names of many happy countries happened to start with some letter, for examples)

to prevent irrelevant features, you use *feature engineering* which consists of:
* feature selection - select the most useful features
* feature extraction - combining existing features to get more useful ones (e.x. dimensionality reduction)
* creating new features by gathering new data

### Overfitting the Training Data
overfitting is when a model performs well on the training data, but performs poorly on new data. the selected model fits too well to the training data, and thus performs poorly. this can be due to noise in the data, or the size of the dataset. it can also occur when a model is too complex for the data, relative to it's noisyness and size. constring a model to make it simpler is called *regularization*. parameters gives an algorithm *degrees of freedom* which the model can use to fit itself to the data. the amount of regularization, and the degree of freedom of a model, can be controlled through a *hyper parameter* which is a parameter of a learning algorithm, not the model, and is set by a human to affect the learning of the model.

to fix overfitting:
* simplify the model by selecting one with fewer parameters
* gather more data
* reduce the noise in the training data

### Underfitting the Training Data
**undefitting** occurs when your model is too simple to learn the underlying structure of the data.

to fix underfitting:
* select a more powerful (complex) model, with more parameters
* feed better features to the learning algorithm (feature engineering)
* Reduce the constraints on the model (reduce the regularization hyperparameter)

### Data Mismatch
it can be easy to get a large dataset, in some cases. but the **data probably won't be perfectly represesntivive of the data used in production**. the problem is how data looks, potentially coming from two different sources. 

E.g: if developing a flower identification, web pictures may look different from ones taken from a phone camera (that the app is intended for). suppose you have millions of webphotos of flowers, but only about 10,000 representative pictures (taken with the app). **the validation and test sets must be as representative as possible of the data you expect to use in production.** so the best case would be to put most of the photo pictures in the training and validation sets. now suppose that your model would performs poorly on the validation set, usually you would not be able to tell whether this is due to overfitting or data mis-match.

one solution is to leave aside an additional **train dev set**, set apart from the training set. you can then **train the model on the regular train set** and evaluate its perfromance on the train dev set. if it performs well then you can be confident that it is not overfitting on the training data. conversly if it is doing poorly on the train dev set, then it might be due to overfitting.

to solve data mismatch:
* you can attempt to preprocess the data to make it look more like it was taken from the intended source of data
* it may be another case of overfitting

**No Free Lunch Theorem**: a model is a simplified version of the observations. the simplification is meant to discard the superfluous details that are unlikely to generalize to new instances. to decide what data to discard, you must make *assumptions*. NFL theorem demonstrates that if you make aboslutely no assumptions abnout the data, then there is no reason to prefer one model over another. for some dataset the best model is a linear model, while for others it is a nerual network. <u>there is no model guaranteed to work better</u>.

## Testing and Validating data
you can split your data into a **training set** and a **testing set**. you train your model on the training set and validate the effectiveness of the model on the testing set, which it hasn't seen before. the error rate on new cases is called the **generalization error** or **out-of-sample** error. if your training error is low, but your generalization error is high, then <u>your model could be overfitting to the training set</u>

## Hyperparameter Tuning and Model Selection 
its simple to test a single machine learning model (with training and testing sets), but how do you compare different machine learning models? One option would be to train them both and compare performance on the test set?

Now suppose you want to decide the best hyper parameters for a particular model. be careful testing many different hyperparameter values for a dataset; if you measure the generalization error multiple times on a **particular** test set, and adapt the hyperparamters to produce the best model for that **particular** set, then model may be unlikely perform well on new data. You chose the best hyper parameters for that set of data, but in doing so, you *indirectly* let the model train on the data.

one method of solving this is **holdout validation** where you hold out a part of the training set to evaluate several candidate models to select the best one. this new held-out set is called the **validation set**, also known as the **development set**. Holdout Validation goes as follows:
1. you train multiple models, with multiple different hyper parameters, on the reduced training set (trainingset - validation)
2. you then test each model on the validation set, and select the model that performs best on it
3. next, you get the final model by training it on the full training set (trainingset + validation set)
4. finally, you evaluate this final model on the test set to get an estimate of the generalization error

this generally works well, but two problems can occur:
* if the **validation set is too small** then the model evaluations will not be precise and you may select a suboptimal model
* if the **validation set is too large** then the remaining training set will be smaller than the full set. this will not be ideal since the candidate models are trained on a much smaller dataset then the final set and **may not be representative** ("it would be like selecting the fastest sprinter to run a marathon").

an improvement (a way to select a sufficiently sized validation set without making the candidate training-set too small) is to perform repeated **cross validation**. in cross validation you use many small validation sets; each model is evaluated once per vlaidation set after it is trained on the rest of the data (a portion of the training set takes turns validating the model). then by **averaging out all evaluations of a model, you can get a measure of its performance**. 