### What is Machine Learning(ML)?
* Teaching computers to learn to perform task from past experience(data).
* A computer program is said to learn from experience E, with respect to some task T and performance measure P. It's performance P for task T, improves with experience E.

### Application of ML
* email spam filtering
* Text and voice recognition
* Web search engine (Ranking)
* Self-driving cars.
* Photo tagging

<h1><center>Types of ML</center></h1>

## Supervised Learning
- Labeled data
- Direct feedback
- Predict outcome/feature
- Learn model from labeled training data that allows to make prediction of future unseen data.
![](images/supervised.PNG)

### Classification
* Discrete label (spam-non spam, benign-malign)
* Example:
    - Email spam filtering
    - Hand writing digit recognition
    - 

### Regression
* Predict continuous value.
* We are given a predictor(Explanatory) variables and a continuous response (target/outcome) variable, we try to find relationship between that.
* Example
    - Predicting SAT score of students (time spent studying, mock exam score)
    - Finding price of house given set of features (predictors)
* We want to minimize the distance (avg squared distance) between sample points and fitted line, and we can use line's intercept and slope to predict future data.

### Common Algorithm

##### K-nearest neighbors

##### Linear Regression

##### Logistic Regression
* Can be used for classification.
* It can output value that corresponds probability of belonging to a given task.

##### Support vector machine (SVM)

##### Decision tree and random forest

##### Neural networks

## Unsupervised Learning
- No labels
- No feedback
- Find hidden structure in data

### Clustering
- Organize pile of info into meaningful subgroups (clusters)
- Each group share certain degree of similarities.
- Example
    - Discovering customer group
![](images/clustring.PNG)

### Common clustering algorithm

##### K-means

##### Hierarchical cluster analysis (HCA)
* Subdivide group into smaller groups

##### Expectation maximization

### Dimensionality reduction
* Often data is in high dimension- occupy storage and CPU
* We want to compress data to smaller dimensional subspace while retaining most of relevant info.
![](images/dimenstionality_reduction.PNG)

### Visualization and dimensionality reduction algorithm

##### Principal component analysis (PCA)

##### Kernel PCA

##### Locally linear embedding (LLE)

##### t-distributed stochastic neighbor embedding (t-SNE)

### Anomaly detection
* Detecting abnormal and unusual case
* Detect unusual credit card transaction
* Automatically removing outlier from dataset.

### Association rule mining
* Discover interesting relationship among attributes.
### Associated rule learning algorithms

##### Apriori
##### Eclat

## Semisupervised Learning
* Partially labeled data.
* Algorithms
    - Deep belief networks
    

## Reinforcement Learning
* Agent observe environment, select and perform an action, and get reward in return. It learn itself to get most reward over the time.
- Decision process
- Reward system
- Learn series of action
- Improves performance via interaction with environment.
- Here the goal is to maximize reward given by reward function
- Example
    - Chess engine (reward is win or lose.)
![](images/reinforcement.PNG)

### ML Process
![](images/process.PNG)

* Selected features should be in same scale. Transform to range [0,1] or a standard normal distribution with mean 0 and unit variance.
* Some features which are highly correlated, so redundant at certain degree. Car's mileage is correlated with age.
* Randomly divide dataset in training and testing data.
* Train different model, compare their performance.

* **Hyperparameter** are parameters that are not  learned from data but represent the knobs of a model that we can turn to improve performance.

* Applying ML techniques to dig into large amount of data can help discover patterns that were not immediately apparent. It is called **Data mining**.

* Sequence mining is predicting next events, click streams

#### Supervised vs unsupervised
* Labeled vs unlabeled
* output is known vs not known
* More measures for accuracy vs not enough
* Controlled environment vs not controlled

----------------

### Batch Learning (Offline learning)
* Incapable of learning incrementally. Must be trained with all available data.
* System is trained and launched in production, after that system will not learn anymore.
* Requires great compute power

### Online Learning
* Train system incrementally by feeding it data instances sequentially.
* Learn from new data on the fly
* Can work with restricted resources
* Useful when data is huge and can not load in memory at same time. We train model with part of data.
* If learning rate is high, system will adapt new data fast and forget about old data.
* Slow learning rate will less sensitive to new noisy data (unrepresentative data)
* We have to monitor performance degradation due to bad data.


### Instance based Learning
* Mark all email spam identical to or similar to known spam emails. 
* This requires measure of similarity between two emails.
* System learns example by heart and generalized to new cases using similarity measure.
![](images/instance.PNG)

### Model based learning
* Create model using example, using model predict outcome.

### Challenges

#### Insufficient quantity of training data

#### Non representative training data
* Training data has to be representative of new cases we want to generalize to.
* If sample is too small there will be a sampling noise, if sample is too large sample can be non representative if the sampling method is flawed. It is called sampling bias.

#### Poor Quality Data
* Data with errors, outlier and noise
* Fix outliers
* Fill missing value, ignore that attribute, train one model with missing value feature or one without it.

#### Irrelevant feature
* Feature selection: Select most useful features.
* Feature extraction: Combining existing features to produce more useful one (dimensionality reduction can help)
* Create new feature by gathering new data.

#### Overfitting of training data
* Model perform well on training data, but do not generalize well.
* Overfitting happens when model is too complex relative to amount and noisiness of training data
    - Simplify model by choosing fewer parameters, use linear than high degree polynomial
    - Gather more training data
    - Reduce noise in training data (fix data error, remove outliers)
    
* Constraining model to make it simpler and reduce risk of overfitting is called **regularization**. Keep balance between fitting data perfect and keeping model simple enough to generalize well.
    - $\theta_0$ and $\theta_1$ are  2 parameter in linear model. Which gives 2 degree of freedom to a model. If we force $\theta_1$ = 0, only 1 degree of freedom. Harder to fit data well. We only can move line up or down. So end up around mean.
    - We want to find right balance between fitting the data and keeping model simple enough.
    - Regularization can be controlled by hyperparameter.

#### Under fitting training data
* Model is too simple to lean underlying structure of data.
* To overcome,
    - Select more powerful model
    - Feeding better features to the learning algo
    - Reducing contrains on the model(Reduce regularization of hyper parameter)

* Sequence of data processing is called **data pipeline**

### Distance Measure (norms)

* Vector norms is total size or length of all vector in vector space or matrix.

#### L0 Norm
* Total number of non zero elements in a vector.
* L0 norm of (0,0) and (0,2) is 1.


#### L1 norm
* MAE (Mean absolute error)(mean absolute distance) corresponds to l1 norm. Also known as Manhattan norm(Taxicab norm), as it measure distance between 2 points in a city if you can travel along orthogonal city block.
* Sum of magnitudes of the vector in a space.
* All component of the vector are weighted equally.
* For (3,4), L1 norm is |3| + |4| = 7
* In image we can see that taxicab travels between Manhattan blocks from (0,0) to (3,4).
![](images/l1.jpg)

#### L2 norm
* Euclidian norm. Shortest distance from 1 point to other.
* RMSE(Root mean square error)
* L2 norm for (3,4) = $\sqrt{|3|^2 + |4|^2}$ = 5
![](images/l2.jpg)
* Each component of vector is squared, so outlier has more weight so it can skew the results.

#### Lk norm
* Lk norm of (3,4) = $\sqrt[k]{|3|^k + |4|^k}$

#### L$\infty$ norm
* Max absolute value in vector.


* Higher the norm index more it focus on large value, neglect small values. SO, RMSE is more sensitive to outlier than MAE. When Outlier are exponentially rare (bell shape curve) RMSE performs well.