# Machine Learning concepts 2

* ***Supervised learning***
    * **regression** (predict a **value**)
    * **classification** (predict a **label**)
* ***Unsupervised learning***
    * **clustering** (find out **groups**)

<img src='training.png' width=700>

### Model parameters Vs Hyperparameters


* Model parameters
    - properties that **learn from data** to describe the model
* Hyperparameters
    - values to control the learning process
    - you must set the hyperparameters before using the ML algorithms
    - e.g. K values for K-means algorithm
    

### Generalization in Machine Learning
* ML is not a memorization process, but a **generalization process**
* How well the trained model generalizes to **new data** (i.e. performance on unseen data prediction)

### Main Challenges of Machine Learning
* ***Data***
    * Not enough data
    * Data is not representative
    * Low quality
        * Missing data, noises
        * incorrect data, irrelevant features, etc.
* **Overfitting**
    * Unable to generalize -> overfitting
    * Perform well on training data, but does not generalize well for unseen data
    * Learned from noise data, which is irrelvant
    * Solutions:
        * simpify the model 
            * use a fewer parameters
            * use less attributes
            * regularization (constrain the model)
        * Reduce the noise
        * Use more training data
* **Underfitting**
    * Selected model is too simple for the data
    * not perform well on both training data and testing data
    * Solutions:
        * Use a more powerful model
        * Reduce constraints on model





<img src="https://upload.wikimedia.org/wikipedia/commons/1/19/Overfitting.svg" height=200 width=200>
<br>
CC BY-SA 4.0: https://en.wikipedia.org/wiki/File:Overfitting.svg 

---

## Model evaluation

1. Train & Test on the same data
    1. may **overfit** the training data, not work well on real world environment
1. **Train and Test on different set of data (train_test_split)**
    1. Split dataset into training set & testing set. Training on training set, do testing on testing set
    1. sklearn.model_selection.train_test_split()
1. **K-Fold Cross-Validation (K-Fold CV)**
    1. Split data in to K equal parts for K runs
    1. Select 1 part as validation set and use the remainding parts for training
    1. Repeat for K times

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width="400">
    
Source: https://scikit-learn.org/stable/modules/cross_validation.html


### Model evaluation metrics

* **Regression:**
    * Minimize the error

    Mean Absolute Error $(MAE) = \frac{1}{n}\sum_{i=1}^{n} \lvert (actual_{i}-predicted_{i}) \rvert$

    Mean Square Error $(MSE) = \frac{1}{n}\sum_{i=1}^{n} (actual_{i}-predicted_{i})^2$

    Root Mean Square Error $(RMSE) = \sqrt{ \frac{1}{n}\sum_{i=1}^{n} (actual_{i}-predicted_{i})^2}$
    
    <img src="regression_error.png" width="200">

* **Classification:**
    1. accuracy
        1. **metrics.accuracy_score(y_actual, y_predict)**
        1. compare with baseline accuracy:
            * predict the most frequent class
            * baseline accuracy = most frequent class samples / total number of samples
    1. Confusion matrix - table / diagram to show the peformance of a classifer
        * **metrics.confusion_matrix(y_label, y_predicted)**
  
    <p><img src="confusion_matrix.png" width="220"></p>


* $\textbf{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

* $\textbf{Precision} = \frac{TP}{TP + FP}$

* $\textbf{Recall} = \frac{TP}{TP + FN}$


