# Intro to Supervised Learning

* Supervised learning algorithms are trained using <strong>labeled</strong> examples, such as input where the desired output is known.

* The network learns by receiving a set of <strong>inputs</strong> along with the corresponding correct <strong>outputs</strong>, and the algorithm <strong>learns by comparing its actual output with correct outputs</strong> to find errors. It then modifies the model accordingly.

* Supervised learning is commonly used in applications where known historical data predicts likely future events. 

## Machine Learning Process:
1. Data Acquisition
2. Data Cleaning
3. Split to training and validation data
4. Build & train model 
5. Validate the model (if needed, refine model parameters)
6. Deploy the model

### Problem
We can update the hyperparameters until our performance score is very high on the validation data. This is not represenattive of the model's performance on new unseen data.

### Solution
To fix this issue we split the data into 3 sets:
1. Training data -- Used to train model parameters 
2. Validation data -- Used to determine what model hyperparameters to adjust
3. Test data -- Used to get some final performance metric

## Model Evaluation 

After we finish training and validating our model, we need to identify key error metrics to determine how well our model performs.

### Classification
A classification task is when the model attempts to classify categorical values. 

The key classification model performance metrics are:
* Accuracy 
* Recall
* Precision 
* F1-Score

In any classification task, your model can achieve one of two results:
   * Correct Prediction
   * Incorrect Prediction

We repeat this process for all our test data. At the end, we have a count of all the correct and incorrect predictions. 

<small>Note: The key realization we need to make is that in the real world, not all incorrect or correct predictions hold equal value.</small>

#### 1. Accuracy

Accuracy in classification problems is the number of correct predictions made by the model divided by the total number of predictions.

Example: If the model had to classify 100 images and it got 80 right. Then the accuracy is 0.8 or 80%

* Accuracy is very useful when the target classes are <strong>well balanced</strong> (near equal number of examples for each target class).
* Accuracy is <strong>not</strong> useful when the target classes are <strong>unbalanced</strong> (big difference in number of examples for each target class).

#### 2. Recall

* Recall = Proportion of correctly labeled as positive cases among actually positive cases.

Recall is the <strong>number of true positives</strong> divided by the <strong>number of true positives and false negatives<strong>. 

#### 3. Precision

* Precision = Proportion of actually positive cases among all cases labeled as positive

Precision is the <strong>number of true positives</strong> divided by the <strong>number of true positives and false positives<strong>. 




Note: There is often a trade off between recall and precision. 



#### 4. F1-score

In cases where we want to find the optimal blend of precision and recall, we can combine the two metrics using the F1 score.

The F1 score is the harmonic mean of precision and recall taking both metrics into account.

F1 = 2 * (precision * recall) / (precision + recall)

We use the harmonic mean instead of the simple mean because it punishes extreme values. For instance, a classifier with 1.0 precision and 0.0 recall has a simple mean of 0.5, but a harmonic mean (and therefore F1 score) of 0.

#### Confusion Matrix
We can also view all correctly and incorrectly classified data in the form of a confusion matrix. 

In a general sense, all 4 metrics are ways of comparing predicted values versus labeled values in some way. What constitutes good metrics depends on the situation.

For instance, in disease diagnostics, it is crucial to minimize the false negatives (at the cost of increasing false positives). i.e. higher recall/lower precision.

All in all machine leanring is a collaborative process where we should consult with domain experts. 

### Regression 

A regression task is when the model attempts to predict continuous values. 

The key regression model performance metrics are:
* Mean Absolute Error
* Mean Squared Error (MSE)
* Root Mean Squared Error (RMSE)

#### 1. Mean Absolute Error

The mean of the absolute value of errors. 

MAE = sum( | yPred - yTrue | ) / n  

#### 1. Mean Absolute Error

The mean of the absolute value of errors. 

MAE = sum( | yPred - yTrue | ) / n  

The issue with MAE is it doesn't punish large errors enough. 

#### 2. Mean Squared Error

The mean of the squared value of errors. 

MSE = sum((yPred - yTrue)^2) / n  

MSE punishes large errors more, making it more popular than MAE.

The issue with MSE is that it squares the units as well. e.g. when we square the price difference, the error is expressed in terms of $^2 not $.

#### 3. Root Mean Squared Error

The square root of the mean of the squared value of errors. 

RMSE = sqrt(sum((yPred - yTrue)^2) / n)  

RMSE punishes large errors and preserves the correct units. It is the most popular metric to express a regression model's performance. 

To answer "Is this RMSE good", compare the RMSE to the average value of the label in your dataset. This will give you an intuition of its overall performance. Domain knowledge is important here too.

## Machine Learning with Python

We will be using `scikit-learn`

Every model is exposed in scikit-learn via an <b>Estimator</b>.

#### Step 1: Import the Estimator

General import statement:
`from sklearn.family import Estimator`

In [10]:
# import statement example
from sklearn.linear_model import LinearRegression

#### Step 2: Instantiate the Estimator

All the parameters of an Estimator can be set when you instantiate the Estimator, and have decent default values. 

In [14]:
#instantiate Linear Regression estimator
model = LinearRegression(fit_intercept=True,normalize=True,copy_X=True,n_jobs=None)
print(model)

LinearRegression(normalize=True)


#### Step 3: Split data into train and test 

In [22]:
# Create fake data
import numpy as np
X,y = np.arange(10).reshape((5,2)), np.arange(5)
print(X)
y

[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]


array([0, 1, 2, 3, 4])

In [23]:
# Split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.3)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3, 2) (2, 2) (3,) (2,)


#### Step 4: Fit the model to the data

In [24]:
model.fit(X_train,y_train)

LinearRegression(normalize=True)

#### Step 5: Use the model to predict y values for X_test

In [26]:
p = model.predict(X_test)
p

array([4., 2.])

#### Step 6: Evaluate model performance

In [28]:
# Get the model's RMSE
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test,p)
mse**1/2

6.280369834735101e-16