#### Supervised Machine Learning

- Data instances/samples/examples/feature $X$
- Target value $y$
- Training and test sets
- Model/Estimator
    - Model fitting produces a 'trained model'
    - Training ins the process of estimating model parameters
- Evaluation method
- Both classification and regression take a set of training instances and learn a mapping to a target value
- For classification, the target value is a discrete class value
    - Binary: target value is 0 (negative class) or 1 (positive class)
    - Multi-class: Target value is one of a set of discrete values
    - Multi-label: There are multiple target values (labels)
- For regression, the target value is continuous (floating point/real-valued)
- Looking at the target value's type will guide you on what supervised learning method to use.
- Many supervised learning methods have 'flavors' for both classification for both classification and regression
- Simple but powerful prediction algorithms:
    - K-nearest neighbors
    - Linear model fit using least-squares
- K-nearest neighbors makes few assumptions about the structure of the data and gives potentially accuarate but sometimes unstable predictions (sensitive to small changes in the training data)
- Linear models make strong assumptions about the structure of the data and give stable but potentially inaccuarate predictions
- Generalization ability: Refers to an algothm's ability to give accuarate predictions for new, previously unseen data.
- Assumptions:
    - Future unseed data (test set) will have the same properties as the current training sets.
    - Thus, models that are accurate on the training set are expected to be accurate on the test set.
    - But that may no happen if the trained model is tuned too specifically to the training set
- Models that are too complex for the amount of training data avaible are said *overfit* and are not likely to generalize well to new examples.
- Models that are too simple, that don't even do well on the training data, are said to *underfit* and also not likely to generalize well
- Model complexity
    - n_neighbors: number of nearest neighbors (k) to consider
    - Default=5
- Model fitting
    - Metric: Distance function between data points
        - Dafault: distance function between data points
            - Default: Minkowski distance with power parameter p=2 (Euclidean)

#### Linear Regression: Least-Squares

- A linear model is a sum of weighted variables that predicts a target output value given  an input data instance
- Ex: Predicting housing prices, taxes per year ($X_{TAX}$), age in years ($X_{AGE}$)
$$\widehat{Y_{PRICE}}=212000+109X_{TAX}-2000X_{AGE}$$
- A house with feature values $(X_{TAX},X_{AGE})$ of (10000,75) would have a predicted selling price of $$\widehat{Y_{PRICE}}=212000+109(10000)-2000(75)=1152000$$

- input instance - feature vector: $x=(x_0,x_1,...,x_n)$
- predicted output:

$$\hat{y}=\widehat{w_{0}}x_0+\widehat{w_{1}}x_1+...\widehat{w_{n}}x_n+\widehat{b}$$

- Parameters to estimate:
    - $\widehat{w}=(\widehat{w_0},...,\widehat{w_n}):$ feature weights/model coefficients
    - $\hat{b}:$ constant bias term/intercept

- A linear Regression Model with one Variable (Feature)
    - Input instance: $x=(x_0)$
    - Predicted output: $\hat{y}=\hat{w_0}x_0+\hat{b}$
    - Parameters to stimate: 
        - $\hat{W_0}$ (slope)
        - $\hat{b}$ (y intercept)

- There are many different ways to estimate w and b:
    - Different methods correspond to different "fit" criteria and goals and ways to control model complexity
- The learning algorithm finds the parameters that optimize an objective function, typically to minimize some kind of loss function of the predicted target values vs target values
- finds $w$ and $b$ that minimizes the sum of squared differentes (RSS) over the training data between predicted target and actual target values.
- a.k.a mean squared error of the linear model
- No parameters to control model complexity

$$RSS(w,b)= \sum_{i=1}^N{(y_i-(w-x_i+b))^2}$$

#### Ridge Regression

- Ridge regression learns $w,b$ using the same least-squares criterion but adds a penalty for large variations in w parameters

$$RSS_{RIDGE}(w,b)=\sum_{i=1}^N{(y_i-(w-x_i+b))^2}+\alpha\sum_{j=1}^p{w_j^2}$$

- Once the parameters are learned, the ridge regression prediction formula is the same as ordinary least-squares.
- The addition of a parameter penalty is called regularization. Regularization prevents overfitting by restricting the model, typically to reduce its complexity.
- Ridge gression uses *L2 regularization:* minimize sum of square of w entries
- The influence of the regularization term is controlled by the $\alpha$ parameter.
- Higher alpha means more regularization and simpler models.
- Importat for some machine learning methods that all feature are on the same scale (e.g faster convergence in learning, more uniform or 'fair' influence for all weights)
- Can also depend on the data. For now, we do MinMax scaling of the features:
    - For each feature $x_i$: compute the min value $x_i^{MIN}$ and the max value $x_{i}^{MAX}$ achieved across all instances in the training set.
    - For each feature: transform a given feature $x_i$ value to a scaled version $x_{i}^{'}$ using the formula
    $$x_{i}^{'}=(x_i-x_i^{MIN})/(x_i^{MAX}-x_i^{MIN})$$

Feature Normalization: The test set must use identical scaling to the training set

- Fit the scaler using the training set, then apply the same scaler to transform the test set.
- Do not scale the training and test sets using different scalers: this could lead to random skew in the data
- Do not fit the scaler using any part of the test data: referencing the test data can lead to a form of data leakage. More on this issue later in the course.
- Ridge regression coefficient pathway as regularization (alpha) is increased

#### Lasso Regression

is another form of regularized linear regression what uses an *L1 regularization* penalty for training (insted of ridge's L2 penalty)

- L1 penalty: Minimize the sum of the *absolute values* of the coefficients

$$RSS_{LASSO}(w,b)=\sum_{i=1}^N{(y_i-(w-x_i+b))^2}+\alpha\sum_{j=1}^p{|w_j|}$$
- This has the effect of setting parameter weights in $w$ to *zero* for least influential variables. This is called a sparse solution: a kind of feature selection
- The parameter $\alpha$ controls amount of L1 regularization (default=1.0)
- The prediction formula is the same as ordinary least-squares
- When to use ridge vs lasso regression:
    - Many small/medium sized effects: use ridge
    - Only a few variables with medium/large effect: use lasso
- Lasso regression coefficient pathway as regularization (alpha) is increased

#### Polynomial Feature with Linear Regression

$$x=(x_0,x_1) \to x^{'}=(x_0,x_1,x_0^2,x_0x_1,x_1^2)$$
$$\hat{y}=\hat{w}_0 x_0+\hat{w}_1 x_1+\hat{w}_{00} x_0^2+\hat{w}_{01}x_0x_1+\hat{w}_{11}x_1^2+b$$

- Generate new feature consisting of all polynomial combinations of the original two feature $(x_0,x_1)$
- The degree of the polynomial specifies how many variables participate at a time in each new feature (above example:degree 2)
- This is still a weighted linear combination of features, so it's still a linear model, and can use same least-squares estimation method for $w$ and $b$


#### Logistic Regreesion

$$\hat{y}=logistic(\hat{b}+\hat{w_1}x_1+...\hat{w}_nx_n)$$
$$\hat{y}=\frac{1}{1+e^{-(\hat{b}+\hat{w}_1x_1+... \hat{w}_nx_n)}}$$
$$y \in (0,1)$$

- The logistic function transforms real-valued input to an output number $y$ between 0 and 1, interpreted as the probability the input object belongs to the positive class, given its input features $(x_0,x_1,...,x_n)$