## Linear Model Selection
### Notes from Scott's lectures
### Notes from ISLR chapter 6


Purpose is to find alternatives to least squares:
1. Increase prediction accuracy
2. Increase model interpretability by removing irrelevant features
    * Avoids overfitting
3. These can be applied to logistic and other types of models.




Three method classes:
Subset selection
1. Stepwise model selection
    1.  goes through every option combination rather than a forward only direction:
        * Predict Intercept: predict sample mean for each observation M0
        * Next create a model M1 by looking at all p models that have 1 predictor.
        * Next create a model M2. With 2 features. 
        * ….Mn features.
        * Eventually the model with smallest RSS and largest R^2 wins.
        * Last: select the single best model using cross-validation Cp (AIC), BIC, adjusted R^2
    2. Limit this process to 10 or maybe 20 at most features because it gets very expensive even with software to calculate all of the different options. 2^P at 40 1 trillion combinations. And models of 100,000’s features are common.
    3. Possible combinations is N choose K.
    4. When you add more features the R^2 cannot get worse because the model becomes over fit.

2. Forward Stepwise Selection:
    * Same process as Stepwise except we add the next best feature without going back and retesting all possible features. 
        * So we pick M1, then we find the next best feature M2, etc 1 direction. 
    * Number of combinations is now P^2. A much smaller number than 2^P above.

3. Backward stepwise selection:
    * Same as forward stepwise 
    * Start with the full feature model.
    * Go through each feature and remove k-1, one by one until we find the smallest RSS or highest R^2
    * Repeat to 0 features for the intercept.
    * Validate each model against the test data using cross validation etc. 
    * ~P^2  
    * Like forward stepwise, not guaranteed to find the “best” model
    * Requires that number of samples n, is larger than number of variables p 
        * Else, use forward stepwise because it can do when n < p, so only viable option. 
Estimating Test error:
* RSS and R^2 are not very bad predictors when we run test data and recalculate RSS and R^2.  

1. Adjust Training Error:
    * Cp, BIC, 
    * Adjusted R^2
        * Can’t compare a 2 models when they have different # of features.
            * The R^2’s will be different
        * The Adjustment accounts for this.
        * Penalizes models with larger number of features:
        * Does this by adding a denominator to the 2 parts RSS and TSS 

![image.png](attachment:image.png)

        * Cannot apply this to logistic regression


2. Directly estimate test error: validation approach or cross-validation:
    * Best Practice
    * Hastie et el like this better than using the above because it can be used in all cases, and all models whether n < p, etc.
    * Provides a direct estimate of the test error. Does not require est. of error VAR.
    * Common Rule of thump within 1 less std dev. of best feature point. 
        * If cross-validation says pick 6 features, we might move over to 4 features

![image.png](attachment:image.png)

### Shrinkage
1. Ridge
    * Fit vs. size of the coefficients. 
    * Takes RSS and adds a penalty for the coefficient size. 
    * This extra term is called the shrinkage parameter. 
    * Scale of the features matters.
        * Important to standardize features before applying ridge regression.           
    * Called an L2 penalty

2. Lasso
    * Uses the sum of values as the penalty vs. Ridge that uses sum of squares as the penalty.
    * Called an L1 penalty
    * Tends to set coefficients exactly = 0 when lambda is large enough
    * Yields sparse models
    * Similar to best subset selection, lasso will select important variables

* Cross-validation can be used to determine which is technique is better for that data set. 
* 

Dimension reduction
1. Option 1. Subset selection through the 3 stepwise model selection methods.
2. Option 2, shrinkage, does not use least squares to fit the model instead uses Lasso or Ridge to fit the model.
3. Option 3: involves transforming the features and using least squares on the transformed features.

Option 3:
* Apply a multiplier to each feature value Xj and sum them up to get Zm
* Now instead of using X as input to fit the model, we use the ew Z values
* Most famous is Principal Component Regression
    1. Step 1, get principal components
    2. Step 2 perform least squares on principal components


    1. 1st component: normalized linear combination of variables with largest variance
    2. Possible to beat least squares alone by not only lowering bias but also lowering variance. 

* Partial Least Squares:



### Interpretation: From Private Lessons with Scott:

* When using indicator variables. The base case is called the referent variable. 
    * All category variables will use the beta0
    * with 2 categories 1 with true/false and the other with sunny, rainy, overcast
        * we have 4 variables/coefficients
        * and 6 possibilities. 
    * Colinear problem. When using an indicator variable, we need to drop one of them for the model because the remaining features will dictate the one left out. Complement rule. 
        * colinearity problem/Multi-colinearity problem is when we split a category into to categorical features and use all of the features in the model when we only need 2/3 to predict all 3 for example.
        * Use P-1 to fix when there is colinearity of indicator variable. 
    * There are 3 ways to address this problem. 
        * 1. is just split out the categories and use Beta0 as the base for each category. 
            * Problem with this is that the linear fit taking averages for each combination of categories will not fit the data as well as it could. 
        * 2. We can create a category for each combnination of categories, so rain and fast is one and overcast and fast is another and rain and slow is a 3rd, etc. for all 6. 
            * Problem with solution is that it is not as interpretable becase we no longer have the linear form. Also we need to set one of the categories to Beta0 in order to eliminate the colinearity problem we are addressing.
            * Good thing it is more flexible and can completely fit the model. 
        * 3rd, we can create a linear form that isolates each change in the categories such that we can see through the model the impact of the change. 
            * The prediction for this model will be the same as the prediction for #2 above.
            * Interpretation is easier under the linear form. 
        * 


### Review LM:

Important Topics to know in LM (Linear Models)
1. Interpreting Linear Form (advanced: logistic)
2. Least Squares fit
3. Testing coefficients
4. Assumptions: (jonly linear form in logistic…all “5”assumptions in LM)
________    
5. mulit-colinearity
6. Indicator Variables
___________________
7. Forward/backward Selection
8. Standardization
9. diagnostics 
    1. if you talk about assumptions, you need to talk about diagnostics
10. Leverage
11. VIF: Variance Inflation factor
12. Parametric Extrapolation
    1. tied to multi-colinearity
13. Regression as MVN/normal

Regularization: L1 and L2 norms

* adds a complexity penalty
    * more features the greater the penalty.
    * We choose lambda.
* L2-Norm formula  = lambda * sqrt(sum(Beta's^2)) 
    * Higher the beta the bigger the cost from Beta
    * Lower lambda beta can be anything, then little penalty. 
* L1-Norm Formula = lamda * sum(abs(Beta_j))