# I. Regularized Linear Models
We've discussed over-fitting concerns. One way to mitigate overfitting (a.k.a. regularize a model) is to trim down the number of involved inputs. At a high level, "regularizing models" means making predictive models more accurate through simplification i.e. __reducing error on training data by dropping redundant inputs__. These are the 3 models, but we'll save the math examples for later:

##### Ridge
Lowering the weights (coefficients) of the lowest impact variables

##### Lasso
Completely eliminating low-impact variables.

##### Elastic Net
See this as a combination of the  Ridge and Lasso

# II. Complex vs. Simple Models

### Complex models
- More inputs
- Low "bias"
- Overfitting
- Low error on training data, but higher error on testing data
- In other words, __has trouble__ creating overarching themes that you can apply to unseen data

### Simple models
- Less inputs
- High "bias"
- More error on training data, but more accurate on testing data
- In other words, __better__ at creating overarching themes that you can apply to unseen data

# III. Illustrative Example

### Real-Life Scenario
I was on a web forum for CPA candidates and you wanted to predict who is going to pass their licensing exams. Forum members were using their online practice test scores a week before their official exam dates as a wet finger in the air to guage whether they'd pass or not. Assuming the following data points were available, I wondered whether it would be help to incorporate additional factors into a model:

- How many hours they logged in the online testing platform
- Quantity of practice questioned they attempted to answer (correct or not)
- The prestiege of their undergrad college
- GPA
- Years of industry experience

### Example Reasons Why This is Flawed
- The number of questions attempted is probably mirrored to number of hours logged
- More experience might mean a better command of certain topics, so you would assume a higher probability of passing. However, people further out from school may have grown rustier when it comes to academic challenges because they've *"been out of school too long"* and lost their edge. Others may have grown complacent *"I already got the jobs I wanted, no need for a CPA"*
- GPA doesn't match school 1:1 i.e. going to a more prestigious school may lower your GPA because it has a more intense curriculum, a higher bar set by the professor, or a less generous curve due to more competitive classmates

Bottomline: we shouldn't try to cram in every data point we can find.

# III. Math Explanations and Alpha
One hyperparameter we set before the experiment is __"alpha"__. A higher alpha means we are more aggressive about shrinking (in Ridge)/eliminating (in Lasso) input variables.

### 1. Ridge Regression
Mechnically, this method uses quadratic values for training data errors. Conceptually, this is best for shrinking the value of variables, under the assumption that most/all of your variables are useful.

Assume a proposed model is 1(x) + 2(y) + 10(z)

__Ridge__ penalty would be 1 + 4 + 100 = 105. Coefficient C is reponsible for 100/105 = 95.2% of the penalty. However, that doesn't translate to a propotional shrinkage. It's the other way around: the A and B coefficients will probably have disproportionately larger reductions because they aren't contributing as much.

The new model might be 0.8(x) + 1.6(y) + 9(z).
Coefficients A and B ate a 20% reduction and C only had a 10% reduction.

### 2. Lasso Regression
Mechanically, this method uses absolute values for training data errors. Conceptually, this about removing useless variables completely, by setting a __threshold__ parameter and wiping out any variables that fall below this threshold.

Assume our previously proposed model of 1(x) + 2(y) + 10(z)

If the threshold paramter is set to 3, then x and y will be wiped because the A and B coefficients are both below 3.


### 3. Elastic Regression
Keeping this [video](https://www.youtube.com/watch?v=1dKRdX9bfIo&ab_channel=StatQuestwithJoshStarmer) here as a placeholder for something that got me halfway:


# III. Early Stoppage

### What is it?
This is not a predictive model. Rather, this is a method to use the aforementioned predictive models better. We talked about Gradient Descent as an iterative way to find a minimum error (or maximum value).

Early Stoppage means we basically stop iterating when we've reached a point of diminishing returns:
- the model is not getting more accurate with each subsquent calculation OR
- even worse, the model is starting to get worse

### Why is it important?
The core benefit is one of accuracy - quitting while you're ahead

However, a secondary benefit is reduced computational load on your device's processing power and memory.