![process](./images/dsprocess.png)

![why](./images/whybother.png)

## CRISP-DM Process
![crisp](./images/crispdm.png)

## Tasks of each CRISP step
![crisp-tasks](./images/crisptasks.png)

![real_life](./images/real_life.png)

### Validation

![val](./images/validation.png)

![tts](./images/tts.png)

Requirements:
- Random sampling of stationary data from the same distribution
- Large enough test dataset
   -  most common: 80/20
    - Make sure at least 1000-3000 samples on test set
- Refrain from using test data for model comparison and/or parameter optimisation
- Use only on your final model to understand expected performance

![kfc](./images/kfoldcross.png)

- Makes it possible to work with smaller dataset (higher variance)
- We expect the mean to be closer to the hidden truth than with TTS

![cvin](./images/cvinsidetts.png)
- By nesting CV inside TTS we can compare multiple models/parameters and select the best one without looking at the the test dataset
-  The test dataset is still only looked at for your final model

![tvt](./images/tvt.png)

- By nesting TTS inside TTS we eliminate the need for training K times
- Make sure your validation dataset is big enough (same rules as test)
- The test dataset is still only looked at for **your final model**

### Bias/Variance tradeoff
![bv](./images/bv.png)

- The data your model will see after deployment will not be identical to the data it was trained on.
- If the magnitude of your errors change a lot depending on the sample, the model may not work well in real life.
- This may be due to your model learning the idiosyncrasies of the training data too well and expecting to find them in all other data (Overfitting)


### Underfitting vs Overfitting

![undovr](./images/underoverfit.jpg)

#### Signs of Fit:

- **Overfitting** (high variance of error):
    - Very high training performance
    - Training performance much higher than validation
- **Underfitting** (high average error):
    - Poor performance in the training dataset
- **Good fit**:
    - Training performance just a bit over test performance
- **Unknown fit** (assumption violation):
    - Test performance higher than Train performance


#### Solving fitting issues:

- When **overfitting**:
     - Increase the size of the test dataset (data variance)
     - Reduce the complexity of the model (model variance)
     - Feature elimination (drop redundant / irrelevant features)
- When **underfitting**:
     - Increase the size of the training dataset (data bias)
     - Increase the complexity of the model (model bias)
     - Feature engineering (add new relevant variables)


![truth](./images/truth.png)

### Feature engineering vs Regularization

Take place in this step:

![substep](./images/substep.png)

#### Feature engineering: (add new relevant variables)
 - Interaction terms
 - polynomials
 - create ratios from existing variables
 - convert a continuous variable to a bucketed variable...
 
#### Regularization: adds "bias" to reduce model complexity
- size of "alpha" can control for multicolinearity OR reduce variable  set
- let's look at [this  link](https://bradleyboehmke.github.io/HOML/regularized-regression.html#ridge) to explore more


#### **It is an iterative process**

![golden](./images/golden.png)


- **Ockham's Razor**: The simpler a model the more likely it will generalise
- **Always ask yourself**:: “Will the model be good with a new sample of data?”
- Always validate your model on an unseen dataset (test). **And only check it once!!**




### Delivering value

Your job is **not** to create high performance models

They pay you to _**solve problems**_


## Which tools do I use?
Ridge? lasso? AIC? BIC? -> all  tools to help  us find the best modeling specifications

Use Statsmodels? Use Sklearn?

Well, it goes  all the way back to:

![crispbo](./images/crispbo.png)

### Working with a small dataset?
Lasso or Ridge isn't likley the best solution

### Huge dataset with  over 500  variables
probably not the best time to  use statsmodels

## Competing model specifications

**Model 1**: 4 normalized original variables, 2 interaction terms, one polynomial, one engineered feature<br>

**Model 2**: on full list of 12 original variables, 42  interaction terms, squared terms  of  alll variables, a lasso regression reduced the variables set to 6

- Which model is better?
- which should be used?
- What if their R-squareds are the same?


## Two Scenarios:

**A**: Want to predict credit scores to be able to give recommendations to people on what to change to get higher scores?

**B**: Want to show that your machine  learning shop has reverse enginneered the Equifax regression algorithm  so investors  will fund your  thinktank. 

### Thoughts:

Statsmodels vs sklearn

Regularization  or not?