# Machine Learning Day 3

## Model Training and Improvement

- **Homework**: Review and summarize two classic ML articles with peer review. Reproduce the results of at least one, with at least one article published after January 1, 2021.
- **Quote**: "Machine Learning is the process we follow to get the right approximators that work in practice." — Yordan

---

## Bias-Variance Tradeoff

1. **Diabetes Dataset Demo**: A walkthrough of model training using the Diabetes dataset.
2. **Create Pipeline**: Use `sklearn.pipeline` to automate processes.
3. **Pitfall**: Always ensure the same scaling is applied to both train/test data and any new data in production.
4. **One-Hot Encoding**: Using `pd.get_dummies()` to encode categorical variables.
5. **Pipeline**: A combination of preprocessing (e.g., scaling, encoding) and model fitting.
6. **Components**: A pipeline generally consists of a **scaler**, **one-hot encoder**, and a **model**.
7. **Sample Attributes and Target**: Separate feature and target columns in the dataset.
8. **Preprocessor**: Use `sklearn.ColumnTransformer()` to preprocess specific columns.
9. **FunctionTransformer**: Apply transformations via `sklearn.FunctionTransformer()` to your pipeline.
10. **Nesting Pipelines**: Preprocessors and estimators can be nested within one pipeline.
11. **Saving Pipelines**: Use the `pickle` library to dump/load the pipeline for future use.
12. **Prediction**: Once trained, the pipeline can predict new data with consistent preprocessing.
13. **No Perfect Model**: 100% accuracy is unattainable in real-world data due to inherent errors.
14. **Irreducible Error (Noise)**: Also called the **Bayesian optimal error**, it represents data noise.
15. **Variance**: This represents unpredictable statistical error due to external factors.
16. **Bias**: Predictable errors caused by the wrong assumptions during model training.
17. **Bias-Variance Tradeoff**: The balance between underfitting (high bias) and overfitting (high variance).
18. **Underfitting vs Overfitting**: Underfitting fails to capture the underlying trend, while overfitting performs well on the training data but poorly on unseen data.
19. **Not Always a Tradeoff**: Sometimes, adding regularization can improve both bias and variance.
20. **Optimal Model**: The ideal model strikes a balance, often achieved with bias, and uses methods like regularization to prevent overfitting.

---

## Applying Regularization

1. **Regularization**: A method to find a better bias-variance tradeoff by controlling the model's complexity.
2. **Weight Coefficients**: These are added to the loss function to penalize complexity.
3. **Lambda (λ)**: Determines the importance of regularization in the model.
4. **L2 Regularization**: Uses the second norm (Ridge Regression), penalizing large weights.
5. **L1 Regularization**: Uses the first norm (Lasso Regression), forcing some weights to zero.
6. **ElasticNet**: Combines both L1 and L2 regularization for a balanced approach.
7. **No Regularization in Linear Regression**: This is often due to historical reasons.
8. **Lasso, Ridge, ElasticNet**: Popular methods for regularization. Lasso (L1), Ridge (L2), and ElasticNet (combination of both).
9. **Logistic Regression**: A large `C` value (e.g., `1e12`) means little or no regularization.

---

## Training and Testing

1. **Train-Test Split**: Generally, a 70/30 split is used to partition the data.
2. **Test on Unseen Data**: Never test on the training data to avoid biased results.
3. **Randomized Samples**: Shuffle the data for better generalization.
4. **Stratified Sampling**: For classification tasks, use `stratify=target` to ensure balanced class distribution.
5. **Test Set Size**: The size of the test set matters, not the ratio. Choose a sufficient size for evaluation.
6. **Specialized Test Sets**: Sometimes, specialized sets (e.g., emoji datasets for sentiment analysis) are necessary.
7. **Model Performance**: Use `pipeline.score()` to evaluate model performance.
8. **Multiple Metrics**: One metric is not enough. Use several to get a full picture.
9. **Regression Metrics**: Metrics like **R²** are used to evaluate regression models.
10. **Classification Metrics**: For classification models, use metrics like **accuracy** and the **classification report**.
11. **Metrics Module**: Use `sklearn.metrics` to access a wide range of performance metrics.
12. **Residual Score**: Measures the difference between predicted and actual values.
13. **Confusion Matrix**: Useful for evaluating classification models.
14. **ROC Curve**: Receiver Operating Characteristic curve for visualizing performance in binary classification.
15. **Limitations of ROC**: For multiclass classification, we use a "1 vs. all" approach.
16. **ROC Curve Clarification**: The ROC curve should not fall below the diagonal dashed line, as this would imply worse-than-random performance.
17. **Learning and Validation Curves**: Useful tools for understanding model performance across training sizes and parameter values.

---

## Cross-Validation

1. **Cross-Validation**: Split data into train, validation, and test sets to improve generalization.
2. **K-Fold Cross-Validation**: Breaks the dataset into K parts and runs K separate training/testing cycles to ensure robustness.

---

## Model Tuning and Selection

1. **Hyperparameter Tuning**: Optimize the model's parameters to improve performance.
2. **Grid Search**: Use `GridSearchCV()` to exhaustively search through predefined parameter grids.
3. **Parameter Grid**: Define a grid of hyperparameters for tuning.
4. **Randomized Search**: Use `RandomizedSearchCV()` for a more efficient, random search of hyperparameters.
5. **Best Params**: Select the best parameters based on cross-validation performance.
6. **CV Results**: Cross-validation results are available to evaluate different hyperparameter settings.
7. **Hyperopt**: An alternative to `GridSearchCV()` for more advanced search algorithms.
8. **Optuna**: A tool designed for tuning hyperparameters of artificial neural networks (ANN).

---

## Feature Selection and Feature Engineering

1. **Occam's Razor**: Simpler models are often better.
2. **Feature Reduction**: Reduce the number of features to improve performance and avoid overfitting.
3. **Focus on Relevant Features**: Only keep features that contribute to the model's predictive power.
4. **Removing Irrelevant Features**: Remove any irrelevant or redundant features that do not help in prediction.
5. **Regularization**: Helps by penalizing the complexity of the model, naturally reducing irrelevant features.
6. **Dimensionality Reduction**: Techniques like PCA (Principal Component Analysis) can help reduce feature space.
7. **Feature Engineering**: Creating new meaningful features from raw data, such as using **geopandas** for geospatial data or **clustering** techniques. This requires deep domain knowledge and expertise.
