Date: 17/09/2024

## Machine Learning Day 2


# Linear and Logistic Regressions

### SoftUni Summary
[Link](https://softuni.bg/trainings/resources/officedocument/104338/material-summary-machine-learning-september-2024/4488)

## Machine Learning Basics
1. **Occam's Razor**: The principle that suggests preferring the simpler of two models or explanations. In machine learning, it implies choosing models that generalize better with fewer parameters, reducing the risk of overfitting.

## Linear Regression
1. **Linear Models**: These models are lightweight, fast, and still effective in many applications, especially when there’s a linear relationship between the features and the target variable.
2. **Use Case**: Linear regression works best when the independent variable (X) has a linear correlation with the dependent variable (y).
3. **Matrix Notation**: 
   $$\mathbf{a^T X}$$
   represents the linear equation for prediction, where $( \mathbf{a} )$ is the vector of coefficients.
4. **Loss Function**: The error for each prediction is:
   $$d_i = (y_i - \tilde{y}_i)^2$$
   where $(\tilde{y}_i )$ is the predicted value, and $( y_i )$ is the actual value.
5. **Total Cost Function (J)**: 
   $$J = \frac{1}{n} \sum (y_i - \hat{y}_i)^2$$

   This function measures the overall model error. Mean Squared Error (MSE)
6. **Objective**: We aim to find parameters $( a )$ and $( b )$ that minimize $( J )$ to achieve the most accurate predictions.
7. **Loss Function Insight**: It evaluates how well the model’s predictions align with the actual data, acting as a gauge of model accuracy.
8. **Optimizer / Solver**: A mathematical function used to minimize the cost function $( J )$, thereby improving model performance.
9. **Gradient Descent**: A popular optimization algorithm that minimizes the cost function by adjusting the model parameters.
10. **Constant Gradients**: For some variables, the gradient may be zero, indicating no change.
11. **Gradient**: It’s a multidimensional derivative, representing the slope of the cost function with respect to each model parameter.
12. **Saddle Point / Inflex Point**: A point where the gradient is zero but doesn’t indicate a maximum or minimum.
13. **Global vs. Local Minimum**: In linear regression, the global and local minima are the same, simplifying optimization.
14. **Gradient Descent Process**: It moves the model parameters in the direction of the steepest descent to find the global minimum.
15. **Learning Rate $( \alpha )$**: The step size that dictates how large or small the steps in the gradient descent process are.
16. **Parameters and Hyperparameters**: Parameters are learned from the data, while hyperparameters are set manually (e.g., learning rate).
17. **Scikit-Learn Datasets**: Built-in datasets used for model training and evaluation.
18. **Demo: California Housing Dataset**: Example of using regression models on real-world housing data.
19. **Scaling Features**: Techniques like `MinMaxScaler` and `StandardScaler` normalize features for better model performance.
20. **Fit and Transform**: Use `fit_transform` on training data to learn scaling parameters and apply `transform` to test or future data.
21. **Geospatial Features**: `geopandas` can be used for feature engineering by incorporating geographical data like latitude/longitude.
22. **Model Coefficients**: Use `model.coef_` to assess each feature’s contribution to the target variable.
23. **R² Score**: A statistical measure of how well the regression predictions approximate the real data points.
24. **QR Decomposition**: An alternative to optimization for solving linear systems in regression.
25. **Dealing with Outliers**: Outliers can skew linear regression models.
26. **Outliers vs. Anomalies**: Outliers are data points that deviate significantly, while anomalies may indicate errors or rare events.
27. **RANSAC (Random Sample Consensus)**: A robust method to fit models in the presence of outliers.
28. **Inliers vs. Outliers**: Use `ransac_model.inlier_mask_` to identify which data points fit the model (inliers) versus those that don't (outliers).
29. **Polynomial Regression**: A type of regression that fits a nonlinear relationship by introducing polynomial features.
30. **Curse of Dimensionality**: As the number of features increases, the model’s performance may degrade due to data sparsity.

## Logistic Regression
1. **Binary Classification**: Logistic regression is primarily used for binary classification problems where the output is either 0 or 1.
2. **Logistic Regression Equation**: Derived from linear regression, but the output is passed through a logistic (sigmoid) function to produce probabilities.
3. **Generalized Linear Model (GLM)**: Uses the sigmoid function to map the continuous output from linear regression into probabilities:
   $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
4. **Quantization**: Logistic regression quantizes the continuous output into binary predictions (0 or 1).
5. **Why Sigmoid over Step Function**: Sigmoid provides smooth transitions between 0 and 1, whereas the step function jumps abruptly between 0 and 1, making it unsuitable for gradient-based optimization.
6. **Loss Function**: Uses cross-entropy loss, which measures the distance between the predicted probability distribution and the actual labels:
* for binary classification:
   $$ \mathcal{L} = - \frac{1}{n} \sum_{i = 1}^n \left[ y_i \log(\tilde{y}_i) + (1 - y_i) \log(1 - \tilde{y}_i) \right] $$
   Where:
- $( n )$ is the number of samples
- $( y_i )$ is the true label (either 0 or 1)
- $( \tilde{y}_i )$ is the predicted probability for the positive class
- $( \log )$ is the natural logarithm
* for multi-class classification:
   $$ \mathcal{L} = - \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{k} y_{i,j} \log(\tilde{y}_{i,j}) $$

Where:
- $( k )$ is the number of classes
- $( y_{i,j} )$ is the true label for class $( j )$
- $( \tilde{y}_{i,j} )$ is the predicted probability for class $( j )$

7. **Demo: MNIST Dataset**: An example of applying logistic regression to the famous MNIST handwritten digits dataset.
8. **Multiclass Classification**: By using one logistic regression for each digit (10 classes), the model classifies which digit is represented by the input.
