# **Machine Learning**

It is a subset of AI focused on building systems that can learn from and make decisions based on data. Instead of being explicitly programmed, a machine learning model learns patterns from historical data and applies those patterns to new, unseen data to make predictions or decisions


The foundation of ML is data. Models are trained on data, which can be numerical, textual, image-based, or otherwise. The quality and quantity of the data significantly impact the model's performance
    
    TRASH IN -> TRASH OUT
 
<hr>

## **Types of ML**

### **1. Supervised Learning:**

The model is trained on labeled data (data that already contains the correct output)

*Example*: spam email classification (spam or not spam)

*Example of models:*
- Linear Regression - used for predicting continuous values
- Logistic Regression - used for binary classification problems
- Support Vector Machines (SVM) - a classification model that finds the optimal hyperplane to separate classes 
- Decision Trees - a model that splits the data into branches based on feature values to make decisions
- Random Forest - an ensemble of decision trees used for classification or regression tasks 
- k-Nearest Neighbors (KNN) - a simple algorithm that classifies data based on the majority class of its neighbors 
- Neural Networks - deep learning models that mimic the human brain and can learn complex patterns in data 


### **2. Unsupervised Learning:**

The model works with unlabeled data and tries to find hidden patterns or structures within the data

*Example of models:*
- 
- 

*Example*: clustering customers into groups based on purchasing behavior

### **3. Reinforcement Learning:**

An agent learns by interacting with an environment and receiving rewards or penalties based on its actions

*Example*: Training a robot to walk by rewarding it for staying uprigh

<hr>

## **The 3-step Machine Learning recipe**

### **1. Define a model**
the first step is to define the structure of the model we will use to make predictions. The model acts as a mathematical function that takes input data and produces an output (the prediction)

* Model selection depends on the type of problem (classification, regression, etc.), the nature of the data, and the assumptions we can make about the data
* The parameters of the model are often initialized randomly or with some prior knowledge

### **2. Formulate a loss function ("How bad is the model?")**
loss function (also called the cost function or objective function) function quantifies how well or poorly the model is performing by comparing the model's predictions to the true values (labels)

* The loss function is essential because it guides the learning process. It tells the model how far off its predictions are from the true outcomes
* The goal is to minimize the loss — meaning the model should learn to make its predictions as accurate as possible


*Common loss functions:*

* Cross-entropy loss for classification tasks (e.g. binary classification (cross-entropy loss calculates the difference between the predicted probability and the actual class label (0 or 1)), multi-class classification (cross-entropy does this for each class and sums differences up))
* Mean squared error (MSE) for regression tasks  (calculates the squared difference between predicted and actual values, penalizing larger errors more heavily)
* Mean absolute error (MAE) for regression tasks  (calculates the absolute difference between the predicted and true values. Unlike MSE, MAE treats all errors equally (doesn’t give extra weight to larger errors), so it’s useful when you don’t want the model to be too sensitive to outliers)
* Hinge Loss for classification tasks, especially SVMs (e.g. in SVM measures how well the model separates different classes, with a focus on the margin between them)
* etc


### **3. Minimize the loss**
After defining the model and loss function, the next step is to minimize the loss function by adjusting the model's parameters (weights and bias) in the direction that reduces the error

Gradient Descent is the most popular optimization algorithm, more below.


#### **Gradient descent**
The main idea behind gradient descent is to adjust the model’s parameters step by step, moving in the direction that reduces the loss. Gradient descent computes the gradient (or slope) of the loss function with respect to each parameter (weight), and uses this to adjust the parameter in the direction that minimizes the loss


**Key Steps in Gradient Descent**
* calculate the gradient - determine how much the loss function changes with respect to each parameter
* update the parameters - adjust the model’s parameters by moving in the direction opposite to the gradient (this is how you minimize the loss)
* repeat - continue this process iteratively, taking small steps toward the minimum of the loss function until the model’s performance stabilizes (i.e. the loss stops decreasing significantly)

**Types of Gradient Descent**
1. *Batch Gradient Descent*
* Calculates the gradient using the entire dataset
* This is very accurate, but can be slow and computationally expensive, especially for large datasets
* Pros: Guaranteed to find the optimal solution if the dataset is small enough
* Cons: Slow for large datasets because it needs to process all data points before updating the model.
2. *Stochastic Gradient Descent (SGD)*
* Instead of calculating the gradient using the entire dataset, SGD uses a single data point at a time to update the parameters
* This makes it faster but more "noisy" because the updates can be more erratic.
* Pros: Faster and can handle large datasets efficiently
* Cons: It may not converge as smoothly, but can still eventually reach a good solution with enough iterations
3. *Mini-batch Gradient Descent*
* A compromise between batch gradient descent and SGD. It uses a small subset (mini-batch) of the dataset to compute the gradient and update the parameters
* This approach is commonly used because it balances speed and accuracy
* Pros: Faster than batch gradient descent, more stable than SGD
* Cons: Requires tuning to find the optimal mini-batch size

**Learning Rate** 
The learning rate determines the size of the steps taken in gradient descent. It's a crucial hyperparameter that impacts how quickly or slowly the model converges. Finding the right learning rate is essential to training a model efficiently:

* if the learning rate is too high, the model might take oversized steps and miss the minimum. This can make the model unstable and prevent it from converging
* if the learning rate is too low, the model will take tiny steps, resulting in slow convergence and requiring more iterations to reach the minimum

**Convergence and stopping criteria**
The goal of optimization is to find the minimum loss (or as close to it as possible), and once that is achieved, we stop. Convergence happens when the model's parameters stabilize and the loss function doesn't decrease significantly anymore

Common stopping criteria:

* fixed number of iterations - stop after a certain number of updates
* tolerance - stop when the improvement in loss is smaller than a defined threshold (indicating that the model is not learning much more)
* validation performance - stop when performance on the validation set starts to degrade (avoiding overfitting)



## Glossary

**Model bias**

refers to a situation where a model makes consistent mistakes because it's based on narrow or inflexible ideas about how things work. It's like having a strong opinion that prevents you from seeing the full picture and leads you to make the same wrong choices over and over again. In the context of machine learning, model bias happens when a model's assumptions are too strict, causing it to miss important patterns in the data and make incorrect predictions.

**Variance**

In the context of machine learning, variance refers to the model's sensitivity to fluctuations or noise in the training data. A high variance model is one that can change significantly with small changes in the training data, leading to inconsistent predictions. High variance can result in overfitting, where the model fits the training data very closely but struggles to generalize to new, unseen data.