## Training Set: 

* This is the portion of the dataset that is used to train the machine learning model. The model learns patterns and relationships within this data. Typically, the training set comprises the majority of the dataset, often around 70-80% of the data.e estimates.


## Validation Set: 

* After training the model, it's essential to evaluate its performance on data it hasn't seen before. This is where the validation set comes in. The validation set helps tune hyperparameters and assess model performance during training. It is used to fine-tune the model without overfitting to the training data. Usually, the validation set consists of around 10-20% of the dataset.



## Test Set: 
* Once the model is trained and validated, it's crucial to evaluate its performance on completely unseen data to get an unbiased estimate of its performance. The test set serves this purpose. It helps assess how well the model generalizes to new, unseen data. The test set is separate from the training and validation sets and should not be used during model training or hyperparameter tuning. Typically, it contains around 10-20% of the dataset, similar to the validation set.

# Now, let's discuss how to split the dataset into these sets:

## Random Splitting: 
* One common approach is to randomly shuffle the dataset and then partition it into training, validation, and test sets according to predetermined ratios (e.g., 70% training, 15% validation, 15% test). This ensures that each set is representative of the overall dataset.

## Stratified Splitting: 
* In cases where the dataset is imbalanced (i.e., some classes are more prevalent than others), it's essential to maintain the same class distribution across the training, validation, and test sets. Stratified splitting ensures that each subset maintains the same class proportions as the original dataset.

## Cross-Validation:
* Instead of splitting the dataset into a single training/validation/test set, cross-validation involves dividing the dataset into multiple subsets, known as folds. The model is trained and validated multiple times, each time using a different fold for validation and the remaining folds for training. This approach helps utilize the entire dataset for both training and validation, providing more reliable performance estimates.

# Bias & Variance

## Bias: 
* Bias refers to the error that is introduced by approximating a real-world problem with a simplified model. A model with high bias pays very little attention to the training data and oversimplifies the underlying patterns. This can lead to significant errors in prediction. In simpler terms, bias is like consistently missing the target, regardless of where you aim.


## Variance: 
* Variance, on the other hand, refers to the variability of model predictions for a given data point. A model with high variance is very sensitive to the training data and captures noise along with the underlying patterns. Such a model might perform very well on the training data but poorly on unseen data because it has essentially memorized the training data instead of learning the general patterns. In simpler terms, variance is like aiming at different spots on the target each time you shoot, leading to scattered results.


In summary:

* High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
* High variance can cause an algorithm to model the random noise in the training data (overfitting).

very simple explanation:

* Bias is when your model consistently gives the wrong answer in the same way.
* Variance is when your model gives very different answers depending on the data it's trained on.

# Overfitting & Underfitting

## What is Overfitting?
* When a `model performs very well for training data but has poor performance with test data (new data)`, it is `known as overfitting`. In this case, the machine learning model learns `the details and noise in the training data such that it negatively affects the performance of the model on test data`. Overfitting can happen due to `low bias and high variance`.

![image.png](attachment:13d00b7e-d852-4c10-9944-775f3ffcbd38.png)

## Reasons for Overfitting
* Data used for training is not cleaned and contains noise (garbage values) in it
* The model has a high variance
* The size of the training dataset used is not enough
* The model is too complex

## Ways to Tackle Overfitting
* Using K-fold cross-validation
* Using Regularization techniques such as Lasso and Ridge
* Training model with sufficient data
* Adopting ensembling techniques

## What is Underfitting?
* When a `model has not learned the patterns in the training data well and is unable to generalize well on the new data`, it is `known as underfitting`. An `underfit model has poor performance on the training data and will result in unreliable predictions`. Underfitting occurs due to `high bias and low variance`.

![image.png](attachment:5cfb8f13-8ae0-47c1-af0c-04b8e5e268c7.png)

## Reasons for Underfitting
* Data used for training is not cleaned and contains noise (garbage values) in it
* The model has a high bias
* The size of the training dataset used is not enough
* The model is too simple

## Ways to Tackle Underfitting
* Increase the number of features in the dataset
* Increase model complexity
* Reduce noise in the data
* Increase the duration of training the data

## What Is a Good Fit In Machine Learning?
* A `good fit model accurately captures the underlying patterns in the data without being overly complex or overly simple`. It strikes a balance between `low bias (accurately representing the data) and low variance (generalizing well to new, unseen data)`, effectively minimizing both `underfitting and overfitting`.

![image.png](attachment:b7caf2c3-ef18-4022-8645-04e7830d29ae.png)