
### **Supervised Learning Revision Guide**

#### **1. Overview**
- **Definition**: Supervised learning is a type of machine learning where the model is trained on labeled data to predict an output for unseen data.
- **Key Features**: 
  - Input-output pairs.
  - Goal: Minimize error/loss.
- **Common Tasks**:
  - **Regression**: Predict continuous values (e.g., house prices, stock prices).
  - **Classification**: Predict categorical labels (e.g., spam or not spam, image classes).

---

#### **2. Key Terminology**
- **Training Set**: Data used to train the model.
- **Test Set**: Data used to evaluate model performance.
- **Features**: Input variables (independent variables).
- **Target**: Output variable (dependent variable).
- **Overfitting**: When a model performs well on training data but poorly on unseen data.
- **Underfitting**: When a model is too simple and performs poorly on both training and test data.

---

#### **3. Common Algorithms**
**Regression**:
1. **Linear Regression**: Predicts a continuous target based on a linear relationship.
   - Formula: \( Y = \beta_0 + \beta_1X + \epsilon \)
   - Metrics: Mean Squared Error (MSE), R².

2. **Ridge & Lasso Regression**: Variants of linear regression with regularization to prevent overfitting.

3. **Polynomial Regression**: Models non-linear relationships by adding polynomial features.

**Classification**:
1. **Logistic Regression**: Predicts binary outcomes (e.g., Yes/No).
   - Output: Probability.
   - Activation: Sigmoid function.

2. **Support Vector Machine (SVM)**:
   - Separates classes using a hyperplane.
   - Works well with high-dimensional data.

3. **K-Nearest Neighbors (KNN)**:
   - Instance-based algorithm.
   - Predicts based on the majority class of \( k \)-nearest neighbors.

4. **Decision Trees**:
   - Splits data based on feature thresholds.
   - Prone to overfitting.

5. **Random Forest**:
   - Ensemble of decision trees.
   - Reduces overfitting.

6. **Gradient Boosting**:
   - Combines weak learners sequentially to improve performance.
   - Examples: XGBoost, LightGBM, CatBoost.

---

#### **4. Model Evaluation**
- **Metrics for Regression**:
  - Mean Absolute Error (MAE)
  - Mean Squared Error (MSE)
  - Root Mean Squared Error (RMSE)
  - R-squared (R²)

- **Metrics for Classification**:
  - Accuracy
  - Precision
  - Recall
  - F1-Score
  - ROC-AUC

- **Cross-Validation**:
  - K-Fold Cross-Validation.
  - Stratified K-Fold for imbalanced datasets.

---

#### **5. Preprocessing Techniques**
- **Feature Scaling**:
  - Standardization (z-score normalization).
  - Min-Max Scaling.

- **Handling Missing Values**:
  - Mean/Median/Mode Imputation.
  - Dropping rows/columns.

- **Encoding Categorical Variables**:
  - One-Hot Encoding.
  - Label Encoding.

---

#### **6. Hyperparameter Tuning**
- **Grid Search**: Exhaustive search over a specified parameter grid.
- **Random Search**: Randomly samples parameters.
- **Bayesian Optimization**: More efficient parameter search.

---

#### **7. Challenges**
- **Imbalanced Datasets**:
  - Use techniques like SMOTE, undersampling, or oversampling.
  - Adjust class weights in models.

- **Feature Selection**:
  - Use methods like Recursive Feature Elimination (RFE), Lasso, or feature importance from tree-based models.

---

#### **8. Tools and Libraries**
- **Python Libraries**:
  - `scikit-learn`: Core library for ML models.
  - `pandas`: Data manipulation.
  - `matplotlib`/`seaborn`: Visualization.

---

#### **9. Workflow Checklist**
1. Understand the problem and define the task (regression/classification).
2. Preprocess the data (handle missing values, scaling, encoding).
3. Split the data (train/test split, cross-validation).
4. Train the model and evaluate using appropriate metrics.
5. Tune hyperparameters and refine the model.
6. Test the model on unseen data.

---

