
### **What is Machine Learning?**
ML revolves around leveraging data to create systems that can automatically identify patterns, adapt to changes, and make predictions or decisions. It’s especially valuable for:
1. **Automating manual tuning or rule-making** for complex problems.
2. Handling **dynamically changing data** in fluctuating environments.
3. Solving **non-trivial problems** where traditional methods fall short.



### **Types of Machine Learning Systems**
#### **1. Based on Supervision During Training**
- **Supervised Learning**: Requires labeled data.
  - Example: Classification (predicting categories) or regression (predicting continuous values).
- **Unsupervised Learning**: Finds hidden patterns in data without labeled outputs.
  - Example: Clustering, visualization, anomaly detection, association rule learning.
- **Semi-Supervised Learning**: Uses a small amount of labeled data to guide the labeling of a large unlabeled dataset.
  - Example: Google Photos tagging faces after a few are labeled.
- **Self-Supervised Learning**: An emerging field where models generate their own labels (e.g., contrastive learning in vision tasks).
- **Reinforcement Learning**: Involves an agent that learns by interacting with its environment, receiving rewards or punishments for its actions.

#### **2. Based on Learning Mode**
- **Batch Learning**:
  - Trains on the full dataset offline.
  - Cannot update itself incrementally, leading to issues like model rot (performance decay due to changing data distributions).
- **Online Learning**:
  - Processes data in small batches and updates incrementally.
  - Key parameter: **Learning Rate** (trade-off between old and new data relevance).

#### **3. Based on Learning Approach**
- **Instance-Based Learning**:
  - Memorizes training data and relies on similarity measures (e.g., k-Nearest Neighbors).
- **Model-Based Learning**:
  - Builds a generalizable model from the training data (e.g., linear regression, neural networks).



### **Challenges in Machine Learning**
1. **Insufficient Quantity of Training Data**: ML models thrive on large, diverse datasets.
2. **Nonrepresentative Training Data**: The dataset must reflect the real-world problem accurately.
3. **Poor-Quality Data**: Missing values, noise, or errors can significantly degrade performance.
4. **Irrelevant Features**: Feature engineering (selection + extraction) is crucial.
5. **Overfitting**: Model performs well on training data but poorly on new data.
   - Solutions:
     - Simplify the model.
     - Regularization (e.g., L1, L2 penalties).
     - Add more data or reduce noise.
6. **Underfitting**: Model fails to capture the complexity of the data.
   - Solutions:
     - Choose a more complex model.
     - Improve features.
     - Reduce constraints (e.g., increase model capacity).



### **Bias-Variance Tradeoff**
- **Bias**: Error from simplifying assumptions in the model (underfitting).
- **Variance**: Error from sensitivity to fluctuations in training data (overfitting).
  - The goal is to find a sweet spot between these two extremes.



### **Training and Evaluation Splits**
1. **Training/Test Split**: A basic division to evaluate generalization.
2. **Training/Validation/Test Split**: Adds a validation set to tune hyperparameters and avoid overfitting.
3. **Cross-Validation**: Efficiently uses data by dividing it into multiple folds for evaluation.



### **Data Mismatch**
- Problem: The training data doesn’t match the distribution of real-world (testing) data.
  - Example: A pet classifier trained on high-resolution web images may fail on mobile device images.
- Solution: Create subsets:
  - **Training Set**: For model training.
  - **Train-Dev Set**: From the same distribution as training data but not used during training.
  - **Validation Set**: For hyperparameter tuning.
  - **Test Set**: From the target distribution.



1) How would you define machine learning?

   Machine learning is converting data into mathematical representations and trying to identify patterns and trends within it which are useful to our cause
2) Can you name four types of applications where it shines?
   Machine learning is useful for:
   - Gaining insights from complex and huge datasets
   - Tradition software engineering approaches don't yield a good result
   - It is required to have the ability to learn incrementally
   - There are solutions but require a lot of fine tuning
3) What is a labeled training set?

   The outputs are labeled with the expected class/values
4) What are the two most common supervised tasks?

   Prediction and Regression 
5) Can you name four common unsupervised tasks?

   Clustering, Anomaly Detection, Novelty Detection, Visualization, Dimensionality Reduction
6) What type of algorithm would you use to allow a robot to walk in various unknown terrains?

   Reinforcement Learning
7) What type of algorithm would you use to segment your customers into multiple groups?

   Clustering
8) Would you frome the problem of spam detection as a supervised or unsupervised learning problem?

   Supervised, as it is required to know what kind of mails are spam
9) What is an online learning system?

   A learning system capable of learning on the fly by using mini-batches
10) What is out-of-core learning?

    When the system doesn't have enough capacity to learn on the entire dataset it is divided into smaller batches upon which learning is done sequentially
11) What type of algorithm relies on a similarity measure to make predictions?

    Instance-based learning algorithm
12) What is difference between model parameter and hyperparameter?

    Model parameter is learned by the algorithm by learning from data, whereas hyperparameter is defined by the user 
13) What do model-based algorithms search for? What is the most common strategy to succeed? How do they make predictions?

    Model based algorithms search for model parameters which are useful in identifying the patterns to make predictions, the most common strategy is to find an equation which 
    defines the relation between the input and output variables, predictions are made by using the input variables in the equations and make predictions
14) Can you name four of the main challegnes of Machine learning?

    Overfitting, Underfitting, Nonrepresentative data, Poor Quality data
15) If your model performs great on tthe training data but generalizes poorly to new instances, what is happening? Can you name 3 possible solutions?
    
    Overfitting, Solutions:
    - Simplify the model (regularization)
    - More training data
    - Reduce noise in training data
16) What is a teset set, and why would you use it?

    A partition of the dataset used to judge wether the model is generalizing well to new data or not
17) What is the purpose of validation set?

    Used in hyperparameter tuning to judge wether the model is overfitting the training data before testing on test set
18) What is the train-dev set, when do you need it, and how do you use it?

    Used to tackle data mismatch, need it when the dataset is not big and we need to collect training data form third party
19) What can go wrong if you tune hyperparameters using the test set?

    The model will perform well on the test set possibly overfitting and will not show such a good performance when deployed
