# Day 1: The Machine Learning Workflow & Scikit-learn API (1.2)

## Table of Contents
1. [The Machine Learning Workflow Revisited (1.2.1)](#ml-workflow)
   - Introduction
   - Typical Stages of an ML Project
     - Problem Definition & Framing
     - Data Acquisition
     - Exploratory Data Analysis & Data Preparation
     - Model Selection
     - Model Training
     - Model Evaluation
     - Hyperparameter Tuning
     - Final Evaluation
     - Deployment & Monitoring
   - Key Takeaway
2. [Introduction to the Scikit-learn API (1.2.2)](#sklearn-api)
   - What is Scikit-learn?
   - Why Learn its API?
   - The Estimator API - Core Pattern
     - Choose & Instantiate
     - Fit
     - Predict / Transform
     - Data Format
     - Benefit
3. [Practice Questions](#practice-questions)

<a id="ml-workflow"></a>
## 1.2.1 The Machine Learning Workflow Revisited

### Introduction
Building a successful Machine Learning model is rarely a single step. It's an **iterative process** involving several key stages. Understanding this workflow provides a crucial framework for tackling ML problems systematically.

![ML Workflow](attachment:736aeede-15c0-400d-a695-f6e32042fe91.png)

### Typical Stages of an ML Project

#### Problem Definition & Framing
- Clearly define the business or research question you want to answer.
- Translate it into an ML problem: Is it regression (predicting a number) or classification (predicting a category)?
- Determine what data is likely needed and how success will be measured (e.g., accuracy, error rate, business impact).

#### Data Acquisition
- Gather the necessary data from databases, files, APIs, or other sources.

#### Exploratory Data Analysis (EDA) & Data Preparation
*This is often the most time-consuming and critical phase!*

- **EDA:** Understand the data's structure, distributions, relationships, and potential issues using statistics and visualizations (Leveraging your Pandas/Matplotlib/Seaborn skills).
  
- **Data Cleaning:** Handle missing values (e.g., imputation), outliers, and inconsistencies.
  
- **Feature Preprocessing:**
  - Handle **Categorical Features:** Convert non-numeric data into a numerical format suitable for ML algorithms (e.g., One-Hot Encoding).
  - **Feature Scaling:** Bring numerical features to a similar scale (e.g., using StandardScaler or MinMaxScaler), which is important for many algorithms.
  
- **Feature Engineering:** Create new, potentially more informative features from existing ones (can significantly impact performance).
  
- **Data Splitting:** Divide the data into separate sets for training, validation (for tuning), and testing (for final unbiased evaluation).
  - **Training Set:** Used to teach the model
  - **Test Set:** Kept aside until the very end to see how well the model generalizes to new, unseen data 
  - **Validation Set:** Often used during development for tuning model parameters

![Data Splitting](attachment:3106b386-7c87-443e-9d8a-342f1544c5d9.png)

#### Model Selection
- Choose one or more appropriate ML algorithms based on the problem type (regression/classification), data characteristics, and project goals.

#### Model Training
- "Teach" the selected model by feeding it the prepared **training data**. The model learns patterns and relationships during this fit process.

#### Model Evaluation
- Assess the trained model's performance using appropriate metrics (e.g., MSE for regression, accuracy/precision/recall for classification) on the **validation set** (or using cross-validation on the training set).
- This tells us how well the model is likely to perform on unseen data.

#### Hyperparameter Tuning
- Adjust the model's settings (hyperparameters) to optimize its performance based on the evaluation results.
- This often involves techniques like Grid Search or Random Search.

#### Final Evaluation
- Once the best model and hyperparameters are chosen, evaluate its final performance on the **held-out test set**.
- This gives an unbiased estimate of real-world performance.

#### Deployment & Monitoring
- Make the finalized model available for making predictions on new data (e.g., integrate into an application).
- Continuously monitor its performance and retrain as needed when performance degrades or new data becomes available.

### Key Takeaway
This process is iterative. You might loop back from evaluation to data preparation or model selection if performance isn't satisfactory. Data quality and preparation are paramount.

<a id="sklearn-api"></a>

## 1.2.2 Introduction to the Scikit-learn API

### What is Scikit-learn?
Scikit-learn (sklearn) is the cornerstone library for traditional Machine Learning in Python. It offers a vast collection of algorithms for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing, all designed with efficiency and ease-of-use in mind.

![Scikit-learn Logo](attachment:841ffa3e-87f1-48ad-9ff0-3fa30dafec3e.png)

### Why Learn its API?
Scikit-learn is built on a philosophy of **consistency**. Understanding its core Application Programming Interface (API) makes it incredibly easy to switch between different models and preprocessing steps with minimal code changes. Learn the pattern once, use it everywhere.

### The Estimator API - Core Pattern

#### Choose & Instantiate
Select an algorithm ('Estimator') and create an instance of its class, potentially setting configuration options called **hyperparameters**.

In [None]:
# Example: Instantiating a Linear Regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()

# Example: Instantiating a Scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

#### Fit
Learn from the data. This is done using the `.fit()` method.

For supervised learning models (classifiers, regressors), fit takes the training features (X_train) and training labels (y_train):

In [None]:
model.fit(X_train, y_train)

For unsupervised transformers (scalers, PCA, imputers), fit usually takes only the features (X_train) to learn the necessary parameters (e.g., mean/std deviation for scaling):

In [None]:
scaler.fit(X_train)

The `.fit()` method modifies the estimator *in-place* and returns the estimator instance itself (self).

#### Predict / Transform

**Supervised Prediction:** Use `.predict()` on new features (X_new) to get predicted labels or values:

In [None]:
predictions = model.predict(X_new)

**Classification Probabilities:** For classifiers, `.predict_proba()` gives the probability of each class:

In [None]:
probabilities = model.predict_proba(X_new)

**Unsupervised Transformation:** Use `.transform()` to apply the learned transformation (like scaling or dimensionality reduction) to new features (X_new):

In [None]:
X_new_scaled = scaler.transform(X_new)

**Convenience Method: fit_transform**: For transformers, it's common to fit on the training data and immediately transform it. Scikit-learn provides `.fit_transform()` for this:

In [None]:
# Fit the scaler on training data and transform it in one step
X_train_scaled = scaler.fit_transform(X_train)
# IMPORTANT: Only use transform() on the test data!
X_test_scaled = scaler.transform(X_test)

*(We fit only on training data to avoid data leakage from the test set).*

#### Data Format
Scikit-learn generally expects data as:
- **X (Features):** A 2D array-like structure (NumPy array, Pandas DataFrame) where rows represent samples and columns represent features.
- **y (Target):** A 1D array-like structure (NumPy array, Pandas Series) containing the target labels or values corresponding to the samples in X.

![Data Format](attachment:74ca1190-4a26-4bbd-9c50-2e6a1e0469fd.png)

#### Benefit
This consistent API (fit, predict, transform) dramatically simplifies experimenting with different models and building complex ML pipelines.

<a id="practice-questions"></a>
## Practice Questions

### ML Workflow Questions
1. Why is the Machine Learning workflow described as "iterative"? Give an example of when you might need to loop back to an earlier stage.
2. What is the difference between the training, validation, and test sets? Why is it important to keep the test set separate until the end?
3. In your own words, explain the importance of Exploratory Data Analysis (EDA) in the ML workflow.
4. What are some common techniques for handling missing values in the data preparation stage? When might you choose one approach over another?
5. Why is feature engineering sometimes considered more important than the choice of algorithm? Give an example of a useful engineered feature.
6. What is "data leakage" and why is it problematic in Machine Learning?

### Scikit-learn API Questions
1. What is the benefit of Scikit-learn's consistent API design across different models and transformers?
2. Explain the difference between the `.fit()`, `.predict()`, and `.transform()` methods in Scikit-learn.
3. Why would you use `.fit_transform()` on your training data but only `.transform()` on your test data?
4. Write the code to instantiate a K-Means clustering model with 5 clusters and fit it to a dataset X.
5. What shape would Scikit-learn expect for:
   - The feature matrix X in a dataset with 1000 samples and 5 features?
   - The target vector y for a binary classification problem with 1000 samples?
6. What would happen if you tried to use `.predict_proba()` on a model that doesn't support probability estimates?
7. Explain why we typically need to apply the same preprocessing steps (like scaling) to both training and test data.