# CRISP-DM Machine Learning Process

## Overview

**CRISP-DM** (Cross-Industry Standard Process for Data Mining) is an **iterative framework** that helps structure and organize Machine Learning (ML) projects in a manageable way — defining **what needs to happen and in which order**.

It consists of **6 main steps:**

1. **Business Understanding**  
2. **Data Understanding**  
3. **Data Preparation (Feature Engineering)**  
4. **Modeling**  
5. **Evaluation**  
6. **Deployment**

> These steps are not strictly linear — after deployment, we often **iterate** back to earlier stages to refine and improve the model.

---

## CRISP-DM Workflow

> **Cross-Industry Standard Process for Data Mining**  
> (Source: [Wikipedia](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining))

For more in-depth reading, refer to the original guide:  
*CRISP-DM 1.0 Step-by-step Data Mining Guide* (linked in Wikipedia’s references).

---

## Step-by-Step Breakdown

### 1. Business Understanding

Understand the business context and objectives before jumping into data or modeling.

**Key actions:**
- Identify the business problem.
- Detect available data sources.
- Specify requirements, premises, and constraints.
- Clarify risks and uncertainties.
- Determine whether the problem is important enough to solve.
- Define success criteria (e.g., cost-benefit analysis).
- Evaluate if machine learning is truly necessary.

---

### 2. Data Understanding

Gather, inspect, and assess the data available for analysis.

**Key actions:**
- Analyze available data sources.
- Collect and explore data.
- Identify missing or incomplete information.
- Assess data reliability, quality, and size.
- Determine whether additional data is needed.

---

### 3. Data Preparation (= Feature Engineering)

Transform and clean the data so it can be fed into ML algorithms.

**Key actions:**
- Extract and engineer features.  
- Clean the data — remove noise and inconsistencies.  
- Build **data pipelines** to convert raw data into structured, clean data.  
- Convert data into **tabular format** for model training.

> *“Coming up with features is difficult, time-consuming, and requires expert knowledge. Applied Machine Learning is basically feature engineering.”*  
> — **Andrew Ng**, Stanford University

**Further Reading:**
- [Feature Engineering Article on Towards Data Science](https://towardsdatascience.com/) (old but still relevant).

---

### 4. Modeling

This is where **the actual Machine Learning happens**.

**Key actions:**
- Train multiple models (e.g., Logistic Regression, Decision Tree, Neural Network).  
- Tune and select model parameters.  
- Improve model performance iteratively.  
- Select the best-performing model.

> Sometimes, you may need to go back to **data preparation** to fix data issues or engineer better features.

> **Remember:** Model quality heavily depends on data quality.  
> → “Garbage in, garbage out!”

---

### 5. Evaluation

Assess how well the model addresses the **original business problem**.

**Key actions:**
- Measure performance using appropriate metrics.  
- Check if business goals were met (e.g., *Reduce spam by 50%*).  
- Compare actual results vs. expected outcomes.  
- Evaluate on a **test group** before full deployment.  
- Conduct a **retrospective**:
  - Was the goal achievable?
  - Did we measure the right things?

**Possible decisions:**
- Adjust goals and retrain.  
- Roll out the model to all users.  
- Stop the project if objectives are met or unfeasible.

---

### 6. Evaluation + Deployment (Often Happen Together)

Perform **online evaluation** using **live user data**.

**Key actions:**
- Deploy the model for a limited group of users.  
- Collect feedback and evaluate real-world performance.  
- If results are satisfactory — proceed to full deployment.

---

### 7. Deployment (= Engineering Practices)

**Key actions:**
- Roll out the model to all users.  
- Implement proper **monitoring** and **maintenance** systems.  
- Ensure **scalability**, **reliability**, and **quality**.  
- Prepare a final project report documenting the results.

> When a model is deployed — it must be reliable, maintainable, and ready for scale.

---

### 8. Iterate!

Machine Learning projects require **continuous iteration**.  
After deployment, revisit **Business Understanding** to check if:
- The problem needs redefinition.  
- The model can be improved further.  
- A new version of the model is needed.

> **Rule of thumb:**  
> - Start simple (use a basic model).  
> - Learn from feedback.  
> - Improve gradually.

---

## References

- [Slides: CRISP-DM — ML Zoomcamp](https://www.slideshare.net/slideshow/ml-zoomcamp-14-crispdm/250116522)  
- [Reference Notes](https://knowmledge.com/2023/09/12/ml-zoomcamp-2023-introduction-to-machine-learning-part-4/)
