<a href="https://colab.research.google.com/github/jirvingphd/osemn-project-template/blob/master/reference/Project_Overview_CRISP_DM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRISP-DM (Data Science Workflows/Processes)

## Intro:

Data science projects can take many forms across many domains (finance, healthcare, higher education, etc.). Despite the variability in subject matter, every data science project has the same core pillars of the project’s life cycle.

These common pillars of the data science process have been captured and summarized in several guiding frameworks, such as OSEMN, SEMMA, etc. While your future employer may have a preferred process, the most popular process used by data science teams today is the "CRoss Industry Standard Process for Data Mining" (CRISP-DM) process.

IBM originally developed CRISP-DM in the late 1990s as a generalizable data modeling process. Since its creation, it has been adapted and applied to many domains and tasks, but it works particularly well for data science projects.

## The 6 Stages of the CRISP-DM Process

![CRISP-DM Process Model](https://lh7-us.googleusercontent.com/tZuEwVJGcm28FA5bwUMR8uEw19iDzMB3i87gACh2rbvHp0HtSTmY1cRMVszeAjduBAHGly41upwMVKbGDeaQe8Qk-nAcOCJ7bQgM5qvl6iYhcbYmxYpP-KawwkeqQRKx9lVbAfBMRKd9W7h2LuQWgw)

[Image Source](https://www.datascience-pm.com/crisp-dm-2/)

The CRISP-DM process consists of 6 sequential phases:

- Phase 1) Business understanding:
    - *What is the stakeholder’s goal? What do they need?*
- Phase 2) Data understanding:
    - *What data do we have or need, and is it clean?*
- Phase 3) Data preparation:
    - How do we prepare this data for modeling?
- Phase 4) Modeling:
    - What types of models and techniques should we use?
- Phase 5) Evaluation:
    -  Which of our models best meets the stakeholders’ needs?
- Phase 6) Deployment:
    - How do stakeholders receive/access the results?

While the specific duties of each data scientist will vary depending on their employer/domain, the general phases of a project share core pillars.
___

## Phase 1) Business Understanding

First, we focus on the stakeholder's needs and requirements.

- Who are the stakeholders? Who is asking you for this project/analysis?
- What do the stakeholders want to know/accomplish?
- How do they plan to use the information/model that we provide?
- What would the stakeholders consider to be a successful outcome?

## Phase 2) Data Understanding

Next, we focus on the data available for the project at a high level.

- 2.1) What data have we been provided?
    - Is there any data we need to collect/combine for our task?
- 2.2) What information is included in the data?
    - How many records (rows)?
    - How many features (columns)?
    - What is the format/data type of each feature (string/integer/etc.)?
    - What is the meaning of each feature (how does it relate to the stakeholder’s goals?)
- 2.3) How clean is the data?
    - Are there missing values?
    - Are there duplicate rows?
    - Are there any features with inconsistent values? (“yes” vs. “Yes”)
    - Do any features need to be combined or separated?
- 2.4) What do the features look like and how are they related?
    - Explore Each Feature
        - I. Dig deeper into the data and further explore and visualize each feature.
        - II. Visualize the features and identify relationships between them.

## Phase 3) Data Preparation

Once we understand the data, we must prepare it for modeling purposes.

- Select the features to include in your model.
- Construct Data/Feature Engineering
    - Create new columns by combining/separating pre-existing data.
    - For example:
        - Creating separate “Year” and “Month” columns from a “Date” column.
        - Calculating a “Profit” feature by subtracting the “Cost” from the “Revenue”.
- Clean the data for modeling.
    - Drop duplicate information.
    - Correct/impute missing values.
    - Fix inconsistencies within columns.
    - Ensure the datatypes are correct for each column.
        - Are numeric columns stores as numeric data type?
    - Convert string/categorical features into numeric features for machine learning.

As a rule of thumb, for any data science project, 80% of our time/effort is spent on the first 3 phases alone.

## Phase 4) Modeling

- Select which types of models to try and their assumptions.
- Set aside some of the data to test the models. (25% saved for testing)
- Build and evaluate the Model
- Revise and iterate until you reach the stakeholder’s specifications.

## Phase 5) Evaluation (in Business Terms)

- Which model/result(s) should we provide to our stakeholders?
-  Do the results meet the stakeholder’s success criteria?
- Summarize findings and make a note of anything you would do differently if you had to start the project from the beginning again.

## Phase 6) Deployment

Depending on the project requirements, the deployment phase could be as simple as generating a final report/presentation, or it could be as complex as deploying models in the cloud with an interface for retrieving predictions on-demand.

- As entry-level data scientists and analysts, you will most likely not be responsible for deploying models to the cloud.
- Produce Final Report/Presentation
    - Create a final summary of the results for the stakeholders.
    - Prepare a short (non-technical) executive summary presentation with your results and business recommendations.
- Identify Future Directions/Future Work
    - What would be the next things you would want to change or add to the project if you continued working on it?

### References

- Hotz, N. (2018, September 10). What is CRISP DM? *Data Science Process Alliance*. https://www.datascience-pm.com/crisp-dm-2/
- Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (1999). *CRISP-DM 1.0: Step-by-step data mining guide*. https://web.archive.org/web/20220401041957/https://www.the-modeling-agency.com/crisp-dm.pdf
- Quantum. (2019, August 20). *Data Science project management methodologies*. Medium. https://medium.datadriveninvestor.com/data-science-project-management-methodologies-f6913c6b29eb