```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. see also CRISP-DM

```

# Applied Machine Learning Process
Each machine learning project is different because the specifc data at the core of the project is different. That does not mean that others have not worked on similar prediction tasks or perhaps even the same high-level task, but you may be the frst to use the specifc data that you have collected (unless you are using a standard dataset for practice). This makes each machine learning project unique. No one can tell you what the best results are or might be, or what algorithms to use to achieve them. You must establish a baseline in performance as a point of reference to compare all of your models and you must discover what algorithm works best for your specifc dataset.

Even though your project is unique, the steps on the path to a good or even the best result are generally the same from project to project. This is sometimes referred to as the `applied machine learning process`, `data science process`, or the older name `knowledge discovery in databases (KDD)`. The process of applied machine learning consists of a sequence of steps. The steps are the same, but the names of the steps and tasks performed may differ from description to description:
- **Step 1: Define Problem.** This step is concerned with learning enough about the project to select the framing or framings of the prediction task. For example, is it classification or regression, or some other higher-order problem type? It involves collecting the data that is believed to be useful in making a prediction and clearly defining the form that the prediction will take. It may also involve talking to project stakeholders and other people with deep expertise in the domain. This step also involves taking a close look at the data, as well as perhaps exploring the data using summary statistics and data visualization.

    Defining the problem may involve the following sub-tasks:
    - Gather data from the problem domain.
    - Discuss the project with subject matter experts.
    - Select those variables to be used as inputs and outputs for a predictive model.
    - Review the data that has been collected.
    - Summarize the collected data using statistical methods.
    - Visualize the collected data using plots and charts.
    
    
- **Step 2: Prepare Data.** This step is concerned with transforming the raw data that was collected into a form that can be used in modeling.

    Preparing data may involve the following sub-tasks:
    - **Data Cleaning:** Identifying and correcting mistakes or errors in the data.
    - **Feature Selection:** Identifying those input variables that are most relevant to the task.
    - **Data Transforms:** Changing the scale or distribution of variables.
    - **Feature Engineering:** Deriving new variables from available data.
    - **Dimensionality Reduction:** Creating compact projections of the data.


- **Step 3: Evaluate Models.** This step is concerned with evaluating machine learning models on your dataset. It requires that you design a robust test harness used to evaluate your models so that the results you get can be trusted and used to select among the models that you have evaluated. This involves tasks such as selecting a `performance metric` for evaluating the skill of a model, `establishing a baseline` or floor in performance to which all model evaluations can be compared, and a `resampling technique` for splitting the data into training and test sets to simulate how the final model will be used. 
    
    Model evaluation may involve the following sub-tasks:
    - Select a performance metric for evaluating model predictive skill.
    - Select a model evaluation procedure.
    - Select algorithms to evaluate.
    - Tune algorithm hyperparameters.
    - Combine predictive models into ensembles.
    
    
    For quick and dirty estimates of model performance, or for a very large dataset, a single `train-test split` of the data may be performed. It is more common to use `k-fold cross-validation` as the data resampling technique, often with repeats of the process to improve the robustness of the result. This step also involves tasks for getting the most out of well-performing models such as hyperparameter tuning and ensembles of models.


- **Step 4: Finalize Model.** This step is concerned with selecting and using a final model. Once a suite of models has been evaluated, you must choose a model that represents the solution to the project. This is called `model selection` and may involve further evaluation of candidate models on a hold out validation dataset, or selection via other project-specific criteria such as model complexity. It may also involve summarizing the performance of the model in a standard way for project stakeholders, which is an important step. Finally, there will likely be tasks related to the `productization of the model`, such as integrating it into a software project or production system and designing a `monitoring and maintenance schedule for the model`.

**There may be a lot of interplay between the definition of the problem and the preparation of the data. There may also be interplay between the data preparation step and the evaluation of models.** Information known about the choice of algorithms and the discovery of well-performing algorithms can also inform the selection and configuration of data preparation methods. For example, the choice of algorithms may impose requirements and expectations on the type and form of input variables in the data. This might require variables to have a specific probability distribution, the removal of correlated input variables, and/or the removal of variables that are not strongly related to the target variable.