# 1. Background

**Business Objective** - Define the objective in business terms.

* 

**Solution Usage** - How will my solution be used?

* 

**Existing Solutions** - What are the current solutions/workarounds, if any?

* 

**Solution Approach** - How should I frame the problem? 
(supervised/unsupervised, online/offline, etc.)

* 

**Performance Measure** - How should performance be measured?

* 

**Performance vs. Business Objective** - Is the performance measure aligned 
with the business objective?

* 

**Performance Threshold** - What would be the minimum performance needed to 
meet the business objective?

* 

**Comparable Solutions** - What are compariable problems? Can I reuse 
experience or tools?

* 

**Human Expertise** - Is human expertise available?

* 

**Manual Solution** - How would I solve the problem manually?

* 

**Initial Assumptions** - List the assumptions I (or others) have made so far.

* 

**Assumption Verification** - Verify assumptions if possible.

* 

# 2. Get the Data

Note: Automate as much as possible so I can easily get fresh data if needed.

**Requirements** - List the data I need and much much I'll need.

* 

**Sources** - Find and document where I can get the data.

* 

**Storage Requirements** - Check how much space it will take.

* 

**Authorization** - Check legal obligations, and get legal/access 
authorization if necessary.

* 

## Acquire the Data

## Format the Data

Convert the data to a format I can easily manipulate (without changing the data
itself)

## Handle Sensitive Information

Ensure sensitive information is deleted or protected (e.g., anonymized)

## Data Size & Type

Check the size and type of data (time series, sample, geographical, etc.)

## Train/Test Split

Sample a test set, put it aside, and never look at it (no data snooping!)

# 3. EDA

Note: Try to get insight from a field expert for these steps

## Create a Copy

Create a copy of the data for exporation (sampling it down to a manageable size
if necessary)

## Attributes

Study each attribute and its characteristics:

* Name
* Type (categorical, float/int, bounded/unbounded, text, structured, etc.)
* % of missing values
* Noisiness and tpe of noise (stochastic, outliers, rounding errors, etc.)
* Usefulness for the task
* Type of distribution (Gaussian, uniform, logarithmic, etc.)

| Name | Type | % Null | Noise | Usefulness | Distribution |
| ---- | ---- | ------ | ----- | ---------- | ------------ |
| | | | | | |



## Target Attribute(s)

For supervised tasks, identify target attributes(s)

## Data Visualization

Visualize the Data

## Correlation

Study the correlation between attributes

## Learnings

**Manual Solution - Revisited** - How would I solve the problem manually?

* 

**Potential Transformations** - Identify the promising transformations I may want to apply.

* 

**Extra Data** - Identify extra data that would be useful (go back to step 2.7)

## 3.8. Extra Data

Identify extra data that would be useful (go back to step 2.7)

## 3.9. Learnings

Document what I have learned

# 4. Prepare the Data

Notes:

* Work on copies of the data (keep the original dataset intact)
* Write functions for all data transformations I apply, for 5 reasons:
  * So I can easily prepare the data the next time I get a fresh dataset.
  * So I can easily apply these transformations for future projects.
  * To clean and prepare the test set.
  * To clean and prepare new data instances once my solution is live.
  * To make it easy to treat the preparation choices as hyperparameters.

## 4.1. Clean the Data

### 4.1.1. Outliers

Fix or remove outliers, if necessary.

### 4.1.2. Missing Values

Fill in missing values (zero, mean, meadian) or drop the rows/column.

## 4.2. Feature Selection

Drop the attributes that provide no useful information for the task.

## 4.3. Feature Engineering (Where Appropriate)

### 4.3.1. Discretize Continuous Features

### 4.3.2. Decompose Features

E.g., categorical, date/time, etc.

### 4.3.3. Transform Features

Add promoising transformations of features (e.g., $log(x)$, $\sqrt{x}$, $x^2$, etc.)

### 4.3.4. Aggregate Features

Aggregate features into promosing new features

## 4.4. Features Scaling

Standardize or normalize features

# 5. Initial Models

Notes

* If data is huge, may want to sample smaller training sets to train many 
  different models in a reasonable time
  * Beware, this penalizes complex models such as large neural nets or random
    forests.
* Again, try to automate these steps as much as possible.

## 5.1. Model Training

Train many quick-and-dirty models from different categories:

* Linear
* Naive-Bayes
* SVM
* Random Forest
* Neural Network
* Etc.

## 5.2. Measure Performance

Measure and compare performance. For each model, use N-fold cross-validation 
and compute the mean and standard deviation of the performance measure on the N-Folds.

## 5.3. Significant Variables

Analyze the most significant variables for each algorithm.

## 5.4. Error Types

Analyze the types of errors the models made. What data would a human have used 
to avoid these errors?

## 5.5. Feature Engineering - Revisited

Perform a quick round of feature selection and engineering.

## 5.6. Repeat

Repeat steps 5.1 - 5.5 one or two more times.

## 5.7. Model Selection

Shortlist the top 3-5 most promising models. Prefer models that make different
types of errors.

# 6. Fine-Tuning

Notes

* Want to use as much data as possible in this step, especially as I move 
  toward the end of fine-tuning.
* As always, automate wherever possible.

## 6.1. Hyperparameter Tuning

Fine-Tune the hyperparameters using cross-validation:

* Treat the data transformation choices as hyperparameters, especially when not
  sure about them (e.g., if not sure whether to replace missing values with the
  zero or the median, or to drop the rows.)
* Unless there are very few hyperparameter values to explore, prefer random
  search over grid search. If training is very long, may prefer a Bayesian 
  optimization approach (e.g., using Gaussian process priors. See work by
  [Jasper Snoek et. al.](https://arxiv.org/abs/1206.2944)).

## 6.2. Try Ensemble Methods

Combining best models will often produce better performance than running them 
individually.

## 6.3. Final Performance Measure

Once confident about the final model, measure its performance on the test set
to estimate the generalization error.

> ⚠️ Note
>
> Don't tweak the model after measuring the generalization error. This would
> lead to overfitting the test set.

# 7. Solution Overview

Optional.

1. Document what has been done
2. Create a nice presentation
   1. Make sure to highlight the big picture first
3. Explain why my solution achieves the business objectives.
4. Don't forget to present interesting points noticed along the way.
   1. What worked and what didn't work
   2. Assumptions and limitations
5. Ensure key findings are communicated through beautiful visualizations or 
   easy-to-remember statements (e.g., 'the median incomde is the number-one 
   predictor of housing prices.')

# 8. Launch

1. Get solution ready for production (plug into production data inputs, write 
   unit tests, etc.)
2. Write monitoring code to check the system's live performance at regular 
   intervals and trigger alerts when it does.
   1. Beware of slow degredation: models tend to 'rot' as data evolves.
   2. Measuring performance may require a human pipeline (e.g., via a 
      crowdsourcing service)
   3. Also monitor quality of inputs (e.g., a malfunctioning sensor sending 
      random values, or another team's output becomming stale). This is 
      particularly important for online learning systems.
3. Retrain models on a regular basis with fresh data (automate as much as 
   possible)