# Checklist for Data Science Projects

Checklist to apply to your data science projects. Based on CRISP-DM structures and methodology proposed by Aurélien Géron, this is a routine that fits most of data science projects.

Remember that it is not a rigid or immutable checklist. On the contrary!

This is a guide for you to not start from scratch. You can (and should) adapt your reality when working on a Data Science project.

### 1 - Understand the Problem
- Look at the whole and define the scope of the project
- What are the existing solutions?
- How will the solution be used?
- Which approach to use?
    - Supervised Learning
    - Unsupervised Learning
    - Reinforcement Learning
- What is the performance metric?
- What is the minimum performance expected to achieve the goal?
- List the basic premises of the project

### 2 - Explore the Data
- Create a copy of the data for the exploration
- Create a Jupyter Notebook to document exploration
- Study each attribute and its characteristics:
    - Name
    - Type
        - Categorical
        - Numeric
            - int
            - float
        - Structured
        - Unstructured
        - etc
    - % of missing values
    - Noise in the data and type of noise (outliers, stochastics, rounding errors)
    - Distribution type
        - Gaussian
        - Uniform
        - Logarithmic
        - etc
- Identify the target variable
- View the data
- Study the correlation between data
- Identify the transformations that can be applied
- Identify extra data that may be useful

### 3 - Prepare the Data
- Working on copies of the data
- Write functions for all transformations
    
    - 1.Data Cleaning
        -Repair or remove outliers
        -Fill in missing values or delete rows / columns
         - Zero
         - Average
         - Median
         - etc
    
    - 2.Attribute selection
         - Eliminate attributes (features) that do not contain useful information

    - 3.Feature Engineering
        - Discrete continuous variables
        - Decompose features (categorical, date, time)
        - Apply transformations to variables
        - Aggregate features to generate new
    
    - 4.Feature Scaling
        - Normalize or standardize features
        
### 4. Model Construction
- Automate as many steps as possible
- Train more than one model and compare performances
- Analyze the most significant variables for each algorithm
- Fine-Tune of hyperparameters
- Use of cross-validation
- Check the performance of Ensemble methods, combining the models that had the best individual performances
- Test its performance with the test dataset.

### 5. Presentation of the Solution and Deploy
- Document all steps
- Make all steps replicable (download files, use the Kaggle API)
- Remember Storytelling
    - Decision makers and directors are probably unaware of the technical part
- See the best chart to count each insight discovered
- Write unit tests
- Create monitoring routines and alerts
- Determine when to update the model