# Planning

**The goal** of this stage is to clearly define your goal(s), measures of success, and plans on how to achieve that.

**The deliverable** is documentation of your goal, your measure of success, and how you plan on getting there.

**How to get there:** You can get there by answering questions about the final product & formulating or identifying any initial hypotheses (from you or others).

**Common questions include:**
- What will the end product look like?
- What format will it be in?
- Who will it be delivered to?
- How will it be used?
- How will I know I'm done?
- What is my MVP?
- How will I know it's good enough?


**Formulating hypotheses**
- Is attribute V1 related to attribute V2?
- Is the mean of target variable Y for subset A significantly different from that of subset B?

# Acquisition

**The goal** is to create a path from original data sources to the environment in which you will work with the data. You will gather data from sources in order to prepare and clean it in the next step.

**The deliverable** is a file, acquire.py, that contains the function(s) needed to reproduce the acquisition of data.

**How to get there:**

- If the data source is SQL, you may need to do some clean-up, integration, aggregation or other manipulation of data in the SQL environment before reading the data into your python environment.
- Using the Python library pandas, acquire the data into a dataframe using a function that reads from your source type, such as pandas.read_csv for acquiring data from a csv.
- You may use Spark and/or Hive when acquiring data from a distributed environment, such as HDFS.
Examples of source types include RDBMS, NoSQL, HDFS, Cloud Files (S3, google drive), static local flat files (csv, txt, xlsx).

# Preparation

**The goal** is to have data, split into 3 samples (train, validate, and test), in a format that can easily be explored, analyzed and visualized. 

**The deliverable** is a file, prep.py, that contains the function(s) needed to reproduce the preparation of the data.

**How to get there:**

- Python libraries: pandas, matplotlib, seaborn, scikit-learn.
- Use pandas to perform tasks such as handling null values, outliers, normalizing text, binning of data, changing data types, etc.
- Use matplotlib or seaborn to plot distributions of numeric attributes and target.
- Use scikit-learn to split the data into train and test samples.

# Exploration & Pre-Processing

**The goal** is to discover features that have the largest impact on the target variable, i.e. provide the most information gain, drive the outcome.

**The deliverable** is a file, preprocess.py, that contains the function(s) needed to reproduce the pre-processing of the data. 

The dataframe resulting from these functions should be one that is pre-processed, i.e. ready to be used in modeling. This means that attributes are reduced to features, features are in a numeric form, there are no missing values, and continuous and/or ordered values are scaled to be unitless.

**How to get there:**

- Use python libraries: pandas, statsmodels, scipy, numpy, matplotlib, seaborn, scikit-learn.
- Perform statistical testing to understand correlations, significant differences in variables, variable interdependencies, etc.
- Create visualizations that demonstrate relationships across and within attributes and target.
- Use domain knowledge and/or information gained through exploration to construct new features.
- Remove features that are noisy, provide no valuable or new information, or are redundant.
- Use scikit-learn's preprocessing algorithms (feature selection, feature engineering, dummy variables, binning, clustering, e.g.) to turn attributes into features.

# Modeling

**The goal** is to create a robust and generalizable model that is a mapping between features and a target outcome.

**The deliverable** is a file, model.py, that contains functions for training the model (fit), predicting the target on new data, and evaluating results.

**How to get there:**

- Python libraries: scikit-learn
- Identify regression, classification, cross validataion, and/or other algorithms that are most appropriate.
- Build your model:
- Create the model object.
- Fit the model to your training, or in-sample, observations.
- Predict the target value on your training observations.
- Evaluate results on the in-sample predictions.
- Repeat as necessary with other algorithms or hyperparameters.
- Using the best performing model, predict on test, out-of-sample, observations.
- Evaluate results on the out-of-sample predictions.

# Delivery

**The goal** is to enable others to use what you have learned or developed through all the previous stages.

**The deliverable** could be of various types:

- A pipeline.py file that takes new observations from acquisition to prediction using the previously built functions.
- A fully deployed model.
- A reproducible report and/or presentation with recommendations of actions to take based on original project goals.
- Predictions made on a specific set of observations.
- A dashboard for observing/monitoring the key drivers, or features, of the target variable.

**How to get there:**

- Python sklearn's pipeline method.
- Tableau for creating a report, presentation, story, or dashboard.
- Jupyter notebook for creating a report or a framework to reproduce your research, e.g.
- Flask to build a web server that provides a gateway to our model's predictions.