# Start Here

This notebook is all about **getting you started doing Reproducible Data Science** , and giving you a **deeper look** at some of the concepts we will cover in this tutorial. For the latest version of this notebook, visit: 

    https://github.com/hackalog/bus_number

## The Bare Minimum
You will need:
* `conda` (via anaconda or miniconda)
* `cookiecutter` 
* `make`
* `git`
* `python >= 3.6` (via `conda`)

### Installing Anaconda
We use `conda` for handling package dependencies, maintaining virtual environments, and installing particular version of python. For proper integration with pip, you should make sure you are running conda >= 4.4.0. Some earlier versions of conda have difficulty with editable packages (which is how we install our `src` package)

* See the [https://conda.io/docs/user-guide/install/index.html Anadonda installation guide] for details

### Installing Cookiecutter
`cookiecutter` is a python tool for creating projects from project templates. We use cookiecutter to create a reproducible data science template for starting our data science projects.

To install it:

  conda install -c conda-forge cookiecutter

### make
We use gnu `make` (and `Makefiles`) as a convenient interface to the various stages of the reproducible data science data flow. If for some reason your system doesn't have make installed, try:

  conda install -c anaconda make

### git
We use git (in conjunction with a workflow tool like GitHub, BitBucket, or GitLab) to manage version control. 

Atlassian has good [https://www.atlassian.com/git/tutorials/install-git
 instructions for installing git] if it is not already available on your platform.



## The Reproducible Data Science Process
### How do you spend your "Data Science" time?
A typical data science process involves three main kinds of tasks:
* Munge: Fetch, process data, do EDA
* Science: Train models, Predict, Transform data
* Deliver: Analyze, summarize, publish

where our time tends to be allocated something like this:

<img src="charts/munge-supervised.png" alt="Typical Data science Process" width=500/>

Unfortunately, even though most of the work tends to be in the **munge** part of the process, when we do try and make data science reproducible, we tend to focus mainly on reprodibility of the **science** step.

That seems like a bad idea, especially if we're doing unsupervised learning, where often our time is spent like this:

<img src="charts/munge-unsupervised.png" alt="Typical Data science Process" width=500/>

We're going to try to improve this to a process that is **reproducible from start to finish**. 

There are 4 steps to a fully reproducible data science flow:
* Creating a **Reproducible Environment**
* Creating **Reproducible Data**
* Building **Reproducible Models**
* Achieving **Reproducible Results**

In the next few notebooks, we will look at each of these steps in turn.