# Start Here

This notebook is all about **getting you started doing Reproducible Data Science** , and giving you a **deeper look** at some of the concepts we will cover in this tutorial.

## The Bare Minimum
You will need:
* `cookiecutter` 
* `conda` (and then `python >= 3.6`)
* `make`


# Installation: While we are talking...

(These notes are at: https://github.com/hackalog/bus_number/wiki/Getting-Started)

1. [Install cookiecutter](https://cookiecutter.readthedocs.io/en/latest/installation.html), then use it to install the `pydata_nyc` branch of `cookiecutter-easydata`:

```
cookiecutter https://github.com/hackalog/cookiecutter-easydata.git --checkout pydata_nyc
```

2. Configure a new project. Call it **bus_number**:
<pre>
project_name [project_name]: <b>bus_number</b>
repo_name [bus_number]: <b>↵</b>
module_name [src]: <b>↵</b>
author_name [Your name (or your organization/company/team)]: <b>Kjell Wooding</b>
description [A short description of this project.]: <b>Reproducible Data Science</b>
Select open_source_license:
1 - MIT
2 - BSD-2-Clause
3 - Proprietary
Choose from 1, 2, 3 [1]: <b>↵</b>
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]: <b>↵</b>
aws_profile [default]: <b>↵</b>
Select virtualenv:
1 - conda
2 - virtualenv
Choose from 1, 2 [1]: <b>↵</b>
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: <b>↵</b>
</pre>

3. Create your Development Environment
```
cd bus_number
make create_environment
conda activate bus_number         # or `source activate bus_number`
make requirements
git init
```
That's it! You're ready to go

## The Reproducible Data Science Process
### How do you spend your "Data Science" time?
A Typical data science process looks something like this:
* Munge: Fetch, process data, do EDA
* Science: Train models, Predict, Transform data
* Deliver: Analyze, summarize, publish

Usually, the reproducible part is in the **science** step only.
<img src="../references/charts/munge-supervised.png" alt="Typical Data science Process" width="400" />

We're going to try to improve this to a process that is reproducible from start to finish. The flow looks like this:

<img src="../references/workflow/workflow-cheat-sheet.png" alt="Reproducible Data Science Workflow" width="400"/>

### Makefiles and the Data Flow
We use a `Makefile` to organize and invoke the various steps in our Data Science pipeline.
You have already used this file when you created your virtual environment in the first place:
```
make create_environment
```
Here are some 
<img src="../references/workflow/workflow-cheat-sheet.png" alt="Reproducible Data Science Workflow" width="400"/>



### ASIDE: What's my make target doing?
If you are ever curious what commands a `make` command will invoke (including any invoked dependencies), use `make -n`, which lists the commands without executing them:

In [4]:
%%bash
cd .. && make -n requirements

python3 test_environment.py
conda env update --name bus_number -f environment.yml


We use a cute self-documenting trick in our makefiles (borrowed from `cookiecutter-datascience`) to make it easy to document the various targets that you add. This documentation is produced when you type a plain `make`:

In [5]:
%%bash
cd .. && make

[1mAvailable rules:[m

[36mclean              [m Delete all compiled Python files 
[36mclean_datasets     [m Delete all processed datasets 
[36mclean_models       [m Delete all trained models 
[36mclean_predictions  [m Delete all predictions 
[36mcreate_environment [m Set up python interpreter environment 
[36mdata               [m convert raw datasets into fully processed datasets 
[36mlint               [m Lint using flake8 
[36mpredict            [m predict / transform / run experiments 
[36mraw                [m Fetch, Unpack, and Process raw dataset files 
[36mrequirements       [m Install or update Python Dependencies 
[36msummary            [m Convert predictions / transforms / experiments into output 
                    data 
[36msync_data_from_s3  [m Download Data from S3 
[36msync_data_to_s3    [m Upload Data to S3 
[36mtest               [m Run all Unit Tests 
[36mtest_environment   [m Test python environment is set-up correctly 
[36mtrain  

### ASIDE: The Format of a Makefile

```
## Comment to appear in the auto-generated documentation
thing_to_build: space separated list of dependencies
	command_to_run            # there is a tab before this command.
	another_command_to_run    # every line gets run in a *new shell*
```



In [21]:
%%file Makefile.test

data: raw
	@echo "Build Datasets"
train_test_split:
	@echo "do train/test split"
train: data transform_data train_test_split
	@echo "Train Models"
transform_data:
	@echo "do a data transformation"
raw:
	@echo "Fetch raw data"


Overwriting Makefile.test


If you see: ```*** missing separator.  Stop.```, it's because you have used spaces instead of **tabs** before your commands. 

In [24]:
%%bash
make -f Makefile.test train

Fetch raw data
Build Datasets
do a data transformation
do train/test split
Train Models


In [27]:
%%file Makefile.test

cycle: cycle_b
	@echo "in a Makefile"
cycle_b: cycle_c
	@echo "have a cycle"
cycle_c: cycle
	@echo "You can't"

Overwriting Makefile.test


In [28]:
%%bash
make -f Makefile.test cycle

You can't
have a cycle
in a Makefile


make: Circular cycle_c <- cycle dependency dropped.


Using a Makefile like this is an easy way to set up a process flow expressed as a Directed Acyclic Graph (DAG).

Note: We have only scratched the surface here. The are lots of interesting tricks you can do with make.
* http://zmjones.com/make/
* http://blog.byronjsmith.com/makefile-shortcuts.html
* https://www.gnu.org/software/make/manual/


### ASIDE: Our Favourite Python Parts
Why the `python>=3.6` requirement?
* f-strings: Finally, long, readable strings in our code.
* dictionaries: insertion order is preserved!

Other great tools:
* `pathlib`: Sane, multiplatorm path handling: https://realpython.com/python-pathlib/
* `doctest`: Examples that always work: https://docs.python.org/3/library/doctest.html
* `joblib`: Especially the persistence part: https://joblib.readthedocs.io/en/latest/persistence.html


## Revision Control and Git

### ASIDE: Git vs Github
Git is a tool, and by itself, only marginally useful. To use it effecively (especially within a team), you need a **process**. This is what GitHub, GitLab, BitBucket help you with.

For a quick start, see out [Github Workflow Cheat Sheet](https://github.com/hackalog/bus_number/wiki/Github-Workflow-Cheat-Sheet)

### ASIDE: Diffing Jupyter Notebooks

Yes, you can `diff` your jupyter notebooks. The tool is called `nbdime`
    
 https://nbdime.readthedocs.io/en/stable/index.html

To enable it, 
* add `- nbdime` to the pip section of `environment.yml`
* `make requirements`

You will likely want to enable integrations with jupyter notebook and git:

```
nbdime extensions --enable   # Enable Jupyter Notebook extension
nbdime config-git --enable   # Enable automatic nbdiff'ing in `git diff`
```

### ASIDE: The magic pip+conda glue:
You might notice that most of our code is imported from a module called `src`. E.g.

In [12]:
from src.paths import data_path


The magic piece that makes this module work is in `enviroment.yml`, our conda environment specification:

In [7]:
!head ../environment.yml

name: bus_number
dependencies:
  - pip
  - pip:
    - -e .
    - python-dotenv>=0.5.1
    - umap-learn
  - setuptools
  - wheel
  - sphinx


Notice the `- -e .` line. This installs an editable version of the module described in `setup.py`

In [9]:
!cat ../setup.py

from setuptools import find_packages, setup

setup(
    name='src',
    packages=find_packages(),
    version='0.0.1',
    description='Up Your Bus Number: A Primer for Reproducible Data Science',
    author='Tutte Institute for Mathematics and Computing',
    license='MIT',
)
