# Tutorial 1: Reproducible Environments
## Overview

* Requirements: The Bare Minimum 

* Using a Data Science Template: `cookiecutter`

* Virtual Environments: `conda` and environment files
* Revision Control: git and a git workflow
   * Installing, Enabling, and using nbdime
* The Data Science DAG
   * make, Makefiles and data flow
* Python Modules
   * Creating an editable module
* Testing: doctest, pytest, hypothesis

### Test your installation

In [None]:
!conda --version   # or `$CONDA_EXE --version` in some environments

In [None]:
!make --version

In [None]:
!git --version

In [None]:
!cookiecutter --version

## Exercise

Explore your README.md

(Hint: You can use the `%load` magic, or `!cat` to look at it in your notebook)

## Solution

bus_number_tutorial
==============================

Reproducible data science tutorial

GETTING STARTED
---------------

For complete instructions, visit: https://github.com/hackalog/bus_number/wiki/Getting-Started

* Create and switch to the  virtual environment:
```
cd bus_number_tutorial
make create_environment
conda activate bus_number_tutorial
make requirements
```
* Explore the notebooks in the `notebooks` directory

Project Organization
------------
* `LICENSE`
* `Makefile`
    * top-level makefile. Type `make` for a list of valid commands
* `README.md`
    * this file
* `data`
    * Data directory. often symlinked to a filesystem with lots of space
    * `data/raw`
        * Raw (immutable) hash-verified downloads
    * `data/interim`
        * Extracted and interim data representations
    * `data/processed`
        * The final, canonical data sets for modeling.
* `docs`
    * A default Sphinx project; see sphinx-doc.org for details
* `models`
    * Trained and serialized models, model predictions, or model summaries
    * `models/trained`
        * Trained models
    * `models/output`
        * predictions and transformations from the trained models
* `notebooks`
    *  Jupyter notebooks. Naming convention is a number (for ordering),
    the creator's initials, and a short `-` delimited description,
    e.g. `1.0-jqp-initial-data-exploration`.
* `references`
    * Data dictionaries, manuals, and all other explanatory materials.
* `reports`
    * Generated analysis as HTML, PDF, LaTeX, etc.
    * `reports/figures`
        * Generated graphics and figures to be used in reporting
    * `reports/tables`
        * Generated data tables to be used in reporting
    * `reports/summary`
        * Generated summary information to be used in reporting
* `requirements.txt`
    * (if using pip+virtualenv) The requirements file for reproducing the
    analysis environment, e.g. generated with `pip freeze > requirements.txt`
* `environment.yml`
    * (if using conda) The YAML file for reproducing the analysis environment
* `setup.py`
    * Turns contents of `src` into a
    pip-installable python module  (`pip install -e .`) so it can be
    imported in python code
* `src`
    * Source code for use in this project.
    * `src/__init__.py`
        * Makes src a Python module
    * `src/data`
        * Scripts to fetch or generate data. In particular:
        * `src/data/make_dataset.py`
            * Run with `python -m src.data.make_dataset fetch`
            or  `python -m src.data.make_dataset process`
    * `src/analysis`
        * Scripts to turn datasets into output products
    * `src/models`
        * Scripts to train models and then use trained models to make predictions.
        e.g. `predict_model.py`, `train_model.py`
* `tox.ini`
    * tox file with settings for running tox; see tox.testrun.org


--------

<p><small>This project was built using <a target="_blank" href="https://github.com/hackalog/cookiecutter-easydata">cookiecutter-easydata</a>, an experimental fork of [cookiecutter-data-science](https://github.com/drivendata/cookiecutter-data-science) aimed at making your data science workflow reproducible.</small></p>


## Test your environment

Your `active environment` should be `bus_number_tutorial`


In [None]:
!conda info

In [None]:
# if importing src doesn't work, try `make requirements`
import src

## Revision Control: `git`

How do we keep track of our changes? We use **git**.

Before we do anything interesting, let's initialize a git repository (repo) here.


```
git init
git add .
git commit -m "Initial Import"
```

In [None]:
!git status

We will get back to using git again soon.

### Exercise 4: Add a dependency
Modify the environment file so that `make requirements` installs some additional packages
* install `umap-learn` using conda
* install `nbdime` using pip

### Solution 4
<pre>
name: bus_number
channels:
  - conda-forge
dependencies:
  - pip
  - pip:
    - -e .
    - python-dotenv>=0.5.1
    <b>- nbdime</b>
  - setuptools
  - wheel
  - sphinx
  - click
  - coverage
  - pytest-cov
  - jupyter
  - scikit-learn
  - joblib
  - nb_conda
  - nbdime
  - nbval
  - pandas
  - requests
  - python>=3.6
  <b>- umap-learn</b>
</pre>

### Exercise 5: Basic git interactions

See what has changed with git:

In [None]:
#!git status

In [None]:
#!git diff -u

Add or reject the changes incrementally

In [None]:
#!git add -p
#!git reset -p


Commit the changes

In [None]:
#!git commit -v

## The Data Science DAG
DAG = Directed Acyclic Graph. 

That means the process eventually stops. (This is a good thing!) 

It also means we can use a super old, but incredibly handy tool to implement this workflow: `make`.

### Make, Makefiles, and the Data Flow


We use a `Makefile` to organize and invoke the various steps in our Data Science pipeline.
You have already used this file when you created your virtual environment in the first place:
```
make create_environment
```
Here are the steps we will be working through in this tutorial:
<img src="references/cheat_sheet.png" alt="Reproducible Data Science Workflow" width="400"/>

A [PDF version of the cheat sheet](references/cheat_sheet.pdf) is also available.



### What's my make target doing?
If you are ever curious what commands a `make` command will invoke (including any invoked dependencies), use `make -n`, which lists the commands without executing them:

In [None]:
%%bash
cd .. && make -n requirements

We use a cute **self-documenting makefiles trick** (borrowed from `cookiecutter-datascience`) to make it easy to document the various targets that you add. This documentation is produced when you type a plain `make`:

In [None]:
%%bash
cd .. && make

### Under the Hood: The Format of a Makefile

```
## Comment to appear in the auto-generated documentation
thing_to_build: space separated list of dependencies
	command_to_run            # there is a tab before this command.
	another_command_to_run    # every line gets run in a *new shell*
```



### Exercise 6: What does this makefile print?

In [None]:
%%file Makefile.test

data: raw
	@echo "Build Datasets"
train_test_split:
	@echo "do train/test split"
train: data transform_data train_test_split
	@echo "Train Models"
transform_data:
	@echo "do a data transformation"
raw:
	@echo "Fetch raw data"


Note: If you see: ```*** missing separator.  Stop.``` it's because you have used spaces instead of **tabs** before your commands. 

### Solution 6

In [None]:
%%bash
make -f Makefile.test train

### Exercise 7: What happens when you add a cycle to a Makefile
Set up a makefile with a cyclic dependency and run it

### Solution 7

In [None]:
%%file Makefile.test

cycle: cycle_b
	@echo "in a Makefile"
cycle_b: cycle_c
	@echo "have a cycle"
cycle_c: cycle
	@echo "You can't"

In [None]:
%%bash
make -f Makefile.test cycle

Using a Makefile like this is an easy way to set up a process flow expressed as a Directed Acyclic Graph (DAG).

**Note**: We have only scratched the surface here. The are lots of interesting tricks you can do with make.
* http://zmjones.com/make/
* http://blog.byronjsmith.com/makefile-shortcuts.html
* https://www.gnu.org/software/make/manual/


## Back to Revision Control: git workflows

Git isn't really a collaboration tool. It's more a tool for implementing collaboration workflows.

What do we mean by workflow? A process built on top of git that incorporates **pull requests** and **branches**. Typically, this is provided by sites like: GitHub, GitLab, BitBucket.


# TODO: Link to an easy git tutorial
    

## Life Rules for using `git`

* Always work on a branch: `git checkout -b my_branch_name`. Delete branches once they are merged.
* **Never** push to master. Always **work on a branch** and do a pull request.
* Seriously, don't do work on master if you are collaborating with **anyone**.
* If you pushed it anywhere, or shared it with anyone, don't `git rebase`. In fact, if you're reading this, don't `git rebase`. Save that for when you are comfortable solving git merge nightmares on your own.


### Exercise: 

Create a GitHub/GitLab/BitBucket repo and sync your repo to it.


### Exercise
* Create a branch called `add_sklearn`
* Add a scikit-learn dependency
* Check in these changes using git to your local repo
* Push the new branch to GitHub
* Create a pull request to merge this branch into master
* Merge your PR (delete the branch afterwards)
* Sync your local repo with GitHub, including deleting the merged branches

## Python Modules
By default, we keep our source code in a module called `src`. (this can be overridden in the cookieccutter)

This is enabled via one line in `environment.yml`:
```
- pip:
  - -e .
```

This creates an **editable module**, and looks in the current directory for a file called `setup.py` to indicate the module name and location

In [None]:
# %load ../setup.py
from setuptools import find_packages, setup

setup(
    name='src',
    packages=find_packages(),
    version='0.0.1',
    description='Up Your Bus Number: A Primer for Reproducible Data Science',
    author='Tutte Institute for Mathematics and Computing',
    license='MIT',
)


This lets you easily use your code in notebooks and other scripts, and avoids any `sys.path.append` silliness

### ASIDE: Semantic Versioning

Semantic versioning (or *semver*), refers to the convention of versioning with a triple:

    MAJOR.MINOR.PATCH

With the following convention: when releasing new versions, increment the:

*    MAJOR version when you make **incompatible API changes**,
*    MINOR version when you **add functionality** in a backwards-compatible manner, and
*    PATCH version when you make backwards-compatible **bug fixes**.

If you have no other plan, this is a great convention to follow.

For an obscene amount of detail on this concept, see https://semver.org/

### Exercise
* add your favorite utility function to `src/utils`
* increment the version number of the editable package
* run `make requirements` (required if you added dependencies for your utility function)
* import your utility function and run it from this notebook

### Solution:


## Testing: doctest, pytest, coverage

Optional: Hypothesis and property-based testing

# TODO: Add some failing tests that need correcting

### Exercise:
Add a `make test` target to your makefile that:
* runs doctests
* runs pytest unit tests
* (extra credit) Displays test coverage results
    
These tests will fail. Fix them in the next exercise.

### Exercise:
fix the failing tests