# Tutorial 1: Reproducible Environments
## Overview

* Requirements: The Bare Minimum 

* Using a Data Science Template: `cookiecutter`

* Virtual Environments: `conda` and environment files

* The Data Science DAG
   * make, Makefiles and data flow

* Revision Control: git and a git workflow
   * Installing, Enabling, and using nbdime
     
* Python Modules
   * Creating an editable module
* Testing: doctest, pytest, hypothesis

## Requirements: The Bare Minimum
You will need:
* `conda` (via anaconda or miniconda)
* `cookiecutter` 
* `make`
* `git`
* `python >= 3.6` (via `conda`)

### Installing Anaconda
We use `conda` for handling package dependencies, maintaining virtual environments, and installing particular version of python. For proper integration with pip, you should make sure you are running conda >= 4.4.0. Some earlier versions of conda have difficulty with editable packages (which is how we install our `src` package)

* See the [Anadonda installation guide](https://conda.io/docs/user-guide/install/index.html) for details

### Installing Cookiecutter
`cookiecutter` is a python tool for creating projects from project templates. We use cookiecutter to create a reproducible data science template for starting our data science projects.

To install it:
```
  conda install -c conda-forge cookiecutter
```
### make
We use gnu `make` (and `Makefiles`) as a convenient interface to the various stages of the reproducible data science data flow. If for some reason your system doesn't have make installed, try:
```
  conda install -c anaconda make
```
### git
We use git (in conjunction with a workflow tool like GitHub, BitBucket, or GitLab) to manage version control. 

Atlassian has good [instructions for installing git](https://www.atlassian.com/git/tutorials/install-git) if it is not already available on your platform.

### Exercise 1: Install the requirements


### Solution 1

In [None]:
!conda --version

In [None]:
!make --version

In [None]:
!git --version

In [None]:
!cookiecutter --version

## Using a Data Science Template: `cookiecutter`

We use cookiecutter to create a reproducible data science template for starting our data science projects.

1. Obtain the `cookiecutter-easydata` repo:
```
git clone https://github.com/hackalog/cookiecutter-easydata.git
cookiecutter cookiecutter-easydata
```

Note: You could install from the github repo directly, even from a particular branch. For example, to use the `pydata_nyc`

```
cookiecutter https://github.com/hackalog/cookiecutter-easydata.git --checkout pydata_nyc
```


### Exercise 2: Start your cookiecutter-based project
Create a project called `bus_number`:
* Use `conda` as your virtualenv manager
* Use python 3.6 or greater

When complete, you should have a fully populated project directory, complete with customized `README.md`

## Solution 2:
<pre>
 $ <b>cookiecutter cookiecutter-easydata</b>

project_name [project_name]: <b>bus_number</b>
repo_name [bus_number]: <b>↵</b>
module_name [src]: <b>↵</b>
author_name [Your name (or your organization/company/team)]: <b>Kjell Wooding</b>
description [A short description of this project.]: <b>Reproducible Data Science</b>
Select open_source_license:
1 - MIT
2 - BSD-2-Clause
3 - Proprietary
Choose from 1, 2, 3 [1]: <b>↵</b>
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]: <b>↵</b>
aws_profile [default]: <b>↵</b>
Select virtualenv:
1 - conda
2 - virtualenv
Choose from 1, 2 [1]: <b>↵</b>
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: <b>↵</b>


 $ <b>cd bus_number</b>

</pre>


##  Virtual Environments: `conda` and environment files

Everyone's computing environment is different. How can we ensure that another user running a different platform can successfully run the code you are creating? How do we know they are using the same versions of your code and all its various supporting libraries? How do we reproduce your working environment on someone else's machine?

In short, by using **virtual environments**. In this case, we're going to use `conda` (as provided by either *anaconda* or *miniconda*) to create and manage these environments. Furthermore, we use an **environment file**, `environment.yml` to specify all of the dependencies that need to be installed to run our code.
    
Two `make` commands  ensure that we have the appropriate environment:
* `make create_environment`: for the initial creation of a project specific conda environment
* `make requirements`: to update our environment to the latest version of the `environment.yml` specs.

We will get to `make` in the next section.

**Caveat**: Technically speaking, as implemeted in this workflow, a `conda` environment is **not reproducible**. Even if you specify a specific version of a package in your `environment.yml`, the way its dependencies get resolved may differ in their versions. One way to fix this is to have an additional file called a **lockfile** that ensure that the environment is completely reproducible (eg. `pipenv` does this). This is the **right way** to handle such things, and we are hoping conda catches up quickly. In the meantime, we've simulated this behavior using an `environment.yml` and an `environment.lock` file generated from it.


### Exercise 3: create your environment and install all dependencies

### Solution 3
```
make create_environment
conda activate bus_number   # or source activate bus_number
make requirements
```

### Exercise 4: Add a dependency
Modify the environment file so that `make requirements` installs some additional packages
* install `umap-learn` using conda
* install `nbdime` using pip

## The Data Science DAG
DAG = Directed Acyclic Graph. 

That means the process eventually stops. (This is a good thing!) 

It also means we can use a super old, but incredibly handy tool to implement this workflow: `make`.

### Make, Makefiles, and the Data Flow


We use a `Makefile` to organize and invoke the various steps in our Data Science pipeline.
You have already used this file when you created your virtual environment in the first place:
```
make create_environment
```
Here are the steps we will be working through in this tutorial:
<img src="references/cheat_sheet.png" alt="Reproducible Data Science Workflow" width="400"/>

A [PDF version of the cheat sheet](references/cheat_sheet.pdf) is also available.



### What's my make target doing?
If you are ever curious what commands a `make` command will invoke (including any invoked dependencies), use `make -n`, which lists the commands without executing them:

In [None]:
%%bash
cd .. && make -n requirements

We use a cute **self-documenting makefiles trick** (borrowed from `cookiecutter-datascience`) to make it easy to document the various targets that you add. This documentation is produced when you type a plain `make`:

In [None]:
%%bash
cd .. && make

### Under the Hood: The Format of a Makefile

```
## Comment to appear in the auto-generated documentation
thing_to_build: space separated list of dependencies
	command_to_run            # there is a tab before this command.
	another_command_to_run    # every line gets run in a *new shell*
```



### Exercise 5: What does this makefile print?

In [None]:
%%file Makefile.test

data: raw
	@echo "Build Datasets"
train_test_split:
	@echo "do train/test split"
train: data transform_data train_test_split
	@echo "Train Models"
transform_data:
	@echo "do a data transformation"
raw:
	@echo "Fetch raw data"


Note: If you see: ```*** missing separator.  Stop.``` it's because you have used spaces instead of **tabs** before your commands. 

### Solution 5

In [None]:
%%bash
make -f Makefile.test train

### Exercise 6: What happens when you add a cycle to a Makefile
Set up a makefile with a cyclic dependency and run it

### Solution 6

In [None]:
%%file Makefile.test

cycle: cycle_b
	@echo "in a Makefile"
cycle_b: cycle_c
	@echo "have a cycle"
cycle_c: cycle
	@echo "You can't"

In [None]:
%%bash
make -f Makefile.test cycle

Using a Makefile like this is an easy way to set up a process flow expressed as a Directed Acyclic Graph (DAG).

**Note**: We have only scratched the surface here. The are lots of interesting tricks you can do with make.
* http://zmjones.com/make/
* http://blog.byronjsmith.com/makefile-shortcuts.html
* https://www.gnu.org/software/make/manual/


## Revision Control: `git` and a git workflow

What do we mean by workflow? A process built on top of git that incorporates **pull requests** and **branches**. Typically, this is provided by sites like: GitHub, GitLab, BitBucket.


```
cd bus_number
git init
git add .
git commit -m "Initial Import"
```

# TODO: Fill in an easy git tutorial
    

## Life Rules for using `git`

* Always work on a branch: `git checkout -b my_branch_name`. Delete branches once they are merged.
* **Never** push to master. Always **work on a branch** and do a pull request.
* Seriously, don't do work on master.
* If you pushed it anywhere, don't `git rebase`. In fact, if you're reading this, don't `git rebase`.


### Exercise: 

Create a GitHub/GitLab/BitBucket repo and sync your repo to it.


### Exercise
* Create a branch called `add_sklearn`
* Add a scikit-learn dependency
* Check in these changes using git to your local repo
* Push the new branch to GitHub
* Create a pull request to merge this branch into master
* Merge your PR (delete the branch afterwards)
* Sync your local repo with GitHub, including deleting the merged branches

## Python Modules
By default, we keep our source code in a module called `src`. (this can be overridden in the cookieccutter)

This is enabled via one line in `environment.yml`:
```
- pip:
  - -e .
```

This creates an **editable module**, and looks in the current directory for a file called `setup.py` to indicate the module name and location

In [None]:
# %load ../setup.py
from setuptools import find_packages, setup

setup(
    name='src',
    packages=find_packages(),
    version='0.0.1',
    description='Up Your Bus Number: A Primer for Reproducible Data Science',
    author='Tutte Institute for Mathematics and Computing',
    license='MIT',
)


This lets you easily use your code in notebooks and other scripts, and avoids any `sys.path.append` silliness

### Exercise
* add your favorite utility function to `src/utils`
* run `make requirements` (required if you added dependencies for your utility function)
* import your utility function and run it from this notebook

### Solution:


# TODO: fill out this section if need be

## Testing: doctest, pytest, Continuous Integration (CI)

Optional: Hypothesis and property-based testing

# TODO: Add some failing tests that need correcting

### Exercise:
Add a `make test` target to your makefile that:
* runs doctests
* runs pytest unit tests
* (extra credit) Displays test coverage results
    
These will fail. Don't fix them yet.

### Exercise:
* Hook your github repo to CI

### Exercise:
fix the failing tests

### ASIDE: Our Favourite Python Parts
Why the `python>=3.6` requirement?
* f-strings: Finally, long, readable strings in our code.
* dictionaries: insertion order is preserved!

Other great tools:
* `pathlib`: Sane, multiplatorm path handling: https://realpython.com/python-pathlib/
* `doctest`: Examples that always work: https://docs.python.org/3/library/doctest.html
* `joblib`: Especially the persistence part: https://joblib.readthedocs.io/en/latest/persistence.html
