# Tutorial 1: Reproducible Environments
(Continued from `README.md`)

## Overview

* Requirements: The Bare Minimum 

* Using a Data Science Template: `cookiecutter`

* Virtual Environments: `conda` and environment files
* Revision Control: git and a git workflow
   * Installing, Enabling, and using nbdime
* The Data Science DAG
   * make, Makefiles and data flow
* Python Modules
   * Creating an editable module
* Testing: doctest, pytest, hypothesis

We'll start out by checking that all the requirements are met from the previous exercises (started in `README.md`)

### Exercise 1: Install the requirements

* Anaconda
* Cookiecutter
* make
* git

***Note***: You don't need cookiecutter in your `bus_number_tutorial` environment (the one you should be using to run this notebook). We already used it to **build** the `bus_number_tutorial` current environment in the `TUTORIAL.md` file.

### Test your installation

In [None]:
!conda --version   # or `$CONDA_EXE --version` in some environments

In [None]:
!make --version

In [None]:
!git --version

### Exercise 2: Start your cookiecutter-based project

Create a project called `Bus Number Tutorial`:

    Use conda as your virtualenv manager
    Use python 3.6 or greater

When complete, you should have a fully populated project directory, complete with customized README.md.

We will be working in this project from now on.

### Solution 2

<pre>
 $ <b>cookiecutter https://github.com/hackalog/cookiecutter-easydata.git --checkout bus_number</b>

project_name [project_name]: <b>Bus Number Tutorial</b>
repo_name [bus_number_tutorial]: <b>↵</b>
module_name [src]: <b>↵</b>
author_name [Your name (or your organization/company/team)]: <b>hackalog</b>
description [A short description of this project.]: <b>Reproducible Data Science</b>
Select open_source_license:
1 - MIT
2 - BSD-2-Clause
3 - Proprietary
Choose from 1, 2, 3 [1]: <b>↵</b>
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]: <b>↵</b>
aws_profile [default]: <b>↵</b>
Select virtualenv:
1 - conda
2 - virtualenv
Choose from 1, 2 [1]: <b>↵</b>
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: <b>↵</b>


 $ <b>cd bus_number_tutorial</b>

</pre>

### Exercise 2b:

Explore the `README.md` from your new `bus_number_tutorial` repo

(Hint: You can use the `%load` magic, or `!cat` to look at it in your notebook)

### Solution 2b:

In [None]:
# %load ../README.md

Bus Number Tutorial
==============================

Reproducible data science tutorial

GETTING STARTED
---------------

For complete instructions, visit: https://github.com/hackalog/bus_number/wiki/Getting-Started

* Create and switch to the  virtual environment:
```
cd bus_number_tutorial
make create_environment
conda activate bus_number_tutorial
make requirements
```
* Explore the notebooks in the `notebooks` directory

Project Organization
------------
* `LICENSE`
* `Makefile`
    * top-level makefile. Type `make` for a list of valid commands
* `README.md`
    * this file
* `data`
    * Data directory. often symlinked to a filesystem with lots of space
    * `data/raw`
        * Raw (immutable) hash-verified downloads
    * `data/interim`
        * Extracted and interim data representations
    * `data/processed`
        * The final, canonical data sets for modeling.
* `docs`
    * A default Sphinx project; see sphinx-doc.org for details
* `models`
    * Trained and serialized models, model predictions, or model summaries
    * `models/trained`
        * Trained models
    * `models/output`
        * predictions and transformations from the trained models
* `notebooks`
    *  Jupyter notebooks. Naming convention is a number (for ordering),
    the creator's initials, and a short `-` delimited description,
    e.g. `1.0-jqp-initial-data-exploration`.
* `references`
    * Data dictionaries, manuals, and all other explanatory materials.
* `reports`
    * Generated analysis as HTML, PDF, LaTeX, etc.
    * `reports/figures`
        * Generated graphics and figures to be used in reporting
    * `reports/tables`
        * Generated data tables to be used in reporting
    * `reports/summary`
        * Generated summary information to be used in reporting
* `requirements.txt`
    * (if using pip+virtualenv) The requirements file for reproducing the
    analysis environment, e.g. generated with `pip freeze > requirements.txt`
* `environment.yml`
    * (if using conda) The YAML file for reproducing the analysis environment
* `setup.py`
    * Turns contents of `src` into a
    pip-installable python module  (`pip install -e .`) so it can be
    imported in python code
* `src`
    * Source code for use in this project.
    * `src/__init__.py`
        * Makes src a Python module
    * `src/data`
        * Scripts to fetch or generate data. In particular:
        * `src/data/make_dataset.py`
            * Run with `python -m src.data.make_dataset fetch`
            or  `python -m src.data.make_dataset process`
    * `src/analysis`
        * Scripts to turn datasets into output products
    * `src/models`
        * Scripts to train models and then use trained models to make predictions.
        e.g. `predict_model.py`, `train_model.py`
* `tox.ini`
    * tox file with settings for running tox; see tox.testrun.org


--------

<p><small>This project was built using <a target="_blank" href="https://github.com/hackalog/cookiecutter-easydata">cookiecutter-easydata</a>, an experimental fork of [cookiecutter-data-science](https://github.com/drivendata/cookiecutter-data-science) aimed at making your data science workflow reproducible.</small></p>


### Exercise 3: Set up your virtual environment and install all dependencies

Create and activate your `bus_number_tutorial` conda environment using the above make commands.

Your `active environment` should be `bus_number_tutorial`


In [None]:
!conda info

**Note:** If you are using **JupyterHub**, the bash magics `!` and `%%bash` will not work as expected, that is, they will drop you into your root JupyterHub environment, as opposed to the conda kernel that you a running this notebook in, and you will not see `bus_number_tutorial`. To get around this, you will need to run the bash commands in this notebook from a terminal instance with your `bus_number_tutorial` conda environment activated.

If done correctly, you should also be able to import from `src`

In [None]:
# if importing src doesn't work, try `make requirements`
import src

### Exercise 4: Pick up this tutorial in your new repo

* Run jupyter notebook and open `notebooks/10-reproducible-environment.ipynb`

If you're currently running this notebook and the checks from the previous exercises worked, then you're in business!

Keep going from here!

## Revision Control: `git`

How do we keep track of our changes? We use **git**.

Before we do anything interesting, let's initialize a git repository (repo) here.


### Exercise 5: Initialize a git repo for `bus_number_tutorial`

```
git init
git add .
git commit -m "Initial Import"
```

In [None]:
!git status

You should see: 
    
    # On branch master
    nothing to commit, working directory clean


We will get back to using git again soon.

### Exercise 6: Add a dependency
Modify the environment file so that `make requirements` installs some additional packages
* install `joblib` using conda
* install `nbdime` using pip

In [None]:
# you should be able to see the difference via git
!git diff ../environment.yml

### Solution 6
Your `environment.yml` should look like this now:
<pre>
name: bus_number_tutorial
channels:
  - conda-forge
dependencies:
  - pip
  - pip:
    - -e .
    - python-dotenv>=0.5.1
    <b>- nbdime</b>
  - setuptools
  - wheel
  - sphinx
  - click
  - coverage
  - pytest-cov
  - jupyter
  <b>- joblib</b>
  - nb_conda
  - nbval
  - pandas
  - requests
  - python>=3.6
</pre>

In [None]:
# Check that you now have joblib  and nbdime installed
# Don't forget that you need to run `make requirements` once you've change the `environment.yml` file
import joblib
import nbdime

In [None]:
# you should be able to see the difference via git
!git diff ../environment.yml

### Exercise 7: Basic git interactions

See what has changed with git:

In [None]:
!git status

In [None]:
!git diff -u ../environment.yml

Add or reject the changes incrementally

In [None]:
#!git add -p
#!git reset -p


Commit the changes

In [None]:
#!git commit -v

### Solution 7

In [None]:
# You should have no differences in your branch now
# Except for those that you've made by running notebooks
!git status

## The Data Science DAG
DAG = Directed Acyclic Graph. 

That means the process eventually stops. (This is a good thing!) 

It also means we can use a super old, but incredibly handy tool to implement this workflow: `make`.

### Make, Makefiles, and the Data Flow


We use a `Makefile` to organize and invoke the various steps in our Data Science pipeline.
You have already used this file when you created your virtual environment in the first place:
```
make create_environment
```
Here are the steps we will be working through in this tutorial:
<img src="references/cheat_sheet.png" alt="Reproducible Data Science Workflow" width="400"/>

A [PDF version of the cheat sheet](references/cheat_sheet.pdf) is also available.



### What's my make target doing?
If you are ever curious what commands a `make` command will invoke (including any invoked dependencies), use `make -n`, which lists the commands without executing them:

In [None]:
%%bash
cd .. && make -n requirements

We use a cute **self-documenting makefiles trick** (borrowed from `cookiecutter-datascience`) to make it easy to document the various targets that you add. This documentation is produced when you type a plain `make`:

In [None]:
%%bash
cd .. && make

### Under the Hood: The Format of a Makefile

```
## Comment to appear in the auto-generated documentation
thing_to_build: space separated list of dependencies
	command_to_run            # there is a tab before this command.
	another_command_to_run    # every line gets run in a *new shell*
```



In [None]:
%%file Makefile.test

data: raw
	@echo "Build Datasets"
train_test_split:
	@echo "do train/test split"
train: data transform_data train_test_split
	@echo "Train Models"
transform_data:
	@echo "do a data transformation"
raw:
	@echo "Fetch raw data"


In [None]:
# Note: you can run a specific Makefile with with -f option
!make -f Makefile.test data

Note: If you see: ```*** missing separator.  Stop.``` it's because you have used spaces instead of **tabs** before your commands. 

### Exercise 8: What does this makefile print when you run `make train`?

### Solution 8

In [None]:
%%bash
make -f Makefile.test train

### Exercise 9: What happens when you add a cycle to a Makefile
Set up a makefile with a cyclic dependency and run it

### Solution 9

In [None]:
%%file Makefile.test

cycle: cycle_b
	@echo "in a Makefile"
cycle_b: cycle_c
	@echo "have a cycle"
cycle_c: cycle
	@echo "You can't"

In [None]:
%%bash
make -f Makefile.test cycle

Using a Makefile like this is an easy way to set up a process flow expressed as a Directed Acyclic Graph (DAG).

**Note**: We have only scratched the surface here. The are lots of interesting tricks you can do with make.
* http://zmjones.com/make/
* http://blog.byronjsmith.com/makefile-shortcuts.html
* https://www.gnu.org/software/make/manual/


## Back to Revision Control: git workflows

Git isn't really a collaboration tool. It's more a tool for implementing collaboration workflows.

What do we mean by workflow? A process built on top of git that incorporates **pull requests** and **branches**. Typically, this is provided by sites like: GitHub, GitLab, BitBucket.


### Exercise 10: 

Create a GitHub/GitLab/BitBucket repo and sync your repo to it.


### Solution:
See https://help.github.com/articles/create-a-repo/ to start a GitHub repo.

**Note** Since you already have a repo with readme and license, you do **not** want to choose the option to initialize the repo with those. 

Once you have a github repo called `bus_number_tutorial`:

**Using http**

    git remote add origin <your new repo URL eg. https://github.com/${GITHUB_USERNAME}/bus_number_tutorial.git>
    git push -u origin master

**Using SSL**

    git remote add origin <your new repo: git@github.com:${GITHUB_USERNAME}/bus_number_tutorial.git>
    git push -u origin master

In [None]:
# your remote repo should now show up
!git remote -v

For example (using SSL):

    origin	git@github.com:${GITHUB_USERNAME}/bus_number_tutorial.git (fetch)
   
    origin	git@github.com:${GITHUB_USERNAME}/bus_number_tutorial.git (push)


## GitHub workflow cheatsheet
See https://github.com/hackalog/bus_number/wiki/Github-Workflow-Cheat-Sheet

## Life Rules for using `git`

* Always work on a branch: `git checkout -b my_branch_name`. Delete branches once they are merged.
* **Never** push to master. Always **work on a branch** and do a pull request.
* Seriously, don't do work on master if you are collaborating with **anyone**.
* If you pushed it anywhere, or shared it with anyone, don't `git rebase`. In fact, if you're reading this, don't `git rebase`. Save that for when you are comfortable solving git merge nightmares on your own.

Here are some common tasks in git/github

### Starting the day. Where was I? What was I doing?
```
git branch         # What branch am I currently on? e.g. {my_branch}
git status         # anything I forgot to commit? If so...
git commit ...     # Commit work in progress
```

### Didn't I do some work at home last night?
```
git checkout master       # leave whatever branch I was on
git fetch origin --prune  # Check for something new
git merge origin/master   # If updates available, update!
git branch --merged master # check for any merged branches that can be safely deleted
git branch -d {name_of_merged_branch} # delete any fully merged branches
```

### Anything fun happening upstream?

```
git checkout master
git fetch upstream --prune  # grab latest changes from upstream repo
git merge upstream/master   # merge them into local copy of my form
git push origin master      # push latest upstream changes to my forked repo
git branch --merged master # check for any merged branches that can be safely deleted
git branch -d {name_of_merged_branch} # delete any fully merged branches
```

Now that `master` is up to date, you should merge whatever happened in `master` into your development branch:
```
git checkout {my_branch}
git merge master               # merges master->{my_branch}
git push origin {my_branch}    # Let Github know about the merge
```

#### Some useful references if `gitflow` isn't second nature to you yet
* Introduction to GitHub tutorial: https://lab.github.com/githubtraining/introduction-to-github
* Git Handbook: https://guides.github.com/introduction/git-handbook/

### Exercise 11:
* Create a branch called `add_sklearn`
* Add a scikit-learn dependency
* Check in these changes using git to your local repo
* Push the new branch to GitHub
* Create a pull request to merge this branch into master
* Merge your PR (delete the branch afterwards)
* Sync your local repo with GitHub, including deleting the merged branches

### Solution:

* Create a branch called `add_sklearn`
    
    `git checkout -b add_sklearn`
    
* Add a scikit-learn dependency
    * Add `scikit-learn` to the `environment.yml` file
    * `make requirements`
* Check in these changes using git to your local repo
    * `git add -p`
    * `git commit -m "add scikit-learn dependency"`
* Push the new branch to GitHub
    * `git push origin add_sklearn`
* Create a pull request to merge this branch into master
    * Go to the URL that gets generated from the previous step and create a PR
* Merge your PR (delete the branch afterwards)
    * Follow GitHub/BitBucket instructions for merging your PR and then deleting the `add_sklearn` branch
* Sync your local repo with GitHub, including deleting the merged branches
    * `git checkout master`
    * `git fetch origin --prune`
    * `git merge origin/master`
    * `git branch --merged master` should now display `add_sklearn` indicating that that branch has been merged into `master`
    * `git branch -d add_sklearn`

In [None]:
# You should now only have a branch called master
!git branch

## Python Modules
By default, we keep our source code in a module called `src`. (this can be overridden in the cookieccutter)

This is enabled via one line in `environment.yml`:
```
- pip:
  - -e .
```

This creates an **editable module**, and looks in the current directory for a file called `setup.py` to indicate the module name and location

This lets you easily use your code in notebooks and other scripts, and avoids any `sys.path.append` silliness

### ASIDE: Semantic Versioning

Semantic versioning (or *semver*), refers to the convention of versioning with a triple:

    MAJOR.MINOR.PATCH

With the following convention: when releasing new versions, increment the:

*    MAJOR version when you make **incompatible API changes**,
*    MINOR version when you **add functionality** in a backwards-compatible manner, and
*    PATCH version when you make backwards-compatible **bug fixes**.

If you have no other plan, this is a great convention to follow.

For an obscene amount of detail on this concept, see https://semver.org/

### Exercise 12:
* add your favorite utility function to `src/utils`
* increment the version number of the editable package
* run `make requirements` (required if you added dependencies for your utility function)
* import your utility function and run it from this notebook

### Solution:


Add our utility function (note the `-a` flag with the `%%file` magic that allows us to append instead of overwrite the file)

In [None]:
%%file -a ../src/utils.py
def read_space_delimited(filename, skiprows=None, class_labels=True, metadata=None):
    """Read an space-delimited file
    
    Data is space-delimited. Last column is the (string) label for the data

    Note: we can't use automatic comment detection, as `#` characters are also
    used as data labels.

    Parameters
    ----------
    skiprows: list-like, int or callable, optional
        list of rows to skip when reading the file. See `pandas.read_csv`
        entry on `skiprows` for more
    class_labels: boolean
        if true, the last column is treated as the class (target) label
    """
    with open(filename, 'r') as fd:
        df = pd.read_csv(fd, skiprows=skiprows, skip_blank_lines=True,
                           comment=None, header=None, sep=' ', dtype=str)
        # targets are last column. Data is everything else
        if class_labels is True:
            target = df.loc[:, df.columns[-1]].values
            data = df.loc[:, df.columns[:-1]].values
        else:
            data = df.values
            target = np.zeros(data.shape[0])
        return data, target, metadata

Increment the version number to `0.0.2`

In [None]:
%%file ../setup.py
from setuptools import find_packages, setup

setup(
    name='src',
    packages=find_packages(),
    version='0.0.2',
    description='A short description of this project.',
    author='Your name (or your organization/company/team)',
    license='MIT',
)


In [None]:
# A handy magic that allows us to edit modules and have them stay up to date in the notebook. In this case, src.
%load_ext autoreload
%autoreload 2

In [None]:
from src.utils import read_space_delimited

In [None]:
read_space_delimited?

## Testing: doctest, pytest, coverage


Python has built in testing frameworks via:
* doctests:https://docs.python.org/3/library/doctest.html#module-doctest
* unittest: https://docs.python.org/3/library/unittest.html

Additionally, you'll want to make regular use of:
* pytest: https://docs.pytest.org/en/latest/
* pytest-cov: https://pypi.org/project/pytest-cov/
* hypothesis: https://hypothesis.readthedocs.io/en/latest

Cookiecutter (vanilla flavoured) comes witha setup for the `tox` testing framework built in.
* https://tox.readthedocs.io/en/latest/

### Exercise 12:

Add a `make test` target to your makefile that:
* runs doctests
* runs pytest unit tests
* (extra credit) Displays test coverage results
    
When you run `make test`, you will find tests that will fail in `src/test_example.py`. Fix them in the next exercise.

### Solution 12:

    test:
        cd src && pytest --doctest-modules --verbose --cov

In [None]:
!cd .. && make -n test

In [None]:
!cd .. && make test

    cd src && pytest --doctest-modules --verbose --cov
    ============================= test session starts ==============================
    platform linux -- Python 3.6.7, pytest-4.2.1, py-1.7.0, pluggy-0.8.1 -- /opt/software/anaconda3/envs/bus_number_tutorial/bin/python
    cachedir: .pytest_cache
    rootdir: ~/src/devel/bus_number_tutorial, inifile:
    plugins: cov-2.6.1, nbval-0.9.1
    collected 7 items                                                              

    test_example.py::src.test_example.addition FAILED                        [ 14%]
    test_example.py::TestExercises::test_addition FAILED                     [ 28%]
    data/fetch.py::src.data.fetch.available_hashes PASSED                    [ 42%]
    data/fetch.py::src.data.fetch.fetch_file PASSED                          [ 57%]
    data/fetch.py::src.data.fetch.fetch_files PASSED                         [ 71%]
    data/fetch.py::src.data.fetch.get_dataset_filename PASSED                [ 85%]
    data/utils.py::src.data.utils.normalize_labels PASSED                    [100%]

    =================================== FAILURES ===================================
    _____________________ [doctest] src.test_example.addition ______________________
    004 
    005     I'm a failing doctest. Please fix me.
    006     >>> addition(10, 12)
    Expected:
        20
    Got:
        -2

    ~/src/devel/bus_number_tutorial/src/test_example.py:6: DocTestFailure
    _________________________ TestExercises.test_addition __________________________

    self = <src.test_example.TestExercises testMethod=test_addition>

        def test_addition(self):
            """
            I'm a failing unittest. Fix me.
            """
    >       assert subtraction(5, 5) == 0
    E       AssertionError: assert 10 == 0
    E         -10
    E         +0

    test_example.py:22: AssertionError

    ----------- coverage: platform linux, python 3.6.7-final-0 -----------
    Name                         Stmts   Miss  Cover
    ------------------------------------------------
    __init__.py                      0      0   100%
    analysis/__init__.py             0      0   100%
    analysis/analysis.py           105     86    18%
    analysis/run_analysis.py        23      9    61%
    data/__init__.py                 4      0   100%
    data/apply_transforms.py        27     12    56%
    data/datasets.py               311    262    16%
    data/fetch.py                  143    109    24%
    data/localdata.py                1      0   100%
    data/make_dataset.py            15      4    73%
    data/transform_data.py          88     72    18%
    data/transformers.py            42     29    31%
    data/utils.py                   85     61    28%
    features/__init__.py             0      0   100%
    features/build_features.py       0      0   100%
    logging.py                       7      0   100%
    models/__init__.py               3      0   100%
    models/algorithms.py             5      4    20%
    models/model_list.py            74     60    19%
    models/predict.py              100     80    20%
    models/predict_model.py         22      9    59%
    models/train.py                 54     39    28%
    models/train_models.py          25     11    56%
    paths.py                        17      0   100%
    test_example.py                  8      0   100%
    utils.py                        58     45    22%
    visualization/__init__.py        0      0   100%
    visualization/visualize.py       0      0   100%
    workflow.py                      8      0   100%
    ------------------------------------------------
    TOTAL                         1225    892    27%

    ====================== 2 failed, 5 passed in 1.69 seconds ======================
    make: *** [test] Error 1

***Note:*** `make test` is normally functionality built into `cookiecutter-easydata`. We're building it from scratch here for the sake of practice.

### Exercise 13:
Fix the failing tests

In [None]:
%%file ../src/test_example.py
import unittest

def addition(n1, n2):
    """
    I'm addition
    >>> addition(10, 10)
    20
    """
    return n1 + n2

def subtraction(n1, n2):
    """
    I'm subtraction.
    """
    return n1 - n2

class TestExercises(unittest.TestCase):
    def test_subtraction(self):
        """
        I'm a failing unittest. Fix me.
        """
        assert subtraction(5, 5) == 0


In [None]:
# Should pass all tests now!
!cd .. && make test

### Exercise 14:
* Check in all your changes to git
* Merge them into your master branch via a PR in GitHub

In [None]:
!git status