<table style="width:100%; border: 0px solid black;">
    <tr style="width: 100%; border: 0px solid black;">
        <td style="width:75%; border: 0px solid black;">
            <a href="http://www.drivendata.org">
                <img src="https://s3.amazonaws.com/drivendata.org/kif-example/img/dd.png" />
            </a>
        </td>
    </tr>
</table>

# Data Science is Software
## Developer #lifehacks for the Jupyter Data Scientist

### Section 2:  This is my house
#### Environment reproducibility for Python

In [4]:
from __future__ import print_function

import os
import sys

PROJ_ROOT = os.path.join(os.pardir, os.pardir)

# add local python functions
sys.path.append(os.path.join(PROJ_ROOT, "src"))

#### 2.1 The [watermark](https://github.com/rasbt/watermark) extension

Tell everyone when your notebook was run, and with which packages. This is especially useful for nbview, blog posts, and other media where you are not sharing the notebook as executable code.

In [1]:
# install the watermark extension
!pip install watermark

# once it is installed, you'll just need this in future notebooks:
%load_ext watermark

Collecting watermark
  Downloading watermark-2.2.0-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: watermark
Successfully installed watermark-2.2.0


In [2]:
%watermark -a "Peter Bull" -d -t -v -p numpy,pandas -g

Author: Peter Bull

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

numpy : 1.19.2
pandas: 1.1.3

Git hash: 9346522e03c824249e6fa06cb9b38b2c121a1ee6



#### 2.2 Laying the foundation

Continuum's `conda` tool provides a way to create [isolated environments](http://conda.pydata.org/docs/using/envs.html). In fact, you've already seen this at work if you followed the [pydata setup](https://github.com/drivendata/pydata-setup) instructions to setup your machine for this tutorial. The `conda env` functionality let's you created an isolated environment on your machine for 

 - Start from "scratch" on each project
 - Choose Python 2 or 3 as appropriate

To create an empty environment:

 - `conda create -n <name> python=3`

**Note: `python=2` will create a Python 2 environment; `python=3` will create a Python 3 environment.**


To work in a particular virtual environment:

 - `conda activate <name>`
 
To leave a virtual environment:

 - `conda deactivate`

**Note: on Windows, the commands are just `activate` and `deactivate`, no need to type `conda`.**

There are other Python tools for environment isolation, but none of them are perfect. If you're interested in the other options, [`virtualenv`](https://virtualenv.pypa.io/en/stable/) and [`pyenv`](https://github.com/yyuu/pyenv) both provide environment isolation. There are _sometimes_ compatibility issues between the Anaconda Python distribution and these packages, so if you've got Anaconda on your machine you can use `conda env` to create and manage environments.

-------------------
 
**`#lifehack`: create a new environment for every project you work on**

**`#lifehack`: if you use anaconda to manage packages using `mkvirtualenv --system-site-packages <name>` means you don't have to recompile large packages**

------------


#### 2.3 The `pip` [requirements.txt](https://pip.readthedocs.org/en/1.1/requirements.html) file

It's a convention in the Python ecosystem to track a project's dependencies in a file called `requirements.txt`. We recommend using this file to keep track of your MRE, "Minimum reproducible environment".

Conda

-----------

**`#lifehack`: never again run `pip install <package>`. Instead, update `requirements.txt` and run `pip install -r requirements.txt`**

**`#lifehack`: for data science projects, favor `package>=0.0.0` rather than `package==0.0.0`. This works well with the `--system-site-packages` flag so you don't have many versions of large packages with complex dependencies sitting around (e.g., numpy, scipy, pandas)**

-------


In [5]:
# what does requirements.txt look like?
print(open(os.path.join(PROJ_ROOT, 'requirements.txt')).read())

coverage>=4.0.3
engarde>=0.3.1
ipython>=4.1.2
jupyter>=1.0.0
matplotlib>=1.5.1
notebook>=4.1.0
numpy>=1.10.4
pandas>=0.17.1
seaborn>=0.7.0
q>=2.6
python-dotenv>=0.5.0
watermark>=1.3.0
pytest>=2.9.2
tqdm
jupyter
ipython
numpy
pandas
matplotlib
watermark
scikit-learn
scipy
nbdime
runipy



The format for a line in the requirements file is:

 | Syntax | Result |
 | --- | --- |
 | `package_name` | for whatever the latest version on PyPI is |
 | `package_name==X.X.X` | for an exact match of version X.X.X |
 | `package_name>=X.X.X` | for at least version X.X.X |
 
Now, contributors can create a new virtual environment (using conda or any other tool) and install your dependencies just by running:

 - `pip install -r requirements.txt`

#### 2.4 Separation of configuration from codebase

There are some things you don't want to be openly reproducible: your private database url, your AWS credentials for downloading the data, your SSN, which you decided to use as a hash. These shouldn't live in source control, but may be essential for collaborators or others reproducing your work.

This is a situation where we can learn from some software engineering best practices. The [12-factor app principles](http://12factor.net/) give a set of best-practices for building web applications. Many of these principles are relevant for best practices in the data-science codebases as well.

Using a dependency manifest like `requirements.txt` satisfies [II. Explicitly declare and isolate dependencies](http://12factor.net/dependencies). The important principle here is [III. Store config in the environment](http://12factor.net/config):

 > An app’s config is everything that is likely to vary between deploys (staging, production, developer environments, etc). Apps sometimes store config as constants in the code. This is a violation of twelve-factor, which requires strict separation of config from code. Config varies substantially across deploys, code does not. A litmus test for whether an app has all config correctly factored out of the code is whether the codebase could be made open source at any moment, without compromising any credentials.
 
The [`dotenv` pacakge](https://pypi.python.org/pypi/python-dotenv) allows you to easily store these variables in a file that is not in source control (as long as you keep the line `.env` in your `.gitignore` file!). You can then reference these variables as environment variables in your application with `os.environ.get('VARIABLE_NAME')`.

In [6]:
print(open(os.path.join(PROJ_ROOT, '.env')).read())

# Environment variables go here, can be read by `python-dotenv` package:
#
#   `src/script.py`
#   ----------------------------------------------------------------
#    import dotenv
#
#    project_dir = os.path.join(os.path.dirname(__file__), os.pardir)
#    dotenv_path = os.path.join(project_dir, '.env')
#    dotenv.load_dotenv(dotenv_path)
#   ----------------------------------------------------------------
#
# DO NOT ADD THIS FILE TO VERSION CONTROL!

