# Python training for data engineers

## Python development

### Goal
Explain the different possible environments to develop Python code. The three options can all be used, but I would suggest the Docker approach to keep things clean on your system. There is no wrong choice here, it is simply a matter of what you think is more convenient, since every approach has its own benefits.

### Information
<img src="https://conda.io/docs/_images/conda_logo.svg" width=200px/>

**Conda**
> Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.

* [Home](https://conda.io)
* [Installation Windows](https://conda.io/miniconda.html)
* [Docs](https://conda.io/docs/user-guide/install/windows.html)

<img src="http://jupyter.org/assets/nav_logo.svg" width=200px/>

**Jupyter notebook**
> The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

* [Home](http://jupyter.org/)
* [Installation Windows](http://jupyter.org/install.html)
* [Docs](https://jupyter.readthedocs.io/en/latest/)

Important commands:

| Description                 | Command            |
| :-------------------------- | :----------------: |
| Install package | `pip install <package>`        |
| Remove package | `pip uninstall <package>`        |
| Show packages | `pip freeze` |
| Save package list to file | `pip freeze > requirements.txt` |
| Install packages from file | `pip install -r requirements.txt`| 

### Local development

Installing packages can be done with `conda` and with `pip`. The latter is the more common approach for installing new packages, also outside a Conda environment. Normally packages that are required for the scripts will be installed using `pip` by running `pip install mymodule`. Using `pip freeze` a dump will be made of the installed packages, including their version numbers. This information is usefull in case the script will be used on other machines. The output of the `pip freeze` is saved to `requirements.txt` and can be used again to install the packages by running `pip install -f requirements.txt`. It is common to save the requirement file also to version control such as Git.

```bash
(machine-01) $ pip install pandas
(machine-01) $ pip freeze > requirements.txt

(machine-02) $ pip install -r requirements.txt
```
**Remember: if a package fails to import, make sure the Python package is installed. (i.e. `pip install pandas`)**

### Virtual environment
Instead of polluting the system's Python with the modules that are used during development, it is advised to keep development environments separate from the system environment. This can be done using a virtual environment, which is essentially a sandbox where (a different version) of Python is running and all installed packaged will be attached to this sandbox. `conda` is used to create a new environment, for example an environment saved in the current folder with the name `python-101-env` and Python version 2.7:

```bash
$ conda create --prefix ./python-101-env python=2.7
```

We use the full path to the environment because we did not install in the default folder for virtual environments.

```bash
$ source activate 'C:\Users\j.waterschoot\Documents\Work\Itility\itility-python-101\python-101-env'
``` 

Inside the virtual environment modules can be installed. These will only be available inside the virtual environment, so when running the system's Python interpreter, the imports will not work.

```bash
(python-101-env) $ pip install mymodule
(python-101-env) $ pip install -f requirements.txt
```

### Virtual environment
Instead of polluting the system's Python with the modules that are used during development, it is advised to keep development environments separate from the system environment. This can be done using a virtual environment, which is essentially a sandbox where (a different version of) Python is running and all installed packages will be attached to this sandbox. `conda` is used to create a new environment, for example an environment saved in the current folder with the name `python-101-env` and Python version 2.7:

```bash
$ conda create --prefix ./python-101-env python=2.7
```

We use the full path to the environment because we did not install in the default folder for virtual environments.

```bash
$ source activate 'C:\Users\j.waterschoot\Documents\Work\Itility\itility-python-101\python-101-env'
``` 

Inside the virtual environment modules can be installed. These will only be available inside the virtual environment, so when running the system's Python interpreter, the imports will not work.

```bash
(python-101-env) $ pip install mymodule
(python-101-env) $ pip install -f requirements.txt
```

### Docker
Finally code can be developed using Docker. This gives an additional layer of abstraction to the developer by creating a complete system (mostly Linux) running a minimal system setup to get the code up and running. For example, Ubuntu 16.04 is used as a based system, Python 3.6 is installed and the machine is running Jupyter notebook version 4.1. By keeping the Docker files in source control, all developers can easily spin up a Docker container and have the exact same environment as their team members.

#### Dockerfile
```yaml
# Base image
FROM jupyter/datascience-notebook
# Switch user to root
USER root
# Update the packages
RUN apt-get update
# Add the Python requirements
ADD requirements.txt .
# Install the Python requirements
RUN pip install -r requirements.txt
# Start the notebook
CMD ["/opt/conda/bin/jupyter", "notebook", "--notebook-dir=/opt/notebooks", "--ip='*'", "--no-browser", "--allow-root", "--port=8118"]
```

#### docker-compose.yml

```yaml
version: '2'
services:
  anaconda:
    build: .
    volumes:
      - ./notebooks:/opt/notebooks
      - ./data:/opt/data
    ports:
      - "8118:8118"
```

#### Process

1. Create the image based on the Dockerfile

   `$ docker build -f Dockerfile`
   
2. In the same folder, create the container and detach from it. Make sure the `docker-compose.yml` is in this folder.

   `$ docker-compose up -d`
   
3. Check if container is running

   `$ docker ps`
   