# Productionising Scripts

## Introduction

So you've completed your notebooks and everything looks good. Or you've gotten a few scripts going and you would like to take the stability and productionisability of your scripts one step further. If this is you, then this notebook is for you.

What we'll cover is:
+ Modularising your scripts and importing your functions as modules
+ Creating command line arguments using `argparse`
+ Creating config files using `configparser`
+ Creating logs for you programs using `logging`
+ Putting it all together

## Modularising your code

So you've been putting all your code in 1 .py file - 😥. 

This is usually bad practice as it is very difficult to navigate 1 **HUGE** file and if something doesn't work it is very difficult to find out what. Luckily Python makes it easy to package your .py scripts into modules.

Let's look at an example. Say you've got a lot of helper functions, for example `calc_sum`, `calc_diff`, etc. 

In [1]:
def calc_sum(a,b):
    return a+b
def calc_diff(a,b):
    return a-b
print(calc_sum(1,2))
print(calc_diff(1,2))

3
-1


We can abstract these function away into a script called `helpers.py` and import them from there, making our main .py file much cleaner. I've created this file shown below

In [2]:
!ls -l ./ | grep ".py$"
!echo ""
!cat ./helpers.py

-rw-r--r--@ 1 louwrenslabuschagne  staff   2839 Oct  8 15:11 boiler_plate.py
-rw-r--r--  1 louwrenslabuschagne  staff     68 Oct  8 13:38 helpers.py

def calc_sum(a,b):
    return a+b
def calc_diff(a,b):
    return a-b

In [3]:
from helpers import calc_sum, calc_diff
print(calc_sum(1,2))
print(calc_diff(1,2))

3
-1


However this isn't great as our main directory will now become poluted with with scripts. Like a `helpers.py`, `utils.py`, `db.py`, `transform.py`, etc. What we would actually like to do is abstract these away into a folder (which will become our module) and import them from there. 

Luckily Python makes this very easy and all you need to do to make any folder a module is to include a file called `__init__.py` - that's it. Just an _empty_ file called `__init__.py` in a directory makes it a Python module. We'll look a bit later on what you can actually put in this file if you want to get more fancy.

I've decided to call my module `mods` - but this would typically be the name of the project you're working on. Also, I've moved the `calc_sum` and `calc_diff` functions into a script called `math.py` that lives inside the `mods` directory. Below I show the directory structure and file contents. You can see that the `__init__.py` file is empty and the `math.py` function contains the 2 functions.

In [4]:
!ls -l mods/
!echo "\n__init__.py"
!cat mods/__init__.py
!echo "\nmath.py"
!cat mods/math.py

total 8
-rw-r--r--  1 louwrenslabuschagne  staff    0 Oct  8 13:46 __init__.py
drwxr-xr-x  4 louwrenslabuschagne  staff  128 Oct  8 13:50 [1m[34m__pycache__[m[m
-rw-r--r--  1 louwrenslabuschagne  staff   68 Oct  8 13:46 math.py

__init__.py

math.py
def calc_sum(a,b):
    return a+b
def calc_diff(a,b):
    return a-b

Now we can do the following.

In [5]:
from mods.math import calc_diff, calc_sum
print(calc_sum(1,2))
print(calc_diff(1,2))

3
-1


This is allows for much more modularised code and is the way that Python understands modules - at a later stage we'll look at how to package these functions up and get them onto `pip`

## Command Line Arguments

So you've got a script and you'd like to make it callable with different parameters. For example the bash copy command `cp` has many different flags that can be set shown below.

In [6]:
!cp

usage: cp [-R [-H | -L | -P]] [-fi | -n] [-apvXc] source_file target_file
       cp [-R [-H | -L | -P]] [-fi | -n] [-apvXc] source_file ... target_directory


How do we get our scripts to except arguments? 

Well one way is to use `sys.argv` which contains all aguments passed to a script. I've created a script called `cmd_args/sys_args.py` that looks as follows:

In [7]:
!cat cmd_args/sys_args.py

import sys
print(sys.argv)

Calling this function with arguments results in the following:

In [8]:
!python cmd_args/sys_args.py hallo -h there -d -2

['cmd_args/sys_args.py', 'hallo', '-h', 'there', '-d', '-2']


As you can see each argument passed gets appended in order to the list known as `sys.argv`. The first entry in this list is always the file being called - the `__file__` variable. However, using this method isn't great. If the user changes the order of any of the arguments your program will not work - for example executing the following will have a complete different list as the example above.

In [9]:
!python cmd_args/sys_args.py -h there -d -2 hallo

['cmd_args/sys_args.py', '-h', 'there', '-d', '-2', 'hallo']


Luckily Python comes the to rescue again with the `argparse` library - which is defaulty installed BTW. I've created a file `cmd_args/argparse_args.py` showcasing the functionality.

In [10]:
!cat cmd_args/argparse_args.py

import argparse
from datetime import date

parser = argparse.ArgumentParser(prog='Awesome Bananas', 
                                 description='My awesome program that is awesome')
parser.add_argument('-s', '--start_day', 
                    type=str,
                    help='Start day to use in the format 20191008',
                    required=True)
parser.add_argument('-e', '--end_day',
                    type=str,
                    help='End day to use in the format 20191008',
                    default=date.today().strftime('%Y%m%d'))
parser.add_argument('-v', '--verbose',
                    action='store_true',
                    help='Make the program more talkative ')
args = parser.parse_args()
start_day = args.start_day
end_day = args.end_day
v = args.verbose

print('start_day: %s'%start_day)
print('end_day: %s'%end_day)
print('verbose: %r'%v)

There is a lot going on in this piece of code, so lets break it down.

1. First we import the `argparse` module
2. Then we initiate a `ArgumentParser` object with the line `parser = argparse.ArgumentParser()` and assign it to a variable - parser in our case. Not that there are some optional arguments that you can specify for your program, like the program name (Awesome Bananas here) and a description for your prgram - the use of this will be come apparent shortly.
3. Next, we add all the arguments we would like the user to input. For this example my program's got 3 arguments: `start_day`, `end_day` and `verbose` and only `start_day` is mandatory. Unpacking the `add_argument` method a bit you can see that you can assign a shorthand (`-s`) and longhand (`--start_day`) identifier for each argument. If you didn't know it is convension in the linux world to have your shorthand argument with 1 hyphen `-s` and your longhand arguments with 2 hyphens `--start_day`. Further you can specify what type you are expecting for each argument like `string`, `int` etc. Lastly note the `action` argument in the verbose argument's `add_argument` method. It states `store_true` and this will cause the `verbose` argument to be `False` if the `-v` flag is not specified and `True` if it is specified.
4. Lastly, we parse all the arguments using the `parse_args()` method. This parses all the arguments and stores them in a variable that can be thought of as a dictionary - `args` in our case. We can then access each argument with its long name, for example `args.start_day` and use it in our program.

Lets see this in action. First I call the program with no arguments:

In [11]:
!python cmd_args/argparse_args.py

usage: Awesome Bananas [-h] -s START_DAY [-e END_DAY] [-v]
Awesome Bananas: error: the following arguments are required: -s/--start_day


First thing to note is the amazing help printout that comes with the `argparse` package. As I've specified the `start_day` argument to be mandatory the program displays the correct way to use the program namly: `usage: Awesome Bananas [-h] -s START_DAY [-e END_DAY] [-v]`.

If you didn't know in the linux world the convension for optional arguments are that they are contained in square brackets (`[]`). Here we see that `-h`, `-e` and `-v` are optional and `-s` is mandatory. But wait! We didn't add an argument with a `-h`. Exactly! It comes for free and again convention in the linux world is that `-h` and `--help` is used for help shown below.

In [12]:
!python cmd_args/argparse_args.py -h

usage: Awesome Bananas [-h] -s START_DAY [-e END_DAY] [-v]

My awesome program that is awesome

optional arguments:
  -h, --help            show this help message and exit
  -s START_DAY, --start_day START_DAY
                        Start day to use in the format 20191008
  -e END_DAY, --end_day END_DAY
                        End day to use in the format 20191008
  -v, --verbose         Make the program more talkative


Now look at that! We get a beautiful printout and anybody using your program will know exaclty how to interact with it. Let's see 2 more examples, one with the `-v` flag set and one using a differnt `end_time`.

In [13]:
!python cmd_args/argparse_args.py -v -s 20191001

start_day: 20191001
end_day: 20191009
verbose: True


In [14]:
!python cmd_args/argparse_args.py -s 20191001 -e 20191002

start_day: 20191001
end_day: 20191002
verbose: False


## Config Files

Say you've got some passwords that you need to use in your script, or some other configuration that you'd like to abstract out of your program - then config files are for you - and again Python makes it nice and easy with the built in `configparser` module. I've created a `config.ini` with conent shown below:

In [15]:
!cat config.ini

[optima]
username = optima_usr
password = optima_pwd

[touchpoint]
username = tp_usr
password = tp_pwd

[vodanetworks]
username = vd_usr
password = vd_pwd

This is the format that you'll have to adere to - more info <a href='https://docs.python.org/3/library/configparser.html'>here</a>. But now we can create a `ConfigParser()` object and call the different configurations as a dictionary as shown below.

In [16]:
from configparser import ConfigParser

config = ConfigParser()
config.read('config.ini')

optima_username = config['optima']['username']
optima_pwd = config['optima']['password']

vodanetworks_username = config['vodanetworks']['username']
vodanetworks_pwd = config['vodanetworks']['password']

touchpoint_username = config['touchpoint']['username']
touchpoint_pwd = config['touchpoint']['password']

print(optima_username, optima_pwd, 
      vodanetworks_username, vodanetworks_pwd, 
      touchpoint_username, touchpoint_pwd)

optima_usr optima_pwd vd_usr vd_pwd tp_usr tp_pwd


What I've found to be very useful is to put all my passwords in the `config.ini` and then add `*.ini` to my `.gitignore` as my passwords are then not shown on a git server, but at the same time I don't have to type them out each time I'm running my script.

## Logging

Lastly we'll look at logging - which is essisial to productionaising your scripts. Imagine you've got a mission critical app running and for some reason it crashes during the night. The next morning you'd like to know what went wrong - and with good logging this should be apparent. 

The naive way to log is to print statements to the CLI - but this is very fragile and doesn't allow for much flexiblity. For example you'd like to see the time, the user, the process ID etc. Ofcourse you can go and write your own log_print function to do this (trust me I have), but why reinvent the wheel? 

Python comes to the rescue with the pre-installed `logging` module. 

It looks something like:

In [17]:
import logging
LOGGING_FORMAT = '%(asctime)-15s %(levelname)-10s %(name)-8s %(message)s'
logging.basicConfig(level=logging.DEBUG, format=LOGGING_FORMAT)

Note that once you've called the `logging.basicConfig` function - you can't call it again. That is to say, once you've set a debugging level, format and other parameters for your logger - it is fixed for the session. 

For notebooks, to reset this you'll have to reset the kernel for scripts, you'll have to call the script again with different configurations.

So why use the logging module? Well... A lot of 3rd party Python libraries also use it and if you set your logging level and import other modules, they too inherit the logging level. So what logging level are available? In ascending level of importance:

+ logging.DEBUG
+ logging.INFO
+ logging.WARNING
+ logging.ERROR
+ logging.CRITICAL

DEBUG will print everything and CRITICAL will only report big big problems. 

OK. Cool - but you do I use it? By calling `logging.debug(message)` or `logging.critical(message)` for example.

In [18]:
logging.debug('Connecting to DB')
logging.info('Loaded data')
logging.debug('Closed connection to DB')

2019-10-09 06:29:43,055 DEBUG      root     Connecting to DB
2019-10-09 06:29:43,058 INFO       root     Loaded data
2019-10-09 06:29:43,059 DEBUG      root     Closed connection to DB


As you can see the print-out follows the format I specified, viz: `%(asctime)-15s %(levelname)-10s %(name)-8s %(message)s`. 

Here `asctime` is the time, `levelname` is the debug level, `name` is the user running the script and `message` is the message you want to display. See more options <a href='https://docs.python.org/3/howto/logging.html'>here</a>.

Not sold yet? Another great feature is that you can specify if you want your logs to be written to a file using the `filename` and `filemode` arguments in the `basicConfig` function. For example:

`
logging.basicConfig(filename='app.log', filemode='w', level=logging.DEBUG, format=LOGGING_FORMAT)
`
As I mentioned I can't just call this now and expect all subsequent logs to be written to `app.log` as I've already initialised my logger - but if you were to initialise it using the command above then this would be the case. You can use `filemode='a'` if you'd like to always append your logs to the same file, or the `filemode='w'` to create a new file every time.

Still not sold? Maybe seeing some log messages from a 3rd party module will convince you. Below I connect to a remote server using the `paramiko` module and I list contents of a directory.

In [19]:
from paramiko import SSHClient
import pandas as pd

user = 'bigdata'
host = '10.132.46.152'
directory = '/export0/home/iris/server/logs/'
logging.info('*.ll %s@%s:%s'%(user, host, directory))

ssh_client = SSHClient()
ssh_client.load_system_host_keys()
ssh_client.connect(hostname=host,
                   username=user,
                   key_filename='/Users/louwrenslabuschagne/.ssh/bigdata')

stdin, stdout, stderr = ssh_client.exec_command("""ls -Rtl --time-style=long-iso %s | awk '{print $1,"\t",$2,"\t",$3,"\t",$4,"\t",$5,"\t",$6,"\t",$7,"\t",$8,"\t",$9,$10,$11,$12,$13,$14,$15;}'"""%directory)
files_on_remote = stdout.readlines()
ssh_client.close()

sizes = []
times = []
files = []

for file in files_on_remote:
    fields = [f.strip() for f in file.split('\t')]
    size = fields[4]
    time = fields[5] + ' ' + fields[6]
    file_ = fields[7]

    times.append(time)
    sizes.append(size)
    files.append(file_)
df = pd.DataFrame(dict(size_bytes=sizes,
                       filename=files,
                       file_create_time=times))
df.head()

2019-10-09 06:29:43,888 DEBUG      matplotlib $HOME=/Users/louwrenslabuschagne
2019-10-09 06:29:43,889 DEBUG      matplotlib CONFIGDIR=/Users/louwrenslabuschagne/.matplotlib
2019-10-09 06:29:43,891 DEBUG      matplotlib matplotlib data path: /Users/louwrenslabuschagne/Documents/py37/lib/python3.7/site-packages/matplotlib/mpl-data
2019-10-09 06:29:43,896 DEBUG      matplotlib loaded rc file /Users/louwrenslabuschagne/Documents/py37/lib/python3.7/site-packages/matplotlib/mpl-data/matplotlibrc
2019-10-09 06:29:43,899 DEBUG      matplotlib matplotlib version 3.1.1
2019-10-09 06:29:43,900 DEBUG      matplotlib interactive is False
2019-10-09 06:29:43,901 DEBUG      matplotlib platform is darwin


2019-10-09 06:29:43,978 DEBUG      matplotlib CACHEDIR=/Users/louwrenslabuschagne/.matplotlib
2019-10-09 06:29:43,984 DEBUG      matplotlib.font_manager Using fontManager instance from /Users/louwrenslabuschagne/.matplotlib/fontlist-v310.json
2019-10-09 06:29:44,148 INFO       root     *.ll bigdata@10.132.46.152:/export0/home/iris/server/logs/
2019-10-09 06:29:44,170 DEBUG      paramiko.transport starting thread (client mode): 0x15164c50
2019-10-09 06:29:44,172 DEBUG      paramiko.transport Local version/idstring: SSH-2.0-paramiko_2.6.0
2019-10-09 06:29:44,178 DEBUG      paramiko.transport Remote version/idstring: SSH-2.0-Sun_SSH_2.4
2019-10-09 06:29:44,179 INFO       paramiko.transport Connected (version 2.0, client Sun_SSH_2.4)
2019-10-09 06:29:44,187 DEBUG      paramiko.transport kex algos:['diffie-hellman-group-exchange-sha256', 'diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes12

Unnamed: 0,size_bytes,filename,file_create_time
0,,,
1,,,
2,9130025.0,statsMonitor.log,2019-10-09 06:28
3,771216.0,statsLog-tp.log,2019-10-09 06:15
4,396096.0,syslog,2019-10-09 01:00


See how beautifully verbose the entire Parmiko connection is. Perhaps too verobe and using the `info` level logging would probably be a better option - however if things aren't working as you expect then the `debug` level will be great.

## Putting it all together

Now that you've got all the tools you can have a look at the `boiler_plate.py` setup I'm currenlty using to speed up development for new project. 

In [20]:
!python boiler_plate.py -h

usage: Boilerplate Code [-h] [-c CONFIG_FILE] [-d DEBUG_LEVEL] [-f LOG_FILE]

Boilerplate Code python create_ranking_df.py -c config.ini -d D

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config_file CONFIG_FILE
                        config file with passwords for various DBs
  -d DEBUG_LEVEL, --debug_level DEBUG_LEVEL
  -f LOG_FILE, --log_file LOG_FILE
                        log file to write logs to - defaults to
                        create_ranking_df.log


In [21]:
!python boiler_plate.py -c config.ini -d D

In [22]:
!cat boiler_plate.py.log

2019-10-09 06:29:44,839 DEBUG      root     /Users/louwrenslabuschagne/Documents/gitProjects/productionising-scripts/boiler_plate.py called with arguments: config_file:config.ini, debug_level:D, log_file:boiler_plate.py.log
2019-10-09 06:29:44,840 DEBUG      root     Config for: optima, touchpoint, vodanetworks read in
