### Previous knowledge
https://github.com/ScienceParkStudyGroup/studyGroup/blob/gh-pages/lessons/20171010_Intro_to_Python_Like/1hr_python_workshop.ipynb

# Pipelines in Python 

Reproducability is pivotal in science. 
Reproducability means that you or someone else can replicate exactly what you have done and check whether he/she gets similar results. **Preferably in a reasonable amount of time!**

___
##### Can you think of examples of situations in which reproducability is hampered?    
##### (How) can this be avoided?
___





When we use programs and the commandline there is an automatic log of what we've done -we just have to save it. This is in contrast to most procedures in the lab, where we intent to follow a protocol, but are not sure whether we have actually followed it (or made some mistake). Also when using computational work that involves mouse-clicks, most times our actions are not logged and we have to go back and figure out what when went wrong IF we notice a mistake has been made.

Today we will explore how we can use Python to build pipelines. Most pipelines have more or less the following architecture:

<img src='images/Pipelines.png', height=400, width=200>

Individual steps are often performed using 3rd party software, that can be run from the commandline. You can run commands from a python program using `system(cmnd)` from the `os` module (https://docs.python.org/3/library/os.html).

___
##### How do you find out how to use a commandline program? 
___
>- search the online documentation  
>- use man _program name_  
>- run the program with --help (often long help)  
>- run the program with -h (often short help)  
>- run the program without arguments  


For today, we won't bother with 3rd party software because it takes to time and effort we would need to get everyone (Windows/Mac/Linux users) on the same page. Instead we will use operating system commands for our examples.

Create a temporary directory for this course

> **mkdir** _dirname_  

Then go in to this directory 

> **cd** _dirname_  

and create a file called tempfilename.txt:  
MacOS/Linux  
> **touch** tempfilename.txt  
  
Windows  
> **echo blabla >** tempfilename.txt  
  

Let's now try this in Python.


In [6]:
#first import the os module
import os

# where are we?
# the working directory is the directory you were in when you started python (notebook).
print(os.getcwd())



/Users/like/Dropbox/02.Teaching/StudyGroup/20171219_PipelinesInPython/notebook


In [4]:
# make the directory
os.mkdir('temp')

FileExistsError: [Errno 17] File exists: 'temp'

This raises an error because this directory already exists -we just made it on the commandline. 
So let's first delete it:
Safely:
MacOS/Linux  
**rm ** temp/tempfilename  
**rm ** temp  
    
Windows  
**del** temp/tempfilename  
**del ** temp  

With force:  
MacOS/Linux  
**rm -r** temp
  
Windows  
**deltree** temp  


In [7]:
#make the temp directory again, using the os module
os.mkdir('temp')

#let's go in
os.chdir('temp')

#where are we?
print(os.getcwd())

#make our tempfile again
os.system('touch tempfilename.txt')


/Users/like/Dropbox/02.Teaching/StudyGroup/20171219_PipelinesInPython/notebook/temp


0

os.system returns a value, we will not go into this further, but let me say that if it returns a 0, it means all is well. Non-zero returns indicate something went wrong. You can use this as checkpoints in your pipeline.  
  
One important point of programming is to automate repetitive tasks. So suppose we want to create 4 files, how do we this?

To get you started I will repeat what we learned on for-loops

In [8]:
my_name = 'Like'
for i in range(4): #range(n) generates a list with integers [0,1,..,n-1]
    print(my_name+str(i))
    

Like0
Like1
Like2
Like3


___
##### Can you use a for-loop to generate 4 temporary files: tempfilename0.txt, tempfilename1.txt, .. , tempfilename3.txt?  
First just print these to the screen to check whether your commands look OK, then adjust the code so you actually execute these commands.
___


In [11]:
fname_base = 'tempfilename'
for i in range(4):
    cmnd = 'touch '+fname_base+str(i)+'.txt'
    print(cmnd)
    os.system(cmnd)

touch tempfilename0.txt
touch tempfilename1.txt
touch tempfilename2.txt
touch tempfilename3.txt


Let's revisit our schematic pipeline:
  
<img src='images/Pipelines.png', height=400, width=200>
  
'Data 0.0' probably consists of multiple files, on which we want to perform the same procedure. Python has a module `glob` that can be used to e.g. get a list of all the files in a directory.


In [16]:
import glob


input_fnames = glob.glob('*') # '*' selects all files in this directory
print(str(len(input_fnames))+' files:')
print(input_fnames)





5 files:
['tempfilename.txt', 'tempfilename1.txt', 'tempfilename0.txt', 'tempfilename2.txt', 'tempfilename3.txt']


___
##### Can you use `glob` and `os` to change the file extension for all filenames in the temp directory, from '.txt' into '.bla' ?

Hints:  

MacOS/Linux  
**mv** _oldFilename newFilename_  
  
Windows  
**rename** _oldFilename newFilename_    

___


In [24]:
#Hint:
fname = 'tempfilename1.txt'
print(fname.split('.'))
print(fname[:-3])


['tempfilename1', 'txt']
tempfilename1.


In [27]:
#Answer
input_fnames = glob.glob('*') # '*' selects all files in this directory
for fname in input_fnames:

    cmnd  = 'mv '+fname+' '+fname[:-3]+'bla'
    print(cmnd)
    os.system(cmnd)
    

mv tempfilename.txt tempfilename.bla
mv tempfilename1.txt tempfilename1.bla
mv tempfilename0.txt tempfilename0.bla
mv tempfilename2.txt tempfilename2.bla
mv tempfilename3.txt tempfilename3.bla


In [None]:
# you also specify the full path
indir = os.getcwd()
input_fnames = glob.glob(indir+'*') # '*' selects all files in this directory
print(str(len(input_fnames))+' files:')
print(input_fnames)

We only get one file! This is because `os.getcwd()` returns the path without a trailing '/'.

In [15]:
#you also specify the full path
indir = os.getcwd()
input_fnames = glob.glob(indir+'/*') # '*' selects all files in this directory
print(str(len(input_fnames))+' files:')
print(input_fnames)

5 files:
['/Users/like/Dropbox/02.Teaching/StudyGroup/20171219_PipelinesInPython/notebook/temp/tempfilename.txt', '/Users/like/Dropbox/02.Teaching/StudyGroup/20171219_PipelinesInPython/notebook/temp/tempfilename1.txt', '/Users/like/Dropbox/02.Teaching/StudyGroup/20171219_PipelinesInPython/notebook/temp/tempfilename0.txt', '/Users/like/Dropbox/02.Teaching/StudyGroup/20171219_PipelinesInPython/notebook/temp/tempfilename2.txt', '/Users/like/Dropbox/02.Teaching/StudyGroup/20171219_PipelinesInPython/notebook/temp/tempfilename3.txt']


Why do we bother with full paths?  
Because you should organize your data such that not all files are in a single directory.
Let's do some project organization:

`scripts` or `notebooks` 
  
`raw input` (read only!)  
  
`1.Data_cleaning`  
subdirs:  
>`1.0_RemoveContaminants`  
>`2.0_RemoveOutliers`  
>`2.1_RemoveOutliers`   
>`3.0_FillInMissingValues`  
>`3.1_FillInMissingValues`

`2.Analyses`
subdirs: 
>`Frequency plots`  
>`Heatmaps`  
>`PCA`  
>`Cross-comparison__SPECIES`  
>`Cross-comparison__TIME`  

`logs` or `docs`


Creating these directories should be part of your pipeline. Similarly, you can (and probably should) automatically create logfiles, READMEs etc.  
  
A pipeline consists of a series of _modules_ or _steps_ that each have their input and output directory, logfile and README. So your code will look something like this:


#### init() 

#### step1()  
>    - create directories: OUTPUT, logs  
>    - create README: metadata for the files you create
>    - create logfile: cmnds, error messages, etc.  
>    - do the actual task:
>
>**remove_contaminants -i [inputfile] -o [outfile]**  
>
>indir = `raw_data`  
>for _file_ in glob.glob(indir+'/*'):  
>     
>     remove_contaminants -i file -o OUTPUT/file.contaminantsRemoved.out >& logs/remove_contaminants.file.log  
     
#### step2()  
>    - create directories: OUTPUT, logs  
>    - create README: metadata for the files you create
>    - create logfile: cmnds, error messages, etc.  
>    - do the actual task:
>
>**remove_outliers -i [inputfile] -t [treshold] -o [outfile]**  
>
>indir = `1.0_RemoveContaminants/OUTPUT/`  
>for _file_ in (indir+'/*'):  
>     
>     remove_outliers -i file -o OUTPUT/file.outlierssRemoved.out -t T >& logs/remove_outliers.file.log  
     
     
etc...

___
##### You see that we pass arguments to our fictional commandline tools. Can you think of situations in which you want to pass arguments to your own pipeline?
___

#### The argparse module

The `argparse` module is designed to pass all kinds of arguments to your module.  
** IT IS AWESOME! **  
Why?  
- it automatically generates help-text with -h or --help  
- if certain arguments are necessary, you can flag them as such and it automatically gives an error message if the user does not pass that argument  
- it can deal with conditional arguments  
- you can group arguments  

Check the docs:  
https://docs.python.org/3/library/argparse.html  
