# Repeating and controlling

Very often, when we develop a data analysis routine, as we will do later, we will hit three problems: 1) dealing with folders full of data files, 2) repeating an operation for example over multiple files, and 3) executing an analysis only in certain cases for example only if the file format is appropriate. In this chapter we will deal with these three problems: handling file paths, for loops and conditional statements.

In this example, we use the data in ```bacteria_growth``` folder containing information about bacterial growth such as length of bacteria at birth and division time, growth rate etc. We will use a dummy workflow function that just computes the average of the birth length ```Lb``` (more on computations with DataFrames later):

In [5]:
import numpy as np

def workflow_fun(bact):
    
    mean_val = bact['Lb'].mean()
    
    return mean_val

Let's test our workflow on one dataset:

In [6]:
import pandas as pd

bacteria = pd.read_csv('data/bacteria_growth/bact_glucose.csv')

workflow_fun(bact=bacteria)

27.94854330019155

As you can see our function expects a specific output, in this case an image with three channels where we want to analyze the last channel.

## Files and paths

The goal now is to execute the same routine on a series of files. We would for example indicate a folder, get a list of the contents and run the workflow on all contents. There are multiple ways to deal with paths in Python, but we highly recommend to use the ```pathlib``` module which provides a lot of very useful tools to handle path names, extensions etc.

First we use the ```Path``` object to define the path of the folder containing data:

In [7]:
from pathlib import Path

In [8]:
folder = Path('data/bacteria_growth/')

```folder``` is not just a string but an actual object that is much more useful. For example we can ask for the absolute path:

In [9]:
folder.absolute()

PosixPath('/Users/gw18g940/GoogleDrive/BernMIC/Trainings/DAVPy_intro/notebooks/data/bacteria_growth')

or we can ask if the defined path is a folder:

In [10]:
folder.is_dir()

True

In [11]:
folder.is_file()

False

There are also many usueful functions to handle the path itself. For example if you have a folder and a file name, you can simply join them with:

In [12]:
folder.joinpath('myfile.txt')

PosixPath('data/bacteria_growth/myfile.txt')

This spares you the hassle of adding slashes, making sure your code will work on an other OS etc.

### Listing files

Now we can use methods attached to this path object to explore its contents. For example we can check the folder contents with ```iterdir```

In [13]:
files_in_folder = folder.iterdir()

As you can see, the returned object is a *generator*. We haven't seen yet this object which is very specific to Python. It is a sort of a list whose contents can be queried one after the other, for example using the ```next``` statement:

In [14]:
next(files_in_folder)

PosixPath('data/bacteria_growth/bact_glucoseaa.csv')

For the moment we just transform this generator into a regular list:

In [15]:
files_in_folder = folder.iterdir()

files_in_folder = list(files_in_folder)

files_in_folder

[PosixPath('data/bacteria_growth/bact_glucoseaa.csv'),
 PosixPath('data/bacteria_growth/readme.md'),
 PosixPath('data/bacteria_growth/bact_glycerol.csv'),
 PosixPath('data/bacteria_growth/bact_glucose.csv'),
 PosixPath('data/bacteria_growth/.ipynb_checkpoints')]

### Investigating files

Our goal will be to analyze all the csv files in that folder. However we will need to do some clean-up first as some of the files should be discarded.

Again, each of the elements of ```files_in_folder``` is a ```Path``` object and we can get multiple features such as: 
- the folder the file belongs to:

In [16]:
files_in_folder[0].parent

PosixPath('data/bacteria_growth')

- the name of the file:

In [17]:
files_in_folder[0].name

'bact_glucoseaa.csv'

- the two parts of the file: name and extension:

In [18]:
files_in_folder[0].stem

'bact_glucoseaa'

In [19]:
files_in_folder[0].suffix

'.csv'

While all these elements could be recovered from a path represented as a simple string, the ```Path``` object just makes this massively easier, so we definitely recommend to use it!

## Iterating through files

### for loops

As in almost all programming languages, iteration is achieved by using a ```for``` loop which allows us to repeatedly execute a block of code. In many languages, one just tells the for loop how many times it should run through the code block. In Python, instead of this we *traverse a list* and for each element of the list, execute the code block. Let's consider a simple example for the moment. We have a list of numbers:

In [20]:
mylist = [8,3,9,20,27]

and now we want compute the square of each element in the list. So we write:

In [21]:
for e in mylist:
    result = e ** 2

As you can see, ```for``` loops are written in a relatively "natural" way in Python, stating that "for each element e in mylist execute the following lines". Note that:
1. ```e``` just stands for the currently selected element from mylist.
2. The for loop starts with the ```for``` statement
3. The list used for iteration is specified, here ```mylist```
4. Like function definition, the for loop definition ends with ```:```
5. The content of the loop is **indented**

You can also note that when we execute the cell nothing happens. This is because no graphical output is generated fro for loops. If we want to see the actual value we have to use the ```print()``` function:

In [22]:
for e in mylist:
    result = e ** 2
print(result)

729


Only the last value is printed because we put the ```print()``` function outside the loop. If we want to see each value we have to **indent** the ```print()``` call so that it is included in the loop:

In [23]:
for e in mylist:
    result = e ** 2
    print(result)

64
9
81
400
729


### Looping using a range

Often we don't want to loop over the content of a list but just want to do some operation N times or for indexes from 0 to N. To do that, we can use the built-in ```range()``` function that just does this: it provides numbers within a certainrange. The function doesn't really produce a list per se but can be used as if it were one. For example:

In [24]:
for x in range(8):
    print(x)

0
1
2
3
4
5
6
7


Note that as always the first index is not 0 but 8. Of course we could use these indexes to access specific parts of a list. Coming back to the previous example, we might want to calculate the square only of the three first numbers:

In [25]:
for i in range(3):
    result = mylist[i] ** 2
    print(result)

64
9
81


With ```mylist[x]``` we simply use the numbers generated by ```range()``` as indexes of our list.

### Back to files

Now that we know how to iterate through a list, instead of going through a list of numbers we can just go through a list of files. Remember that our files are:

In [26]:
files_in_folder

[PosixPath('data/bacteria_growth/bact_glucoseaa.csv'),
 PosixPath('data/bacteria_growth/readme.md'),
 PosixPath('data/bacteria_growth/bact_glycerol.csv'),
 PosixPath('data/bacteria_growth/bact_glucose.csv'),
 PosixPath('data/bacteria_growth/.ipynb_checkpoints')]

So we can just write:

In [27]:
for f in files_in_folder:
    print(f)

data/bacteria_growth/bact_glucoseaa.csv
data/bacteria_growth/readme.md
data/bacteria_growth/bact_glycerol.csv
data/bacteria_growth/bact_glucose.csv
data/bacteria_growth/.ipynb_checkpoints


And within the for loop, we can add any piece of code that we want. For example we can checke the file extension. As you can see, one of the files in the folder is not a tiff file and we would like to exclude it from analysis. A first step is hence to actually recover the extension with the ```suffix``` method:

In [28]:
for f in files_in_folder:
    suffix = f.suffix
    print(f)
    print(suffix)

data/bacteria_growth/bact_glucoseaa.csv
.csv
data/bacteria_growth/readme.md
.md
data/bacteria_growth/bact_glycerol.csv
.csv
data/bacteria_growth/bact_glucose.csv
.csv
data/bacteria_growth/.ipynb_checkpoints



## Using conditions


### ```if``` statement

Now that we catch the extension of each file, we can run our workflow *only if the extension is really csv*! For this we need another very common statement in programming languages, which is the ```if``` statement. Let's do a simple example first:

In [30]:
a = 3

In [31]:
if a > 4:
    print('Large')

In [32]:
if a < 4:
    print('Small')

Small


We see that the structure of the ```if``` statement is very similar to that of functions and ```for``` loops:
- a condition is stated and ends with ```:```
- the block executed only if the statement is *True* is *indented*

In some cases, we want to execute a different block code when the ```if``` statement is *False*. For that we need to use the ```else``` statement which has the same structure:

In [33]:
a = 10

if a < 4:
    print('Small')
else:
    print('Large')

Large


You can even add multiple sub-cases with ```elif```:

In [34]:
a = 10

if a < 4:
    print('Small')
elif a < 20:
    print('Intermediate')
else:
    print('Large')

Intermediate


## Back to files

So now we want to add an ```if``` statement in our routine, that will check the file format. Let's see if we can come up with a check e.g.:

In [35]:
files_in_folder[2]

PosixPath('data/bacteria_growth/bact_glycerol.csv')

In [36]:
files_in_folder[2].suffix == '.csv'

True

In [37]:
files_in_folder[1]

PosixPath('data/bacteria_growth/readme.md')

In [38]:
files_in_folder[1].suffix == '.csv'

False

So we can compare our suffix to the string ```.csv``` and that should work:

In [39]:
for f in files_in_folder:
    
    print(f)
    if f.suffix == '.csv':
        print('Is csv file')
    else:
        print('Is NOT csv file')

data/bacteria_growth/bact_glucoseaa.csv
Is csv file
data/bacteria_growth/readme.md
Is NOT csv file
data/bacteria_growth/bact_glycerol.csv
Is csv file
data/bacteria_growth/bact_glucose.csv
Is csv file
data/bacteria_growth/.ipynb_checkpoints
Is NOT csv file


## Complete routine

So now we can finally go through all files, check the extension and execute the workflow only if the file is csv file: 

In [43]:
keep_value = []

for f in files_in_folder:
    
    if f.suffix == '.csv':
        
        bacteria = pd.read_csv(f)

        out = workflow_fun(bacteria)
        
        keep_value.append(out)

In [44]:
keep_value

[34.79750161626745, 26.606417765323293, 27.94854330019155]

## Exercise

Modify the loop above so that it prints out the extension of the files that is not a csv file. Try also to come up with a solution to know at which index in the file list this file is located.