# Dealing with repetetive tasks in your terminal

## For Loops
### A 'for loop' is a bash programming languange statement which allows code to be repeatedly executed. Perfect for our metagenomic and genomic needs.

### Calling each file within a folder

In [16]:
%%bash
ls
# make sure the following files are in your current directory

Untitled.ipynb
file1.txt
file2.txt
file3.txt
file4.txt
file5.txt
merged_files.txt


In [8]:
%%bash
echo file1.txt
echo file2.txt
echo file3.txt
# we can call one file at a time but this is tedious this is where for loops come in handy

file1.txt
file2.txt
file3.txt


In [12]:
%%bash
for i in *.txt;
do
echo $i;
done
# See how easy it is

file1.txt
file2.txt
file3.txt
file4.txt
file5.txt


### But we can do more than just call files we can manipulate them too

In [15]:
%%bash
for i in *.txt;
do
cp $i ${i%}_newname.txt;
done
# See how easy it is
ls


Untitled.ipynb
file1.txt
file1.txt_newname.txt
file2.txt
file2.txt_newname.txt
file3.txt
file3.txt_newname.txt
file4.txt
file4.txt_newname.txt
file5.txt
file5.txt_newname.txt


# This can be manipulated to run with metagenomic scripts
### For example, lets say that we want to run the taxonomic profiling program 'kraken' and have 20 fasta files we want to run. The script goes like this

```perl kraken --db Bacteria file1.fasta >> file1.fasta.kraken```

### Now imagine having to do this 20 times or 100, this is why loops are great.

In [None]:
#example script and this can be manipulated to whatever program you might want to run
for f in *fasta;
do
perl kraken --db Bacteria $f >> ${f%}.kraken;
done

#  How to parse/read text files line by line?

In [18]:
%%bash 
cat merged_files.txt
#simplest way to read line by line would be like this

/path/to/file1
/path/to/file2
/path/to/another/file3
/path/to/file4
/path/to/another/file5


### But what if we wanted to read the file from a script and assing that information line to a unix variable
### While read loops are perfect when you wat to read an input file, line by line. 

In [19]:
%%bash
while read LINE;
do 
echo $LINE;
done < merged_files.txt
# While there is a line to process, the loop body will be executed in this case or it cold be the name of a bunch of files 
#

/path/to/file1
/path/to/file2
/path/to/another/file3
/path/to/file4
/path/to/another/file5


In [None]:
#just an example when wanting to download datasets from the SRA
while read LINE ; 
do  
fastq-dump --outdir . --gzip --skip-technical --readids --dumpbase --split-files --clip $LINE ; 
done < sra_list.txt

#cold be a list of SRA you might want ot download

# Using multiple nodes/cluster

## A great advantage of having a cluster is the multiple nodes that allows you to split these repetitive tasks into its own 'computer' allowing the analysis of a large number of files extreemly quickly. In this example we will pretend that we are working in the SDSU anthill cluster.

### Need to write a bash script
### Each script starts with a "shebang" (#!) and you have to tell the shell which interpreter to run the rest of the script, in this case bash script.
```#!/bin/bash```


In [None]:
#lets write a bash script first create the file you can use vim, nano or whatever I will use nano
nano simple_script.sh
# then new window opens copy and paste below

In [None]:
#!/bin/bash
#$ -cwd 

# always comment your script so months from now you remember what it does
# A simple script
echo “Hey there” >> new.txt
echo “Hi there”>> new.txt
echo “Whoa there” >> new.txt


### Next you need to make this script executable
### Chmod +x simple_script.sh
##### to run just type ```./simple_script.sh```

### Now lets use it with a cluster. Simple job array A common problem is that you have a large number of jobs to run, and they are largely identical in terms of the command to run. For example, you may have 1000 data sets, and you want to run a single program on them, using the cluster. The naive solution is to somehow generate 1000 shell scripts, and submit them to the queue. This is not efficient, neither for you nor for the head node.


### We will use SGE_TASK_ID: SGE_TASK_ID  is set to a unique number in a range that you define, and is incremented as you define it. Make a new file with all file names of interest in this case we want all files that end with fasta:


In [None]:
#!/bin/bash
#$ -cwd

FILE=$(head -n $SGE_TASK_ID fasta_files.txt | tail -n 1)

cp $FILE ${FILE%}_newoutput.txt

In [None]:
qsub -t 1-3:1 ./new_script.sh 

### the -t flag when doing a qsub will be the the range of SGE_TASK_ID in this case it is set to every number form one to 3 and is incremented by 1. The reange can be any set of numbers you define.