Based on what we saw in the previous section of the tutorial, we will now address a new topic: the implementation of a **script** in bash

## **3.1 SCRIPTING IN BASH**
****
A **BASH SCRIPT** is nothing more than a file containing a series of instructions that are executed in order from first to last. Once the script has been made, you just have to launch it from the command line in order to execute all the instructions contained within it. 

It is particularly useful when very repetitive operations must be performed. It is faster than executing all the individual commands one by one, plus it reduces the risk of making mistakes. 

There are several ways to make a bash script.


**Example 1**:

In [None]:
%%bash
vi filename.sh   # the command allows you to create a script within the 
                 # which you can insert all the instructions that you 
                 # want to execute

chmod +x filename.sh    # it's the command needed to activate the script. If it
                        # is not executed, upon invoking the script, from the
                        # command line the error  
                        # "-bash: ./filename:Permission denied" will appear

./filename.sh       # ./ is the command that allows the script to be executed.



**Example 2**:

In [None]:
%%bash
cat > filename.sh    # the command 'cat >' followed by the filename, allows you 
                     # to create a script, that is not executable immediately 
                     # but saved in the filename.sh file.
                     # In order to exit the script mode, it is necessary to
                     # press Ctrl+D.
chmod u+x filename.sh   # it's used to make the script executable
sudo chmod 777 filename.sh  # this command also allows you to make 
                            # the script executable, but to use it, it is 
                            # necessary that in the first line of the script 
                            # appear ' #!/bin/bash ', plus, being a 
                            # sudo command will prompt you for
                            # the password in order to execute it
bash filename.sh    # 'bash' is the command to run the script.
                    # You can also, alternatively, use the command 
                    # ' ./ ' to run the script

Another very useful command in scripting is **nano**.
It allows you to access and edit the script file, being able to move freely back and forth between lines and various commands. 

In [None]:
%%bash
nano filename.sh   # to be able to exit the script once the 
                   # changes press Ctrl+X then press Y to save the 
               # changes or N otherwise finally press the key 
               # Enter once you exit the script it is again executable

On the terminal will then open the text editor with the contents of the file, the first line of a script is dedicated to the interpreter to be used (The interpreter is the operating system program that reads, interprets the script commands and executes them), so in this case it will have to be written ` #!/bin/sh `. I can then enter the commands that will then be executed. When I have finished writing, I save the changes to the file with CTRL+O and exit the editor with CTRL+X. 
Now then, as in the previous cases, I must make the script executable (` chmod +x filename.sh `), and run it (` ./filename.sh `)


Some very useful structures for making scripts: **IF STATEMENTS** and **LOOPS**.

**3.2 IF STATEMENTS**
****
The **if** command can be very useful within a script because it allows a series of statements to be made only if a certain condition is met. 

In [None]:
# example of syntax of an if statements
%%bash 

n=12
if [ n -gt 10 ]                    # the condition to be checked must be 
                                   # placed in brackets [], taking care to 
                                   # respecting the spaces
do
echo n è maggiore di 10            # instruction to be executed if the condition
                                   # is verified

else echo n è minore di 10         # instruction to be executed if the condition 
                                   # is not verified

fi                                 # it is necessary to remember to close the if 
                                   # with the command 'fi'

The following are some comparison operators that are used to impose the conditions of if statements, but also of while and until loops.

Comparing integers:

In [None]:
if [ "$a" -eq "$b" ]                                 # -eq --> is equal to
if [ "$a" -ne "$b" ]                                 # -ne --> is different from
if [ "$a" -gt "$b" ] or if (( "$a" > "$b" )) # -gt --> is greater than
if [ "$a" -ge "$b" ] or if (( "$a" >= "$b" ))# -ge --> is greater o equal to 
if [ "$a" -lt "$b" ] or if (( "$a" < "$b" )) # -lt --> is smaller than 
if [ "$a" -le "$b" ] or (( "$a" <= "$b" ))# -le --> is smaller or equal to

Comparing strings:

In [None]:
if [ "$a" = "$b"]  or   if [ "$a" == "$b"]         # equal to
if [ "$a" != "$b"]                                 # different from

**3.3 LOOPS**
****
The **FOR** loop can be used to perform repetitive operations, such as parsing all files within a folder. 

In [None]:
# for loop syntax example
%%bash

for file in $(ls)         # what follows the 'for' indicates for which elements
                          # to execute the instructions contained 
                          # within the loop
do
....                      # after the 'do' should be inserted the commands that
                          # have to be on the variables in the list

done                      # indicates the end of the cycle 

The **WHILE** loop can be used to carry out a series of instructions until a certain condition turns out to be true. The moment the condition defined by the 'while' is no longer true, it exits the loop directly and proceeds with subsequent commands

In [None]:
# while loop syntax example
%%bash

a=0
while [ "$a" -lt 5 ]      # the 'while' is followed by the condition between [] 
do
((a++))                   # after 'do' you have to insert the instructions that
                          # have to be executed within the loop
                              
done                      # ends the loop
echo $a

**3.4 COMMAND *READ* IN BASH ON COLAB**

****
The **read** command, as we have seen, can be used to take input data provided directly by the user and save it in a specific variable. 

Example:

In [None]:
%%bash
echo Insert two numbers: 
read -r a b


If you tried running this code on your linux terminal, you would see that it works correctly and saves the two numbers you entered in the variables a and b.

Running the cell on colab, instead, things don't work as well.
This is because bash commands are executed in a subshell on colab, so this interactive mode doesn't work.

In order to take advantage of the **read** command anyway, you must first create a script inside a cell and then run it in a separate cell. (see exercise 2)

It works in the same way if you want to pass parameters directly from the command line, on colab you must write a script and then run it in a separate cell (**!bash nomefile.sh parameter_1 parameter_2**) (See exercise 1).

**WARNING**: the syntax of the read command is slightly different in this mode.

Here's how to do it:

In [None]:
%%bash                                         
echo "#!/bin/bash                           
read -p 'insert a number ' var
echo You have inserted \$var " > file.sh

# With the command #!/bin/bash the script is created.
# All commands contained in the script must be put between ""
# At the end of the script you use > filename.sh to save it inside a script file
# which can be called at any time from another cell.

In [None]:
!bash file.sh

The **!bash filename.sh** command allows you to launch the previously defined and executed script.

**3.5 FILE .PDB**
***
A PDB file describes the three-dimensional structure of a protein contained in the Protein Data Bank. It also contains a whole set of secondary information inserted at the beginning as a header ("HEADER"). This part includes information about the authors of the research that determined the structure of the protein, some experimental observations or even the list of amino acids of which it is composed.

The main as well as the most extensive part, however, remains the list of atoms of which the protein is composed. For each atom a whole range of information is given including:

* the spatial coordinates (x, y, z)
* the name and number of the residue to which they belong.
* the respective chain (useful in the case of protein with a quaternary structure)
* a temperature factor (describing their vibration)
* the name of the atom, which is very useful for studying specific atoms such as carbon α (denoted by CA)


##**3.6 SCRIPT EXERCISES**
****

The following are some exercises that you can perform to apply what was explained in the first part of the tutorial.

**WARNING**: In order to run the cells, simply run the cell below to temporarily upload the necessary files and folders to colab

In [None]:
#@title **Run this cell in order to load the files needed to perform subsequent exercises**
%%bash
! wget https://files.rcsb.org/download/1YZB.pdb
! wget https://files.rcsb.org/download/7MCI.pdb
! wget https://files.rcsb.org/download/7WN4.pdb
! wget https://files.rcsb.org/download/7TZ4.pdb
mkdir /content/student


**EXERCISE 1**

Create a script that 
when started receives from the command line the name of a file to take the atoms from, the name of another file to save them to, the name of a coordinate, and reorders the atoms in the chosen file according to the descending order of the chosen coordinate, printing them to the chosen file.

**HINTS:** sort (-k, -nr), $N (for the corresponding data entered at program execution)

In [2]:
#EXERCISE 1
#@title solution ex1
%%bash
echo "#!/bin/bash
case \$3 in   # I use a switch case (instead of nested ifs )
              # to handle better in the case of 
              # multiple possible choices
x)                    
coordinate=7;; # based on the chosen coordinate I take the 
                # number of the

y)              # corresponding column 
coordinate=8;;            
z)
coordinate=9;;
esac

sort -k\$coordinate -nr \$1 > \$2 " > es_1.sh 
               # sort the chosen file ($1) 
               # and save it to the second chosen file ($2).


In [3]:
!bash es_1.sh 1YZB.pdb test.pdb x

In [None]:
! cat test.pdb    # print the test.pdb file to verify that 
                  # the atoms have been reordered properly

**EXERCISE 2** (script)

Make a script that

* reads all the .pdb files in a folder and prints all their names on the screen.
* receives as input the name of one of them and the name of a residue
*print the total number of residuals in the file and the number of the residual chosen

**HINTS**: for, ls, read, if


In [5]:
#EXERCISE 2
#@ title solution ex2
%%bash
echo "#!/bin/bash
echo The files in the folder are:
for file in \$(ls *.pdb)
do
    echo \$file
done
read -p 'Choose one of the files and type the name: ' filename 
# read -r filename (if you're not working on colab you can just use this comman)
read -p 'Choose the residue of which you want to know how many there are and 
         type its abbreviation (all in uppercase): ' res_name
# read -r res_name
for file in \$(ls)
do
  if [ \$file = \$filename ]
  then 
    res_tot=\$(grep ^ATOM \$filename | grep CA | sort | uniq -w 20 | wc -l )
    res_choice=\$(cat \$file| grep \$res_name| grep CA| sort| uniq -w 20| wc -l)
  fi
done
echo There are a total of \$res_tot residuals in the \$filename file,            #***
     \$res_choice of \$res_name " > es_2.sh


#TIPS: alternative code to check if a file is .pdb
#typefile=$(file -b $i) (N.B. $i contains the names of files in a directory)
#echo $typefile | cut -b 1-17 >typefile.txt
#typefilefinal=$(head -n 1 tipofile.txt)
#if [ "$typefilefinal" = "Protein Data Bank" ]


In [None]:
!bash es_2.sh

**EXERCISE 3** (script)

Make a script that

* receives in imput the name of a folder.
* read all the .pdb files in a folder.
* copy into a file called "list_protein.txt" the name of the file and the title of the protein contained within it (The title from the protein is contained within the file)
* count how many proteins have been written to the file and print on the screen the words "In the chosen folder there are N proteins and they are: " with the list of files with their titles following

**HINTS**: for, ls, if, cat, echo


In [None]:
#EXERCISE 3
#@title solution ex3
%%bash 
>list_proteins.txt
for file in $(ls)
do filename=${file##*/}
  extention=${filename##*.}
  if [ $extention = pdb ]
  then 
    title=$( cat $file | grep TITLE)
    echo $file: $title >>list_proteins.txt 
  fi  
done 
proteins_num=$(cat list_proteins.txt | wc -l)
echo In the folder there are $proteins_num proteins and they are: 
cat list_proteins.txt


**EXERCISE 4** (script)

Make a script that allows you to: 

* read all the files in a directory.
* copy all .pdb files to the /content/student directory.

* count the number of HIS residues in each .pdb file and create a reshis.csv file in which the filename ( < filename >) and the number (< n_residues >) of HIS residues counted must appear for each line.
* update the minres.stat file with the phrase: "< filename > has the lowest number of histidine residues equal to < rmin >

**HINTS**: for, ls, if, grep, echo, cd, cat, sort, uniq

Is it possible to avoid reading all the files in the folder since then only the .pdb is selected? Maybe directly reading only the files with the extension of interest (**ls** options)


 

In [None]:
#EXERCISE4
#@title solution ex4
%%bash
rmin=10000 #initialize the variable
> hisres.csv

# It allows to move to the folder whose files you want to read
dir_source=/content
dir_destination=/content/student
cd $dir_destination # I move to the directory where I will later go to work 
                    # on the files

# I make a for loop to perform the same operations on all files in the directory
for file in $(ls $dir_source/*.pdb) # using ls *.pdb you can avoid reading all 
                                    # files, but select, already in the for,
                                    # only the .pdb files
do
 filename=${file##*/}
 cp $dir_source/$filename $dir_destination #copies the file to the new directory
 n_residues=$(grep ^ATOM $filename| grep HIS| grep CA| sort| uniq -w 20 | wc -l) 
  # selects only HIS rows then those with CA, reorder and delete identical rows, 
  # count residuals
  
  echo $filename , $n_residues >>hisres.csv
      if [ $n_residues -le $rmin ] # checks if the number of residuals is less 
                                   # than those in the previous file
      then 
         rmin=$n_residues
         echo $filename has the lowest number of histidine residues 
              equal to $rmin > minres.csv
      fi
 done

    cat hisres.csv #print the contents of the two generated files on the screen
    echo
    cat minres.csv


**EXERCISE 5** (script)

Make a script that

* allows you to read the files in a folder whose directory is provided in input by the user.
* count the number of ARG residues and save in the argres.csv file the name of each file with the corresponding number of residues putting them in order, from the file that contains the least residues to the one that contains the most.

* save in a second file xyz.pdb the coordinates of the atoms that have a value of z>100. (In the file must appear, for each line, the number of the atom, its name, the residue and its coordinates) 

Ex. 18 CA ARG -43.986 -34.605 1.456

**HINTS**: read, echo, for, if grep, awk, sort, cat



*Executing this cell will prompt you to enter the path to the folder whose files you want to analyze. In order to proceed with running the program on colab you will need to type **/content**. If you run the code on your linux terminal instead, you will need to enter the full path to the folder on your computer.*

In [9]:
#EXERCISE 5
#@title solution ex5
%%bash
echo "#!/bin/bash
read -p 'Enter the path of the directory you want to access: ' dir_source 
# this command allows you to capture as a variable an input entered by the user

cd \$dir_source

>argres.csv  
>argres.pdb
>xyz.pdb
# these commands are only used for emptying the files in case the scripts have
# already been run, to avoid shuffling the data

for file in \$(ls \$dir_source/*.pdb) # in this way the loop read only the .pdb  
                                      # files in the folder
do
 filename=\${file##*/}
 n_residues=\$(grep ^ATOM \$filename| grep ARG| grep CA|sort| uniq -w 20| wc -l)
 echo \$filename , \$n_residues >>argres.pdb
 grep ^ATOM \$filename | grep ARG | grep CA | sort | uniq -w 20 >arg.pdb
 awk ' \$9>100 {print \$2 , \$3 , \$4 , \$7 , \$8 , \$9} ' arg.pdb >>xyz.pdb 
     # allows selecting only atoms bound to an ARG residue that are
     # in a position with z>100

 sort -g argres.pdb > argres.csv # allows you to sort the files 
done

cat argres.csv
echo
cat xyz.pdb " > es_5.sh


In [None]:
!bash es_5.sh

**EXERCISE 6** (script)

Make a script that allows you to:


* Ask the user which folder to open and print on the screen all the files in the folder.
* Ask the user to enter the name of the file to be parsed and the remainder to be counted

* Verify that the file name has been entered in the correct format and count the number of residuals of the chosen type
* Print on the screen " The < filename > contains < n_residues > of < res_name> "

**HINTS**: echo, read, cd, while, grep




*Executing this cell will prompt you to enter the path to the folder whose files you want to analyze. In order to proceed with running the program on colab you will need to type **/content**. If you run the code on your linux terminal instead, you will need to enter the full path to the folder on your computer.*

In [11]:
#@ title solution ex6
%%bash
echo "#!/bin/bash
read -p 'Enter the path of the directory you want to access: ' directory 
# allows you to acquire a varialbile entered as input by the user

cd \$directory

read -p 'Enter the name of the file to be parsed: ' filename
read -p 'Enter the name of the residue that has to be counted, 
         using the identifying three-digit abbreviation: ' res_name

# example of a possible control that should be done
# while/do cycle --> executes the contents of do until the condition [] is met.
while [ \$(expr length "\$res_name") -ne 3 ]  
# the 'expr length' command allows reading the number of characters in a stript
# It is different from the 'wc' command which reads only rows as input

do
echo The name of the residue was not entered correctly
read -p ' !!!Enter the name of the residue that has to be counted, 
         using the three-digit identifying abbreviation: ' res_name
done

n_residues=\$(grep ATOM \$filename | grep \$res_name |sort| uniq -w 20 | wc -l)
echo In \$filename there are \$n_residues residues of \$res_name " > es_6.sh


In [None]:
!bash es_6.sh