# Advanced topics

 * Subroutines/functions
 * Calling external programs from Perl
 * Command line arguments 
 * BioPerl

## Subroutines

**Subroutines**, also known as **functions**, will become useful once you start writing code that starts to get longer and more complex. 

Here's a simple example of a program that could be simplified with a subroutine:

In [9]:
%%perl

my $number = 5;

print "The cube of $number is: ";
print $number*$number*$number ."\n";
print "One less than this is: ";
print $number*$number*$number-1 ."\n";
print "One more than this is: ";
print $number*$number*$number+1 ."\n";

The cube of 5 is: 125
One less than this is: 124
One more than this is: 126


It's tedious to type `$number*$number*$number` over and over again, so you could make this into a subroutine:

In [31]:
%%perl

my $number = 5;

print "The cube of $number is: ";
print cube($number) ."\n";
print "One less than this is: ";
print cube($number)-1 ."\n";
print "One more than this is: ";
print cube($number)+1 ."\n";

# Subroutines can be placed any where in the script, even after they are called in the main script. 
# This makes it easier to keep your code tidy because you can corrall all the subs at the very end.
sub cube {
    my ($in) = @_;
    my $output = $in*$in*$in;
    return $output;
}

The cube of 5 is: 125
One less than this is: 124
One more than this is: 126


### Benefits of using subroutines

Leaving aside the details of `sub` for now, let's take a big picture look at why using subroutines, or **modularizing** your code, is so useful:

 * Saves on unnecessary typing and repetition - fewer errors can creep in.
 * Updating code is easier - if you decide to square instead of cube, you only have to change it at one place vs. three
 * For more complex examples of code, you'll have to keep track of fewer variable names at a time, because variables within a subroutine are only defined within that subroutine (their **scope**) 
 * Code is easier to understand; if you choose subroutine names carefully, you can have meaningful semantic abstraction - calling the subroutine `cube` is more meaningful than `slartibartfast` and helps you remember what it's for

### Syntax of `sub`

A subroutine is like a mathematical function: it takes some *input*, performs some operation on it, and returns an *output*. 

Let's look at the example above line by line:

```
sub cube {
```
Use `sub` to declare a new subroutine called `cube`. The code for the subroutine is enclosed in curly braces `{` and `}`.

```
    my ($in) = @_;
```
The input to a subroutine is placed in a special variable called `@_`. Recall the `@ARGV` special variable that's used to take command line arguments. You can imagine that `@_` is behaving like a miniature `@ARGV`, but instead of acting for the entire script, it is only valid within that subroutine. If you have more than one subroutine, each one has its own `@_`. 

The above line assigns the contents of `@_` to a new array called `($in)`. Why the brackets around `$in`? Remember that `@_` is an array, so you should not have `my $in = @_;`. Alternatives: `my $in = $_[0];`, `my @in = @_;`.

```
    my $output = $in*$in*$in;
```
Perform the operation on the input, and assign the results to a new variable called `$output`. 

Notice the `my` command here. The variable `$output` is only defined within this subroutine. Outside of this subroutine, in the main code body, you can have another variable called `$output` which would not be affected. Likewise, if you try to call `$output` outside of this subroutine, it would return an error message. This is what is meant when we say that `$output` is **lexically scoped** to this subroutine.

You might recall seeing something similar with variables that are used in loops or if-else conditions.

```
    return $output;
}
```
Return the variable `$output` as the output of this subroutine.

Don't forget the curly brace at the end!

### Subroutines without input arguments

Not all subroutines require an input (they don't *have* to have an output either, but what's the point?):

In [33]:
%%perl

my $number = 5;

print $number*$number;
newline();
print $number;
newline();

sub newline {
    print "\n";
}

25
5


Within a subroutine, `print` and other actions on standard streams still work. If a subroutine doesn't take any inputs, you still need to put parantheses after the subroutine name when you call it, i.e. `newline()` works but `newline` doesn't. 

### Subroutines with more than one input argument

The input arguments will be treated as an array passed to the special variable `@_`. Here's an example of a script with two subroutines, each taking two arguments.

In [36]:
%%perl

use warnings; # Need this to return error messages

my $number1 = 5;
my $number2 = 4;

print "Sum of $number1 and $number2: ";
print sum($number1,$number2);
print "\n";

print "Sum of 1 and $number1: ";
print sum(1,$number1);
print "\n";

print "Difference of $number1 and $number2: ";
print diff($number1,$number2);
print "\n";

sub sum {
    my ($in1,$in2) = @_; 
    return $in1 + $in2;
}

sub diff {
    my ($in1,$in2) = @_; 
    # We re-use the variable names $in1 and $in2 but it doesn't matter
    # because they remain inside the lexical scope of their respective
    # subroutines
    return $in1 - $in2;
}

print $in1; # Should throw an error message because it's being used 
            # outside of its lexical scope!

Sum of 5 and 4: 9
Sum of 1 and 5: 6
Difference of 5 and 4: 1


Name "main::in1" used only once: possible typo at - line 32.
Use of uninitialized value $in1 in print at - line 32.


## Calling external programs from a Perl script

Perl is a popular choice as a scripting language - like duct tape to bind together other programs that may be written in other languages. You probably have already been using bash scripts for this purpose. Now let's see how to do it in Perl!

### Method 1 - Backticks

If you want to capture the output of a program or script, enclose the code by the backtick character: `` ` ``

This will capture the STDOUT of the program, which you can put in a variable.

The following example captures the output of the `ls` command:

In [7]:
%%perl

my $ls_output = `ls`;
print $ls_output;

Advanced_topics.ipynb
Reading_writing_files.ipynb
Sorting.ipynb
Why_perl.ipynb
demo_argv.pl
planning.md


The output from `ls` is in several lines, but has been assigned to a string. We can separate the lines into the elements of an array by using the `split` function:

In [8]:
%%perl

my $ls_output = `ls`;
my @ls_split = split "\n", $ls_output; # Split using the newline character \n
print join "\t", @ls_split; # Print the results separated now by tab character 

Advanced_topics.ipynb	Reading_writing_files.ipynb	Sorting.ipynb	Why_perl.ipynb	demo_argv.pl	planning.md

However, the STDERR is not captured but continues to be displayed.

In [13]:
%%perl

my $ls_output = `ls banana`;

ls: banana: No such file or directory


In [16]:
%%perl

# To capture STDERR but discard STDOUT:

my $ls_output = `ls banana 2>&1 1>/dev/null`;
print $ls_output;

ls: banana: No such file or directory


### Method 2 - `system`

If you want to check whether a program has executed successfully or not, use the `system` function. 

This returns the **exit status** of the program, which is zero when successful, and non-zero otherwise.

In [20]:
%%perl

my $ls_status1 = system("ls");

my $ls_status2 = system("ls banana");

print "\n\n";
print "Status of the first command: $ls_status1";
print "\n";
print "Status of the second command: $ls_status2";

Advanced_topics.ipynb
Reading_writing_files.ipynb
Sorting.ipynb
Why_perl.ipynb
demo_argv.pl
planning.md


Status of the first command: 0
Status of the second command: 256

ls: banana: No such file or directory


 * The output of the command continues to STDOUT and/or STDERR as before
 * If you care about the output you could include a redirection with `>` in the command
 * The Perl script will wait for the command to return an error status before continuing with the rest of the script

## Command line arguments

How does a program know where to get its input from? Here's some options:

 * Input is hard-coded into the program
 * Read from a file 
 * Input interactively by user from keyboard
 * Piped from STDIN from another program
 * Command-line arguments

We've already covered examples of the first four in this course (for no. 2 to 4, see the section on *Reading and Writing Files*). 

### What are command line arguments?

You've seen them before but perhaps didn't know that there was a special name for them, those things that you type after the name of a program at the command line. 

For example if you want to list files in a directory you use `ls`. If you also want to show hidden files, you use `ls -a`. The switch `-a` is an example of a **command line argument**. 

Another example: You want to see the manual page for the program `date`, so you type `man date`. The text string `date` is a command line argument for the program `man`.

Another example: You want to display the contents of a file with `cat`, so you type `cat somefile`. The filename `somefile` is an argument for `cat`.

A more complicated example: You want to compress a bunch of files into a tar-gzip archive. You type `tar -czf archive.tar.gz my_files*`. `-czf`, `archive.tar.gz` and `my_files*` are all command-line arguments, but the designer of the program has decided to use a special syntax, which mixes filename arguments (`archive.tar.gz`), wildcard arguments (`my_files*`), and option switches (`-czf`). 

### Command line arguments in Perl

Perl scripts can accept command line arguments. Anything that you type after the name of the script will be put into a special array called `@ARGV`. 

Suppose you have the following script in a file called `demo_argv.pl`

In [30]:
%%perl

# Code will not run in Jupyter notebook because requires external input

use strict;
use warnings;

my @input_arguments = @ARGV; # Read command line arguments

print "Your first command line argument was: ";
print $input_arguments[0]; # Print the first one
print "\n";

print "One more time now: ";
print $ARGV[0]; # does the same thing
print "\n";

Your first command line argument was: 
One more time now: 


Use of uninitialized value in print at - line 10.
Use of uninitialized value in print at - line 14.


If you run the following bash script, you'll see how this works

In [23]:
%%bash 

perl demo_argv.pl arg1 arg2 arg3

Your first command line argument was: arg1
One more time now: arg1


### Checking for missing command line arguments

Let's say you want to prompt the user and exit gracefully when required arguments are not supplied, e.g. if the user simply types `perl demo_argv.pl` without arguments. 

You can check if `@ARGV` is actually defined using an if-else condition. 

In [24]:
%%perl

use strict;
use warnings;

if (@ARGV) { # If arguments are supplied
    print "Your first command line argument was: ";
    print $ARGV[0]; # Print the first one
    print "\n";
} else { # Otherwise, if no arguments are supplied
    print STDERR "At least one command line argument is required!\n";
}

At least one command line argument is required!


In [27]:
%%perl

# Alternative formula that requires less indentation

use strict;
use warnings;

if (!@ARGV) {
    print STDERR "At least one command line argument is required!\n";
    exit;
}

print "Your first command line argument was: ";
print $ARGV[0]; # Print the first one
print "\n";

At least one command line argument is required!


### Things to note

 * Bear in mind that the special variable `@ARGV` is always an array, even if the user only supplies one argument! 
 * Arguments in that array are separated by spaces by default. You can't use commas, or dashes, or other fancy stuff.
 * Arguments aren't necessarily just filenames. They are simply text strings that your script has to process later. It's up to you to tell the user what they should be!
 * If you want fancy stuff, with switches like Unix programs (e.g. the `tar` example shown above), you should use the modules [`Getopt::Std`](http://perldoc.perl.org/Getopt/Std.html) and/or [`Getopt::Long`](http://perldoc.perl.org/Getopt/Long.html) (I recommend the latter).

### Exercises

Write scripts that...

 * Take two integers as command line arguments and reports their sum 
 * Take two integers and also let the user specify a third argument that decides whether to add or multiply them
 * Bonus points: In the above scripts, give an error message if the inputs are not integers (hint: use regex), or if there are too few / too many command line arguments (hint: check length of `@ARGV`)

## BioPerl

If you're interested in using Perl to directly manipulate sequences and other bioinformatic data, e.g. converting between different formats, the [BioPerl](http://bioperl.org) module is probably the next step, once you're familiar with basic Perl.

BioPerl is designed differently from basic Perl, because it uses an object-oriented approach. This is beyond the scope of this tutorial, but the best place to start is probably the [Beginner's How-to](http://bioperl.org/howtos/Beginners_HOWTO.html) on the BioPerl website.
