# Reading and writing files

In this tutorial you will learn about

 * Standard streams and filehandles
 * `open` and `close` to read and write external files
 * Using a `while` loop to read in multiple lines from a file
 * `chomp` to process input lines
 * **Extra:** Error reporting with `die`
 * **Extra:** Processing delimited input like CSV or TSV files with `split`
 * **Extra:** Reading input from `STDIN`
 * **Extra:** Reading whole file into a string ("slurping")

## Review: STDIN, STDOUT, STDERR

The Internet may or may not be a [series of tubes](https://en.wikipedia.org/wiki/Series_of_tubes), but Unix (and Linux) is like a bunch of pipes. A typical Unix program directs input and output through three [**standard streams**](https://en.wikipedia.org/wiki/Standard_streams):

 * Standard input (`STDIN`)
 * Standard output (`STDOUT`)
 * Standard error (`STDERR`) - Used for error messages and diagnostics

If you're using a text console, `STDOUT` and `STDERR` are typically displayed to the screen. 

Of course, not all programs require all three streams. For example, some programs like `ls` take no input, and others like `mv` display no standard output but act directly on the filesystem. And as we are all aware, some programs don't say anything when there's an error....

In the bash shell, you can **pipe** the `STDOUT` from one program directly into the `STDIN` of another, like so:

```
ls | less
```

This redirects the `STDOUT` from `ls` not to the terminal window, but to the paged text viewer `less`.

You can also redirect `STDOUT` to a file:

```
ls > list_of_files
```

The `STDERR` can also be redirected. For example, trying to find the manual page for the nonexistent program `foobar` produces an error message:

```
man foobar 2> error_message
```

To convert the contents of a file into a stream, you can use the `cat` command:

```
cat error_message # contents of file to STDOUT - defaults to display on text terminal
cat error_message | sed 's/entry/page/' # Substitute the word 'page' for 'entry'
```

To pipe output from a program to a file *and* to `STDOUT` at the same time, use the program `tee` (the name comes explicitly from a [plumbing metaphor](https://en.wikipedia.org/wiki/Tee_(command%29)).

Okay, now back to perl...

## Filehandles

So far we've been using the `print` command to write output to screen. By default, it writes to `STDOUT`.

In [12]:
%%perl

print STDOUT "Hello world!\n"; 
# Notice no comma between STDOUT and "Hello world!\n"

print "Hello world!\n"; # does the same thing

Hello world!
Hello world!


You can also print to **standard error**, `STDERR`, which as the name suggests is typically used for error messages or status alerts by a running program, rather than the actual desired output.

In [5]:
%%perl

print STDERR "ERROR: Goodbye world!\n";

ERROR: Goodbye world!


The `STDOUT` and `STDERR` in all-caps are special arguments telling the `print` function where to send its output. 

## Writing to a file

If you want the output from a perl script to be written to a specific file, you could do this using the standard redirection operator in bash, like so

In [None]:
%%bash
# Illustration only - code will not execute
perl example.pl > example.txt

However, you could also specify a specific output file within the perl script itself. This could be useful for several reasons:

 * More than one output file (vs. only one STDOUT)
 * Filename for output is generated by the perl script itself
 * STDOUT already being used for some other output

To do this, you need to learn how to use `open` and `close`.

In [51]:
%%perl

my $output_file = "perlIO_example.txt"; # Define a variable containing output filename

print STDERR "Opening file for writing\n"; 
# Example of using STDERR for a status message that is not the 
# primary desired output of the script but still informative

open (OUTPUT, ">", $output_file);
print OUTPUT "Hello world\n";
close (OUTPUT);


Opening file for writing


You should now find a new file called `perlIO_example.txt` containing the line "Hello world".

Notice how the `print` command now takes an argument `OUTPUT` instead of `STDIN` or `STDOUT`. This argument is known as the "filehandle". The filehandle was created within the `open` function, and serves as an alias (or "handle") for the output stream that is piped to that particular file. This allows more than one file to be open simultaneously for reading and/or writing. 

The style of the `open` command here has three parts - this is the currently recommended form, although you will still see tutorials and code examples using the old two-part style on the Web. We'll stick to the **three-part open** in this course.

The three arguments to `open` are: 

 * **Filehandle** (`OUTPUT`)
  * Used by subsequent `print` commands to send output to the right place
 * **Mode** (`">"`)
  * Similar to the bash redirection operators - specifies whether reading or writing
 * **Filename** (`$output_file`)
  * Either a character string or a variable, giving the filename

Every `open` command should be matched by a corresponding `close` for the same filehandle. 

In [52]:
%%perl

open (OUTPUT, ">", "perlIO_example.txt"); # Filename directly given in open command
print OUTPUT "Hello world\n";
close (OUTPUT);

### Using variable as a filehandle

An all-caps name like `OUTPUT` is the old-school way of specifying a filehandle in perl. The name is arbitrary, so `OUT` or `THEOUTPUT` or `FOOBAR` would all work just as well. 

However, it's now recommended to define a variable to use as a filehandle, rather than the traditional filehandles, for various reasons. (See [`perldoc open`](http://perldoc.perl.org/functions/open.html))

In the examples below, `$fh` and `$fh2` serve as filehandles, just like `OUTPUT` in the preceding example.

In [54]:
%%perl

my $output_file = "perlIO_example.txt";

open (my $fh, ">", $output_file); # Define filehandle variable within the open command
print $fh "Hello world\n";
close ($fh);

my $fh2; # Filehandle variable can also be defined before the open command
open ($fh2, ">", $output_file);
print $fh2 "Hello world\n";
close($fh2);

## Appending to a file

In the bash shell, if you use the redirection operator `>`, it will overwrite any existing data in the file. If you want to simply *append* to the end of the file, you use the `>>` operator instead.

The same syntax is used for the mode of `open` in perl: `>` for write, `>>` for append.

If the file doesn't already exist, the append mode will create a new file, just like write mode. 

In [55]:
%%perl

my $output_file = "perlIO_example.txt";

open (my $fh, ">>", $output_file);
print $fh "Hello world again\n";
close ($fh);

## Reading from a file

Reading from a file uses the same `open` and `close` commands, except that - you guessed it - the mode is different.

You might also guess, from the analogy to bash, that to specify read mode, you use the `<` operator.

In the example below, let's try to open the file that we created in the earlier example, and print its contents to the screen (STDOUT).

In [56]:
%%perl

my $input_file = "perlIO_example.txt"; # Opening the file we wrote to earlier

open (my $fh, "<", $input_file);
print STDOUT $fh; 
close($fh);

GLOB(0x7fe481802ee8)

Okay, something's weird - What's this `GLOB` business?

Recall what we said earlier about filehandles. They are like an alias, or a reference, pointing to some location to send output or receive input. 

By trying to print the filehandle `$fh` directly, we are simply seeing the internal name that perl has assigned for this particular input stream. It's not meant for human consumption!

What we want to see is not the reference to this file, but instead the *contents* of the file. That's why we'll need an additional operator, the **readline operator**, which operates on the filehandle and gives us the contents. The readline operator is a pair of angled brackets that surround the filehandle, e.g. `<FILEHANDLE>`

Perl was designed primarily to work with text files. Text files are usually split across several lines. The readline operator returns one line at a time, as shown in the following example:

In [57]:
%%perl

my $input_file = "perlIO_example.txt";

open (my $fh, "<", $input_file);
my $line = <$fh>; 
print "The first line from file $input_file is:\n";
print $line; # first line

my $line2 = <$fh>;
print "The second line from file $input_file is:\n";
print $line2; # second line
close($fh);

The first line from file perlIO_example.txt is:
Hello world
The second line from file perlIO_example.txt is:
Hello world again


### Reading input with a `while` loop

It's tedious to write something like this over and over again, so if all the lines in a file are to be processed in the same way, you could put this in a loop with `while`. There is a special syntax for this:

In [58]:
%%perl

my $input_file = "perlIO_example.txt";

open (my $fh, "<", $input_file);
my $counter = 0; # A counter for the line numbers

while (<$fh>) { # Start the reading loop
    $counter++; # Update the counter
    
    # The current line is always assigned to a special variable called $_
    my $current_line = $_;
    
    print "Line number $counter from file $input_file:\n";
    print $current_line;
    print $_; # Same thing
}
close($fh);

Line number 1 from file perlIO_example.txt:
Hello world
Hello world
Line number 2 from file perlIO_example.txt:
Hello world again
Hello world again


### Stripping end-of-line characters with `chomp`

Notice that the lines that you read it still keep the newline character. This can be troublesome if you need to process the text in some way (as you probably do). For example, if your script is supposed to rearrange the words on every line, then the newline character which is attached to the last word of each line would break up the resulting text in a way that you probably do not want.

Newline and/or return characters at the end of each line can be stripped with the `chomp` command.

In [59]:
%%perl

my $input_file = "perlIO_example.txt";

print "Without chomp:\n";
open (my $fh, "<", $input_file);
while (<$fh>) {
    print $_; 
}
close($fh);

print "\n\n";

print "With chomp:\n";
open (my $fh, "<", $input_file);
while (<$fh>) {
    chomp; # This implicitly operates on the special variable $_;
    print $_; 
}
close($fh);

Without chomp:
Hello world
Hello world again


With chomp:
Hello worldHello world again

## Extra: Error messages when files cannot be opened

What happens when a file can't be opened?

In [60]:
%%perl

my $input_file = "fake_file.txt"; # Filename that doesn't exist

open(my $fh, "<", $input_file); # Open file
# Do something to the file here...
close($fh); # Close file

Notice how the error is silently ignored. 

That's not so helpful when it's part of a larger script. If the script doesn't work, you want to know which part is causing the problem. That's where the `die` command comes in handy. 

The syntax should be self-explanatory.

In [61]:
%%perl

my $input_file = "fake_file.txt";

open(my $fh, "<", $input_file) or die ;
close($fh);

print "The script stops at die so commands after that line are not executed\n";

Died at - line 4.


If the `open` command cannot work, e.g. a missing input file, or trying to write to a folder where the user has no write permissions, then `die` will stop the script and tell you which line of the script caused it.

You can also customize the error message, for example to report the name of the file that can't be opened.

In [40]:
%%perl

my $input_file = "fake_file.txt";

print "Custom error message \n";

open(my $fh, "<", $input_file) or die ("Cannot open file $input_file");
close($fh);

Custom error message 


Cannot open file fake_file.txt at - line 6.


Two special variables can be used with the `die` command: `$!` will describe the type of error, whereas `$?` will give the error code number.

In [43]:
%%perl

my $input_file = "fake_file.txt";

print "Custom error message with more details \n";

open(my $fh, "<", $input_file) or die ("Cannot open file $input_file. Error code $?: $!");
close($fh);

Custom error message with more details 


Cannot open file fake_file.txt. Error code 0: No such file or directory at - line 6.


## Extra: Processing delimited output with `split`

Tables of data are often stored as delimited files such as the CSV (comma-separated variables) or TSV (tab-separated) formats. Reading and processing such input is a very common task that's easily done in perl with the `split` function, which lets you split up a string into an array by specifying what character is used to delimit the data fields. 

In [62]:
%%perl

# Write some data to a file in CSV format

my $csv_file = "perlIO_example_table.csv";

open (my $fh, ">", $csv_file) or die;
print $fh "apple,33,pome\n";
print $fh "orange,23,hesperidium\n";
print $fh "pear,11,pome\n";
print $fh "kiwi,41,berry\n";
print $fh "banana,50,berry\n";
close($fh);

# Now let's try to read it in and report only the values of the first column

open (my $fh, "<", $csv_file) or die;
while (<$fh>) {
    chomp; # Don't forget! Strip the newline character from end of each line
    
    # Recall that the contents of current line are stored in special variable $_
    my $line = $_;
    # Split each line into an array, using the comma as separator
    my @line_split = split ",", $line;
    # Report only the value of the first column in each line
    print $line_split[0]; 
    print "\n";
}
close($fh);

apple
orange
pear
kiwi
banana


## Extra: Processing STDIN from keyboard or from a file

You've learned how to process input from a file by with `open` and filehandles. But recall the review in the beginning about how Unix pipes work. Could you also write a script that could be combined with other programs at the bash command line?

```
cat somefile | perl somescript.pl > output
```

In the hypothetical example above, `cat` is writing the contents of `somefile` to STDOUT, which is piped with the `|` character to the STDIN of a perl script called `somescript.pl`, and the STDOUT of that program is written to a file called `output`. You could read the file directly using perl with `open`, but let's say you want to read from STDIN. How do you do this?

You might have guessed already: use the `STDIN` stream!

In [3]:
%%perl

# This code won't work in Jupyter notebook because it requires user interaction

print "Please type something:\n";
my $input = <STDIN>;
chomp $input; # Remove the return character at the end of the input

print "You typed the following:\n";
print $input;
print "\n";




If you write a script like the above and run it, it'll say "Please type something" and wait until you do and press enter, then tell you what you typed. Note the following three points:

 * This is an example of STDIN from the keyboard, i.e. user interaction, which can be used to make text-based role playing games, or for interactive menus, etc.
 * The angle-brackets readline operator only processes one line at a time, and lines are (by default) delimited by the newline character. So the program knows that you've finished typing the input when you press 'enter'. If you want more than one line of input, you could use a loop (more on that later).
 * Instead of typing the input to STDIN, you could redirect it with a pipe!

An example of a bash command that would do this:

```
echo "my input" | perl thescript.pl 
```

If your input has more than one line (e.g. a multi-line file) and you want to process each line as it comes in through the stream, use a `while` loop like so:

In [4]:
%%perl

# This code won't work in Jupyter notebook because it requires external input

while (<STDIN>) {
    # Remember the $_ special variable holds the current line read from the stream in this loop
    chomp; # Chop off the end-of-line character
    my $line = $_; 
    print "The current line is: $line\n";
}

Translated to English: `while` there is still something coming in through the STDIN input stream, do the following: chomp off the end of line character, assign the current line of input to a new variable called `$line`, and print a message to STDOUT saying what that input was. Rinse and repeat.

What happens if you run this perl code without a pipe to STDIN?

```
perl thescript.pl
```

Then you'll be faced with a silent prompt like before, and the <STDIN> will accept input from keyboard. But wait, previously you could tell it when you reached the end of input by pressing 'enter' because the <> operator only reads one line at a time. Now that you've nested it in a loop, how do you stop?

You should have asked that question before trying to run the script! Simply type `Ctrl-D`, which is the End of Transmission (EOT) character. This is an example of a [control character](https://en.wikipedia.org/wiki/Control_character), which date back to the days when the default output of a computer was a printer rather than a screen, and has origins in telegraphic codes. 

## Extra: Reading whole file into a string

Perl's default is to read files line by line. This is because it was designed for processing text like tables and log files, where each line is an entry.

Sometimes it makes sense to have the entire file in a single string, e.g. to apply a regex to search within the whole file. You can still split it up later into lines using `split` on the newline character `\n`. 

To do this, you'll need to look "under the hood". 

There is a special variable called `$/` (that's the normal slash), the *input record separator*. The default value is `\n` (hence each line represents one record). To "slurp" a file into a single string, you'll want to tell Perl that there is no separator, i.e. let `$/` be undefined. 

However, you only want this to apply to a single file and not for your entire script. To "isolate" this undefining of the input record separator, use the command `local` and fence off the slurping within curly braces.

In [5]:
%%perl

my $file = "demo_argv.pl";
my $contents; # Define variable to hold the contents
{ # Curly braces to contain the local definition of the separator
    open (my $fh, "<", $file) or die ("$!"); # Open file
    local $/ = undef; # Temporarily undefine the separator
    $contents = <$fh>; # Slurp!
    close $fh;
}
print length($contents); # Length of file contents in characters

297

## Remove files generated in this tutorial

In [63]:
%%bash

rm perlIO_example_*

# Exercises

Write scripts for the following tasks:

 * Write a table of integers from 1 to 100 and their squares to a new file. (Hint: Use a `for` loop.)
 * Add the integers from 101 to 200 and their squares to the same file that you created in the previous exercise. 
 * Read a Fasta file and report only the header lines. (Hint: Use `if` and regex.)