# Reading input files

https://www.gnu.org/software/gawk/manual/html_node/Reading-Files.html#Reading-Files

awk reads all input either from the standard input (by default, this is the keyboard, but often it is a pipe from another command) or from files whose names you specify on the awk command line. If you specify input files, awk reads them in order, processing all the data from one before going on to the next. The name of the current input file can be found in the predefined variable FILENAME (see Built-in Variables).

The input is read in units called records, and is processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called fields. This makes it more convenient for programs to work on the parts of a record.

On rare occasions, you may need to use the getline command. The getline command is valuable both because it can do explicit input from any number of files, and because the files used with it do not have to be named on the awk command line (see Getline).

## Record splitting

### Record Splitting with Standard awk

In [4]:
# Default RS is newline
awk '{ print $0 }' data/mail_data.txt
echo --- Same with record numbers
awk '{ print NR,$0 }' data/mail_data.txt


Amelia       555-5553     amelia.zodiacusque@gmail.com    F
Anthony      555-3412     anthony.asserturo@hotmail.com   A
Becky        555-7685     becky.algebrarum@gmail.com      A
Bill         555-1675     bill.drowning@hotmail.com       A
Broderick    555-0542     broderick.aliquotiens@yahoo.com R
Camilla      555-2912     camilla.infusarum@skynet.be     R
Fabius       555-1234     fabius.undevicesimus@ucb.edu    F
Julie        555-6699     julie.perscrutabor@skeeve.com   F
Martin       555-6480     martin.codicibus@hotmail.com    A
Samuel       555-3430     samuel.lanceolis@shu.edu        A
Jean-Paul    555-2127     jeanpaul.campanorum@nyu.edu     R
--- Same with record numbers
1 Amelia       555-5553     amelia.zodiacusque@gmail.com    F
2 Anthony      555-3412     anthony.asserturo@hotmail.com   A
3 Becky        555-7685     becky.algebrarum@gmail.com      A
4 Bill         555-1675     bill.drowning@hotmail.com       A
5 Broderick    555-0542     broderick.aliquotiens@yahoo.com R
6

In [3]:
# Consider U is a record separator
awk 'BEGIN { RS = "u" }
     { print NR,$0 }' data/mail_data.txt
     
# Note that existing newlines are still printed, just not record separators

1 Amelia       555-5553     amelia.zodiac
2 sq
3 e@gmail.com    F
Anthony      555-3412     anthony.assert
4 ro@hotmail.com   A
Becky        555-7685     becky.algebrar
5 m@gmail.com      A
Bill         555-1675     bill.drowning@hotmail.com       A
Broderick    555-0542     broderick.aliq
6 otiens@yahoo.com R
Camilla      555-2912     camilla.inf
7 sar
8 m@skynet.be     R
Fabi
9 s       555-1234     fabi
10 s.
11 ndevicesim
12 s@
13 cb.ed
14     F
J
15 lie        555-6699     j
16 lie.perscr
17 tabor@skeeve.com   F
Martin       555-6480     martin.codicib
18 s@hotmail.com    A
Sam
19 el       555-3430     sam
20 el.lanceolis@sh
21 .ed
22         A
Jean-Pa
23 l    555-2127     jeanpa
24 l.campanor
25 m@ny
26 .ed
27      R


In [5]:
# same can be done on command line
awk '{ print NR,$0 }' RS="u" data/mail_data.txt

1 Amelia       555-5553     amelia.zodiac
2 sq
3 e@gmail.com    F
Anthony      555-3412     anthony.assert
4 ro@hotmail.com   A
Becky        555-7685     becky.algebrar
5 m@gmail.com      A
Bill         555-1675     bill.drowning@hotmail.com       A
Broderick    555-0542     broderick.aliq
6 otiens@yahoo.com R
Camilla      555-2912     camilla.inf
7 sar
8 m@skynet.be     R
Fabi
9 s       555-1234     fabi
10 s.
11 ndevicesim
12 s@
13 cb.ed
14     F
J
15 lie        555-6699     j
16 lie.perscr
17 tabor@skeeve.com   F
Martin       555-6480     martin.codicib
18 s@hotmail.com    A
Sam
19 el       555-3430     sam
20 el.lanceolis@sh
21 .ed
22         A
Jean-Pa
23 l    555-2127     jeanpa
24 l.campanor
25 m@ny
26 .ed
27      R


### Record Splitting with gawk

When using gawk, the value of RS is not limited to a one-character string. It can be any regular expression.

When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.

If the input file ends without any text matching RS, gawk sets RT to the null string.

In [6]:
# sets RS equal to a regular expression that matches either a newline or a series of one or more uppercase letters 
# with optional leading and/or trailing whitespace:

echo record 1 AAAA record 2 BBBB record 3 |
gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
             { print "Record =", $0,"and RT = [" RT "]" }'

Record = record 1 and RT = [ AAAA ]
Record = record 2 and RT = [ BBBB ]
Record = record 3 and RT = [
]


## Fields

When awk reads an input record, the record is automatically parsed or separated by the awk utility into chunks called fields. By default, fields are separated by whitespace, like words in a line. Whitespace in awk means any string of one or more spaces, TABs, or newlines; other characters that are considered whitespace by other languages (such as formfeed, vertical tab, etc.) are not considered whitespace by awk.

The purpose of fields is to make it more convenient for you to refer to these pieces of the record. You don’t have to use them—you can operate on the whole record if you want—but fields are what make simple awk programs so powerful.

```
You use a dollar sign $ to refer to a field in an awk program, followed by the number of the field you want. Thus, $1 refers to the first field, $2 to the second, and so on. Unlike in the Unix shells, the field numbers are not limited to single digits. $127 is the 127th field in the record. 
```

NF is a predefined variable whose value is the number of fields in the current record. awk automatically updates the value of NF each time it reads a record. No matter how many fields there are, the last field in a record can be represented by $NF. 

The use of $0, which looks like a reference to the “zeroth” field, is a special case: it represents the whole input record.

In [7]:
# Search for li in first field only
awk '$1 ~ /li/ { print $0 }' data/mail_data.txt


Amelia       555-5553     amelia.zodiacusque@gmail.com    F
Julie        555-6699     julie.perscrutabor@skeeve.com   F


In [8]:
# search for LI in whole record
awk '/li/ { print $1, $NF }' data/mail_data.txt

Amelia F
Broderick R
Julie F
Samuel A


In [10]:
# Nonconstant Field Numbers
awk '{ print NR, $NR }' data/mail_data.txt
# prints i-th field from record, where i = 1 ... Number of records

1 Amelia
2 555-3412
3 becky.algebrarum@gmail.com
4 A
5 
6 
7 
8 
9 
10 
11 


In [11]:
# Same as print $4
awk '{ print $(2*2) }' data/mail_data.txt

F
A
A
A
R
R
F
F
A
A
R


### Changing the Contents of a Field

The contents of a field, as seen by awk, can be changed within an awk program; this changes what awk perceives as the current input record. The actual input is untouched; awk never modifies the input file.

In [12]:
awk '{ nboxes = $3 ; $3 = $3 - 10; print nboxes, $3 }' data/inventory_shipped.txt

25 15
32 22
24 14
52 42
34 24
42 32
34 24
34 24
55 45
54 44
87 77
35 25
 -10
36 26
58 48
75 65
70 60


In [15]:
# When the value of a field is changed (as perceived by awk), the text of the input record is recalculated 
# to contain the new field where the old one was. In other words, $0 changes to reflect the altered field. 
# Thus, this program prints a copy of the input file, with 10 subtracted from the second field of each line:

cat data/inventory_shipped.txt
echo
echo -------
awk '{ $2 = $2 - 10; print $0 }' data/inventory_shipped.txt

Jan  13  25  15 115
Feb  15  32  24 226
Mar  15  24  34 228
Apr  31  52  63 420
May  16  34  29 208
Jun  31  42  75 492
Jul  24  34  67 436
Aug  15  34  47 316
Sep  13  55  37 277
Oct  29  54  68 525
Nov  20  87  82 577
Dec  17  35  61 401

Jan  21  36  64 620
Feb  26  58  80 652
Mar  24  75  70 495
Apr  21  70  74 514
-------
Jan 3 25 15 115
Feb 5 32 24 226
Mar 5 24 34 228
Apr 21 52 63 420
May 6 34 29 208
Jun 21 42 75 492
Jul 14 34 67 436
Aug 5 34 47 316
Sep 3 55 37 277
Oct 19 54 68 525
Nov 10 87 82 577
Dec 7 35 61 401
 -10
Jan 11 36 64 620
Feb 16 58 80 652
Mar 14 75 70 495
Apr 11 70 74 514


In [21]:
# It is also possible to assign contents to fields that are out of range. For example:

awk '{ $6 = ($5 + $4 + $3)
       print $6 }' data/inventory_shipped.txt

155
282
286
535
271
609
537
397
369
647
746
497
0
720
790
640
658


Creating a new field changes awk’s internal copy of the current input record, which is the value of `$0`. Thus, if you do `‘print $0’` after adding a field, the record printed includes the new field, with the appropriate number of field separators between it and the previously existing fields.

This recomputation affects and is affected by NF (the number of fields; see Fields). For example, the value of NF is set to the number of the highest field you create. The exact format of $0 is also affected by a feature that has not been discussed yet: the output field separator, OFS, used to separate the fields (see Output Separators).

Note, however, that merely referencing an out-of-range field does not change the value of either $0 or NF. Referencing an out-of-range field only produces an empty string. For example:

```
if ($(NF+1) != "")
    print "can't happen"
else
    print "everything is normal"
```

In [23]:
# making an assignment to an existing field changes the value of $0 
# but does not change the value of NF, even when you assign the empty string to a field. For example:
echo a b c d | awk '{ OFS = ":"; $2 = ""
print $0; print NF }' 

# The field is still there; it just has an empty value, delimited by the two colons between ‘a’ and ‘c’. 


a::c:d
4


In [24]:
# This example shows what happens if you create a new field:
echo a b c d | awk '{ OFS = ":"; $2 = ""; $6 = "new"
                    print $0; print NF }'

a::c:d::new
6


In [27]:
# Decrementing NF throws away the values of the fields after the new value of NF and recomputes $0
echo a b c d e f | awk '{ print "NF =", NF; print $0;
                         NF = 3; print $0;print "NF =", NF; }'

NF = 6
a b c d e f
a b c
NF = 3


It is important to remember that $0 is the full record, exactly as it was read from the input. This includes any leading or trailing whitespace, and the exact whitespace (or other characters) that separates the fields.

It is a common error to try to change the field separators in a record simply by setting FS and OFS, and then expecting a plain `print` or `print $0` to print the modified record.

But this does not work, because nothing was done to change the record itself. Instead, you must force the record to be rebuilt, typically with a statement such as `$1 = $1`, as described earlier.

## Field splitting

### Default field splitting

The field separator, which is either a single character or a regular expression, controls the way awk splits an input record into fields. awk scans the input record for character sequences that match the separator; the fields themselves are the text between the matches.

In [32]:
# Default FS is ' '+
echo 'moo    goo gai       pan' | awk '{ print ":" $2 ";" }'

:goo;


In [33]:
# the 'oo' as FS, see that spaces are now part of field values
echo 'moo    goo gai       pan' | awk 'BEGIN {FS="oo"}; { print ":" $2 ";" }'

:    g;


In [34]:
echo 'John Q. Smith, 29 Oak St., Walamazoo, MI 42139' | awk '{ print ":" $2 ";" }'

:Q.;


In [35]:
echo 'John Q. Smith, 29 Oak St., Walamazoo, MI 42139' | awk -F"," '{ print ":" $2 ";" }'

: 29 Oak St.;


## Constant width data


In [40]:
w

 13:27:26 up 7 days,  6:27,  0 users,  load average: 0.11, 0.05, 0.01
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT


In [41]:
w | awk 'BEGIN  { FIELDWIDTHS = "9 6 10 6 7 7 35" }
NR > 2 {
    idle = $4
    sub(/^ +/, "", idle)   # strip leading spaces
    if (idle == "")
        idle = 0
    if (idle ~ /:/) {      # hh:mm
        split(idle, t, ":")
        idle = t[1] * 60 + t[2]
    }
    if (idle ~ /days/)
        idle *= 24 * 60 * 60

    print $1, $2, idle
}'