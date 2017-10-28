GNU awk

$ awk --version | head -n1
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

Prerequisites and notes

familiarity with programming concepts like variables, printing, control structures, arrays, etc

familiarity with regular expressions if not, check out ERE portion of GNU sed regular expressions which is close enough to features available in gawk

this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using grep , sed , etc

, , etc see Gawk: Effective AWK Programming manual for complete reference, has information on other awk versions as well as notes on POSIX standard

Field processing

Default field separation

$0 contains the entire input record default input record separator is newline character

contains the entire input record $1 contains the first field text default input field separator is one or more of continuous space, tab or newline characters

contains the first field text $2 contains the second field text and so on

contains the second field text and so on $(2+3) result of expressions can be used, this one evaluates to $5 and hence gives fifth field similarly if variable i has value 2 , then $(i+3) will give fifth field See also gawk manual - Expressions

result of expressions can be used, this one evaluates to and hence gives fifth field NF is a built-in variable which contains number of fields in the current record so, $NF will give last field $(NF-1) will give second last field and so on

is a built-in variable which contains number of fields in the current record

$ cat fruits.txt fruit qty apple 42 banana 31 fig 90 guava 6 $ # print only first field $ awk ' {print $1} ' fruits.txt fruit apple banana fig guava $ # print only second field $ awk ' {print $2} ' fruits.txt qty 42 31 90 6

Specifying different input field separator

by using -F command line option

command line option by setting FS variable

variable See FPAT and FIELDWIDTHS section for other ways of defining input fields

$ # second field where input field separator is : $ echo ' foo:123:bar:789 ' | awk -F: ' {print $2} ' 123 $ # last field $ echo ' foo:123:bar:789 ' | awk -F: ' {print $NF} ' 789 $ # first and last field $ # note the use of , and space between output fields $ echo ' foo:123:bar:789 ' | awk -F: ' {print $1, $NF} ' foo 789 $ # second last field $ echo ' foo:123:bar:789 ' | awk -F: ' {print $(NF-1)} ' bar $ # use quotes to avoid clashes with shell special characters $ echo ' one;two;three;four ' | awk -F ' ; ' ' {print $3} ' three

Regular expressions based input field separator

$ echo ' Sample123string54with908numbers ' | awk -F ' [0-9]+ ' ' {print $2} ' string $ # first field will be empty as there is nothing before '{' $ echo ' {foo} bar=baz ' | awk -F ' [{}= ]+ ' ' {print $1} ' $ echo ' {foo} bar=baz ' | awk -F ' [{}= ]+ ' ' {print $2} ' foo $ echo ' {foo} bar=baz ' | awk -F ' [{}= ]+ ' ' {print $3} ' bar

default input field separator is one or more of continuous space, tab or newline characters (will be termed as whitespace here on) exact same behavior if FS is assigned single space character

in addition, leading and trailing whitespaces won't be considered when splitting the input record

$ printf ' a ate b\tc

' a ate b c $ printf ' a ate b\tc

' | awk ' {print $1} ' a $ printf ' a ate b\tc

' | awk ' {print NF} ' 4 $ # same behavior if FS is assigned to single space character $ printf ' a ate b\tc

' | awk -F ' ' ' {print $1} ' a $ printf ' a ate b\tc

' | awk -F ' ' ' {print NF} ' 4 $ # for anything else, leading/trailing whitespaces will be considered $ printf ' a ate b\tc

' | awk -F ' [ \t]+ ' ' {print $2} ' a $ printf ' a ate b\tc

' | awk -F ' [ \t]+ ' ' {print NF} ' 6

assigning empty string to FS will split the input record character wise

note the use of command line option -v to set FS

$ echo ' apple ' | awk -v FS= ' {print $1} ' a $ echo ' apple ' | awk -v FS= ' {print $2} ' p $ echo ' apple ' | awk -v FS= ' {print $NF} ' e $ # detecting multibyte characters depends on locale $ printf ' hi👍 how are you? ' | awk -v FS= ' {print $3} ' 👍

Specifying different output field separator

by setting OFS variable

variable also gets added between every argument to print statement use printf to avoid this

statement default is single space

$ # statements inside BEGIN are executed before processing any input text $ echo ' foo:123:bar:789 ' | awk ' BEGIN{FS=OFS=":"} {print $1, $NF} ' foo:789 $ # can also be set using command line option -v $ echo ' foo:123:bar:789 ' | awk -F: -v OFS= ' : ' ' {print $1, $NF} ' foo:789 $ # changing a field will re-build contents of $0 $ echo ' a ate b ' | awk ' {$2 = "foo"; print $0} ' | cat -A a foo b$ $ # $1=$1 is an idiomatic way to re-build when there is nothing else to change $ echo ' foo:123:bar:789 ' | awk -F: -v OFS= ' - ' ' {print $0} ' foo:123:bar:789 $ echo ' foo:123:bar:789 ' | awk -F: -v OFS= ' - ' ' {$1=$1; print $0} ' foo-123-bar-789 $ # OFS is used to separate different arguments given to print $ echo ' foo:123:bar:789 ' | awk -F: -v OFS= ' \t ' ' {print $1, $3} ' foo bar $ echo ' Sample123string54with908numbers ' | awk -F ' [0-9]+ ' ' {$1=$1; print $0} ' Sample string with numbers

Filtering

Idiomatic print usage

print statement with no arguments will print contents of $0

statement with no arguments will print contents of if condition is specified without corresponding statements, contents of $0 is printed if condition evaluates to true

is printed if condition evaluates to true 1 is typically used to represent always true condition and thus print contents of $0

$ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # displaying contents of input file(s) similar to 'cat' command $ # equivalent to using awk '{print $0}' and awk '1' $ awk ' {print} ' poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you.

Field comparison

Each block of statements within {} can be prefixed by an optional condition so that those statements will execute only if condition evaluates to true

can be prefixed by an optional condition so that those statements will execute only if condition evaluates to true Condition specified without corresponding statements will lead to printing contents of $0 if condition evaluates to true

$ # if first field exactly matches the string 'apple' $ awk ' $1=="apple"{print $2} ' fruits.txt 42 $ # print first field if second field > 35 $ # NR>1 to avoid the header line $ # NR built-in variable contains record number $ awk ' NR>1 && $2>35{print $1} ' fruits.txt apple fig $ # print header and lines with qty < 35 $ awk ' NR==1 || $2<35 ' fruits.txt fruit qty banana 31 guava 6

If the above examples are too confusing, think of it as syntactical sugar

Statements are grouped within {} inside {} , we have a if control structure Like C language, braces not needed for single statements within if , but consider that {} is used for clarity From this explicit syntax, remove the outer {} , if and () used for if

As we'll see later, this allows to mash up few lines of program compactly on command line itself Of course, for medium to large programs, it is better to put the code in separate file. See awk scripts section



$ # awk '$1=="apple"{print $2}' fruits.txt $ awk ' { if($1 == "apple"){ print $2 } } ' fruits.txt 42 $ # awk 'NR==1 || $2<35' fruits.txt $ awk ' { if(NR==1 || $2<35){ print $0 } } ' fruits.txt fruit qty banana 31 guava 6

Regular expressions based filtering

the REGEXP is specified within // and by default acts upon $0

and by default acts upon See also stackoverflow - lines around matching regexp

$ # all lines containing the string 'are' $ # same as: grep 'are' poem.txt $ awk ' /are/ ' poem.txt Roses are red, Violets are blue, And so are you. $ # negating REGEXP, same as: grep -v 'are' poem.txt $ awk ' !/are/ ' poem.txt Sugar is sweet, $ # same as: grep 'are' poem.txt | grep -v 'so' $ awk ' /are/ && !/so/ ' poem.txt Roses are red, Violets are blue, $ # lines starting with 'a' or 'b' $ awk ' /^[ab]/ ' fruits.txt apple 42 banana 31 $ # print last field of all lines containing 'are' $ awk ' /are/{print $NF} ' poem.txt red, blue, you.

strings can be used as well, which will be interpreted as REGEXP if necessary

Allows using shell variables instead of hardcoded REGEXP that section also notes difference between using // and string



$ awk ' $0 !~ "are" ' poem.txt Sugar is sweet, $ awk ' $0 ~ "^[ab]" ' fruits.txt apple 42 banana 31 $ # also helpful if search strings have the / delimiter character $ cat paths.txt /foo/a/report.log /foo/y/power.log $ awk ' /\/foo\/a\// ' paths.txt /foo/a/report.log $ awk ' $0 ~ "/foo/a/" ' paths.txt /foo/a/report.log

REGEXP matching against specific field

$ # if first field contains 'a' $ awk ' $1 ~ /a/ ' fruits.txt apple 42 banana 31 guava 6 $ # if first field contains 'a' and qty > 20 $ awk ' $1 ~ /a/ && $2 > 20 ' fruits.txt apple 42 banana 31 $ # if first field does NOT contain 'a' $ awk ' $1 !~ /a/ ' fruits.txt fruit qty fig 90

Fixed string matching

to search a string literally, index function can be used instead of REGEXP similar to grep -F

function can be used instead of REGEXP the function returns the starting position and 0 if no match found

$ cat eqns.txt a=b,a+b=c,c * d a+b,pi=3.14,5e12 i * (t+9-g)/8,4-a+b $ # no output since '+' is meta character, would need '/a\+b/' $ awk ' /a+b/ ' eqns.txt $ # same as: grep -F 'a+b' eqns.txt $ awk ' index($0,"a+b") ' eqns.txt a+b,pi=3.14,5e12 i * (t+9-g)/8,4-a+b $ # much easier than '/i\*\(t\+9-g\)/' $ awk ' index($0,"i*(t+9-g)") ' eqns.txt i * (t+9-g)/8,4-a+b $ # check only last field $ awk -F, ' index($NF,"a+b") ' eqns.txt i * (t+9-g)/8,4-a+b $ # index not needed if entire field/line is being compared $ awk -F, ' $1=="a+b" ' eqns.txt a+b,pi=3.14,5e12

return value is useful to match at specific position

for ex: at start/end of line

$ # start of line $ awk ' index($0,"a+b")==1 ' eqns.txt a+b,pi=3.14,5e12 $ # end of line $ # length function returns number of characters, by default acts on $0 $ awk ' index($0,"a+b")==length()-length("a+b")+1 ' eqns.txt i * (t+9-g)/8,4-a+b $ # to avoid repetitions, save the search string in variable $ awk -v s= " a+b " ' index($0,s)==length()-length(s)+1 ' eqns.txt i * (t+9-g)/8,4-a+b

Line number based filtering

Built-in variable NR contains total records read so far

contains total records read so far Use FNR if you need line numbers separately for multiple file processing

$ # same as: head -n2 poem.txt | tail -n1 $ awk ' NR==2 ' poem.txt Violets are blue, $ # print 2nd and 4th line $ awk ' NR==2 || NR==4 ' poem.txt Violets are blue, And so are you. $ # same as: tail -n1 poem.txt $ # statements inside END are executed after processing all input text $ awk ' END{print} ' poem.txt And so are you. $ awk ' NR==4{print $2} ' fruits.txt 90

for large input, use exit to avoid unnecessary record processing

$ seq 14323 14563435 | awk ' NR==234{print; exit} ' 14556 $ # sample time comparison $ time seq 14323 14563435 | awk ' NR==234{print; exit} ' 14556 real 0m0.004s user 0m0.004s sys 0m0.000s $ time seq 14323 14563435 | awk ' NR==234{print} ' 14556 real 0m2.167s user 0m2.280s sys 0m0.092s

Case Insensitive filtering

$ # same as: grep -i 'rose' poem.txt $ awk -v IGNORECASE=1 ' /rose/ ' poem.txt Roses are red, $ # for small enough set, can also use REGEXP character class $ awk ' /[rR]ose/ ' poem.txt Roses are red, $ # another way is to use built-in string function 'tolower' $ awk ' tolower($0) ~ /rose/ ' poem.txt Roses are red,

Changing record separators

RS to change input record separator

to change input record separator default is newline character

$ s= ' this is a sample string ' $ # space as input record separator, printing all records $ printf " $s " | awk -v RS= ' ' ' {print NR, $0} ' 1 this 2 is 3 a 4 sample 5 string $ # print all records containing 'a' $ printf " $s " | awk -v RS= ' ' ' /a/ ' a sample

ORS to change output record separator

to change output record separator gets added to every print statement use printf to avoid this

statement default is newline character

$ seq 3 | awk ' {print $0} ' 1 2 3 $ # note that there is empty line after last record $ seq 3 | awk -v ORS= '



' ' {print $0} ' 1 2 3 $ # dynamically changing ORS $ # can also use: seq 6 | awk '{ORS = NR%2 ? " " : RS} 1' $ seq 6 | awk ' {ORS = NR%2 ? " " : "

"} 1 ' 1 2 3 4 5 6 $ seq 6 | awk ' {ORS = NR%3 ? "-" : "

"} 1 ' 1-2-3 4-5-6

Paragraph mode

When RS is set to empty string, one or more consecutive empty lines is used as input record separator

is set to empty string, one or more consecutive empty lines is used as input record separator Can also use regular expression RS=



+ but there are subtle differences, see gawk manual - multiline records. Important points from that link quoted below

However, there is an important difference between ‘RS = ""’ and ‘RS = "



+"’. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done

Now that the input is separated into records, the second step is to separate the fields in the records. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature. When RS is set to the empty string and FS is set to a single character, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS

When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to the default field separator of a single space: ‘FS = " "’

Consider the below sample file

$ cat sample.txt Hello World Good day How are you Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he

Filtering paragraphs

$ # print all paragraphs containing 'it' $ # if extra newline at end is undesirable, can use $ # awk -v RS= '/it/{print c++ ? "

" $0 : $0}' sample.txt $ awk -v RS= -v ORS= '



' ' /it/ ' sample.txt Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too $ # based on number of lines in each paragraph $ awk -F '

' -v RS= -v ORS= '



' ' NF==1 ' sample.txt Hello World $ awk -F '

' -v RS= -v ORS= '



' ' NF==2 && /do/ ' sample.txt Just do-it Believe it Much ado about nothing He he he

Re-structuring paragraphs

$ # default FS is one or more of continuous space, tab or newline characters $ # default OFS is single space $ # so, $1=$1 will change it uniformly to single space between fields $ awk -v RS= ' {$1=$1} 1 ' sample.txt Hello World Good day How are you Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he $ # a better usecase $ awk ' BEGIN{FS="

"; OFS=". "; RS=""; ORS="



"} {$1=$1} 1 ' sample.txt Hello World Good day. How are you Just do-it. Believe it Today is sunny. Not a bit funny. No doubt you like it too Much ado about nothing. He he he

Multicharacter RS

Some marker like Error or Warning etc

$ cat report.log blah blah Error: something went wrong more blah whatever Error: something surely went wrong some text some more text blah blah blah $ awk -v RS= ' Error: ' ' END{print NR-1} ' report.log 2 $ awk -v RS= ' Error: ' ' NR==1 ' report.log blah blah $ # filter 'Error:' block matching particular string $ # to preserve formatting, use: '/whatever/{print RS $0}' $ awk -v RS= ' Error: ' ' /whatever/ ' report.log something went wrong more blah whatever $ # blocks with more than 3 lines $ # splitting string with 3 newlines will yield 4 fields $ awk -F '

' -v RS= ' Error: ' ' NF>4{print RS $0} ' report.log Error: something surely went wrong some text some more text blah blah blah

Regular expression based RS the RT variable will contain string matched by RS

Note that entire input is treated as single string, so ^ and $ anchors will apply only once - not every line

$ s= ' Sample123string54with908numbers ' $ printf " $s " | awk -v RS= ' [0-9]+ ' ' NR==1 ' Sample $ # note the relationship between record and separators $ printf " $s " | awk -v RS= ' [0-9]+ ' ' {print NR " : " $0 " - " RT} ' 1 : Sample - 123 2 : string - 54 3 : with - 908 4 : numbers - $ # need to be careful of empty records $ printf ' 123string54with908 ' | awk -v RS= ' [0-9]+ ' ' {print NR " : " $0} ' 1 : 2 : string 3 : with $ # and newline at end of input $ printf ' 123string54with908

' | awk -v RS= ' [0-9]+ ' ' {print NR " : " $0} ' 1 : 2 : string 3 : with 4 :

Joining lines based on specific end of line condition

$ cat msg.txt Hello there. It will rain to- day. Have a safe and pleasant jou- rney. $ # join lines ending with - to next line $ # by manipulating RS and ORS $ awk -v RS= ' -

' -v ORS= ' 1 ' msg.txt Hello there. It will rain today. Have a safe and pleasant journey. $ # by manipulating ORS alone, sub function covered in later sections $ awk ' {ORS = sub(/-$/,"") ? "" : "

"} 1 ' msg.txt Hello there. It will rain today. Have a safe and pleasant journey. $ # easier: perl -pe 's/-

//' msg.txt as newline is still part of input line

processing null terminated input

$ printf ' foo\0bar\0 ' | cat -A foo^@bar^@$ $ printf ' foo\0bar\0 ' | awk -v RS= ' \0 ' ' {print} ' foo bar

Substitute functions

Use sub string function for replacing first occurrence

string function for replacing first occurrence Use gsub for replacing all occurrences

for replacing all occurrences By default, $0 which contains input record is modified, can specify any other field or variable as needed

$ # replacing first occurrence $ echo ' 1-2-3-4-5 ' | awk ' {sub("-", ":")} 1 ' 1:2-3-4-5 $ # replacing all occurrences $ echo ' 1-2-3-4-5 ' | awk ' {gsub("-", ":")} 1 ' 1:2:3:4:5 $ # return value for sub/gsub is number of replacements made $ echo ' 1-2-3-4-5 ' | awk ' {n=gsub("-", ":"); print n} 1 ' 4 1:2:3:4:5 $ # // format is better suited to specify search REGEXP $ echo ' 1-2-3-4-5 ' | awk ' {gsub(/[^-]+/, "abc")} 1 ' abc-abc-abc-abc-abc $ # replacing all occurrences only for third field $ echo ' one;two;three;four ' | awk -F ' ; ' ' {gsub("e", "E", $3)} 1 ' one two thrEE four

Use gensub to return the modified string unlike sub or gsub which modifies inplace

to return the modified string unlike or which modifies inplace it also supports back-references and ability to modify specific match

acts upon $0 if target is not specified

$ # replace second occurrence $ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(":", "-", 2)} 1 ' foo:123-bar:baz $ # use REGEXP as needed $ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(/[^:]+/, "XYZ", 2)} 1 ' foo:XYZ:bar:baz $ # or print the returned string directly $ echo ' foo:123:bar:baz ' | awk ' {print gensub(":", "-", 2)} ' foo:123-bar:baz $ # replace third occurrence $ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(/[^:]+/, "XYZ", 3)} 1 ' foo:123:XYZ:baz $ # replace all occurrences, similar to gsub $ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(/[^:]+/, "XYZ", "g")} 1 ' XYZ:XYZ:XYZ:XYZ $ # target other than $0 $ echo ' foo:123:bar:baz ' | awk -F: -v OFS=: ' {$1=gensub(/o/, "b", 2, $1)} 1 ' fob:123:bar:baz

back-reference examples

use \" within double-quotes to represent " character in replacement string

within double-quotes to represent character in replacement string use \\1 to represent \1 - the first captured group and so on

to represent - the first captured group and so on & or \0 will back-reference entire matched string

$ # replacing last occurrence without knowing how many occurrences are there $ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(/(.*):/, "\\1-", 1)} 1 ' foo:123:bar-baz $ echo ' foo and bar and baz land good ' | awk ' {$0=gensub(/(.*)and/, "\\1XYZ", 1)} 1 ' foo and bar and baz lXYZ good $ # use word boundaries as necessary $ echo ' foo and bar and baz land good ' | awk ' {$0=gensub(/(.*)\<and\>/, "\\1XYZ", 1)} 1 ' foo and bar XYZ baz land good $ # replacing last but one $ echo ' 456:foo:123:bar:789:baz ' | awk ' {$0=gensub(/(.*):(.*:)/, "\\1-\\2", 1)} 1 ' 456:foo:123:bar-789:baz $ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(/[^:]+/, "\"&\"", "g")} 1 ' " foo " : " 123 " : " bar " : " baz "

saving quotes in variables - to avoid escaping double quotes or having to use octal code for single quotes

$ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(/[^:]+/, "\047&\047", "g")} 1 ' ' foo ' : ' 123 ' : ' bar ' : ' baz ' $ echo ' foo:123:bar:baz ' | awk -v sq= " ' " ' {$0=gensub(/[^:]+/, sq"&"sq, "g")} 1 ' ' foo ' : ' 123 ' : ' bar ' : ' baz ' $ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(/[^:]+/, "\"&\"", "g")} 1 ' " foo " : " 123 " : " bar " : " baz " $ echo ' foo:123:bar:baz ' | awk -v dq= ' " ' ' {$0=gensub(/[^:]+/, dq"&"dq, "g")} 1 ' " foo " : " 123 " : " bar " : " baz "

Inplace file editing

Use this option with caution, preferably after testing that the awk code is working as intended

$ cat greeting.txt Hi there Have a nice day $ awk -i inplace ' {gsub("e", "E")} 1 ' greeting.txt $ cat greeting.txt Hi thErE HavE a nicE day

Multiple input files are treated individually and changes are written back to respective files

$ cat f1 I ate 3 apples $ cat f2 I bought two bananas and 3 mangoes $ awk -i inplace ' {gsub("3", "three")} 1 ' f1 f2 $ cat f1 I ate three apples $ cat f2 I bought two bananas and three mangoes

to create backups of original file, set INPLACE_SUFFIX variable

$ awk -i inplace -v INPLACE_SUFFIX= ' .bkp ' ' {gsub("three", "3")} 1 ' f1 $ cat f1 I ate 3 apples $ cat f1.bkp I ate three apples

See gawk manual - Enabling In-Place File Editing for implementation details

Using shell variables

when awk code is part of shell program and shell variable needs to be passed as input to awk code

code is part of shell program and shell variable needs to be passed as input to code for example: command line argument passed to shell script, which is in turn passed on to awk control structures in shell script calling awk with different search strings

See also stackoverflow - How do I use shell variables in an awk script?

$ # examples tested with bash shell $ f= ' apple ' $ awk -v word= " $f " ' $1==word ' fruits.txt apple 42 $ f= ' fig ' $ awk -v word= " $f " ' $1==word ' fruits.txt fig 90 $ q= ' 20 ' $ awk -v threshold= " $q " ' NR==1 || $2>threshold ' fruits.txt fruit qty apple 42 banana 31 fig 90

accessing shell environment variables

$ # existing environment variable $ awk ' BEGIN{print ENVIRON["PWD"]} ' /home/learnbyexample $ awk ' BEGIN{print ENVIRON["SHELL"]} ' /bin/bash $ # defined along with awk code $ word= ' hello world ' awk ' BEGIN{print ENVIRON["word"]} ' hello world $ # using ENVIRON also prevents awk's interpretation of escape sequences $ s= ' a

=c ' $ foo= " $s " awk ' BEGIN{print ENVIRON["foo"]} ' a

=c $ awk -v foo= " $s " ' BEGIN{print foo} ' a =c

passing REGEXP

See also gawk manual - Using Dynamic Regexps

$ s= ' are ' $ # for: awk '!/are/' poem.txt $ awk -v s= " $s " ' $0 !~ s ' poem.txt Sugar is sweet, $ # for: awk '/are/ && !/so/' poem.txt $ awk -v s= " $s " ' $0 ~ s && !/so/ ' poem.txt Roses are red, Violets are blue, $ r= ' [^-]+ ' $ echo ' 1-2-3-4-5 ' | awk -v r= " $r " ' {gsub(r, "abc")} 1 ' abc-abc-abc-abc-abc $ # escape sequence has to be doubled when string is interpreted as REGEXP $ s= ' foo and bar and baz land good ' $ echo " $s " | awk ' {$0=gensub("(.*)\\<and\\>", "\\1XYZ", 1)} 1 ' foo and bar XYZ baz land good $ # hence passing as variable should be $ r= ' (.*)\\<and\\> ' $ echo " $s " | awk -v r= " $r " ' {$0=gensub(r, "\\1XYZ", 1)} 1 ' foo and bar XYZ baz land good $ # or use ENVIRON $ r= ' (.*)\<and\> ' $ echo " $s " | r= " $r " awk ' {$0=gensub(ENVIRON["r"], "\\1XYZ", 1)} 1 ' foo and bar XYZ baz land good

Multiple file input

Example to show difference between NR and FNR

$ # NR for overall record number $ awk ' NR==1 ' poem.txt greeting.txt Roses are red, $ # FNR for individual file's record number $ # same as: head -q -n1 poem.txt greeting.txt $ awk ' FNR==1 ' poem.txt greeting.txt Roses are red, Hi thErE

Constructs to do some processing before starting each file as well as at the end

BEGINFILE - to add code to be executed before start of each input file

- to add code to be executed before start of each input file ENDFILE - to add code to be executed after processing each input file

- to add code to be executed after processing each input file FILENAME - file name of current input file being processed

$ # similar to: tail -n1 poem.txt greeting.txt $ awk ' BEGINFILE{print "file: "FILENAME} ENDFILE{print $0"

------"} ' poem.txt greeting.txt file: poem.txt And so are you. ------ file: greeting.txt HavE a nicE day ------

And of course, there can be usual awk code

$ awk ' BEGINFILE{print "file: "FILENAME} FNR==1; ENDFILE{print "------"} ' poem.txt greeting.txt file: poem.txt Roses are red, ------ file: greeting.txt Hi thErE ------ $ awk ' BEGINFILE{c++; print "file: "FILENAME} FNR==2; END{print "

Total input files: "c} ' poem.txt greeting.txt file: poem.txt Violets are blue, file: greeting.txt HavE a nicE day Total input files: 2

Control Structures

Syntax is similar to C language and single statements inside control structures don't require to be grouped within {}

language and single statements inside control structures don't require to be grouped within See gawk manual - Control Statements for details

Remember that by default there is a loop that goes over all input records and constructs like BEGIN and END fall outside that loop

$ cat nums.txt 42 -2 10101 -3.14 -75 $ awk ' {sum += $1} END{print sum} ' nums.txt 10062.9 $ # uninitialized variables will have empty string $ printf ' ' | awk ' {sum += $1} END{print sum} ' $ # so either add '0' or use unary '+' operator to convert to number $ printf ' ' | awk ' {sum += $1} END{print +sum} ' 0

if-else and loops

We have already seen simple if examples in Filtering section

examples in Filtering section See also gawk manual - Switch

$ # same as: sed -n '/are/ s/so/SO/p' poem.txt $ # remember that sub/gsub returns number of substitutions made $ awk ' /are/{if(sub("so", "SO")) print} ' poem.txt And SO are you. $ # of course, can also use $ awk ' /are/ && sub("so", "SO") ' poem.txt And SO are you. $ # if-else example $ awk ' NR>1{if($2>40) $0="+"$0; else $0="-"$0} 1 ' fruits.txt fruit qty +apple 42 -banana 31 +fig 90 -guava 6

conditional operator

See also stackoverflow - finding min and max value of a column

$ cat nums.txt 42 -2 10101 -3.14 -75 $ # changing -ve to +ve and vice versa $ # same as: awk '{if($0 ~ /^-/) sub(/^-/,""); else sub(/^/,"-")} 1' nums.txt $ awk ' {$0 ~ /^-/ ? sub(/^-/,"") : sub(/^/,"-")} 1 ' nums.txt -42 2 -10101 3.14 75 $ # can also use: awk '!sub(/^-/,""){sub(/^/,"-")} 1' nums.txt

for loop

similar to C language, break and continue statements are also available

language, and statements are also available See also stackoverflow - find missing numbers from sequential list

$ awk ' BEGIN{for(i=2; i<11; i+=2) print i} ' 2 4 6 8 10 $ # looping each field $ s= ' scat:cat:no cat:abdicate:cater ' $ echo " $s " | awk -F: -v OFS=: ' {for(i=1;i<=NF;i++) if($i=="cat") $i="CAT"} 1 ' scat:CAT:no cat:abdicate:cater $ # can also use sub function $ echo " $s " | awk -F: -v OFS=: ' {for(i=1;i<=NF;i++) sub(/^cat$/,"CAT",$i)} 1 ' scat:CAT:no cat:abdicate:cater

while loop

do-while is also available

$ awk ' BEGIN{i=2; while(i<11){print i; i+=2}} ' 2 4 6 8 10 $ # recursive substitution $ # here again return value of sub/gsub is useful $ echo ' titillate ' | awk ' {while( gsub(/til/, "") ) print} ' tilate ate

next and nextfile

next will skip rest of statements and start processing next line of current file being processed there is a loop by default which goes over all input records, next is applicable for that it is similar to continue statement within loops

will skip rest of statements and start processing next line of current file being processed it is often used in Two file processing

$ # here 'next' is used to skip processing header line $ awk ' NR==1{print; next} /a.*a/{$0="*"$0} /[eiou]/{$0="-"$0} 1 ' fruits.txt fruit qty -apple 42 * banana 31 -fig 90 - * guava 6

nextfile is useful to skip remaining lines from current file being processed and move on to next file

$ # same as: head -q -n1 poem.txt greeting.txt fruits.txt $ awk ' FNR>1{nextfile} 1 ' poem.txt greeting.txt fruits.txt Roses are red, Hi thErE fruit qty $ # specific field $ awk ' FNR>2{nextfile} {print $1} ' poem.txt greeting.txt fruits.txt Roses Violets Hi HavE fruit apple $ # similar to 'grep -il' $ awk -v IGNORECASE=1 ' /red/{print FILENAME; nextfile} ' * colors_1.txt colors_2.txt poem.txt $ awk -v IGNORECASE=1 ' $1 ~ /red/{print FILENAME; nextfile} ' * colors_1.txt colors_2.txt

Multiline processing

Processing consecutive lines

$ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # match two consecutive lines $ awk ' p~/are/ && /is/{print p ORS $0} {p=$0} ' poem.txt Violets are blue, Sugar is sweet, $ # if only the second line is needed $ awk ' p~/are/ && /is/; {p=$0} ' poem.txt Sugar is sweet, $ # match three consecutive lines $ awk ' p2~/red/ && p1~/blue/ && /is/{print p2} {p2=p1; p1=$0} ' poem.txt Roses are red, $ # common mistake $ sed -n ' /are/{N;/is/p} ' poem.txt $ # would need something like this and not practical to extend for other cases $ sed ' $!N; /are.*

.*is/p; D ' poem.txt Violets are blue, Sugar is sweet,

Consider this sample input file

$ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz

extracting lines around matching line

See also stackoverflow - lines around matching regexp

how n && n-- works: need to note that right hand side of && is processed only if left hand side is true so for example, if initially n=2 , then we get 2 && 2; n=1 - evaluates to true 1 && 1; n=0 - evaluates to true 0 && - evaluates to false ... no decrementing n and hence will be false until n is re-assigned non-zero value

works:

$ # similar to: grep --no-group-separator -A1 'BEGIN' range.txt $ awk ' /BEGIN/{n=2} n && n-- ' range.txt BEGIN 1234 BEGIN a $ # only print the line after matching line $ # can also use: awk '/BEGIN/{n=1; next} n && n--' range.txt $ awk ' n && n--; /BEGIN/{n=1} ' range.txt 1234 a $ # generic case: print nth line after match $ awk ' n && !--n; /BEGIN/{n=3} ' range.txt END c $ # print second line prior to matched line $ awk ' /END/{print p2} {p2=p1; p1=$0} ' range.txt 1234 b $ # save all lines in an array for generic case $ awk ' /END/{print a[NR-3]} {a[NR]=$0} ' range.txt BEGIN a $ # or use the reversing trick $ tac range.txt | awk ' n && !--n; /END/{n=3} ' | tac BEGIN a

Checking if multiple strings are present at least once in entire input file

If there are lots of strings to check, use arrays

$ # can also use BEGINFILE instead of FNR==1 $ awk ' FNR==1{s1=s2=0} /is/{s1=1} /are/{s2=1} s1&&s2{print FILENAME; nextfile} ' * poem.txt sample.txt $ awk ' FNR==1{s1=s2=0} /foo/{s1=1} /report/{s2=1} s1&&s2{print FILENAME; nextfile} ' * paths.txt

Two file processing

We'll use awk's associative arrays (key-value pairs) here key can be number or string See also gawk manual - Arrays

Unlike comm the input files need not be sorted and comparison can be done based on certain field(s) as well

Comparing whole lines

Consider the following test files

$ cat colors_1.txt Blue Brown Purple Red Teal Yellow $ cat colors_2.txt Black Blue Green Red White

common lines and lines unique to one of the files

For two files as input, NR==FNR will be true only when first file is being processed

will be true only when first file is being processed Using next will skip rest of code when first file is processed

will skip rest of code when first file is processed a[$0] will create unique keys (here entire line content is used as key) in array a just referencing a key will create it if it doesn't already exist, with value as empty string (will also act as zero in numeric context)

will create unique keys (here entire line content is used as key) in array $0 in a will be true if key already exists in array a

$ # common lines $ # same as: grep -Fxf colors_1.txt colors_2.txt $ awk ' NR==FNR{a[$0]; next} $0 in a ' colors_1.txt colors_2.txt Blue Red $ # lines from colors_2.txt not present in colors_1.txt $ # same as: grep -vFxf colors_1.txt colors_2.txt $ awk ' NR==FNR{a[$0]; next} !($0 in a) ' colors_1.txt colors_2.txt Black Green White $ # reversing the order of input files gives $ # lines from colors_1.txt not present in colors_2.txt $ awk ' NR==FNR{a[$0]; next} !($0 in a) ' colors_2.txt colors_1.txt Brown Purple Teal Yellow

Comparing specific fields

Consider the sample input file

$ cat marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 ECE Om 92 CSE Amy 67

single field

For ex: only first field comparison by using $1 instead of $0 as key

$ cat list1 ECE CSE $ # extract only lines matching first field specified in list1 $ awk ' NR==FNR{a[$1]; next} $1 in a ' list1 marks.txt ECE Raj 53 ECE Joel 72 CSE Surya 81 ECE Om 92 CSE Amy 67 $ # if header is needed as well $ awk ' NR==FNR{a[$1]; next} FNR==1 || $1 in a ' list1 marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 CSE Surya 81 ECE Om 92 CSE Amy 67

multiple fields

create a string by adding some character between the fields to act as key for ex: to avoid matching two field values abc and 123 to match with two other field values ab and c123 by adding character, say _ , the key would be abc_123 for first case and ab_c123 for second case this can still lead to false match if input data has _ there is also a built-in way to do this using gawk manual - Multidimensional Arrays



$ cat list2 EEE Moi CSE Amy ECE Raj $ # extract only lines matching both fields specified in list2 $ awk ' NR==FNR{a[$1"_"$2]; next} $1"_"$2 in a ' list2 marks.txt ECE Raj 53 EEE Moi 68 CSE Amy 67 $ # uses SUBSEP as separator, whose default value is non-printing character \034 $ awk ' NR==FNR{a[$1,$2]; next} ($1,$2) in a ' list2 marks.txt ECE Raj 53 EEE Moi 68 CSE Amy 67

field and value comparison

$ cat list3 ECE 70 EEE 65 CSE 80 $ # extract line matching Dept and minimum marks specified in list3 $ awk ' NR==FNR{d[$1]; m[$1]=$2; next} $1 in d && $3 >= m[$1] ' list3 marks.txt ECE Joel 72 EEE Moi 68 CSE Surya 81 ECE Om 92

getline

If entire line (instead of fields) from one file is needed to change the other file, using getline would be faster

would be faster But use it with caution gawk manual - getline for details, especially about corner cases, errors, etc gawk manual - Closing Input and Output Redirections if you have to start from beginning of file again



$ # replace mth line in poem.txt with nth line from nums.txt $ awk -v m=3 -v n=2 ' BEGIN{while(n-- > 0) getline s < "nums.txt"} FNR==m{$0=s} 1 ' poem.txt Roses are red, Violets are blue, -2 And so are you. $ # without getline, but slower due to NR==FNR check for every line processed $ awk -v m=3 -v n=2 ' NR==FNR{if(FNR==n){s=$0; nextfile} next} FNR==m{$0=s} 1 ' nums.txt poem.txt Roses are red, Violets are blue, -2 And so are you.

Another use case is if two files are to be processed exactly for same line numbers

$ # print line from fruits.txt if corresponding line from nums.txt is +ve number $ awk -v file= ' nums.txt ' ' {getline num < file; if(num>0) print} ' fruits.txt fruit qty banana 31 $ # without getline, but has to save entire file in array $ awk ' NR==FNR{n[FNR]=$0; next} n[FNR]>0 ' nums.txt fruits.txt fruit qty banana 31

Creating new fields

Number of fields in input record can be changed by simply manipulating NF

$ # reducing fields $ echo ' foo,bar,123,baz ' | awk -F, -v OFS=, ' {NF=2} 1 ' foo,bar $ # creating new empty field(s) $ echo ' foo,bar,123,baz ' | awk -F, -v OFS=, ' {NF=5} 1 ' foo,bar,123,baz, $ # assigning to field greater than NF will create empty fields as needed $ echo ' foo,bar,123,baz ' | awk -F, -v OFS=, ' {$7=42} 1 ' foo,bar,123,baz,,,42 $ # adding a new 'Grade' field $ awk ' BEGIN{OFS="\t"; g[9]="S"; g[8]="A"; g[7]="B"; g[6]="C"; g[5]="D"} {NF++; if(NR==1)$NF="Grade"; else $NF=g[int($(NF-1)/10)]} 1 ' marks.txt Dept Name Marks Grade ECE Raj 53 D ECE Joel 72 B EEE Moi 68 C CSE Surya 81 A EEE Tia 59 D ECE Om 92 S CSE Amy 67 C

two file example

$ cat list4 Raj class_rep Amy sports_rep Tia placement_rep $ awk -v OFS= ' \t ' ' NR==FNR{r[$1]=$2; next} {NF++; if(FNR==1)$NF="Role"; else $NF=r[$2]} 1 ' list4 marks.txt Dept Name Marks Role ECE Raj 53 class_rep ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 placement_rep ECE Om 92 CSE Amy 67 sports_rep

Dealing with duplicates

default value of uninitialized variable is 0 in numeric context and empty string in text context and evaluates to false when used conditionally

in numeric context and empty string in text context

Illustration to show default numeric value and array in action

$ printf ' mad

42

42

dam

42

' mad 42 42 dam 42 $ printf ' mad

42

42

dam

42

' | awk ' {print $0 "\t" int(a[$0]); a[$0]++} ' mad 0 42 0 42 1 dam 0 42 2 $ # only those entries with second column value zero will be retained $ printf ' mad

42

42

dam

42

' | awk ' !a[$0]++ ' mad 42 dam

first, examples that retain only first copy of duplicates

$ cat duplicates.txt abc 7 4 food toy **** abc 7 4 test toy 123 good toy **** $ # whole line $ awk ' !seen[$0]++ ' duplicates.txt abc 7 4 food toy **** test toy 123 good toy **** $ # particular column $ awk ' !seen[$2]++ ' duplicates.txt abc 7 4 food toy **** $ # total count $ awk ' !seen[$2]++{c++} END{print +c} ' duplicates.txt 2

if input is so large that integer numbers can overflow

See also gawk manual - Arbitrary-Precision Integer Arithmetic

$ # avoid unnecessary counting altogether $ awk ' !($2 in seen); {seen[$2]} ' duplicates.txt abc 7 4 food toy **** $ # use arbitrary-precision integers, limited only by available memory $ awk -M ' !($2 in seen){c++} {seen[$2]} END{print +c} ' duplicates.txt 2

For multiple fields, separate them using , or form a string with some character in between choose a character unlikely to appear in input data, else there can be false matches

or form a string with some character in between

$ awk ' !seen[$2"_"$3]++ ' duplicates.txt abc 7 4 food toy **** test toy 123 $ # can also use simulated multidimensional array $ # SUBSEP, whose default is \034 non-printing character, is used as separator $ awk ' !seen[$2,$3]++ ' duplicates.txt abc 7 4 food toy **** test toy 123

retaining specific numbered copy

$ # second occurrence of duplicate $ awk ' ++seen[$2]==2 ' duplicates.txt abc 7 4 test toy 123 $ # third occurrence of duplicate $ awk ' ++seen[$2]==3 ' duplicates.txt good toy ****

retaining only last copy of duplicate

$ # reverse the input line-wise, retain first copy and then reverse again $ tac duplicates.txt | awk ' !seen[$2]++ ' | tac abc 7 4 good toy ****

filtering based on duplicate count

allows to emulate uniq command for specific fields

See also unix.stackexchange - retain only parent directory paths

$ # all duplicates based on 1st column $ awk ' NR==FNR{a[$1]++; next} a[$1]>1 ' duplicates.txt duplicates.txt abc 7 4 abc 7 4 $ # all duplicates based on 3rd column $ awk ' NR==FNR{a[$3]++; next} a[$3]>1 ' duplicates.txt duplicates.txt abc 7 4 food toy **** abc 7 4 good toy **** $ # more than 2 duplicates based on 2nd column $ awk ' NR==FNR{a[$2]++; next} a[$2]>2 ' duplicates.txt duplicates.txt food toy **** test toy 123 good toy **** $ # only unique lines based on 3rd column $ awk ' NR==FNR{a[$3]++; next} a[$3]==1 ' duplicates.txt duplicates.txt test toy 123

Lines between two REGEXPs

This section deals with filtering lines bound by two REGEXPs (referred to as blocks)

For simplicity the two REGEXPs usually used in below examples are the strings BEGIN and END

All unbroken blocks

Consider the below sample input file, which doesn't have any unbroken blocks (i.e BEGIN and END are always present in pairs)

$ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz

Extracting lines between starting and ending REGEXP

$ # include both starting/ending REGEXP $ # can also use: awk '/BEGIN/,/END/' range.txt $ # which is similar to sed -n '/BEGIN/,/END/p' $ # but not suitable to extend for other cases $ awk ' /BEGIN/{f=1} f; /END/{f=0} ' range.txt BEGIN 1234 6789 END BEGIN a b c END $ # exclude both starting/ending REGEXP $ # can also use: awk '/BEGIN/{f=1; next} /END/{f=0} f' range.txt $ awk ' /END/{f=0} f; /BEGIN/{f=1} ' range.txt 1234 6789 a b c

Include only start or end REGEXP

$ # include only starting REGEXP $ awk ' /BEGIN/{f=1} /END/{f=0} f ' range.txt BEGIN 1234 6789 BEGIN a b c $ # include only ending REGEXP $ awk ' f; /END/{f=0} /BEGIN/{f=1} ' range.txt 1234 6789 END a b c END

Extracting lines other than lines between the two REGEXPs

$ awk ' /BEGIN/{f=1} !f; /END/{f=0} ' range.txt foo bar baz $ # the other three cases would be $ awk ' /END/{f=0} !f; /BEGIN/{f=1} ' range.txt $ awk ' !f; /BEGIN/{f=1} /END/{f=0} ' range.txt $ awk ' /BEGIN/{f=1} /END/{f=0} !f ' range.txt

Specific blocks

Getting first block

$ awk ' /BEGIN/{f=1} f; /END/{exit} ' range.txt BEGIN 1234 6789 END $ # use other tricks discussed in previous section as needed $ awk ' /END/{exit} f; /BEGIN/{f=1} ' range.txt 1234 6789

Getting last block

$ # reverse input linewise, change the order of REGEXPs, finally reverse again $ tac range.txt | awk ' /END/{f=1} f; /BEGIN/{exit} ' | tac BEGIN a b c END $ # or, save the blocks in a buffer and print the last one alone $ # ORS contains output record separator, which is newline by default $ seq 30 | awk ' /4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b} ' 24 25 26

Getting blocks based on a counter

$ # all blocks $ seq 30 | sed -n ' /4/,/6/p ' 4 5 6 14 15 16 24 25 26 $ # get only 2nd block $ # can also use: seq 30 | awk -v b=2 '/4/{c++} c==b{print; if(/6/) exit}' $ seq 30 | awk -v b=2 ' /4/{c++} c==b; /6/ && c==b{exit} ' 14 15 16 $ # to get all blocks greater than 'b' blocks $ seq 30 | awk -v b=1 ' /4/{f=1; c++} f && c>b; /6/{f=0} ' 14 15 16 24 25 26

excluding a particular block

$ # excludes 2nd block $ seq 30 | awk -v b=2 ' /4/{f=1; c++} f && c!=b; /6/{f=0} ' 4 5 6 24 25 26

Broken blocks

If there are blocks with ending REGEXP but without corresponding start, awk '/BEGIN/{f=1} f; /END/{f=0}' will suffice

will suffice Consider the modified input file where starting REGEXP doesn't have corresponding ending

$ cat broken_range.txt foo BEGIN 1234 6789 END bar BEGIN a b c baz $ # the file reversing trick comes in handy here as well $ tac broken_range.txt | awk ' /END/{f=1} f; /BEGIN/{f=0} ' | tac BEGIN 1234 6789 END

But if both kinds of broken blocks are present, accumulate the records and print accordingly

$ cat multiple_broken.txt qqqqqqq BEGIN foo BEGIN 1234 6789 END bar END 0-42-1 BEGIN a BEGIN b END ; as ; s ; sd ; $ awk ' /BEGIN/{f=1; buf=$0; next} f{buf=buf ORS $0} /END/{f=0; if(buf) print buf; buf=""} ' multiple_broken.txt BEGIN 1234 6789 END BEGIN b END

Arrays

We've already seen examples using arrays, some more examples discussed in this section

array looping

$ # average marks for each department $ awk ' NR>1{d[$1]+=$3; c[$1]++} END{for(i in d)print i, d[i]/c[i]} ' marks.txt ECE 72.3333 EEE 63.5 CSE 74

Sorting

See gawk manual - Predefined Array Scanning Orders for more details

$ # by default, keys are traversed in random order $ awk ' BEGIN{a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]} ' x 12 z 1 b 42 $ # index sorted ascending order as strings $ awk ' BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc"; a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]} ' b 42 x 12 z 1 $ # value sorted ascending order as numbers $ awk ' BEGIN{PROCINFO["sorted_in"] = "@val_num_asc"; a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]} ' z 1 x 12 b 42

deleting array elements

$ cat list5 CSE Surya 75 EEE Jai 69 ECE Kal 83 $ # update entry if a match is found $ # else append the new entries $ awk ' {ky=$1"_"$2} NR==FNR{upd[ky]=$0; next} ky in upd{$0=upd[ky]; delete upd[ky]} 1; END{for(i in upd)print upd[i]} ' list5 marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 EEE Moi 68 CSE Surya 75 EEE Tia 59 ECE Om 92 CSE Amy 67 ECE Kal 83 EEE Jai 69

true multidimensional arrays

length of sub-arrays need not be same. See gawk manual - Arrays of Arrays for details

$ awk ' NR>1{d[$1][$2]=$3} END{for(i in d["ECE"])print i} ' marks.txt Joel Raj Om $ awk -v f= ' CSE ' ' NR>1{d[$1][$2]=$3} END{for(i in d[f])print i, d[f][i]} ' marks.txt Surya 81 Amy 67

awk scripts

For larger programs, save the code in a file and use -f command line option

command line option ; is not needed to terminate a statement

is not needed to terminate a statement See also gawk manual - Command-Line Options for other related options

$ cat buf.awk /BEGIN/{ f=1 buf= $0 next } f{ buf=buf ORS $0 } /END/{ f=0 if(buf) print buf buf= " " } $ awk -f buf.awk multiple_broken.txt BEGIN 1234 6789 END BEGIN b END

Another advantage is that single quotes can be freely used

$ echo ' foo:123:bar:baz ' | awk ' {$0=gensub(/[^:]+/, "\047&\047", "g")} 1 ' ' foo ' : ' 123 ' : ' bar ' : ' baz ' $ cat quotes.awk { $0 = gensub(/[^:]+/, " '&' " , " g " ) } 1 $ echo ' foo:123:bar:baz ' | awk -f quotes.awk ' foo ' : ' 123 ' : ' bar ' : ' baz '

If the code has been first tried out on command line, add -o option to get a pretty printed version

$ awk -o -v OFS= ' \t ' ' NR==FNR{r[$1]=$2; next} {NF++; if(FNR==1)$NF="Role"; else $NF=r[$2]} 1 ' list4 marks.txt Dept Name Marks Role ECE Raj 53 class_rep ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 placement_rep ECE Om 92 CSE Amy 67 sports_rep

File name can be passed along -o option, otherwise by default awkprof.out will be used

$ cat awkprof.out # gawk profile, created Tue Oct 24 15:10:02 2017 # Rule(s) NR == FNR { r[ $1 ] = $2 next } { NF++ if (FNR == 1) { $NF = " Role " } else { $NF = r[ $2 ] } } 1 { print $0 } $ # note that other command line options have to be provided as usual $ # for ex: awk -v OFS='\t' -f awkprof.out list4 marks.txt

Miscellaneous

FPAT and FIELDWIDTHS

FS allows to define field separator

allows to define field separator In contrast, FPAT allows to define what should the fields be made up of

allows to define what should the fields be made up of See also gawk manual - Defining Fields by Content

$ s= ' Sample123string54with908numbers ' $ # define fields to be one or more consecutive digits $ echo " $s " | awk -v FPAT= ' [0-9]+ ' ' {print $1, $2, $3} ' 123 54 908 $ # define fields to be one or more consecutive alphabets $ echo " $s " | awk -v FPAT= ' [a-zA-Z]+ ' ' {print $1, $2, $3, $4} ' Sample string with numbers

For simpler csv input having quoted strings if fields themselves have , in them, using FPAT is reasonable approach

input having quoted strings if fields themselves have in them, using is reasonable approach Use a proper parser if input can have other cases like newlines in fields See unix.stackexchange - using csv parser for a sample program in perl



$ s= ' foo,"bar,123",baz,abc ' $ echo " $s " | awk -F, ' {print $2} ' " bar $ echo " $s " | awk -v FPAT=' " [^ " ]* " | [^,] * ' ' {print $2 } ' "bar,123"

if input has well defined fields based on number of characters, FIELDWIDTHS can be used to specify width of each field

$ awk -v FIELDWIDTHS= ' 8 3 ' -v OFS= ' /fig/{$2=35} 1 ' fruits.txt fruit qty apple 42 banana 31 fig 35 guava 6 $ # without FIELDWIDTHS $ awk ' /fig/{$2=35} 1 ' fruits.txt fruit qty apple 42 banana 31 fig 35 guava 6

String functions

length function - returns length of string, by default acts on $0

$ seq 8 13 | awk ' length()==1 ' 8 9 $ awk ' NR==1 || length($1)>4 ' fruits.txt fruit qty apple 42 banana 31 guava 6 $ # character count and not byte count is calculated, similar to 'wc -m' $ printf ' hi👍 ' | awk ' {print length()} ' 3 $ # use -b option if number of bytes are needed $ printf ' hi👍 ' | awk -b ' {print length()} ' 6

split function - similar to FS splitting input record into fields

function - similar to splitting input record into fields use patsplit function to get results similar to FPAT

function to get results similar to See also gawk manual - Split function

See also unix.stackexchange - delimit second column

$ # 1st argument is string to be split $ # 2nd argument is array to save results, indexed from 1 $ # 3rd argument is separator, default is FS $ s= ' foo,1996-10-25,hello,good ' $ echo " $s " | awk -F, ' {split($2,d,"-"); print "Month is: " d[2]} ' Month is: 10 $ # using regular expression to define separator $ # return value is number of fields after splitting $ s= ' Sample123string54with908numbers ' $ echo " $s " | awk ' {n=split($0,s,/[0-9]+/); for(i=1;i<=n;i++)print s[i]} ' Sample string with numbers $ # use 4th argument if separators are needed as well $ echo " $s " | awk ' {n=split($0,s,/[0-9]+/,seps); for(i=1;i<n;i++)print seps[i]} ' 123 54 908 $ # single row to multiple rows based on splitting last field $ s= ' foo,baz,12:42:3 ' $ echo " $s " | awk -F, ' {n=split($NF,a,":"); NF--; for(i=1;i<=n;i++) print $0,a[i]} ' foo baz 12 foo baz 42 foo baz 3

substr function allows to extract specified number of characters from given string indexing starts with 1

function allows to extract specified number of characters from given string See gawk manual - substr function for corner cases and details

$ # 1st argument is string to be worked on $ # 2nd argument is starting position $ # 3rd argument is number of characters to be extracted $ echo ' abcdefghij ' | awk ' {print substr($0,1,5)} ' abcde $ echo ' abcdefghij ' | awk ' {print substr($0,4,3)} ' def $ # if 3rd argument is not given, string is extracted until end $ echo ' abcdefghij ' | awk ' {print substr($0,6)} ' fghij $ echo ' abcdefghij ' | awk -v OFS= ' : ' ' {print substr($0,2,3), substr($0,6,3)} ' bcd:fgh $ # if only few characters are needed from input line, can use empty FS $ echo ' abcdefghij ' | awk -v FS= ' {print $3} ' c $ echo ' abcdefghij ' | awk -v FS= ' {print $3, $5} ' c e

Executing external commands

External commands can be issued using system function

function Output would be as usual on stdout unless redirected while calling the command

unless redirected while calling the command Return value of system depends on exit status of executed command, see gawk manual - Input/Output Functions for details

$ awk ' BEGIN{system("echo Hello World")} ' Hello World $ wc poem.txt 4 13 65 poem.txt $ awk ' BEGIN{system("wc poem.txt")} ' 4 13 65 poem.txt $ awk ' BEGIN{system("seq 10 | paste -sd, > out.txt")} ' $ cat out.txt 1,2,3,4,5,6,7,8,9,10 $ ls xyz.txt ls: cannot access ' xyz.txt ' : No such file or directory $ echo $? 2 $ awk ' BEGIN{s=system("ls xyz.txt"); print "Status: " s} ' ls: cannot access ' xyz.txt ' : No such file or directory Status: 2 $ cat f2 I bought two bananas and three mangoes $ echo ' f1,f2,odd.txt ' | awk -F, ' {system("cat " $2)} ' I bought two bananas and three mangoes

printf formatting

Similar to printf function in C and shell built-in command

function in and shell built-in command use sprintf function to save result in variable instead of printing

function to save result in variable instead of printing See also gawk manual - printf

$ awk ' {sum += $1} END{print sum} ' nums.txt 10062.9 $ # note that ORS is not appended and has to be added manually $ awk ' {sum += $1} END{printf "%.2f

", sum} ' nums.txt 10062.86 $ awk ' {sum += $1} END{printf "%10.2f

", sum} ' nums.txt 10062.86 $ awk ' {sum += $1} END{printf "%010.2f

", sum} ' nums.txt 0010062.86 $ awk ' {sum += $1} END{printf "%d

", sum} ' nums.txt 10062 $ awk ' {sum += $1} END{printf "%+d

", sum} ' nums.txt +10062 $ awk ' {sum += $1} END{printf "%e

", sum} ' nums.txt 1.006286e+04

to refer argument by positional number (starts with 1), use <num>$

$ # can also use: awk 'BEGIN{printf "hex=%x

oct=%o

dec=%d

", 15, 15, 15}' $ awk ' BEGIN{printf "hex=%1$x

oct=%1$o

dec=%1$d

", 15} ' hex=f oct=17 dec=15 $ # adding prefix to hex/oct numbers $ awk ' BEGIN{printf "hex=%1$#x

oct=%1$#o

dec=%1$d

", 15} ' hex=0xf oct=017 dec=15

strings

$ # prefix remaining width with spaces $ awk ' BEGIN{printf "%6s:%5s

", "foo", "bar"} ' foo: bar $ # suffix remaining width with spaces $ awk ' BEGIN{printf "%-6s:%-5s

", "foo", "bar"} ' foo :bar $ # truncate $ awk ' BEGIN{printf "%.2s

", "foobar"} ' fo

avoid using printf without format specifier

$ awk ' BEGIN{s="solve: 5 % x = 1"; printf s} ' awk: cmd. line:1: fatal: not enough arguments to satisfy format string ` solve: 5 % x = 1 ' ^ ran out for this one $ awk ' BEGIN{s= " solve: 5 % x = 1 " ; printf " %s

" , s} ' solve: 5 % x = 1

Redirecting print output

redirecting to file instead of stdout using >

similar to behavior in shell, if file already exists it is overwritten use >> to append to an existing file without deleting content

however, unlike shell, subsequent redirections to same file will append to it

See also gawk manual - Closing Input and Output Redirections if you have too many redirections

$ seq 6 | awk ' NR%2{print > "odd.txt"; next} {print > "even.txt"} ' $ cat odd.txt 1 3 5 $ cat even.txt 2 4 6 $ awk ' NR==1{col1=$1".txt"; col2=$2".txt"; next} {print $1 > col1; print $2 > col2} ' fruits.txt $ cat fruit.txt apple banana fig guava $ cat qty.txt 42 31 90 6

redirecting to shell command

this is useful if you have different things to redirect to different commands, otherwise it can be done as usual in shell acting on awk 's output

's output all redirections to same command gets combined as single input to that command

$ # same as: echo 'foo good 123' | awk '{print $2}' | wc -c $ echo ' foo good 123 ' | awk ' {print $2 | "wc -c"} ' 5 $ # to avoid newline character being added to print $ echo ' foo good 123 ' | awk -v ORS= ' {print $2 | "wc -c"} ' 4 $ # assuming no format specifiers in input $ echo ' foo good 123 ' | awk ' {printf $2 | "wc -c"} ' 4 $ # same as: echo 'foo good 123' | awk '{printf $2 $3 | "wc -c"}' $ echo ' foo good 123 ' | awk ' {printf $2 | "wc -c"; printf $3 | "wc -c"} ' 7

Gotchas and Tips

using $ for variables

for variables only input record $0 and field contents $1 , $2 etc need $

and field contents , etc need See also unix.stackexchange - Why does awk print the whole line when I want it to print a variable?

$ # wrong $ awk -v word= " apple " ' $1==$word ' fruits.txt $ # right $ awk -v word= " apple " ' $1==word ' fruits.txt apple 42

dos style line endings

See also unix.stackexchange - filtering when last column has \r

$ # no issue with unix style line ending $ printf ' foo bar

123 789

' | awk ' {print $2, $1} ' bar foo 789 123 $ # dos style line ending causes trouble $ printf ' foo bar\r

123 789\r

' | awk ' {print $2, $1} ' foo 123 $ # easy to deal by simply setting appropriate RS $ # note that ORS would still be newline character only $ printf ' foo bar\r

123 789\r

' | awk -v RS= ' \r

' ' {print $2, $1} ' bar foo 789 123

relying on default intial value

$ # step 1 - works for single file $ awk ' {sum += $1} END{print sum} ' nums.txt 10062.9 $ # step 2 - change to work for multiple file $ awk ' {sum += $1} ENDFILE{print FILENAME, sum} ' nums.txt nums.txt 10062.9 $ # step 3 - check with multiple file input $ # oops, default numerical value '0' for sum works only once $ awk ' {sum += $1} ENDFILE{print FILENAME, sum} ' nums.txt <( seq 3 ) nums.txt 10062.9 /dev/fd/63 10068.9 $ # step 4 - correctly initialize variables $ awk ' BEGINFILE{sum=0} {sum += $1} ENDFILE{print FILENAME, sum} ' nums.txt <( seq 3 ) nums.txt 10062.9 /dev/fd/63 6

use unary operator + to force numeric conversion

$ awk ' {sum += $1} END{print FILENAME, sum} ' nums.txt nums.txt 10062.9 $ awk ' {sum += $1} END{print FILENAME, sum} ' /dev/null /dev/null $ awk ' {sum += $1} END{print FILENAME, +sum} ' /dev/null /dev/null 0

concatenate empty string to force string comparison

$ echo ' 5 5.0 ' | awk ' {print $1==$2 ? "same" : "different", "string"} ' same string $ echo ' 5 5.0 ' | awk ' {print $1""==$2 ? "same" : "different", "string"} ' different string

beware of expressions going -ve for field calculations

$ cat misc.txt foo good bad ugly 123 xyz a b c d $ # trying to delete last two fields $ awk ' {NF -= 2} 1 ' misc.txt awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: NF set to negative value $ # dynamically change it depending on number of fields $ awk ' {NF = (NF<=2) ? 0 : NF-2} 1 ' misc.txt good a b $ # similarly, trying to access 3rd field from end $ awk ' {print $(NF-2)} ' misc.txt awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: attempt to access field -1 $ awk ' NF>2{print $(NF-2)} ' misc.txt good b

If input is ASCII alone, simple trick to improve speed

For simple non-regex based column filtering, using cut command might give faster results See stackoverflow - how to split columns faster for example



$ # all words containing exactly 3 lowercase a $ time awk -F ' a ' ' NF==4{cnt++} END{print +cnt} ' /usr/share/dict/words 1019 real 0m0.075s $ time LC_ALL=C awk -F ' a ' ' NF==4{cnt++} END{print +cnt} ' /usr/share/dict/words 1019 real 0m0.045s

