Skip to content

Use awk to provide cut like syntax for field extraction

License

Notifications You must be signed in to change notification settings

learnbyexample/regexp-cut

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

regexp-cut

Uses awk to provide cut like syntax for field extraction. The command name is rcut.

⚠️ ⚠️ Work under construction!


Motivation

cut's syntax is handy for many field extraction problems. But it doesn't allow multi-character or regexp delimiters. So, this project aims to provide cut like syntax for those cases. Currently uses mawk in a bash script.

ℹ️ Note that rcut isn't feature compatible or a replacement for the cut command. rcut helps when you need features like regexp field separator.


Features

  • Default field separation is same as awk
  • Both input (-d) and output (-o) field separators can be multiple characters
  • Input field separator can use regular expressions
    • this script uses mawk by default
    • you can change it to gawk for better regexp support with -g option
  • If input field separator is a single character, output field separator will also be this same character
  • Fixed string input field separator can be enabled by using the -F option
    • if -o is not used, value passed to the -d option will be set as the output field separator
  • Field range can be specified by using - separator (same as cut)
    • - by itself means all the fields (this is also the default if -f option isn't used at all)
    • if start of the range isn't given, default is 1
    • if end of the range isn't given, default is last field of a line
  • Negative indexing is allowed if you use -n option
    • -1 means the last field, -2 means the second-last field and so on
    • you'll have to use : to specify field ranges
  • Multiple fields and ranges can be separated using , character (same as cut)
  • Unlike cut, order matters with the -f option and field/range duplication is also allowed
    • this assumes -c (complement) is not active
  • Using -c option will print all the fields in the same order as input except the fields specified by -f option
  • Using -s option will suppress lines not matching the input field separator
  • Minimum field number is forced to be 1
  • Maximum field number is forced to be last field of a line

⚠️ ⚠️ Work under construction!


Examples

$ cat spaces.txt
   1 2	3  
x y z
 i          j 		k	

# by default, it uses awk's space/tab field separation and trimming
# unlike cut, order matters
$ rcut -f3,1 spaces.txt
3 1
z x
k i

# multi-character delimiter
$ echo 'apple:-:fig:-:guava' | rcut -d:-: -f2
fig

# regexp delimiter
$ echo 'Sample123string42with777numbers' | rcut -d'[0-9]+' -f1,4
Sample numbers

# fixed string delimiter
$ echo '123)(%)*#^&(*@#.[](\\){1}\xyz' | rcut -Fd')(%)*#^&(*@#.[](\\){1}\' -f1,2 -o,
123,xyz

# multiple ranges can be specified, order matters
$ printf '1 2 3 4 5\na b c d e\n' | rcut -f2-3,5,1,2-4
2 3 5 1 2 3 4
b c e a b c d

# last field
$ printf 'apple ball cat\n1 2 3 4 5' | rcut -nf-1
cat
5

# except last two fields
$ printf 'apple ball cat\n1 2 3 4 5' | rcut -cnf-2:
apple
1 2 3

# suppress lines without input field delimiter
$ printf '1,2,3,4\nhello\na,b,c\n' | rcut -sd, -f2
2
b

# -g option will switch to gawk
$ echo '1aa2aa3' | rcut -gd'a{2}' -f2
2

See Examples.md for many more examples.


Tests

You can use script.awk to check if all the example code snippets are working as expected.

$ cd examples/
$ awk -f script.awk Examples.md

TODO

  • Step value other than 1 for field range
  • What to do if start of the range is greater than end?
  • And possibly more...

Similar tools

  • hck — close to drop in replacement for cut that can use a regex delimiter, works on compressed files, etc
  • choose — negative indexing, regexp based delimiters, etc

Contributing

  • Please open an issue for typos/bugs/suggestions/etc
  • Even for pull requests, open an issue for discussion before submitting PRs
  • In case you need to reach me, mail me at echo 'bGVhcm5ieWV4YW1wbGUubmV0QGdtYWlsLmNvbQo=' | base64 --decode or send a DM via twitter

License

This project is licensed under MIT, see LICENSE file for details.

About

Use awk to provide cut like syntax for field extraction

Topics

Resources

License

Stars

Watchers

Forks

Languages