<style> p { max-width: 500px; } </style>

# Utilities

## [cut](https://linux.die.net/man/1/cut)

This allows you to select parts of a file to print.

Seems like it might be useful for a CSV or space formatted file.


In [17]:
cd ~/proj/shell
# Remove extensions.
# -f1 is field 1, -d. is dot deliminated
ls | head -n 8 | cut -f1 -d.

# Truncate to 40 characters.
cut -c-40 data/lorem.txt | sed '/^$/d'
echo

# Print character 40-80 and 100-120
cut -c40-80,100-120 data/lorem.txt | sed '/^$/d'

data
dist
get-pip
lib
LICENCE
notebook
__pycache__
pyproject
Lorem ipsum dolor sit amet, consectetur 
Curabitur maximus arcu magna, in fringil
Proin luctus est odio, id sollicitudin i
Praesent et gravida lacus. Nulla a ultri
Mauris eu felis non dui semper pulvinar 

 adipiscing elit. Nam posuere congue frincula tristique effici
lla lectus viverra id. Cras nec tempor erauris id efficitur ma
ipsum rutrum sed. Cras facilisis tinciduna diam efficitur et. 
ices elit, et elementum urna. Sed ornare lestie, vel eleifend 
 sit amet nec neque. Phasellus lacinia, jhoncus, magna eros co


<style> p { max-width: 500px; } </style>

So it has a few interesting tricks, but it's not really for doing linewise work.

I had to remove the blank lines with sed.


<style> p { max-width: 500px; } </style>

## [paste](https://linux.die.net/man/1/paste)

Those one combines multiple files in the same line, separated by a delimiter.

I collected some data for names in "data/firstname_raw.txt".

Then it was transformed with:

In [1]:
cd ~/proj/shell
cat data/firstname_raw.txt | cut -f1 | 
    sed '/^[A-Z]/d;s/^[ ]*//;/^$/d;s/\(.\)\(.*\)/\1\L\2/' > data/firstname.txt


<style> p,ul { max-width: 500px; } </style>

This knowledge of `cut` and `sed` is already paying off.
 * The raw data is tab separated (cut's default delimiter).
 * The first field is the name.
 * The sed command then:
    - Removes unindented lines (headings).
    - Removes the space indent from remaining lines.
    - Removes empty lines (table gaps).
    - Converts everything after the first character to lowercase.
    
Next I copies a table of common surnames from Wikipedia and put them into the "data/surname_raw.txt".

I transformed them with:

In [29]:
cd ~/proj/shell

cat data/surname_raw.txt | cut -f2 | sed 's/ //g' > data/surname.txt



I then used paste to combine them:

In [34]:
# -d for delimiter, default is tab.
paste -d' ' data/firstname.txt data/surname.txt | head -n 5

Joshua Smith
Lachlan Jones
Ethan Williams
Thomas Brown
James Wilson


<style> p { max-width: 500px; } </style>

Looks good, but the number of first and last names don't match.


## [sort](https://linux.die.net/man/1/sort)

This does what you'd expect. Sorts lines in an order specified by flags.

There are plenty of options.
 * `-d` dictionary (A-Z).
 * `-n` numeric
 * `-h` human numeric (1K, 5M). Numeric with units.
 * `-R` random
 * `-M` month sort (JAN, then FEB, etc)
 * `-u` unique items only.
 * `-r` reverse

In [1]:
cd ~/proj/shell

# Forwards dictionary order.
sort data/firstname.txt | head -n 4

# Backwards
sort -d -r data/surname.txt | head -n 4

# Smallest files
du -h | sort -h | head -n 5

# Largest files
du -h | sort -hr | head -n 5

echo




Amelia
Benjamin
Caitlin
Charlotte
Wilson
Williams
White
Walker
4.0K	./dist
4.0K	./lib
4.0K	./__pycache__
4.0K	./.venv/include/python3.11
4.0K	./.venv/lib64
103M	.
101M	./.venv/lib
101M	./.venv
52M	./.venv/lib/python3.12/site-packages
52M	./.venv/lib/python3.12



In [None]:
cd ~/proj/shell
# Random name
export first=$(sort -R data/firstname.txt | head -n 1)
export last=$(sort -R data/surname.txt | head -n 1)
echo $first $last

Lucy Johnson


<style> p { max-width: 500px; } </style>

## [uniq](https://linux.die.net/man/1/uniq)

This finds duplicated lines in a file. 

I've already familiar with this one, I'll explore the other options.

In [17]:
cd ~/proj/shell

# Find the nationalities list of nationalities in the surnames.
# Only repeated lines next to each other count to they have to be sorted.
# The '-c' option is interesting because is counts the number of items that were duplicated.
cat data/surname_raw.txt |  # Open the raw data.
cut -f4 |                   # Take the fourth field (ethnicity of name)
sed 's/ //g' |              # Removed the extra spaces in the list.
sed 's/,/\n/g' |            # If there are multiple origins it's comma separated
sort |                      # Must be sorted for uniq to do it's magic.
uniq -c |                   # Count the number of names with each ethnicity.
sort -nr                    # Now sort from most to least.

# Unsurprisingly for Australia, most of these are from parts of the UK.


     14 English
      8 Irish
      7 Scottish
      3 Welsh
      1 Vietnamese
      1 Punjabi
      1 Korean
      1 Indian
      1 French
      1 Chinese


<style> p { max-width: 500px; } </style>
Hmm, I like that. I put it in the source file "03_name_origins".

<style> p { max-width: 500px; } </style>

## [seq](https://linux.die.net/man/1/seq)

I'm aware this is a command to help shell loops.

It prints a sequence between two numbers.

```sh
seq [OPTION]... LAST
seq [OPTION]... FIRST LAST
seq [OPTION]... FIRST INCREMENT LAST
```

Apparently you can also do an increment.



In [63]:
# 1,2,3
seq 1 3

# Apparently you don't need sed to merge lines. `paste -s` can do that.
# Negatives also work.
seq -1 2 | paste -s

# This doesn't work.
#seq 5 1 | paste -s -d "," -

# This does
seq 5 -1 1 | paste -sd,

# Decimal numbers can be used.
seq 1 0.5 3 | paste -sd,

# Doesn't this introduce floating point bugs?
seq 0.1 0.1 1 | paste -sd,
# Surprsingly not, likely due to the limited precision output.

# Oops, you can just do '-s,' to keep it in one line.
# '-w' pads it... with zeros?
seq -s, -w 2 2 20


1
2
3
-1	0	1	2
5,4,3,2,1
1.0,1.5,2.0,2.5,3.0
0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0
02,04,06,08,10,12,14,16,18,20


<style> p { max-width: 500px; } </style>

Much more flexible that I thought!

I'll have to find a practical use for that later.

<style> p { max-width: 500px; } </style>

# [tr](https://linux.die.net/man/1/tr)

Converts characters from one set to another or deletes them.

I've used it only occassionally.

It definitely sounds useful.




In [72]:
# Make the 'A' lowercase.
echo Asdf | tr 'A' 'a'

# Delete the letter 'e'.
echo Example text | tr -d 'e'

# Capitalize
# There's a low of these [:sets:] of characters.
echo Hello world. | tr '[:lower:]' '[:upper:]'

# Strip spaces
echo " spaces s p a c e s   " | tr -d " "

# Fix hex case.
# This uses a custom range.
echo 0X1ab3 | tr 'Xa-f' 'xA-F'

# Delete everything *not* in the set.
echo 'asdf#$*()@#$' | tr -dc '[:alnum:]'

# Replace everything not in the set with an underscore.
echo 'asdf#$*()@#$' | tr -c '[:alpha:]' '[_*]'



asdf
Exampl txt
HELLO WORLD.
spacesspaces
0x1AB3
asdf
asdf_________


<style> p { max-width: 500px; } </style>

# [wc](https://linux.die.net/man/1/wc)

Counts lines, words, bytes, or characters.



In [None]:
cd ~/proj/shell

# First names, should just be a matter of lines.
wc -l data/firstname.txt 

# Probably want just the number for most things.
echo -n Surnames: 
cat data/surname.txt | wc -l | cut -f1

wc -lmw data/lorem.txt


39 data/firstname.txt
Surnames:
20
   8  411 2777 data/lorem.txt


<style> p { max-width: 500px; } </style>

## [xargs](https://linux.die.net/man/1/xargs)

It seems to just convert the piped into to parameters. 


In [None]:
cd ~/proj/shell

# List notebooks
echo notebook | xargs ls

# Print odd lines with sed
seq -s, 1 2 9 | 
sed 's/,/p\;/g;s/$/\p/' |
xargs -I PAT sed -n PAT data/lorem.txt |
cut -c-40


00_start.ipynb	01_sed.ipynb  02_fileutil.ipynb
Lorem ipsum dolor sit amet, consectetur 
Curabitur maximus arcu magna, in fringil
Proin luctus est odio, id sollicitudin i
Praesent et gravida lacus. Nulla a ultri
Mauris eu felis non dui semper pulvinar 


<style> p { max-width: 500px; } </style>

## [fold](https://linux.die.net/man/1/fold)

Splits lines after fixed length.

Should be able to do what I've been doing with cut, except it can break at spaces.


In [1]:
cd ~/proj/shell

# Move to next line after 40 characters.
cat data/lorem.txt | head -n 1 | fold -w 40 | head -n 4

echo

# Same, but only break at word boundries.
cat data/lorem.txt | head -n 1 | fold -s -w 40 | head -n 4


Lorem ipsum dolor sit amet, consectetur 
adipiscing elit. Nam posuere congue frin
gilla. Quisque vehicula tristique effici
tur. Aliquam pellentesque lacus blandit,

Lorem ipsum dolor sit amet, consectetur 
adipiscing elit. Nam posuere congue 
fringilla. Quisque vehicula tristique 
efficitur. Aliquam pellentesque lacus 


<style> p { max-width: 500px; } </style>

Unfortunately the GNU version on Linux comes "broken". 

They just never implements the "-c" option for characters, so it only does bytes.

This means it's going to create a huge mess the moment it encounters a non-ascii character, which makes it pretty much useless.

So doing some reading, <em>neither do `fmt` or even `par`!</em>

Yikes. 😑

I've been spoilt by the great unicode support of modern languages, but when I look at shell utilities it's a dumpster fire.

So let's see:
 * fold - no unicode support on Linux.
 * fmt - should in theory support it, but it doesn't.
 * cut - goes by bytes only, even in character mode.
 * par - was not unicode-aware, and split characters making the input text unreadable.


Maybe I should just filter it though `awk`... or `python`.

So in the end I gave up and wrote a dozen or so lines of Python.

That's all it took to:
 * Handle commandline arguments with optional width and help.
 * Pipe input in and out.
 * Wrap text with a unicode compliant standard library.
 
Did I mention that I can just copy-paste it to any modern system and expect it to work?

Anyway... I'm adding it as 'parwrap' in the portable binaries.


<style> p { max-width: 500px; } </style>

## [printf](https://linux.die.net/man/1/printf)

![Printer](data/printer.png)

Yep, it exists as a shell command as well.

The man page says it should take all the same arguments as C printf.

Though there are some extra formatting codes added.

And there's the '%q' qualifiers that prints in a shell-friendly format.



In [None]:
cd ~/proj/shell

printf '%s' "Hello World"

# Escaping single quote is still a pain the arse.
printf '%s' '"It'"'"'s cold" said Tom'

for file in data/*.txt; do
    printf "%s\tMo%\n" "$(du -hb $file)" "$(stat -t $file | cut -d' ' -f4)"
done



Hello World
"It's cold" said Tom
502	data/bronze.txt	81b4
0	data/cat.txt	81b4
572	data/firstname_raw.txt	81b4
265	data/firstname.txt	81b4
2777	data/lorem.txt	81b4
695	data/surname_raw.txt	81b4
136	data/surname.txt	81b4
173	data/tree.txt	81b4


## TODO

### Utils
 - nl
 - shuf
 - tree
 - column
 - file
 - expr
 - look
 - yes
 - factor
 - tac
 - rev

### Seperate Chapters
 - bc
 - awk
 - jq

### Maybe? ###
 - expr
 - m4
 - ed
 - (scripting with vim -c)
 - tclsh/wish
 