# Notes from DataQuest Courses

This is mostly for non-Python topics.

## Useful Shell Commands

`column -t -s"," file.csv`

Pretty-prints CSV file `file.csv` as table, using the column separator `","`

`shuf -n 10 file.csv`

Samples 10 lines at random from `file.csv` (without replacement) and prints them.

`cat *.csv`

Concatenate and output the content of all CSV files in the current directory. Can be piped into a new CSV file.

`cut -d"," -f1,3-5,8,12`

Selects and outputs columns 1, 3, 4, 5, 8, 12 (starting from 1), given that the column separator is `","`. Useful to select certain columns from a data file.

**Note**: `cut` cannot change the ordering of columns. The list behind `-f` must be increasing.

**Note**: `cut` also displays rows which do not contain the separator, unless the `-s` option is used.

`tail -n+2 file.csv`

Display all rows except for the first one (i.e., starting from 2nd). This is useful to remove the column header line of a CSV file.

## Sorting Files in the Shell

This is done with the `sort` command.

`sort file.csv`

Sort lines of `file.csv` in lexicographic order (ascending). Use `-r` for reverse order (descending). Use `-u` in order to remove duplicate lines.

`sort -u *.csv`

Concatenate all CSV files in the current directory, sort rows, and remove all duplicates.

For a CSV file, we can specify the sort order column by column. Examples:

`sort -t"," -k2,2gr -k4,4g example_data_no_header.csv`

The column separator is `","`. Sort by 2nd column first in reverse order and taken as numerical, then by numerical 4th column in normal order. Here, `g` switches from lexicographic to comparison between numbers.

In fact, `-k` can be a range of columns, for example `-k4,5` uses the concatenation of 4th and 5th column.

## Pipelining and Redirection

* Use `>filename` to write the output (stdout) into file `filename`, overwriting the previous content (if any)
* Use `>>filename` to append the output (stdout) to file `filename`
* Use `2>filename` to write error output (stderr) into file `filename`, overwriting the previous content (if any)

Examples for pipelining:

`ls -l /bin | tail -n+2 | wc -l`

Count files listed by `ls -l /bin`. The `tail` command removes the first line of the `ls` output.

`ls -l /bin | grep "^d" | wc -l`

Counts the number of directories listed by `ls -l /bin`.

Streams are typically numbered 0 (stdin), 1 (stdout), 2 (stderr). This is why `2>` redirects `stderr` to a file. Note that `>` is short for `1>`.

`command >filename 2>&1`

Redirects both `stdout` and `stderr` to the same file `filename`. This works by copying the file descriptor 1 to 2. Note that the order matters here:

`command 2>&1 >filename`

does the same as `command >filename`. Namely, both 1 and 2 are initially directed to the shell, so `2>&1` does nothing. In the example above, file descriptor was first changed to `filename`, before this is copied to 2 as well.