scut, cols, and stats

by Harry Mangalam harry.mangalam@uci.edu updated Oct 21, 2020

Summary

'scut' is a short perl script that acts as a better (if slower) 'cut', and
extracts arbitrary columns to be selected based on regexes you supply. It also has a 'join' function not unlike the *nix join (search for 'join') command.

Being unoptimized perl, it is considerably slower than 'cut' but it can do things that cut can't do, so if you have 100s of GB of input to slice & dice, it may be worthwhile to spend some time learning the finer points of 'cut' and 'awk', but it you just need to chew thru 100s of MB to GBs of complex text, scut may be of interest.

In addition to scut, there are 2 other small utilities here.

'cols' is a utility to 'columnize' lots of irregularly spaced data, developed with and often used with scut. It is similar to column/columns.

Both 'scut' and 'cols' are documented in the included scut_cols_HOWTO.html

'stats' is another Perl utility to consume all numeric-like data fed to it via STDIN and emit some useful descriptive statistics. 'stats -h' will give you all the help you need. It also has the ability to stream-transform numeric input and apply stats on those transformed data or print it to STDOUT without the stats calculation.

Those transforms are: log10, ln, sqrt, x^2, x^3, 1/x, sin, cos, tan, asin, acos, atan, round, abs, exp, trunc (integer part), frac (decimal part)

It can also emit only the value you're interested in. So if you only want the 'Median', if you pipe some stream of numbers | 'stats --median', it will provide an unadorned single value of the median.

Recent Changes

Apr 1 , 2023

added sample size estimation: --sample=#, where # is the Margin Of Error that you estimate in the sample population. Uses the input number pool to estimate the std_dev, and requires you provide the confidence interval (ex --conf=90) you'll be using (default 95%).
added '--all' to print all the descriptive stats, otherwise the output is still as above.

eg, calculate the file size distribution in the current directory:

# try the following with and without the '--xf=ln' and the '--raw'

   $ ls -l | awk '{print $5}' |stats --raw --xf=ln --dist=2 --x=20 --y=10
or $ ls -l | scut -f=4        |stats --dist=2 --x=20 --y=10

# which yields:

Sum       694.1
Number    172
Mean      4.03546
Median    4.14357
Mode      3.61236
NModes    11
Min       0
Max       8.45349
Range     8.45349
Variance  2.41354
Std_Dev   1.55356
SEM       0.118458
95% Conf  3.80329 to 4.26764
          (for a normal distribution - see skew)
Skew      -0.300842
          (skew = 0 for a symmetric dist)
Std_Skew  -1.61074
Kurtosis  0.483482
          (K=3 for a normal dist)


Distribution
X BinSize 0.422674703876582
Y BinSize  2.66666666666667

YMax:24
      |           *        
      |         *          
      |        *           
      |            *       
      |     *    *         
      |                    
      |      **     *      
      |    *          *    
      |**            *     
      |  **            ****
      |--------------------
  X Min               X Max
   0.00                8.45

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
webscut		webscut
.gitignore		.gitignore
README.html		README.html
README.md		README.md
README.txt		README.txt
cols		cols
scut		scut
scut_cols_HOWTO.html		scut_cols_HOWTO.html
scut_cols_HOWTO.txt		scut_cols_HOWTO.txt
stats		stats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scut, cols, and stats

Summary

Recent Changes

Apr 1 , 2023

About

Releases

Packages

Languages

hjmangalam/scut

Folders and files

Latest commit

History

Repository files navigation

scut, cols, and stats

Summary

Recent Changes

Apr 1 , 2023

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages