Skip to content

scut is a better (if slower) cut, extracts arbitrary columns to be selected based on regexes, also has join fn().

Notifications You must be signed in to change notification settings

hjmangalam/scut

Repository files navigation

scut, cols, and stats

by Harry Mangalam harry.mangalam@uci.edu updated Oct 21, 2020

Summary

'scut' is a short perl script that acts as a better (if slower) 'cut', and
extracts arbitrary columns to be selected based on regexes you supply. It also has a 'join' function not unlike the *nix join (search for 'join') command.

Being unoptimized perl, it is considerably slower than 'cut' but it can do things that cut can't do, so if you have 100s of GB of input to slice & dice, it may be worthwhile to spend some time learning the finer points of 'cut' and 'awk', but it you just need to chew thru 100s of MB to GBs of complex text, scut may be of interest.

In addition to scut, there are 2 other small utilities here.

'cols' is a utility to 'columnize' lots of irregularly spaced data, developed with and often used with scut. It is similar to column/columns.

Both 'scut' and 'cols' are documented in the included scut_cols_HOWTO.html

'stats' is another Perl utility to consume all numeric-like data fed to it via STDIN and emit some useful descriptive statistics. 'stats -h' will give you all the help you need. It also has the ability to stream-transform numeric input and apply stats on those transformed data or print it to STDOUT without the stats calculation.

Those transforms are: log10, ln, sqrt, x^2, x^3, 1/x, sin, cos, tan, asin, acos, atan, round, abs, exp, trunc (integer part), frac (decimal part)

It can also emit only the value you're interested in. So if you only want the 'Median', if you pipe some stream of numbers | 'stats --median', it will provide an unadorned single value of the median.

Recent Changes

Apr 1 , 2023

  • added sample size estimation: --sample=#, where # is the Margin Of Error that you estimate in the sample population. Uses the input number pool to estimate the std_dev, and requires you provide the confidence interval (ex --conf=90) you'll be using (default 95%).

  • added '--all' to print all the descriptive stats, otherwise the output is still as above.

eg, calculate the file size distribution in the current directory:

# try the following with and without the '--xf=ln' and the '--raw'

   $ ls -l | awk '{print $5}' |stats --raw --xf=ln --dist=2 --x=20 --y=10
or $ ls -l | scut -f=4        |stats --dist=2 --x=20 --y=10

# which yields:

Sum       694.1
Number    172
Mean      4.03546
Median    4.14357
Mode      3.61236
NModes    11
Min       0
Max       8.45349
Range     8.45349
Variance  2.41354
Std_Dev   1.55356
SEM       0.118458
95% Conf  3.80329 to 4.26764
          (for a normal distribution - see skew)
Skew      -0.300842
          (skew = 0 for a symmetric dist)
Std_Skew  -1.61074
Kurtosis  0.483482
          (K=3 for a normal dist)


Distribution
X BinSize 0.422674703876582
Y BinSize  2.66666666666667

YMax:24
      |           *        
      |         *          
      |        *           
      |            *       
      |     *    *         
      |                    
      |      **     *      
      |    *          *    
      |**            *     
      |  **            ****
      |--------------------
  X Min               X Max
   0.00                8.45 


About

scut is a better (if slower) cut, extracts arbitrary columns to be selected based on regexes, also has join fn().

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published