Get basic R statistical functions working in Perl as if they were part of List::Util, like min, max, sum, etc.
I've used Artificial Intelligence tools such as Claude, Gemini, and Grok to write this as well as using my own gray matter.
There are other similar tools on CPAN, but I want speed and a form like List::Util, which I've gotten here with the help of AI, which often required many attempts to do correctly.
This is meant to call subroutines directly through eXternal Subroutines (XS) for performance and portability.
There are other modules on CPAN that can do PARTS of this, but this works the way that I want it to.
aov(
{
yield => [5.5, 5.4, 5.8, 4.5, 4.8, 4.2],
ctrl => [1, 1, 1, 0, 0, 0]
},
'yield ~ ctrl');
which returns
{
ctrl {
Df 1,
"F value" 25.6000000000001,
"Mean Sq" 1.70666666666667,
Pr(>F) 0.00718232855871859,
"Sum Sq" 1.70666666666667
},
Residuals {
Df 4,
"Mean Sq" 0.0666666666666665,
"Sum Sq" 0.266666666666666
}
}
You can also perform Two-Way ANOVA with categorical interactions using the * operator. The parser will implicitly evaluate the main effects alongside the interaction:
my $res_2way = aov($data_2way, 'len ~ supp * dose');
It is robust against rank deficiency; collinear terms will gracefully receive 0 degrees of freedom and 0 sum of squares, matching R's behavior.
my @test_data = ([762, 327, 468], [484, 239, 477]);
my $test_data = chisq_test(\@test_data);
which outputs:
{
data.name "Perl ArrayRef",
expected [
[0] [
[0] 703.671381936888,
[1] 319.645266594124,
[2] 533.683351468988
],
[1] [
[0] 542.328618063112,
[1] 246.354733405876,
[2] 411.316648531012
]
],
method "Pearson's Chi-squared test",
observed [
[0] [
[0] 762,
[1] 327,
[2] 468
],
[1] [
[0] 484,
[1] 239,
[2] 477
]
],
p.value 2.95358918321176e-07,
parameter {
df 2
},
statistic {
X-squared 30.0701490957547
}
}
It also supports 1D arrays for Goodness of Fit tests:
my $chisq_1d = chisq_test([10, 20, 30]);
For 2x2 matrices, Yates' Continuity Correction is applied automatically, exactly like in R.
cor($array1, $array2, $method = 'pearson'),
that is, pearson is the default and will be used if $method is not specified.
Just like R, pearson, spearman, and kendall are available
If you provide an array of arrays (a matrix), cor will compute the correlation matrix automatically.
my $result = cor_test(
'x' => $x,
'y' => $y,
alternative => 'two.sided'
method => 'pearson',
continuity => 1
);
cor_test safely handles undef (or NA) values seamlessly by computing over pairwise complete observations.
cov($array1, $array2, 'pearson')
or
cov($array1, $array2, 'spearman')
or
cov($array1, $array2, 'kendall')
my $array_data = [
[10, 2],
[3, 15]
];
my $res1 = fisher_test($array_data);
which returns a hash reference:
{
alternative "two.sided",
conf_int [
[0] 2.75338278824932,
[1] 301.462337971516
],
estimate {
"odds ratio" 21.3053175567504
},
method "Fisher's Exact Test for Count Data",
p_value 0.00053672411914343
}
$ft = fisher_test( {
Guess => {
Milk => 3, Tea => 1
},
Truth => {
Milk => 1, Tea => 3
}
});
I have the p-value calculated very precisely, but there are some inexactness (approximately 1% for the confidence intervals) which I couldn't rectify. The answers are very close to R besides the p-value, where they are identical.
takes a hash of an array as input
my %tooth_growth = (
dose => [qw(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0)],
len => [qw(4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 16.5 16.5 15.2 17.3 22.5
17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7 19.7 23.3 23.6 26.4 20.0
25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0)],
supp => [qw(VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ)]
);
my $glm_teeth = glm(
data => \%tooth_growth,
formula => 'len ~ dose + supp',
family => 'gaussian'
);
I'm not completely confident that this is working perfectly, though I've gotten this subroutine to work for simple cases.
In addition to the gaussian default, it fully supports logistic regression using the binomial family parameter via Iteratively Reweighted Least Squares (IRLS):
my $glm_bin = glm(formula => 'am ~ wt + hp', data => \%mtcars, family => 'binomial');
Computes the histogram of the given data values, operating in single
my $res = hist([1, 2, 2, 3, 3, 3, 4, 4, 5], breaks => 4);
If breaks is not explicitly provided, it defaults to calculating the number of bins using Sturges' formula.
Essentially the test determines if all groups have the same median (same distribution) (an excellent review is at https://library.virginia.edu/data/articles/getting-started-with-the-kruskal-wallis-test)
Performs a Kruskal-Wallis rank sum test, see https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kruskal.test
my @xk = (2.9, 3.0, 2.5, 2.6, 3.2); # normal subjects
my @yk = (3.8, 2.7, 4.0, 2.4); # with obstructive airway disease
my @zk = (2.8, 3.4, 3.7, 2.2, 2.0); # with asbestosis
my @x = (@xk, @yk, @zk);
my @g = (
(map {'Normal subjects'} 0..4),
(map {'Subjects with obstructive airway disease'} 0..3),
map {'Subjects with asbestosis'} 0..4
);
my $t0 = Time::HiRes::time();
my $kt = kruskal_test(\@x, \@g);
my $t1 = Time::HiRes::time();
printf("Kruskal calculation in %g seconds.\n", $t1-$t0);
p $kt;
I feel that this is better, and more easily read, than what you get in R:
my %x = (
'normal.subjects' => [2.9, 3.0, 2.5, 2.6, 3.2],
'obs. airway disease' => [3.8, 2.7, 4.0, 2.4],
'asbestosis' => [2.8, 3.4, 3.7, 2.2, 2.0]
);
$t0 = Time::HiRes::time();
$kt = kruskal_test(\%x);
$t1 = Time::HiRes::time();
printf("Kruskal calculation via HoA in %g seconds.\n", $t1-$t0);
p $kt;
The Kolmogorov-Smirnov test, which tests whether or not two arrays/lists of data are part of the same distribution is implemented simply:
$ks = ks_test(\@x, \@y, alternative => 'greater');
returning a hash reference.
Also, a single array can be tested against a normal distribution:
$ks = ks_test($ksx, 'pnorm');
The p-value precision is about 1e-8, which I want to improve, but am not sure how.
This is the linear models function.
$lm = lm(formula => 'mpg ~ wt + hp', data => $mtcars);
where $mtcars is a hash of hashes
lm also supports generating interaction terms directly within the formula using the * operator:
my $lm = lm(formula => 'mpg ~ wt * hp^2', data => \%mtcars);
If your data contains missing numbers (NA or undef), lm handles listwise deletion dynamically to ensure mathematical integrity before fitting.
the dot operator also works:
$lm = lm(formula => 'y ~ .', data => $dot_data);
my $mat1 = matrix(
data => [1..6],
nrow => 2
);
You can also pass byrow => 1 if you want the matrix populated row-wise instead of column-wise.
mean(1,2,3);
or
my @arr = 1..8;
mean(@arr, 4, 5)
or
mean([1,1], [2,2]) # 1.5
works like mean, taking array references and arrays:
median( $test_data[$i][0] )
Returns array of false-discovery-rate-corrected p-values, where methods available are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr"
my @q = p_adjust(\@pvalues, $method);
$test_data = power_t_test(
n => 30, delta => 0.5,
sd => 1.0, sig_level => 0.05
);
It also allows configuring the test type (type => 'one.sample', 'two.sample', 'paired') and alternative hypothesis (alternative => 'one.sided'). You can also pass strict => 1 to strictly evaluate both tails of the distribution.
Calculates sample quantiles using R's continuous Type 7 interpolation.
my $quantile = quantile('x' => [1..99], probs => [0.05, 0.1, 0.25]);
If the probs parameter is omitted, it behaves identically to R by defaulting to the 0, 25, 50, 75, and 100 percentiles (c(0, .25, .5, .75, 1)). The returned hash keys match R's standardized naming convention (e.g., "25%", "33.3%").
Create a binomial distribution of numbers
my $binom = rbinom( n => $n, prob => 0.5, size => 9);
It hooks directly into Perl's internal PRNG system, respecting srand() seeds.
I've tried to make this as simple as possible, trying to follow from R:
my $test_data = read_table('t/HepatitisCdata.csv');
output types can be AOH (aoa), HOA (hoa), HOH (hoh)
read_table($filename, 'output.type' => 'aoh');
read_table($filename, 'output.type' => 'hoa');
and, like Text::CSV_XS, filters can be applied in order to save RAM on big files:
$test_data = read_table(
't/HepatitisCdata.csv',
filter => {
Sex => sub {$_ eq 'f'} # where "Sex" is the column name, and "$_" is the value for that column
},
'output.type' => 'aoh'
);
Make a normal distribution of numbers, with pre-set mean mean, standard deviation sd, and number n.
my ($rmean, $sd, $n) = (10, 2, 9999);
my $normals = rnorm( n => $n, mean => $rmean, sd => $sd);
Make a distribution of approximately uniform distribution
my $unif = runif( n => $n, min => 0, max => 1);
where n is the number of items, the values are between min and max
this is to match R's behavior:
runif( 9 )
will make 9 numbers in [0,1]
runif(9, 0, 99)
will match n, min, and max respectively
my @scaled_results = scale(1..5);
You can also pass an options hash to disable centering or scaling:
my @scaled_results = scale(1..5, { center => false, scale => true });
It fully supports matrix operations. By passing an array of arrays, scale processes the data column by column independently:
my $scaled_mat = scale([[1, 2], [3, 4], [5, 6]]);
my $stdev = sd(2,4,4,4,5,5,7,9);
Correct answer is 2.1380899352994;
Works as closely as I can to R's seq, which is very similar to Perl's for loops. Returns an array, not an array reference.
say 'seq(1, 5):';
my @seq = seq(1, 5);
say join(', ', @seq), "\n";
say 'seq(1, 2, 0.25):';
@seq = seq(1, 2, 0.25);
say 'seq(1, 2, 0.25):';
@seq = seq(1, 2, 0.25);
say join(", ", @seq), "\n";
for (my $idx = 2; $idx >= 1; $idx -= 0.25) { # count down to pop
is_approx(pop @seq, $idx, "seq item $idx with fractional step");
}
say 'seq(10, 5, -1):';
@seq = seq(10, 5, -1);
say join(", ", @seq), "\n";
for (my $idx = 5; $idx <= 10; $idx++) { # count down to pop
is_approx(pop @seq, $idx, "seq item $idx with negative step");
}
tests to see if an array reference is normally distributed, returns a p-value and a statistic
my $shapiro = shapiro_test(
[1..5]
);
and returns the hash reference:
{
p.value 0.589650577093106,
p_value 0.589650577093106,
statistic 0.960870680168535,
W 0.960870680168535
}
returns sum, but using both arrays and array references.
my $test_data = [1..8];
sum($test_data)
which I prefer, compared to List::Util's required casting into an array:
sum(@{ $test_data });
which is shorter and much easier to read
There are 1-sample and 2-sample t-tests:
my $t_test = t_test( $test_data[$i][$j], mu => mean( $test_data[$i][$j] ));
or 2-sample:
$t_test = t_test(
$test_data[3][0],
$test_data[3][1],
paired => true
);
returns a hash reference, which looks like:
conf_int => [
-0.06672889, 0.25672889
],
df => 5,
estimate => 0.095,
p_value => 0.19143688433660,
statistic => 1.50996688705414
the two groups compared can be specified, though not necessarily, as x and y, just like in R:
$t_test = t_test(
'x' => test_data[3][0],
'y' => $test_data[3][1],
paired => true
);
as simple as possible:
var(2, 4, 5, 8, 9)
$test_data = wilcox_test(
[1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30],
[0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29]
);
It fully supports paired tests (paired => true) and can calculate exact p-values (the default for
mimics R's "write.table", with data as first argument to subroutine, and output file as second
write_table(\@data_aoh, $tmp_file, sep => "\t", 'row.names' => true);
You can also precisely filter and reorder which columns are written by passing an array reference to col.names:
write_table(\@data, $tmp_file, sep => "\t", 'col.names' => ['c', 'a']);