[GSOC]Descriptive Statistics command-line program #742

Merged
merged 7 commits into from Aug 8, 2016

Projects

None yet

3 participants

@keon
Member
keon commented Jul 27, 2016

I originally built a class that calculates descriptive statistics. But after a few discussion, I ended up shrinking all of the functions down to minimum to provide maximum performance and maintainability.
I also merged all commits to one to discard unnecessary commits.

Sample output on "iris.csv" would be:

[INFO ] dim     var     mean    std     median  min     max     range   skew    kurt    SE      
[INFO ] 0       0.6857  5.8433  0.8281  5.8000  4.3000  7.9000  3.6000  0.3149  -0.5521 0.0676  
[INFO ] 1       0.1880  3.0540  0.4336  3.0000  2.0000  4.4000  2.4000  0.3341  0.2908  0.0354  
[INFO ] 2       3.1132  3.7587  1.7644  4.3500  1.0000  6.9000  5.9000  -0.2745 -1.4019 0.1441  
[INFO ] 3       0.5824  1.1987  0.7632  1.3000  0.1000  2.5000  2.4000  -0.1050 -1.3398 0.0623  

Users can control the width and precision using -w and -p flag.
I tested the output using excel and they match perfectly.

keon added some commits Jul 11, 2016
@keon keon add descriptive statistics executable 5aed5ba
@keon keon Merge branch 'master' of github.com:keonkim/mlpack 63d5959
@keon keon add descriptive statistics cli executable
27ac82e
@stereomatchingkiss stereomatchingkiss and 1 other commented on an outdated diff Jul 29, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+* Calculates the sum of deviations to the Nth Power
+*
+* @param input Vector that captures a dimension of a dataset
+* @param rowMean Mean of the given vector.
+* @return sum of nth power deviations
+*/
+double SumNthPowerDeviations(const arma::rowvec& input,
+ const double& rowMean,
+ const size_t Nth) // Degree of Power
+{
+ double sum = 0;
+ for (size_t i = 0; i < input.n_elem; ++i)
+ {
+ sum += pow(input(i) - rowMean, Nth);
+ }
+ return sum;
@stereomatchingkiss
stereomatchingkiss Jul 29, 2016 edited Contributor

This one could be vectorize, performance is better or not need to measure, but it should make the codes shorter and easier to read.

return arma::sum(arma::pow(input - rowMean,
                             static_cast<double>(Nth)));
@keon
keon Aug 6, 2016 Member

updated

@stereomatchingkiss stereomatchingkiss and 1 other commented on an outdated diff Jul 29, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+ * @param rowMean Mean of the given vector.
+ * @return Skewness of the given vector.
+ */
+double Skewness(const arma::rowvec& input,
+ const double& rowStd,
+ const double& rowMean,
+ const bool population)
+{
+ double skewness = 0;
+ double S3 = pow(rowStd, 3);
+ double M3 = SumNthPowerDeviations(input, rowMean, 3);
+ double n = input.n_elem;
+ if (population)
+ {
+ // Calculate Population Skewness
+ skewness = n * M3 / (n * n * S3);
@stereomatchingkiss
stereomatchingkiss Jul 29, 2016 Contributor

Why not just write

skewness = M3 / (n*S3);

Besides, according to the formula at this page, I think the equation of population skewness should be

std::sqrt(n) * M3 / (n*S3);

@keon
keon Aug 6, 2016 edited Member

@stereomatchingkiss I imitated the result from excel.
I guess this is equivalent to "direct skewness formula" in the page.
I changed it to a simplified version you recommended.

@keon
keon Aug 6, 2016 Member

updated

@stereomatchingkiss
stereomatchingkiss Aug 6, 2016 Contributor

My fault, sorry for misunderstanding, yes, they are the same thing as you mentioned.

@stereomatchingkiss stereomatchingkiss and 1 other commented on an outdated diff Jul 29, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+ const double& rowMean,
+ const bool population)
+{
+ double skewness = 0;
+ double S3 = pow(rowStd, 3);
+ double M3 = SumNthPowerDeviations(input, rowMean, 3);
+ double n = input.n_elem;
+ if (population)
+ {
+ // Calculate Population Skewness
+ skewness = n * M3 / (n * n * S3);
+ }
+ else
+ {
+ // Calculate Sample Skewness
+ skewness = n * M3 / ((n-1) * (n-2) * S3);
@stereomatchingkiss
stereomatchingkiss Jul 29, 2016 edited Contributor

As the page mention, the equation should be

(n*std::sqrt(n-1)/std::sqrt(n-2)) * std::sqrt(n) * M3 / (n*S3)

Please correct me if I am wrong

@keon
keon Aug 6, 2016 Member

@stereomatchingkiss I imitated the result from excel.
This formula appears near the bottom of this page

@stereomatchingkiss
stereomatchingkiss Aug 6, 2016 Contributor

@keonkim I think you are right, my mistakes, thanks for your correction.

@stereomatchingkiss stereomatchingkiss and 1 other commented on an outdated diff Jul 29, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+ double S4 = pow(rowStd, 4);
+ double norm3 = (3 * (n-1) * (n-1)) / ((n-2) * (n-3));
+ double normC = (n * (n+1))/((n-1) * (n-2) * (n-3));
+ double normM = M4 / S4;
+ kurtosis = normC * normM - norm3;
+ }
+ return kurtosis;
+}
+/**
+ * Calculates standard error of standard deviation.
+ *
+ * @param input Vector that captures a dimension of a dataset
+ * @param rowStd Standard Deviation of the given vector.
+ * @return Standard error of the stanrdard devation of the given vector.
+ */
+double StandardError(const arma::rowvec& input, const double rowStd)
@stereomatchingkiss
stereomatchingkiss Jul 29, 2016 edited Contributor

I think we could replace rowvec by size_t, because this could reduce dependency on specific type.
In medium or large size project, reduce dependency of your api can make your codes easier to maintain.

ps : This is not an issue in this small CLI program

@keon
keon Aug 6, 2016 Member

updated

@stereomatchingkiss stereomatchingkiss and 1 other commented on an outdated diff Jul 29, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+}
+/**
+ * Calculates Skewness of the given vector.
+ *
+ * @param input Vector that captures a dimension of a dataset
+ * @param rowStd Standard Deviation of the given vector.
+ * @param rowMean Mean of the given vector.
+ * @return Skewness of the given vector.
+ */
+double Skewness(const arma::rowvec& input,
+ const double& rowStd,
+ const double& rowMean,
+ const bool population)
+{
+ double skewness = 0;
+ double S3 = pow(rowStd, 3);
@stereomatchingkiss
stereomatchingkiss Jul 29, 2016 Contributor

good candidate for const

@keon
keon Aug 6, 2016 Member

updated

@stereomatchingkiss stereomatchingkiss and 1 other commented on an outdated diff Jul 29, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+
+ // Print statistics of the row i.
+ Log::Info << boost::format(numberFormat)
+ % i
+ % arma::var(row, population)
+ % rowMean
+ % rowStd
+ % arma::median(row)
+ % rowMin
+ % rowMax
+ % (rowMax - rowMin) // range
+ % Skewness(row, rowStd, rowMean, population)
+ % Kurtosis(row, rowStd, rowMean, population)
+ % StandardError(row, rowStd) << endl;
+ }
+ }
@stereomatchingkiss
stereomatchingkiss Jul 29, 2016 Contributor

We can reduce some duplicate codes

auto printStatResults = [&](int dim)
{
    arma::rowvec row = data.row(dim);
    double rowMax = arma::max(row);
    double rowMin = arma::min(row);
    double rowMean = arma::mean(row);
    double rowStd = arma::stddev(row, population);

    // Print statistics of the given dimension.
    Log::Info << boost::format(numberFormat)
        % dim
        % arma::var(row, population)
        % rowMean
        % rowStd
        % arma::median(row)
        % rowMin
        % rowMax
        % (rowMax - rowMin) // range
        % Skewness(row, rowStd, rowMean, population)
        % Kurtosis(row, rowStd, rowMean, population)
% StandardError(row, rowStd) << endl;
};

if(CLI::HasParam("dimension")){
    printStatResults(dimension);
}else{
    for(size_t i = 0; i < data.n_rows; ++i){
        printStatResults(i);
    }
}
@keon
keon Aug 6, 2016 Member

updated

@stereomatchingkiss
Contributor

This CLI assume every column equal to one sample, what if the data is every row equal to one sample?

@rcurtin rcurtin and 1 other commented on an outdated diff Aug 1, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+PARAM_INT_IN("precision", "Precision of the output statistics.", "p", 4);
+PARAM_INT_IN("width", "Width of the output table.", "w", 8);
+PARAM_FLAG("population", "If specified, the program will calculate statistics "
+ "assuming the dataset is the population. By default, the program will "
+ "assume the dataset as a sample.", "P");
+
+/**
+* Calculates the sum of deviations to the Nth Power
+*
+* @param input Vector that captures a dimension of a dataset
+* @param rowMean Mean of the given vector.
+* @return sum of nth power deviations
+*/
+double SumNthPowerDeviations(const arma::rowvec& input,
+ const double& rowMean,
+ const size_t Nth) // Degree of Power
@rcurtin
rcurtin Aug 1, 2016 Member

For consistency with the style guide, I'd name this parameter n (lowercase) instead of Nth.

@keon
keon Aug 6, 2016 Member

updated

@rcurtin rcurtin and 1 other commented on an outdated diff Aug 1, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+ double rowMean = arma::mean(row);
+ double rowStd = arma::stddev(row, population);
+
+ // Print statistics of the given dimension.
+ Log::Info << boost::format(numberFormat)
+ % dimension
+ % arma::var(row, population)
+ % rowMean
+ % rowStd
+ % arma::median(row)
+ % rowMin
+ % rowMax
+ % (rowMax - rowMin) // range
+ % Skewness(row, rowStd, rowMean, population)
+ % Kurtosis(row, rowStd, rowMean, population)
+ % StandardError(row, rowStd) << endl;
@rcurtin
rcurtin Aug 1, 2016 Member

Everywhere else in the mlpack code we use just regular iostreams instead of boost::format... is it possible to use iostreams and iomanips here? Like std::setw, etc.

@keon
keon Aug 6, 2016 Member

@rcurtin sorry for late response.
I tried with std::setw first, but I found that std::setprecision is not compatible with Log::Info.
So the boost::format alternative was chosen. It is not fast, but I thought this would not cause performance issue in this case.

@rcurtin
rcurtin Aug 6, 2016 Member

No problem, I think your solution is perfectly reasonable. Would you like to open a ticket about the Log::Info and std::setw incompatibility? We don't need to solve it now (or soon), but it would maybe be a good task for someone who is looking for ways to contribute, since it's a pretty self-contained and easy-to-replicate problem. :)

@rcurtin rcurtin and 1 other commented on an outdated diff Aug 4, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+ const string inputFile = CLI::GetParam<string>("input_file");
+ const size_t dimension = static_cast<size_t>(CLI::GetParam<int>("dimension"));
+ const size_t precision = static_cast<size_t>(CLI::GetParam<int>("precision"));
+ const size_t width = static_cast<size_t>(CLI::GetParam<int>("width"));
+ const bool population = CLI::HasParam("population");
+
+ // Load the data
+ arma::mat data;
+ data::Load(inputFile, data, false, true /*transpose*/);
+
+ // Generate boost format recipe.
+ const string widthPrecision("%-"+
+ to_string(width)+ "." +
+ to_string(precision));
+ const string widthOnly("%-"+
+ to_string(width)+ ".");
@rcurtin
rcurtin Aug 4, 2016 Member

Be sure to follow the style guide---wrapped lines should be indented twice (not once), and try to place spaces between operators (like a + b not a+b) for readability. :)

@keon
keon Aug 6, 2016 Member

updated

keon added some commits Aug 6, 2016
@keon keon polish describe executable program
4cf1dde
@keon keon adhere to the style guide
a5c996d
@stereomatchingkiss stereomatchingkiss and 1 other commented on an outdated diff Aug 7, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+ * @param rowStd Standard Deviation of the given vector.
+ * @param rowMean Mean of the given vector.
+ * @return Kurtosis of the given vector.
+ */
+double Kurtosis(const arma::rowvec& input,
+ const double& fStd,
+ const double& fMean,
+ const bool population)
+{
+ double kurtosis = 0;
+ const double M4 = SumNthPowerDeviations(input, fMean, 4);
+ const double n = input.n_elem;
+ if (population)
+ {
+ // Calculate Population Excess Kurtosis
+ double M2 = SumNthPowerDeviations(input, fMean, 2);
@stereomatchingkiss
stereomatchingkiss Aug 7, 2016 Contributor

I think this is a good candidate of const too, as other variables in this function.
Sorry of being picky on const correctness.

After this is fixed, I think the codes are ready to merge

@keon
keon Aug 7, 2016 Member

updated

@keon keon use const instead of just normal types
3dca3fb
@stereomatchingkiss stereomatchingkiss and 1 other commented on an outdated diff Aug 7, 2016
...lpack/methods/preprocess/preprocess_describe_main.cpp
+ "assuming the dataset is the population. By default, the program will "
+ "assume the dataset as a sample.", "P");
+PARAM_FLAG("rowMajor", "If specified, the program will calculate statistics "
+ "assuming the dataset is organized in row major. By default, the program "
+ "will assume the dataset is a column major.", "r");
+
+/**
+* Calculates the sum of deviations to the Nth Power.
+*
+* @param input Vector that captures a dimension of a dataset.
+* @param rowMean Mean of the given vector.
+* @param n Degree of power.
+* @return sum of nth power deviations.
+*/
+double SumNthPowerDeviations(const arma::rowvec& input,
+ const double& fMean,
@stereomatchingkiss
stereomatchingkiss Aug 7, 2016 edited Contributor

@keonkim Discover a small problem before merged, could you align the paramters?like load.hpp did, thanks

@keon
keon Aug 8, 2016 Member

updated

@keon keon align parameters in describe executable
23c54cb
@stereomatchingkiss stereomatchingkiss merged commit acd81e1 into mlpack:master Aug 8, 2016

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@stereomatchingkiss stereomatchingkiss changed the title from Descriptive Statistics command-line program to [GSOC]Descriptive Statistics command-line program Aug 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment