New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC]Descriptive Statistics command-line program #742

Merged
merged 7 commits into from Aug 8, 2016

Conversation

Projects
None yet
3 participants
@keon
Member

keon commented Jul 27, 2016

I originally built a class that calculates descriptive statistics. But after a few discussion, I ended up shrinking all of the functions down to minimum to provide maximum performance and maintainability.
I also merged all commits to one to discard unnecessary commits.

Sample output on "iris.csv" would be:

[INFO ] dim     var     mean    std     median  min     max     range   skew    kurt    SE      
[INFO ] 0       0.6857  5.8433  0.8281  5.8000  4.3000  7.9000  3.6000  0.3149  -0.5521 0.0676  
[INFO ] 1       0.1880  3.0540  0.4336  3.0000  2.0000  4.4000  2.4000  0.3341  0.2908  0.0354  
[INFO ] 2       3.1132  3.7587  1.7644  4.3500  1.0000  6.9000  5.9000  -0.2745 -1.4019 0.1441  
[INFO ] 3       0.5824  1.1987  0.7632  1.3000  0.1000  2.5000  2.4000  -0.1050 -1.3398 0.0623  

Users can control the width and precision using -w and -p flag.
I tested the output using excel and they match perfectly.

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
{
sum += pow(input(i) - rowMean, Nth);
}
return sum;

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

This one could be vectorize, performance is better or not need to measure, but it should make the codes shorter and easier to read.

return arma::sum(arma::pow(input - rowMean,
                             static_cast<double>(Nth)));
@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

This one could be vectorize, performance is better or not need to measure, but it should make the codes shorter and easier to read.

return arma::sum(arma::pow(input - rowMean,
                             static_cast<double>(Nth)));

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

updated

@keon

keon Aug 6, 2016

Member

updated

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
if (population)
{
// Calculate Population Skewness
skewness = n * M3 / (n * n * S3);

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

Why not just write

skewness = M3 / (n*S3);

Besides, according to the formula at this page, I think the equation of population skewness should be

std::sqrt(n) * M3 / (n*S3);

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

Why not just write

skewness = M3 / (n*S3);

Besides, according to the formula at this page, I think the equation of population skewness should be

std::sqrt(n) * M3 / (n*S3);

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

@stereomatchingkiss I imitated the result from excel.
I guess this is equivalent to "direct skewness formula" in the page.
I changed it to a simplified version you recommended.

@keon

keon Aug 6, 2016

Member

@stereomatchingkiss I imitated the result from excel.
I guess this is equivalent to "direct skewness formula" in the page.
I changed it to a simplified version you recommended.

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

updated

@keon

keon Aug 6, 2016

Member

updated

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Aug 6, 2016

Contributor

My fault, sorry for misunderstanding, yes, they are the same thing as you mentioned.

@stereomatchingkiss

stereomatchingkiss Aug 6, 2016

Contributor

My fault, sorry for misunderstanding, yes, they are the same thing as you mentioned.

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
else
{
// Calculate Sample Skewness
skewness = n * M3 / ((n-1) * (n-2) * S3);

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

As the page mention, the equation should be

(n*std::sqrt(n-1)/std::sqrt(n-2)) * std::sqrt(n) * M3 / (n*S3)

Please correct me if I am wrong

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

As the page mention, the equation should be

(n*std::sqrt(n-1)/std::sqrt(n-2)) * std::sqrt(n) * M3 / (n*S3)

Please correct me if I am wrong

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

@stereomatchingkiss I imitated the result from excel.
This formula appears near the bottom of this page

@keon

keon Aug 6, 2016

Member

@stereomatchingkiss I imitated the result from excel.
This formula appears near the bottom of this page

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Aug 6, 2016

Contributor

@KeonKim I think you are right, my mistakes, thanks for your correction.

@stereomatchingkiss

stereomatchingkiss Aug 6, 2016

Contributor

@KeonKim I think you are right, my mistakes, thanks for your correction.

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
* @param rowStd Standard Deviation of the given vector.
* @return Standard error of the stanrdard devation of the given vector.
*/
double StandardError(const arma::rowvec& input, const double rowStd)

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

I think we could replace rowvec by size_t, because this could reduce dependency on specific type.
In medium or large size project, reduce dependency of your api can make your codes easier to maintain.

ps : This is not an issue in this small CLI program

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

I think we could replace rowvec by size_t, because this could reduce dependency on specific type.
In medium or large size project, reduce dependency of your api can make your codes easier to maintain.

ps : This is not an issue in this small CLI program

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

updated

@keon

keon Aug 6, 2016

Member

updated

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
const bool population)
{
double skewness = 0;
double S3 = pow(rowStd, 3);

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

good candidate for const

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

good candidate for const

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

updated

@keon

keon Aug 6, 2016

Member

updated

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
% Kurtosis(row, rowStd, rowMean, population)
% StandardError(row, rowStd) << endl;
}
}

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

We can reduce some duplicate codes

auto printStatResults = [&](int dim)
{
    arma::rowvec row = data.row(dim);
    double rowMax = arma::max(row);
    double rowMin = arma::min(row);
    double rowMean = arma::mean(row);
    double rowStd = arma::stddev(row, population);

    // Print statistics of the given dimension.
    Log::Info << boost::format(numberFormat)
        % dim
        % arma::var(row, population)
        % rowMean
        % rowStd
        % arma::median(row)
        % rowMin
        % rowMax
        % (rowMax - rowMin) // range
        % Skewness(row, rowStd, rowMean, population)
        % Kurtosis(row, rowStd, rowMean, population)
% StandardError(row, rowStd) << endl;
};

if(CLI::HasParam("dimension")){
    printStatResults(dimension);
}else{
    for(size_t i = 0; i < data.n_rows; ++i){
        printStatResults(i);
    }
}
@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

We can reduce some duplicate codes

auto printStatResults = [&](int dim)
{
    arma::rowvec row = data.row(dim);
    double rowMax = arma::max(row);
    double rowMin = arma::min(row);
    double rowMean = arma::mean(row);
    double rowStd = arma::stddev(row, population);

    // Print statistics of the given dimension.
    Log::Info << boost::format(numberFormat)
        % dim
        % arma::var(row, population)
        % rowMean
        % rowStd
        % arma::median(row)
        % rowMin
        % rowMax
        % (rowMax - rowMin) // range
        % Skewness(row, rowStd, rowMean, population)
        % Kurtosis(row, rowStd, rowMean, population)
% StandardError(row, rowStd) << endl;
};

if(CLI::HasParam("dimension")){
    printStatResults(dimension);
}else{
    for(size_t i = 0; i < data.n_rows; ++i){
        printStatResults(i);
    }
}

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

updated

@keon

keon Aug 6, 2016

Member

updated

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jul 29, 2016

Contributor

This CLI assume every column equal to one sample, what if the data is every row equal to one sample?

Contributor

stereomatchingkiss commented Jul 29, 2016

This CLI assume every column equal to one sample, what if the data is every row equal to one sample?

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
*/
double SumNthPowerDeviations(const arma::rowvec& input,
const double& rowMean,
const size_t Nth) // Degree of Power

This comment has been minimized.

@rcurtin

rcurtin Aug 1, 2016

Member

For consistency with the style guide, I'd name this parameter n (lowercase) instead of Nth.

@rcurtin

rcurtin Aug 1, 2016

Member

For consistency with the style guide, I'd name this parameter n (lowercase) instead of Nth.

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

updated

@keon

keon Aug 6, 2016

Member

updated

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
% (rowMax - rowMin) // range
% Skewness(row, rowStd, rowMean, population)
% Kurtosis(row, rowStd, rowMean, population)
% StandardError(row, rowStd) << endl;

This comment has been minimized.

@rcurtin

rcurtin Aug 1, 2016

Member

Everywhere else in the mlpack code we use just regular iostreams instead of boost::format... is it possible to use iostreams and iomanips here? Like std::setw, etc.

@rcurtin

rcurtin Aug 1, 2016

Member

Everywhere else in the mlpack code we use just regular iostreams instead of boost::format... is it possible to use iostreams and iomanips here? Like std::setw, etc.

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

@rcurtin sorry for late response.
I tried with std::setw first, but I found that std::setprecision is not compatible with Log::Info.
So the boost::format alternative was chosen. It is not fast, but I thought this would not cause performance issue in this case.

@keon

keon Aug 6, 2016

Member

@rcurtin sorry for late response.
I tried with std::setw first, but I found that std::setprecision is not compatible with Log::Info.
So the boost::format alternative was chosen. It is not fast, but I thought this would not cause performance issue in this case.

This comment has been minimized.

@rcurtin

rcurtin Aug 6, 2016

Member

No problem, I think your solution is perfectly reasonable. Would you like to open a ticket about the Log::Info and std::setw incompatibility? We don't need to solve it now (or soon), but it would maybe be a good task for someone who is looking for ways to contribute, since it's a pretty self-contained and easy-to-replicate problem. :)

@rcurtin

rcurtin Aug 6, 2016

Member

No problem, I think your solution is perfectly reasonable. Would you like to open a ticket about the Log::Info and std::setw incompatibility? We don't need to solve it now (or soon), but it would maybe be a good task for someone who is looking for ways to contribute, since it's a pretty self-contained and easy-to-replicate problem. :)

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
to_string(width)+ "." +
to_string(precision));
const string widthOnly("%-"+
to_string(width)+ ".");

This comment has been minimized.

@rcurtin

rcurtin Aug 4, 2016

Member

Be sure to follow the style guide---wrapped lines should be indented twice (not once), and try to place spaces between operators (like a + b not a+b) for readability. :)

@rcurtin

rcurtin Aug 4, 2016

Member

Be sure to follow the style guide---wrapped lines should be indented twice (not once), and try to place spaces between operators (like a + b not a+b) for readability. :)

This comment has been minimized.

@keon

keon Aug 6, 2016

Member

updated

@keon

keon Aug 6, 2016

Member

updated

keon added some commits Aug 6, 2016

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
if (population)
{
// Calculate Population Excess Kurtosis
double M2 = SumNthPowerDeviations(input, fMean, 2);

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Aug 7, 2016

Contributor

I think this is a good candidate of const too, as other variables in this function.
Sorry of being picky on const correctness.

After this is fixed, I think the codes are ready to merge

@stereomatchingkiss

stereomatchingkiss Aug 7, 2016

Contributor

I think this is a good candidate of const too, as other variables in this function.
Sorry of being picky on const correctness.

After this is fixed, I think the codes are ready to merge

This comment has been minimized.

@keon

keon Aug 7, 2016

Member

updated

@keon

keon Aug 7, 2016

Member

updated

Show outdated Hide outdated src/mlpack/methods/preprocess/preprocess_describe_main.cpp
* @return sum of nth power deviations.
*/
double SumNthPowerDeviations(const arma::rowvec& input,
const double& fMean,

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Aug 7, 2016

Contributor

@KeonKim Discover a small problem before merged, could you align the paramters?like load.hpp did, thanks

@stereomatchingkiss

stereomatchingkiss Aug 7, 2016

Contributor

@KeonKim Discover a small problem before merged, could you align the paramters?like load.hpp did, thanks

This comment has been minimized.

@keon

keon Aug 8, 2016

Member

updated

@keon

keon Aug 8, 2016

Member

updated

@stereomatchingkiss stereomatchingkiss merged commit acd81e1 into mlpack:master Aug 8, 2016

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details

@stereomatchingkiss stereomatchingkiss changed the title from Descriptive Statistics command-line program to [GSOC]Descriptive Statistics command-line program Aug 27, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment