Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC]Descriptive Statistics command-line program #742

Merged
merged 7 commits into from Aug 8, 2016
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions src/mlpack/methods/preprocess/CMakeLists.txt
Expand Up @@ -15,5 +15,6 @@ set(MLPACK_SRCS ${MLPACK_SRCS} ${DIR_SRCS} PARENT_SCOPE)
#add_cli_executable(preprocess_stats)
add_cli_executable(preprocess_split)
add_cli_executable(preprocess_binarize)
add_cli_executable(preprocess_describe)
#add_cli_executable(preprocess_scan)
#add_cli_executable(preprocess_imputer)
226 changes: 226 additions & 0 deletions src/mlpack/methods/preprocess/preprocess_describe_main.cpp
@@ -0,0 +1,226 @@
/**
* @file preprocess_describe_main.cpp
* @author Keon Kim
*
* Descriptive Statistics Class and CLI executable.
*/
#include <mlpack/core.hpp>
#include <boost/format.hpp>
#include <boost/lexical_cast.hpp>

using namespace mlpack;
using namespace mlpack::data;
using namespace std;
using namespace boost;

PROGRAM_INFO("Descriptive Statistics", "This utility takes a dataset and "
"prints out the descriptive statistics of the data. Descriptive statistics "
"is the discipline of quantitatively describing the main features of a "
"collection of information, or the quantitative description itself. The "
"program does not modify the original file, but instead prints out the "
"statistics to the console. The printed result will look like a table."
"\n\n"
"Optionally, width and precision of the output can be adjusted by a user "
"using the --width (-w) and --precision (-p). A user can also select a "
"specific dimension to analyize if he or she has too many dimensions."
"--population (-P) is a flag which can be used when the user wants the "
"dataset to be considered as a population. Otherwise, the dataset will "
"be considered as a sample."
"\n\n"
"So, a simple example where we want to print out statistical facts about "
"dataset.csv, and keep the default settings, we could run"
"\n\n"
"$ mlpack_preprocess_describe -i dataset.csv -v"
"\n\n"
"If we want to customize the width to 10 and precision to 5 and consider "
"the dataset as a population, we could run"
"\n\n"
"$ mlpack_preprocess_describe -i dataset.csv -w 10 -p 5 -P -v");

// Define parameters for data.
PARAM_STRING_IN_REQ("input_file", "File containing data,", "i");
PARAM_INT_IN("dimension", "Dimension of the data. Use this to specify a "
"dimension", "d", 0);
PARAM_INT_IN("precision", "Precision of the output statistics.", "p", 4);
PARAM_INT_IN("width", "Width of the output table.", "w", 8);
PARAM_FLAG("population", "If specified, the program will calculate statistics "
"assuming the dataset is the population. By default, the program will "
"assume the dataset as a sample.", "P");
PARAM_FLAG("rowMajor", "If specified, the program will calculate statistics "
"assuming the dataset is organized in row major. By default, the program "
"will assume the dataset is a column major.", "r");

/**
* Calculates the sum of deviations to the Nth Power.
*
* @param input Vector that captures a dimension of a dataset.
* @param rowMean Mean of the given vector.
* @param n Degree of power.
* @return sum of nth power deviations.
*/
double SumNthPowerDeviations(const arma::rowvec& input,
const double& fMean,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KeonKim Discover a small problem before merged, could you align the paramters?like load.hpp did, thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

size_t n) // Degree of Power
{
return arma::sum(arma::pow(input - fMean, static_cast<double>(n)));
}

/**
* Calculates Skewness of the given vector.
*
* @param input Vector that captures a dimension of a dataset
* @param rowStd Standard Deviation of the given vector.
* @param rowMean Mean of the given vector.
* @return Skewness of the given vector.
*/
double Skewness(const arma::rowvec& input,
const double& fStd,
const double& fMean,
const bool population)
{
double skewness = 0;
const double S3 = pow(fStd, 3);
const double M3 = SumNthPowerDeviations(input, fMean, 3);
const double n = input.n_elem;
if (population)
{
// Calculate Population Skewness
skewness = M3 / (n * S3);
}
else
{
// Calculate Sample Skewness
skewness = n * M3 / ((n - 1) * (n - 2) * S3);
}
return skewness;
}

/**
* Calculates excess kurtosis of the given vector.
*
* @param input Vector that captures a dimension of a dataset
* @param rowStd Standard Deviation of the given vector.
* @param rowMean Mean of the given vector.
* @return Kurtosis of the given vector.
*/
double Kurtosis(const arma::rowvec& input,
const double& fStd,
const double& fMean,
const bool population)
{
double kurtosis = 0;
const double M4 = SumNthPowerDeviations(input, fMean, 4);
const double n = input.n_elem;
if (population)
{
// Calculate Population Excess Kurtosis
double M2 = SumNthPowerDeviations(input, fMean, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good candidate of const too, as other variables in this function.
Sorry of being picky on const correctness.

After this is fixed, I think the codes are ready to merge

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

kurtosis = n * (M4 / pow(M2, 2)) - 3;
}
else
{
// Calculate Sample Excess Kurtosis
double S4 = pow(fStd, 4);
double norm3 = (3 * (n - 1) * (n - 1)) / ((n - 2) * (n - 3));
double normC = (n * (n + 1)) / ((n - 1) * (n - 2) * (n - 3));
double normM = M4 / S4;
kurtosis = normC * normM - norm3;
}
return kurtosis;
}

/**
* Calculates standard error of standard deviation.
*
* @param input Vector that captures a dimension of a dataset
* @param rowStd Standard Deviation of the given vector.
* @return Standard error of the stanrdard devation of the given vector.
*/
double StandardError(const size_t size, const double& fStd)
{
return fStd / sqrt(size);
}

int main(int argc, char** argv)
{
// Parse command line options.
CLI::ParseCommandLine(argc, argv);
const string inputFile = CLI::GetParam<string>("input_file");
const size_t dimension = static_cast<size_t>(CLI::GetParam<int>("dimension"));
const size_t precision = static_cast<size_t>(CLI::GetParam<int>("precision"));
const size_t width = static_cast<size_t>(CLI::GetParam<int>("width"));
const bool population = CLI::HasParam("population");
const bool rowMajor = CLI::HasParam("rowMajor");

// Load the data
arma::mat data;
data::Load(inputFile, data);

// Generate boost format recipe.
const string widthPrecision("%-"+
to_string(width)+ "." +
to_string(precision));
const string widthOnly("%-"+
to_string(width)+ ".");
string stringFormat = "";
string numberFormat = "";
for (size_t i = 0; i < 11; ++i)
{
stringFormat += widthOnly + "s";
numberFormat += widthPrecision + "f";
}

Timer::Start("statistics");
// Headers
Log::Info << boost::format(stringFormat)
% "dim" % "var" % "mean" % "std" % "median" % "min" % "max"
% "range" % "skew" % "kurt" % "SE" << endl;

// Lambda function to print out the results.
auto printStatResults = [&](size_t dim, bool rowMajor)
{
arma::rowvec feature;
if (rowMajor)
feature = arma::conv_to<arma::rowvec>::from(data.col(dim));
else
feature = data.row(dim);

// f at the front means "feature"
const double fMax = arma::max(feature);
const double fMin = arma::min(feature);
const double fMean = arma::mean(feature);
const double fStd = arma::stddev(feature, population);

// Print statistics of the given fension.
Log::Info << boost::format(numberFormat)
% dim
% arma::var(feature, population)
% fMean
% fStd
% arma::median(feature)
% fMin
% fMax
% (fMax - fMin) // range
% Skewness(feature, fStd, fMean, population)
% Kurtosis(feature, fStd, fMean, population)
% StandardError(feature.n_elem, fStd)
<< endl;
};

// If the user specified dimension, describe statistics of the given
// dimension. If it dimension not specified, describe all dimensions.
if(CLI::HasParam("dimension"))
{
printStatResults(dimension, rowMajor);
}
else
{
const size_t dimensions = rowMajor ? data.n_cols : data.n_rows;
for(size_t i = 0; i < dimensions; ++i)
{
printStatResults(i, rowMajor);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reduce some duplicate codes

auto printStatResults = [&](int dim)
{
    arma::rowvec row = data.row(dim);
    double rowMax = arma::max(row);
    double rowMin = arma::min(row);
    double rowMean = arma::mean(row);
    double rowStd = arma::stddev(row, population);

    // Print statistics of the given dimension.
    Log::Info << boost::format(numberFormat)
        % dim
        % arma::var(row, population)
        % rowMean
        % rowStd
        % arma::median(row)
        % rowMin
        % rowMax
        % (rowMax - rowMin) // range
        % Skewness(row, rowStd, rowMean, population)
        % Kurtosis(row, rowStd, rowMean, population)
% StandardError(row, rowStd) << endl;
};

if(CLI::HasParam("dimension")){
    printStatResults(dimension);
}else{
    for(size_t i = 0; i < data.n_rows; ++i){
        printStatResults(i);
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Timer::Stop("statistics");
}