New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSOC]Descriptive Statistics command-line program #742
Changes from 5 commits
5aed5ba
63d5959
27ac82e
4cf1dde
a5c996d
3dca3fb
23c54cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,226 @@ | ||
/** | ||
* @file preprocess_describe_main.cpp | ||
* @author Keon Kim | ||
* | ||
* Descriptive Statistics Class and CLI executable. | ||
*/ | ||
#include <mlpack/core.hpp> | ||
#include <boost/format.hpp> | ||
#include <boost/lexical_cast.hpp> | ||
|
||
using namespace mlpack; | ||
using namespace mlpack::data; | ||
using namespace std; | ||
using namespace boost; | ||
|
||
PROGRAM_INFO("Descriptive Statistics", "This utility takes a dataset and " | ||
"prints out the descriptive statistics of the data. Descriptive statistics " | ||
"is the discipline of quantitatively describing the main features of a " | ||
"collection of information, or the quantitative description itself. The " | ||
"program does not modify the original file, but instead prints out the " | ||
"statistics to the console. The printed result will look like a table." | ||
"\n\n" | ||
"Optionally, width and precision of the output can be adjusted by a user " | ||
"using the --width (-w) and --precision (-p). A user can also select a " | ||
"specific dimension to analyize if he or she has too many dimensions." | ||
"--population (-P) is a flag which can be used when the user wants the " | ||
"dataset to be considered as a population. Otherwise, the dataset will " | ||
"be considered as a sample." | ||
"\n\n" | ||
"So, a simple example where we want to print out statistical facts about " | ||
"dataset.csv, and keep the default settings, we could run" | ||
"\n\n" | ||
"$ mlpack_preprocess_describe -i dataset.csv -v" | ||
"\n\n" | ||
"If we want to customize the width to 10 and precision to 5 and consider " | ||
"the dataset as a population, we could run" | ||
"\n\n" | ||
"$ mlpack_preprocess_describe -i dataset.csv -w 10 -p 5 -P -v"); | ||
|
||
// Define parameters for data. | ||
PARAM_STRING_IN_REQ("input_file", "File containing data,", "i"); | ||
PARAM_INT_IN("dimension", "Dimension of the data. Use this to specify a " | ||
"dimension", "d", 0); | ||
PARAM_INT_IN("precision", "Precision of the output statistics.", "p", 4); | ||
PARAM_INT_IN("width", "Width of the output table.", "w", 8); | ||
PARAM_FLAG("population", "If specified, the program will calculate statistics " | ||
"assuming the dataset is the population. By default, the program will " | ||
"assume the dataset as a sample.", "P"); | ||
PARAM_FLAG("rowMajor", "If specified, the program will calculate statistics " | ||
"assuming the dataset is organized in row major. By default, the program " | ||
"will assume the dataset is a column major.", "r"); | ||
|
||
/** | ||
* Calculates the sum of deviations to the Nth Power. | ||
* | ||
* @param input Vector that captures a dimension of a dataset. | ||
* @param rowMean Mean of the given vector. | ||
* @param n Degree of power. | ||
* @return sum of nth power deviations. | ||
*/ | ||
double SumNthPowerDeviations(const arma::rowvec& input, | ||
const double& fMean, | ||
size_t n) // Degree of Power | ||
{ | ||
return arma::sum(arma::pow(input - fMean, static_cast<double>(n))); | ||
} | ||
|
||
/** | ||
* Calculates Skewness of the given vector. | ||
* | ||
* @param input Vector that captures a dimension of a dataset | ||
* @param rowStd Standard Deviation of the given vector. | ||
* @param rowMean Mean of the given vector. | ||
* @return Skewness of the given vector. | ||
*/ | ||
double Skewness(const arma::rowvec& input, | ||
const double& fStd, | ||
const double& fMean, | ||
const bool population) | ||
{ | ||
double skewness = 0; | ||
const double S3 = pow(fStd, 3); | ||
const double M3 = SumNthPowerDeviations(input, fMean, 3); | ||
const double n = input.n_elem; | ||
if (population) | ||
{ | ||
// Calculate Population Skewness | ||
skewness = M3 / (n * S3); | ||
} | ||
else | ||
{ | ||
// Calculate Sample Skewness | ||
skewness = n * M3 / ((n - 1) * (n - 2) * S3); | ||
} | ||
return skewness; | ||
} | ||
|
||
/** | ||
* Calculates excess kurtosis of the given vector. | ||
* | ||
* @param input Vector that captures a dimension of a dataset | ||
* @param rowStd Standard Deviation of the given vector. | ||
* @param rowMean Mean of the given vector. | ||
* @return Kurtosis of the given vector. | ||
*/ | ||
double Kurtosis(const arma::rowvec& input, | ||
const double& fStd, | ||
const double& fMean, | ||
const bool population) | ||
{ | ||
double kurtosis = 0; | ||
const double M4 = SumNthPowerDeviations(input, fMean, 4); | ||
const double n = input.n_elem; | ||
if (population) | ||
{ | ||
// Calculate Population Excess Kurtosis | ||
double M2 = SumNthPowerDeviations(input, fMean, 2); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a good candidate of const too, as other variables in this function. After this is fixed, I think the codes are ready to merge There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated |
||
kurtosis = n * (M4 / pow(M2, 2)) - 3; | ||
} | ||
else | ||
{ | ||
// Calculate Sample Excess Kurtosis | ||
double S4 = pow(fStd, 4); | ||
double norm3 = (3 * (n - 1) * (n - 1)) / ((n - 2) * (n - 3)); | ||
double normC = (n * (n + 1)) / ((n - 1) * (n - 2) * (n - 3)); | ||
double normM = M4 / S4; | ||
kurtosis = normC * normM - norm3; | ||
} | ||
return kurtosis; | ||
} | ||
|
||
/** | ||
* Calculates standard error of standard deviation. | ||
* | ||
* @param input Vector that captures a dimension of a dataset | ||
* @param rowStd Standard Deviation of the given vector. | ||
* @return Standard error of the stanrdard devation of the given vector. | ||
*/ | ||
double StandardError(const size_t size, const double& fStd) | ||
{ | ||
return fStd / sqrt(size); | ||
} | ||
|
||
int main(int argc, char** argv) | ||
{ | ||
// Parse command line options. | ||
CLI::ParseCommandLine(argc, argv); | ||
const string inputFile = CLI::GetParam<string>("input_file"); | ||
const size_t dimension = static_cast<size_t>(CLI::GetParam<int>("dimension")); | ||
const size_t precision = static_cast<size_t>(CLI::GetParam<int>("precision")); | ||
const size_t width = static_cast<size_t>(CLI::GetParam<int>("width")); | ||
const bool population = CLI::HasParam("population"); | ||
const bool rowMajor = CLI::HasParam("rowMajor"); | ||
|
||
// Load the data | ||
arma::mat data; | ||
data::Load(inputFile, data); | ||
|
||
// Generate boost format recipe. | ||
const string widthPrecision("%-"+ | ||
to_string(width)+ "." + | ||
to_string(precision)); | ||
const string widthOnly("%-"+ | ||
to_string(width)+ "."); | ||
string stringFormat = ""; | ||
string numberFormat = ""; | ||
for (size_t i = 0; i < 11; ++i) | ||
{ | ||
stringFormat += widthOnly + "s"; | ||
numberFormat += widthPrecision + "f"; | ||
} | ||
|
||
Timer::Start("statistics"); | ||
// Headers | ||
Log::Info << boost::format(stringFormat) | ||
% "dim" % "var" % "mean" % "std" % "median" % "min" % "max" | ||
% "range" % "skew" % "kurt" % "SE" << endl; | ||
|
||
// Lambda function to print out the results. | ||
auto printStatResults = [&](size_t dim, bool rowMajor) | ||
{ | ||
arma::rowvec feature; | ||
if (rowMajor) | ||
feature = arma::conv_to<arma::rowvec>::from(data.col(dim)); | ||
else | ||
feature = data.row(dim); | ||
|
||
// f at the front means "feature" | ||
const double fMax = arma::max(feature); | ||
const double fMin = arma::min(feature); | ||
const double fMean = arma::mean(feature); | ||
const double fStd = arma::stddev(feature, population); | ||
|
||
// Print statistics of the given fension. | ||
Log::Info << boost::format(numberFormat) | ||
% dim | ||
% arma::var(feature, population) | ||
% fMean | ||
% fStd | ||
% arma::median(feature) | ||
% fMin | ||
% fMax | ||
% (fMax - fMin) // range | ||
% Skewness(feature, fStd, fMean, population) | ||
% Kurtosis(feature, fStd, fMean, population) | ||
% StandardError(feature.n_elem, fStd) | ||
<< endl; | ||
}; | ||
|
||
// If the user specified dimension, describe statistics of the given | ||
// dimension. If it dimension not specified, describe all dimensions. | ||
if(CLI::HasParam("dimension")) | ||
{ | ||
printStatResults(dimension, rowMajor); | ||
} | ||
else | ||
{ | ||
const size_t dimensions = rowMajor ? data.n_cols : data.n_rows; | ||
for(size_t i = 0; i < dimensions; ++i) | ||
{ | ||
printStatResults(i, rowMajor); | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can reduce some duplicate codes
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated |
||
Timer::Stop("statistics"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KeonKim Discover a small problem before merged, could you align the paramters?like load.hpp did, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated