### Subversion checkout URL

You can clone with
or
.

# New Statistics Module#118

Merged
merged 5 commits into from
+3,798 −11

### 3 participants

Collaborator

This pull request should be merged only after pull request 139. Some minor changes will be necessary for that. However, I'd like to open the discussion already now because I'll be offline for two days.

commented on the diff
src/modules/stats/t_test.cpp
 ((65 lines not shown)) + typename HandleTraits::ReferenceToDouble x_sum; + typename HandleTraits::ReferenceToDouble correctedX_square_sum; + + typename HandleTraits::ReferenceToUInt64 numY; + typename HandleTraits::ReferenceToDouble y_sum; + typename HandleTraits::ReferenceToDouble correctedY_square_sum; + + typename HandleTraits::ReferenceToDouble parameter; +}; + +/** + * @brief Update the corrected sum of squares + * + * For numerical stability, we should not compute the sample variance in the + * naive way. The literature has many examples where this gives bad results + * even with moderately sized inputs.
 Collaborator fschopp added a note Feb 15, 2012 In fact, the implementation below is like in Chan et al. (and not a naive one). Owner cwelton added a note May 9, 2012 Good to add to the comments, would also be good to note that further improvement would be desirable, e.g. Ogita et al. Collaborator fschopp added a note May 10, 2012 Yes, this should point to MADLIB-499 and subtasks. to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

Current plan is to schedule this for pull.

commented on the diff
src/config/Version.yml
 @@ -1 +1 @@ -version: 0.3 +version: 0.4dev
 Owner cwelton added a note May 9, 2012 Good, I was going to file a jira for this. I didn't expect to see it in the inferential statistics PR, but I'm happy to see it changing. We should probably still have a jira to remove the "dev" prior to release. Collaborator fschopp added a note May 10, 2012 Yes, we might want to factor this out into a separate commit. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/chi_squared_test.cpp
 ((19 lines not shown)) + +// Import names from other MADlib modules +using prob::chiSquaredCDF; + +namespace stats { + +// Workaround for Doxygen: A header file that does not declare namespaces is to +// be ignored if and only if it is processed stand-alone +#undef _DOXYGEN_IGNORE_HEADER_FILE +#include "chi_squared_test.hpp" + +/** + * @brief Transition state for chi-squared functions + * + * Note: We assume that the DOUBLE PRECISION array is initialized by the + * database with length 8, and all elemenets are 0.
 Owner cwelton added a note May 9, 2012 s/elemenets/elements/ Owner cwelton added a note May 9, 2012 The assumption that the state will be an array of length 8 is somewhat suspect. This will be the case when it is properly initialized by an aggregate, but individual functions of can be directly callable. When called directly the parameters can be anything that a user provides. For instance, what is the behavior if a user calls: select chi2_gof_test_merge_states(array[1],array[2]);? If this will reference uninitialized memory then it needs to be fixed. Collaborator fschopp added a note May 10, 2012 This assumption here is meant as a precondition for the correct outcome (in a mathematical sense). If this precondition is not satisfied, e.g., because the function is called directly, the result will not be correct. The C++ AL will raise an exception when an out-of-bounds array access occurs. That said, your concern raises a valid point in that there are cases where the code takes the address of an array element (with bounds checking), but then accesses memory relative to that pointer (no bounds checking any more). This is the case when Eigen objects are mapped to different locations in a single array. I'll check those cases and fix if necessary. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/modules/stats/chi_squared_test.cpp
 ((64 lines not shown)) + typename HandleTraits::ReferenceToInt64 df; +}; + +inline +void +updateSumSquaredDeviations(double &ioLeftNumRows, double &ioLeftSumExp, + double &ioLeftSumObsSquareOverExp, double &ioLeftSumObs, + double &ioLeftSumSquaredDeviations, + double inRightNumRows, double inRightSumExp, + double inRightSumObsSquareOverExp, double inRightSumObs, + double inRightSumSquaredDeviations) { + + if (inRightNumRows <= 0) + return; + + // FIXME: Use compensated sums for numerical stability
 Owner cwelton added a note May 9, 2012 File jiras for unresolved issues/improvements and refer to the jira number in the comment. Collaborator fschopp added a note May 10, 2012 MADLIB-500, MADLIB-501 to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/chi_squared_test.cpp
 ((82 lines not shown)) + + ioLeftSumExp * inRightSumObsSquareOverExp + + ioLeftSumObsSquareOverExp * inRightSumExp + - 2 * ioLeftSumObs * inRightSumObs; + + ioLeftNumRows += inRightNumRows; + ioLeftSumExp += inRightSumExp; + ioLeftSumObsSquareOverExp += inRightSumObsSquareOverExp; + ioLeftSumObs += inRightSumObs; +} + +AnyType +chi2_gof_test_transition::run(AnyType &args) { + Chi2TestTransitionState > state = args[0]; + int64_t observed = args[1].getAs(); + double expected = args[2].isNull() ? 1 : args[2].getAs(); + int64_t df = args[3].isNull() ? -1 : args[3].getAs();
 Owner cwelton added a note May 9, 2012 Slight discrepancy from the documented behavior. Documented: @param df Degrees of freedom. This is the number of events reduced by the degree of freedom lost by using the observed numbers for defining the expected number of observations. If this parameter is \c NULL, the degree of freedom is taken as \f$(k - 1) \f$. Actual: @param df Degrees of freedom. This is the number of events reduced by the degree of freedom lost by using the observed numbers for defining the expected number of observations. If this parameter is \c NULL or < 0, the degree of freedom is taken as \f$(k - 1) \f$. Collaborator fschopp added a note May 10, 2012 Probably the right thing to do is to raise an error if df <= 0. Also, we should make df an optional argument (which currently means defining an overloaded UDA). Update: Implemented as I suggested. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/chi_squared_test.cpp
 ((88 lines not shown)) + ioLeftSumObsSquareOverExp += inRightSumObsSquareOverExp; + ioLeftSumObs += inRightSumObs; +} + +AnyType +chi2_gof_test_transition::run(AnyType &args) { + Chi2TestTransitionState > state = args[0]; + int64_t observed = args[1].getAs(); + double expected = args[2].isNull() ? 1 : args[2].getAs(); + int64_t df = args[3].isNull() ? -1 : args[3].getAs(); + + if (state.uniformDist != args[2].isNull()) { + if (state.numRows > 0) + throw std::invalid_argument("Expected number of observations must " + "be given for all events or must be NULL for all events, in " + "which case a discrete uniform distribution is assumed.");
 Owner cwelton added a note May 9, 2012 Not technically correct since NULL and -1 are treated as equivalent, but not a huge deal. Collaborator fschopp added a note May 16, 2012 For args[2], the expected number of observations, this error msg is correct. You are thinking about args[3], the degree of freedom. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/modules/stats/kolmogorov_smirnov_test.cpp
 ((19 lines not shown)) + +// Import names from other MADlib modules +using prob::kolmogorovCDF; + +namespace stats { + +// Workaround for Doxygen: A header file that does not declare namespaces is to +// be ignored if and only if it is processed stand-alone +#undef _DOXYGEN_IGNORE_HEADER_FILE +#include "kolmogorov_smirnov_test.hpp" + +/** + * @brief Transition state for Kolmogorov-Smirnov-Test functions + * + * Note: We assume that the DOUBLE PRECISION array is initialized by the + * database with length 7, and all elemenets are 0.
 Owner cwelton added a note May 9, 2012 Like chi_squared I am concerned about the assumption of what the input looks like for user callable functions. Owner cwelton added a note May 9, 2012 Also s/elemenets/elements/ Collaborator haradh1 added a note May 10, 2012 And looks like the required length is 6. Collaborator fschopp added a note May 11, 2012 True. This is, BTW, an example where directly calling the function using an array of length 5 would lead to undefined behavior. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/kolmogorov_smirnov_test.cpp
 ((65 lines not shown)) + KSTestTransitionState > state = args[0]; + int sample = args[1].getAs() ? 0 : 1; + double value = args[2].getAs(); + ColumnVector2 expectedNum; + expectedNum << args[3].getAs(), args[4].getAs(); + + if (state.expectedNum != expectedNum) { + if (state.num.sum() > 0) + throw std::invalid_argument("Number of samples must be constant " + "parameters."); + + state.expectedNum = expectedNum; + } + + if (state.last > value && state.num.sum() > 0) + throw std::invalid_argument("Must be used as an ordered aggregate.");
 Owner cwelton added a note May 9, 2012 Not just as an ordered aggregate, but also ordered by the same expression as the input parameter. E.g. this is good: select agg(x order by x) ..., whereas these are not select agg(x order by x desc), select agg(x order by y). It would be good if the database had a better way to express the relevant properties so that a user can call the function simply, unfortunately it does not. Owner cwelton added a note May 9, 2012 Additionally ordered aggregates don't scale particularly well due to the need of establishing a single global ordering. Not that I see any immediate alternatives. Collaborator fschopp added a note May 10, 2012 Unfortunately, anything that requires the rank of values also requires a sort. The only alternative would be online approximation algorithms. Though I am doubtful that that would be easy to parallelize. Research on anything like "online multi-stream Kolmogorov-Smirnov test approximation algorithms" is probably limited or non-existent... to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/modules/stats/mann_whitney_test.cpp
 ((19 lines not shown)) + +// Import names from other MADlib modules +using prob::normalCDF; + +namespace stats { + +// Workaround for Doxygen: A header file that does not declare namespaces is to +// be ignored if and only if it is processed stand-alone +#undef _DOXYGEN_IGNORE_HEADER_FILE +#include "mann_whitney_test.hpp" + +/** + * @brief Transition state for Mann-Whitney-Test functions + * + * Note: We assume that the DOUBLE PRECISION array is initialized by the + * database with length 7, and all elemenets are 0.
 Owner cwelton added a note May 9, 2012 Same comment regarding user supplied parameters. Collaborator fschopp added a note May 11, 2012 No problem here, bounds will be checked. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/mann_whitney_test.cpp
 ((60 lines not shown)) +/** + * @brief Perform the Mann-Whitney-test transition step + */ +AnyType +mw_test_transition::run(AnyType &args) { + MWTestTransitionState > state = args[0]; + int sample = args[1].getAs() ? 0 : 1; + double value = args[2].getAs(); + + if (state.last < value) { + state.numTies.setZero(); + } else if (state.last == value) { + for (int i = 0; i <= 1; i++) + state.rankSum(i) += state.numTies(i) * 0.5; + } else if (state.num.sum() > 0) { // also satisfied here: state.last > value + throw std::invalid_argument("Must be used as an ordered aggregate.");
 Owner cwelton added a note May 9, 2012 Same comment regarding use as an ordered aggregate. Collaborator fschopp added a note May 16, 2012 Fixed:  throw std::invalid_argument("Must be used as an ordered aggregate, " "in ascending order of the second argument.");  to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/one_way_anova.cpp
 ((20 lines not shown)) + +// Import names from other MADlib modules +using prob::fisherF_CDF; + +namespace stats { + +// Workaround for Doxygen: A header file that does not declare namespaces is to +// be ignored if and only if it is processed stand-alone +#undef _DOXYGEN_IGNORE_HEADER_FILE +#include "one_way_anova.hpp" + +/** + * @brief Transition state for one-way ANOVA functions + * + * Note: We assume that the DOUBLE PRECISION array is initialized by the + * database with length 1, and all elemenets are 0.
 Owner cwelton added a note May 9, 2012 Same comment regarding user supplied parameters and elemenets Collaborator fschopp added a note May 11, 2012 Out-of-bounds access possible here. Needs fix. Collaborator fschopp added a note May 16, 2012 Fixed. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/one_way_anova.cpp
 ((43 lines not shown)) + rebind(utils::nextPowerOfTwo(static_cast(mStorage[0]))); + } + + /** + * @brief Convert to backend representation + * + * We define this function so that we can use TransitionState in the argument + * list and as a return type. + */ + inline operator AnyType() const { + return mStorage; + } + + /** + * @brief Return the index (in the num, sum, and coorected_square_sum + * fields) of a group value
 Owner cwelton added a note May 9, 2012 s/coorected/corrected/ to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/modules/stats/one_way_anova.cpp
 ((99 lines not shown)) + groupValues, groupValues + numGroups, inValue) - groupValues; + + if (pos >= numGroups || groupValues[pos] != inValue) { + // Did not find this group value. We have to start a new group. + throw std::runtime_error("Could not find a grouping value during " + "one-way ANOVA."); + } + return pos; +} + +template <> +uint32_t +OWATransitionState >::idxOfGroup( + const Allocator& inAllocator, uint16_t inValue) { + + // FIXME: Think of using proper iterators. Add some safety. Overflow checks.
 Owner cwelton added a note May 9, 2012 Jiras for known issues remaining. Edit: I don't see any explicit issues in the code below related to overflow issues, but the comment concerns me. If you do not believe this code is safe then it isn't ready for merge. Collaborator fschopp added a note May 11, 2012 The issue here is that the transition state contains pointers within the transition-state array, but no bounds-checking is currently performed. Probably not the only such place in MADlib, and there is MADLIB-499 anyway. But yes, some input validation should be added for now. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/modules/stats/one_way_anova.cpp
 ((149 lines not shown)) + groupValues[pos] = inValue; + + std::copy(oldSelf.posToIndices, oldSelf.posToIndices + pos, posToIndices); + std::copy(oldSelf.posToIndices + pos, + oldSelf.posToIndices + oldSelf.numGroups, posToIndices + pos + 1); + posToIndices[pos] = oldSelf.numGroups; + + num.segment(0, oldSelf.numGroups) << oldSelf.num; + sum.segment(0, oldSelf.numGroups) << oldSelf.sum; + corrected_square_sum.segment(0, oldSelf.numGroups) << oldSelf.corrected_square_sum; + } + } + return posToIndices[pos]; +} + +// FIXME: Same function used for t_test. Factor out.
 Owner cwelton added a note May 9, 2012 Jiras for known issues remaining Collaborator fschopp added a note May 11, 2012 to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/modules/stats/one_way_anova.cpp
 ((87 lines not shown)) + typename HandleTraits::DoublePtr posToIndices; + typename HandleTraits::ColumnVectorTransparentHandleMap num; + typename HandleTraits::ColumnVectorTransparentHandleMap sum; + typename HandleTraits::ColumnVectorTransparentHandleMap corrected_square_sum; +}; + +template <> +uint32_t +OWATransitionState >::idxOfGroup( + const Allocator&, uint16_t inValue) { + + uint16_t pos = std::lower_bound( + groupValues, groupValues + numGroups, inValue) - groupValues; + + if (pos >= numGroups || groupValues[pos] != inValue) { + // Did not find this group value. We have to start a new group.
 Owner cwelton added a note May 9, 2012 I'm a bit confused by the "we have to start a new group" comment in combination with throwing a runtime_error. Collaborator fschopp added a note May 10, 2012 Good catch. This is comment is a left-over from copied code, and it is clearly confusing and inappropriate. The comment should read sth. like: "Did not find this group value. Since the underlying state is immutable, we will raise a runtime error. Obviously, this error should never occur." to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/modules/stats/t_test.cpp
 ((25 lines not shown)) +namespace stats { + +// Workaround for Doxygen: A header file that does not declare namespaces is to +// be ignored if and only if it is processed stand-alone +#undef _DOXYGEN_IGNORE_HEADER_FILE +#include "t_test.hpp" + +struct internal : public AbstractionLayer { + static AnyType tStatsToResult(double inT, double inDegreeOfFreedom); +}; + +/** + * @brief Transition state for t-Test functions + * + * Note: We assume that the DOUBLE PRECISION array is initialized by the + * database with length 6, and all elemenets are 0.
 Owner cwelton added a note May 9, 2012 Same comment about user supplied arguments and elemenets Collaborator fschopp added a note May 11, 2012 No issue here. Bounds will be checked. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/wilcoxon_signed_rank_test.cpp
 ((65 lines not shown)) + * + * Index 0 always refers to the positive values and index 1 refers to + * the negative values. + */ +AnyType +wsr_test_transition::run(AnyType &args) { + WSRTestTransitionState > state = args[0]; + double value = args[1].getAs(); + + // Ignore values of zero. + if (value == 0) + return state; + + int sample = value > 0 ? 0 : 1; + + // FIXME: The following epsilon is hard-coded
 Owner cwelton added a note May 9, 2012 Yes it is. What fix would you propose? Collaborator fschopp added a note May 10, 2012 I'd rather express it as a function of the machine epsilon and give an explanation for this expression. Collaborator fschopp added a note May 16, 2012 Fixed. Added a new function utils::almostEqual(). to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff
src/modules/stats/mann_whitney_test.cpp
 ((56 lines not shown)) + typename HandleTraits::ColumnVectorTransparentHandleMap rankSum; + typename HandleTraits::ReferenceToDouble last; +}; + +/** + * @brief Perform the Mann-Whitney-test transition step + */ +AnyType +mw_test_transition::run(AnyType &args) { + MWTestTransitionState > state = args[0]; + int sample = args[1].getAs() ? 0 : 1; + double value = args[2].getAs(); + + if (state.last < value) { + state.numTies.setZero(); + } else if (state.last == value) {
 Owner cwelton added a note May 9, 2012 Is there a reason you perform floating point equality here, but allow for an epsilon in wsr_test_transition? Collaborator fschopp added a note May 10, 2012 Not anything I remember. The WSR epsilon test might have been motivated by me testing some sample data. I wrote all this in one week, so there was not much time, and this just seems inconsistent. I'll check again before merging. Collaborator fschopp added a note May 16, 2012 Replaced by almost-equal testing. to join this conversation on GitHub. Already have an account? Sign in to comment
was assigned
was assigned
Owner

My review is complete.

1) I have concerns about direct invocation of the aggregate sub-functions due to assumptions about the transition state that are never verified.

2) For outstanding issues it would be good to have jiras tracking them so that they do not become lost and forgotten.

3) Some inconsistencies between behavior and documentation.

4) A couple small typos.

commented on the diff
src/modules/stats/one_way_anova.cpp
 ((112 lines not shown)) + const Allocator& inAllocator, uint16_t inValue) { + + // FIXME: Think of using proper iterators. Add some safety. Overflow checks. + uint16_t pos = std::lower_bound( + groupValues, groupValues + numGroups, inValue) - groupValues; + + if (pos >= numGroups || groupValues[pos] != inValue) { + // Did not find this group value. We have to start a new group. + + uint16_t numGroupsReserved = utils::nextPowerOfTwo( + static_cast(numGroups)); + if (numGroupsReserved > numGroups) { + // We have enough reserve space allocated. + std::copy(groupValues + pos, groupValues + numGroups, + groupValues + pos + 1); + groupValues[pos] = inValue;
 Collaborator haradh1 added a note May 9, 2012 This is an insertion sort, which may become a bottleneck. Collaborator fschopp added a note May 10, 2012 True, but I don't think the number of groups will ever become significant (but rather be in the single-digit range, or at most two-digit range). Otherwise, you are right, and I made a wrong assumption. Are you aware of use cases with more groups? Collaborator haradh1 added a note May 10, 2012 No, actually I don't have any idea of the expected number of groups. If it's only two-digits at most, it may be ok. It is always true my assumption doesn't work when some crazy man shows up, which bothers me :). Especially performance issues. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/modules/stats/t_test.cpp
 ((189 lines not shown)) + double sampleVariance = state.correctedX_square_sum + / degreeOfFreedom; + double t = std::sqrt(state.numX / sampleVariance) + * (state.x_sum / state.numX); + + return internal::tStatsToResult(t, degreeOfFreedom); +} + +/** + * @brief Perform the pooled (i.e., assuming equal variances) two-sample t-Test + * final step + */ +AnyType +t_test_two_pooled_final::run(AnyType &args) { + TTestTransitionState > state = args[0]; +
 Collaborator haradh1 added a note May 9, 2012 For t_test_two_pooled_final, t_test_two_unpooled_final, and t_test_final, is there no need to check to return NULL as in t_test_one_final? Collaborator fschopp added a note May 10, 2012 True, that should be done. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
src/ports/postgres/modules/stats/hypothesis_tests.sql_in
 ((491 lines not shown)) + * + * @usage + * - Test null hypothesis that two samples stem from the same distribution: + *
SELECT (ks_test(first, value,
+ *    (SELECT count(value) FROM source WHERE first),
+ *    (SELECT count(value) FROM source WHERE NOT first)
+ *    ORDER BY value
+ *)).* FROM source
+ * + * @note + * This aggregate must be used as an ordered aggregate + * (ORDER BY \em value) and will raise an exception if values are + * not ordered. + */ +CREATE +m4_ifdef(__GREENPLUM__',ORDERED')
 Collaborator haradh1 added a note May 10, 2012 If we support stats in GPDB 4.0, this will fail. Owner cwelton added a note May 10, 2012 Good point. We should document and not install the unsupported module on GPDB 4.0. Or... have some way of identifying which portions of the module are not supported and skip only those portions. Collaborator fschopp added a note May 10, 2012 I assume ordered aggregates are the only unsupported feature in GPDB 4.0. We thus could omit all rank-based tests, or we simply omit all hypothesis tests. Alternatively, we could leave them in, but the user would still have to ensure that the aggregate is seeing the values in order (which may or may not be possible using workarounds). Collaborator fschopp added a note May 16, 2012 For now, module stats is deactivated on GP4.0. to join this conversation on GitHub. Already have an account? Sign in to comment
was assigned
Collaborator

Done my quick review. Assign back to Florian.

 Florian Schoppmann New Statistics module (initial version): - Implemented one-sample t-test and two-sample (pooled and unpooled) t-test - F-test - Kolmogorov-Smirnov test - Mann-Whitney test - Wilcoxon-Signed-Rank test - Implemented One-way ANOVA - Added Pearson's chi-squared goodness-of-fit test (for arbitrary distributions). Can also be used as a test of independence. - Added unit tests for all of the above Utility module: - Added small but common utility functions: assert() and relative_error() C++ AL / utilities: - Allow "masquerading" mutable references to be passed by reference to the real type - Added names for constant-size vectors (so far only for size 2 and 3) Documentation: - Added module hypothesis tests - Added warning for Wilcoxon-Signed-Rank test. This needs to be solved. 12f1713 Florian Schoppmann Documentation: - doxysql: Support for aggregate default arguments and the NULL keyword - Added a bibliography database (in BibTeX format) that we can use in doxygen - Added amsfonts package to LaTeX generator Build system: - Bumped version number to 0.4dev 3ea3524 Florian Schoppmann Documentation: - Updated hypothesis-tests module. Gave more precise description of what is computed and how. Included example for chi-squared test of independence. 2a135e2 Florian Schoppmann Hypothesis tests / One-way ANOVA: - Fixed bug in merge function 882c7a8
commented on an outdated diff
src/modules/stats/wilcoxon_signed_rank_test.cpp
 ((70 lines not shown)) +wsr_test_transition::run(AnyType &args) { + WSRTestTransitionState > state = args[0]; + double value = args[1].getAs(); + + // Ignore values of zero. + if (value == 0) + return state; + + int sample = value > 0 ? 0 : 1; + + // FIXME: The following epsilon is hard-coded + const double epsilon = 1e-10; + + if (std::fabs(state.last) + epsilon < std::fabs(value)) { + state.numTies.setZero(); + } else if (std::fabs(std::fabs(state.last) - std::fabs(value)) < epsilon) {
 Collaborator fschopp added a note May 16, 2012 Replaced by: // For almostEqual, we choose a precision of 2 * 3 units in the last place. // This is because we assume that value is the result of adding up to three // values (in the dependent paired test, value may be computed as // "first - second - mu_0"). if (utils::almostEqual(std::fabs(state.last), std::fabs(value), 3) { [...]  Update: Please disregard this comment. Collaborator fschopp added a note May 16, 2012 Hmm, this one is actually problematic. If value is the result of a subtraction, then we are facing the problem of loss of significance. (We obviously know the difference with much less precision than we knew the original values.) From a numerical perspective, the only clean solution is to change the signature and have three parameters first, second, and mu_0. Alternatively, the user could provide a different argument that tells us the precision with which the differences are known. to join this conversation on GitHub. Already have an account? Sign in to comment
 Florian Schoppmann Changes resulting from discussion around pull request 118. Hypothesis tests: - Out-of-bounds checks for transition states - t-test/f-test: UDAs return NULL in case of insufficient data - Wilcoxon-signed-rank test/Mann-Whitney test: Updated testing of ties. For WSR tests, I added a new precision parameter. This parameter has to be used in the unit test. - Disabled module on GPDB4.0 because ordered aggregates are required - If UDAs are not properly used as ordered aggregates, the error messages are now more informative. - UDA chi2_gof_test now has default arguments k-means: - Fixed a wrong schema name in test 242cbff
Collaborator

I have made the following modifications before merging. The biggest change involves a change in functionality and is shown in bold font below. I'll merge already now to facilitate QA, but I suggest a quick post-merge review of the latest commit as well.

Changes resulting from discussion around pull request 118.
Hypothesis tests:

• Out-of-bounds checks for transition states
• t-test/f-test: UDAs return NULL in case of insufficient data
• Wilcoxon-signed-rank test/Mann-Whitney test: Updated testing of ties. For WSR tests, I added a new precision parameter. This parameter has to be used in the unit test.
• Disabled module on GPDB4.0 because ordered aggregates are required
• If UDAs are not properly used as ordered aggregates, the error messages are now more informative.
• UDA chi2_gof_test now has default arguments

k-means:

• Fixed a wrong schema name in test
merged commit 242cbff into madlib:master
to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 16, 2012
1. Florian Schoppmann committed
- Implemented one-sample t-test and two-sample (pooled and unpooled) t-test
- F-test
- Kolmogorov-Smirnov test
- Mann-Whitney test
- Wilcoxon-Signed-Rank test
- Implemented One-way ANOVA
- Added Pearson's chi-squared goodness-of-fit test (for arbitrary distributions). Can also be used as a test of independence.
- Added unit tests for all of the above

Utility module:
- Added small but common utility functions: assert() and relative_error()

C++ AL / utilities:
- Allow "masquerading" mutable references to be passed by reference to the real type
- Added names for constant-size vectors (so far only for size 2 and 3)

Documentation:
- Added module hypothesis tests
- Added warning for Wilcoxon-Signed-Rank test. This needs to be solved.
2. Florian Schoppmann committed
- doxysql: Support for aggregate default arguments and the NULL keyword
- Added a bibliography database (in BibTeX format) that we can use in doxygen
- Added amsfonts package to LaTeX generator

Build system:
- Bumped version number to 0.4dev
3. Florian Schoppmann committed
- Updated hypothesis-tests module. Gave more precise description of what is computed and how. Included example for chi-squared test of independence.
4. Florian Schoppmann committed
- Fixed bug in merge function
Commits on May 17, 2012
1. Florian Schoppmann committed
Hypothesis tests:
- Out-of-bounds checks for transition states
- t-test/f-test: UDAs return NULL in case of insufficient data
- Wilcoxon-signed-rank test/Mann-Whitney test: Updated testing of ties. For WSR tests, I added a new precision parameter. This parameter has to be used in the unit test.
- Disabled module on GPDB4.0 because ordered aggregates are required
- If UDAs are not properly used as ordered aggregates, the error messages are now more informative.
- UDA chi2_gof_test now has default arguments

k-means:
- Fixed a wrong schema name in test