### Subversion checkout URL

You can clone with HTTPS or Subversion.

# New Statistics Module#118

Merged
merged 5 commits into from almost 2 years ago
 +3,798 11

### 3 participants

Collaborator

This pull request should be merged only after pull request 139. Some minor changes will be necessary for that. However, I'd like to open the discussion already now because I'll be offline for two days.

commented on the diff
 src/modules/stats/t_test.cpp 
 ((65 lines not shown)) 65 + typename HandleTraits::ReferenceToDouble x_sum; 66 + typename HandleTraits::ReferenceToDouble correctedX_square_sum; 67 + 68 + typename HandleTraits::ReferenceToUInt64 numY; 69 + typename HandleTraits::ReferenceToDouble y_sum; 70 + typename HandleTraits::ReferenceToDouble correctedY_square_sum; 71 + 72 + typename HandleTraits::ReferenceToDouble parameter; 73 +}; 74 + 75 +/** 76 + * @brief Update the corrected sum of squares 77 + * 78 + * For numerical stability, we should not compute the sample variance in the 79 + * naive way. The literature has many examples where this gives bad results 80 + * even with moderately sized inputs.
Owner

Current plan is to schedule this for pull.

commented on the diff
 src/config/Version.yml 
 ... ... @@ -1 +1 @@ 1 -version: 0.3 1 +version: 0.4dev
 2 Owner cwelton added a note May 09, 2012 Good, I was going to file a jira for this. I didn't expect to see it in the inferential statistics PR, but I'm happy to see it changing. We should probably still have a jira to remove the "dev" prior to release. Collaborator fschopp added a note May 10, 2012 Yes, we might want to factor this out into a separate commit. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
 src/modules/stats/chi_squared_test.cpp 
 ((19 lines not shown)) 19 + 20 +// Import names from other MADlib modules 21 +using prob::chiSquaredCDF; 22 + 23 +namespace stats { 24 + 25 +// Workaround for Doxygen: A header file that does not declare namespaces is to 26 +// be ignored if and only if it is processed stand-alone 27 +#undef _DOXYGEN_IGNORE_HEADER_FILE 28 +#include "chi_squared_test.hpp" 29 + 30 +/** 31 + * @brief Transition state for chi-squared functions 32 + * 33 + * Note: We assume that the DOUBLE PRECISION array is initialized by the 34 + * database with length 8, and all elemenets are 0.
 3 Owner cwelton added a note May 09, 2012 s/elemenets/elements/ Owner cwelton added a note May 09, 2012 The assumption that the state will be an array of length 8 is somewhat suspect. This will be the case when it is properly initialized by an aggregate, but individual functions of can be directly callable. When called directly the parameters can be anything that a user provides. For instance, what is the behavior if a user calls: select chi2_gof_test_merge_states(array[1],array[2]);? If this will reference uninitialized memory then it needs to be fixed. Collaborator fschopp added a note May 10, 2012 This assumption here is meant as a precondition for the correct outcome (in a mathematical sense). If this precondition is not satisfied, e.g., because the function is called directly, the result will not be correct. The C++ AL will raise an exception when an out-of-bounds array access occurs. That said, your concern raises a valid point in that there are cases where the code takes the address of an array element (with bounds checking), but then accesses memory relative to that pointer (no bounds checking any more). This is the case when Eigen objects are mapped to different locations in a single array. I'll check those cases and fix if necessary. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
 src/modules/stats/chi_squared_test.cpp 
 ((64 lines not shown)) 64 + typename HandleTraits::ReferenceToInt64 df; 65 +}; 66 + 67 +inline 68 +void 69 +updateSumSquaredDeviations(double &ioLeftNumRows, double &ioLeftSumExp, 70 + double &ioLeftSumObsSquareOverExp, double &ioLeftSumObs, 71 + double &ioLeftSumSquaredDeviations, 72 + double inRightNumRows, double inRightSumExp, 73 + double inRightSumObsSquareOverExp, double inRightSumObs, 74 + double inRightSumSquaredDeviations) { 75 +  76 + if (inRightNumRows <= 0) 77 + return; 78 +  79 + // FIXME: Use compensated sums for numerical stability
 src/modules/stats/chi_squared_test.cpp 
 ((82 lines not shown)) 82 + + ioLeftSumExp * inRightSumObsSquareOverExp 83 + + ioLeftSumObsSquareOverExp * inRightSumExp 84 + - 2 * ioLeftSumObs * inRightSumObs; 85 +  86 + ioLeftNumRows += inRightNumRows; 87 + ioLeftSumExp += inRightSumExp; 88 + ioLeftSumObsSquareOverExp += inRightSumObsSquareOverExp; 89 + ioLeftSumObs += inRightSumObs; 90 +} 91 + 92 +AnyType 93 +chi2_gof_test_transition::run(AnyType &args) { 94 + Chi2TestTransitionState > state = args[0]; 95 + int64_t observed = args[1].getAs(); 96 + double expected = args[2].isNull() ? 1 : args[2].getAs(); 97 + int64_t df = args[3].isNull() ? -1 : args[3].getAs();
 2 Owner cwelton added a note May 09, 2012 Slight discrepancy from the documented behavior. Documented: @param df Degrees of freedom. This is the number of events reduced by the degree of freedom lost by using the observed numbers for defining the expected number of observations. If this parameter is \c NULL, the degree of freedom is taken as \f$(k - 1) \f$. Actual: @param df Degrees of freedom. This is the number of events reduced by the degree of freedom lost by using the observed numbers for defining the expected number of observations. If this parameter is \c NULL or < 0, the degree of freedom is taken as \f$(k - 1) \f$. Collaborator fschopp added a note May 10, 2012 Probably the right thing to do is to raise an error if df <= 0. Also, we should make df an optional argument (which currently means defining an overloaded UDA).Update: Implemented as I suggested. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
 src/modules/stats/chi_squared_test.cpp 
 ((88 lines not shown)) 88 + ioLeftSumObsSquareOverExp += inRightSumObsSquareOverExp; 89 + ioLeftSumObs += inRightSumObs; 90 +} 91 + 92 +AnyType 93 +chi2_gof_test_transition::run(AnyType &args) { 94 + Chi2TestTransitionState > state = args[0]; 95 + int64_t observed = args[1].getAs(); 96 + double expected = args[2].isNull() ? 1 : args[2].getAs(); 97 + int64_t df = args[3].isNull() ? -1 : args[3].getAs(); 98 +  99 + if (state.uniformDist != args[2].isNull()) { 100 + if (state.numRows > 0) 101 + throw std::invalid_argument("Expected number of observations must " 102 + "be given for all events or must be NULL for all events, in " 103 + "which case a discrete uniform distribution is assumed.");
 2 Owner cwelton added a note May 09, 2012 Not technically correct since NULL and -1 are treated as equivalent, but not a huge deal. Collaborator fschopp added a note May 16, 2012 For args[2], the expected number of observations, this error msg is correct. You are thinking about args[3], the degree of freedom. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
 src/modules/stats/kolmogorov_smirnov_test.cpp 
 ((19 lines not shown)) 19 + 20 +// Import names from other MADlib modules 21 +using prob::kolmogorovCDF; 22 + 23 +namespace stats { 24 + 25 +// Workaround for Doxygen: A header file that does not declare namespaces is to 26 +// be ignored if and only if it is processed stand-alone 27 +#undef _DOXYGEN_IGNORE_HEADER_FILE 28 +#include "kolmogorov_smirnov_test.hpp" 29 + 30 +/** 31 + * @brief Transition state for Kolmogorov-Smirnov-Test functions 32 + * 33 + * Note: We assume that the DOUBLE PRECISION array is initialized by the 34 + * database with length 7, and all elemenets are 0.
 src/modules/stats/kolmogorov_smirnov_test.cpp 
 ((65 lines not shown)) 65 + KSTestTransitionState > state = args[0]; 66 + int sample = args[1].getAs() ? 0 : 1; 67 + double value = args[2].getAs(); 68 + ColumnVector2 expectedNum; 69 + expectedNum << args[3].getAs(), args[4].getAs(); 70 +  71 + if (state.expectedNum != expectedNum) { 72 + if (state.num.sum() > 0) 73 + throw std::invalid_argument("Number of samples must be constant " 74 + "parameters."); 75 +  76 + state.expectedNum = expectedNum; 77 + } 78 +  79 + if (state.last > value && state.num.sum() > 0) 80 + throw std::invalid_argument("Must be used as an ordered aggregate.");
 3 Owner cwelton added a note May 09, 2012 Not just as an ordered aggregate, but also ordered by the same expression as the input parameter. E.g. this is good: select agg(x order by x) ..., whereas these are not select agg(x order by x desc), select agg(x order by y). It would be good if the database had a better way to express the relevant properties so that a user can call the function simply, unfortunately it does not. Owner cwelton added a note May 09, 2012 Additionally ordered aggregates don't scale particularly well due to the need of establishing a single global ordering. Not that I see any immediate alternatives. Collaborator fschopp added a note May 10, 2012 Unfortunately, anything that requires the rank of values also requires a sort. The only alternative would be online approximation algorithms. Though I am doubtful that that would be easy to parallelize. Research on anything like "online multi-stream Kolmogorov-Smirnov test approximation algorithms" is probably limited or non-existent... Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
 src/modules/stats/mann_whitney_test.cpp 
 ((19 lines not shown)) 19 + 20 +// Import names from other MADlib modules 21 +using prob::normalCDF; 22 + 23 +namespace stats { 24 + 25 +// Workaround for Doxygen: A header file that does not declare namespaces is to 26 +// be ignored if and only if it is processed stand-alone 27 +#undef _DOXYGEN_IGNORE_HEADER_FILE 28 +#include "mann_whitney_test.hpp" 29 + 30 +/** 31 + * @brief Transition state for Mann-Whitney-Test functions 32 + * 33 + * Note: We assume that the DOUBLE PRECISION array is initialized by the 34 + * database with length 7, and all elemenets are 0.
 src/modules/stats/mann_whitney_test.cpp 
 ((60 lines not shown)) 60 +/** 61 + * @brief Perform the Mann-Whitney-test transition step 62 + */ 63 +AnyType 64 +mw_test_transition::run(AnyType &args) { 65 + MWTestTransitionState > state = args[0]; 66 + int sample = args[1].getAs() ? 0 : 1; 67 + double value = args[2].getAs(); 68 +  69 + if (state.last < value) { 70 + state.numTies.setZero(); 71 + } else if (state.last == value) { 72 + for (int i = 0; i <= 1; i++) 73 + state.rankSum(i) += state.numTies(i) * 0.5; 74 + } else if (state.num.sum() > 0) { // also satisfied here: state.last > value 75 + throw std::invalid_argument("Must be used as an ordered aggregate.");
 2 Owner cwelton added a note May 09, 2012 Same comment regarding use as an ordered aggregate. Collaborator fschopp added a note May 16, 2012 Fixed:  throw std::invalid_argument("Must be used as an ordered aggregate, " "in ascending order of the second argument.");  Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
 src/modules/stats/one_way_anova.cpp 
 ((20 lines not shown)) 20 + 21 +// Import names from other MADlib modules 22 +using prob::fisherF_CDF; 23 + 24 +namespace stats { 25 + 26 +// Workaround for Doxygen: A header file that does not declare namespaces is to 27 +// be ignored if and only if it is processed stand-alone 28 +#undef _DOXYGEN_IGNORE_HEADER_FILE 29 +#include "one_way_anova.hpp" 30 + 31 +/** 32 + * @brief Transition state for one-way ANOVA functions 33 + * 34 + * Note: We assume that the DOUBLE PRECISION array is initialized by the 35 + * database with length 1, and all elemenets are 0.
 3 Owner cwelton added a note May 09, 2012 Same comment regarding user supplied parameters and elemenets Collaborator fschopp added a note May 11, 2012 Out-of-bounds access possible here. Needs fix. Collaborator fschopp added a note May 16, 2012 Fixed. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
 src/modules/stats/one_way_anova.cpp 
 ((43 lines not shown)) 43 + rebind(utils::nextPowerOfTwo(static_cast(mStorage[0]))); 44 + } 45 +  46 + /** 47 + * @brief Convert to backend representation 48 + * 49 + * We define this function so that we can use TransitionState in the argument 50 + * list and as a return type. 51 + */ 52 + inline operator AnyType() const { 53 + return mStorage; 54 + } 55 +  56 + /** 57 + * @brief Return the index (in the num, sum, and coorected_square_sum 58 + * fields) of a group value
commented on the diff
 src/modules/stats/one_way_anova.cpp 
 ((99 lines not shown)) 99 + groupValues, groupValues + numGroups, inValue) - groupValues; 100 + 101 + if (pos >= numGroups || groupValues[pos] != inValue) { 102 + // Did not find this group value. We have to start a new group. 103 + throw std::runtime_error("Could not find a grouping value during " 104 + "one-way ANOVA."); 105 + } 106 + return pos; 107 +} 108 + 109 +template <> 110 +uint32_t 111 +OWATransitionState >::idxOfGroup( 112 + const Allocator& inAllocator, uint16_t inValue) { 113 +  114 + // FIXME: Think of using proper iterators. Add some safety. Overflow checks.
commented on the diff
 src/modules/stats/one_way_anova.cpp 
 ((149 lines not shown)) 149 + groupValues[pos] = inValue; 150 +  151 + std::copy(oldSelf.posToIndices, oldSelf.posToIndices + pos, posToIndices); 152 + std::copy(oldSelf.posToIndices + pos, 153 + oldSelf.posToIndices + oldSelf.numGroups, posToIndices + pos + 1); 154 + posToIndices[pos] = oldSelf.numGroups; 155 +  156 + num.segment(0, oldSelf.numGroups) << oldSelf.num; 157 + sum.segment(0, oldSelf.numGroups) << oldSelf.sum; 158 + corrected_square_sum.segment(0, oldSelf.numGroups) << oldSelf.corrected_square_sum; 159 + } 160 + } 161 + return posToIndices[pos]; 162 +} 163 + 164 +// FIXME: Same function used for t_test. Factor out.
commented on the diff
 src/modules/stats/one_way_anova.cpp 
 ((87 lines not shown)) 87 + typename HandleTraits::DoublePtr posToIndices; 88 + typename HandleTraits::ColumnVectorTransparentHandleMap num; 89 + typename HandleTraits::ColumnVectorTransparentHandleMap sum; 90 + typename HandleTraits::ColumnVectorTransparentHandleMap corrected_square_sum; 91 +}; 92 + 93 +template <> 94 +uint32_t 95 +OWATransitionState >::idxOfGroup( 96 + const Allocator&, uint16_t inValue) { 97 +  98 + uint16_t pos = std::lower_bound( 99 + groupValues, groupValues + numGroups, inValue) - groupValues; 100 + 101 + if (pos >= numGroups || groupValues[pos] != inValue) { 102 + // Did not find this group value. We have to start a new group.
 2 Owner cwelton added a note May 09, 2012 I'm a bit confused by the "we have to start a new group" comment in combination with throwing a runtime_error. Collaborator fschopp added a note May 10, 2012 Good catch. This is comment is a left-over from copied code, and it is clearly confusing and inappropriate. The comment should read sth. like: "Did not find this group value. Since the underlying state is immutable, we will raise a runtime error. Obviously, this error should never occur." Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
 src/modules/stats/t_test.cpp 
 ((25 lines not shown)) 25 +namespace stats { 26 + 27 +// Workaround for Doxygen: A header file that does not declare namespaces is to 28 +// be ignored if and only if it is processed stand-alone 29 +#undef _DOXYGEN_IGNORE_HEADER_FILE 30 +#include "t_test.hpp" 31 + 32 +struct internal : public AbstractionLayer { 33 + static AnyType tStatsToResult(double inT, double inDegreeOfFreedom); 34 +}; 35 + 36 +/** 37 + * @brief Transition state for t-Test functions 38 + * 39 + * Note: We assume that the DOUBLE PRECISION array is initialized by the 40 + * database with length 6, and all elemenets are 0.
 src/modules/stats/wilcoxon_signed_rank_test.cpp 
 ((65 lines not shown)) 65 + * 66 + * Index 0 always refers to the positive values and index 1 refers to 67 + * the negative values. 68 + */ 69 +AnyType 70 +wsr_test_transition::run(AnyType &args) { 71 + WSRTestTransitionState > state = args[0]; 72 + double value = args[1].getAs(); 73 +  74 + // Ignore values of zero. 75 + if (value == 0) 76 + return state; 77 +  78 + int sample = value > 0 ? 0 : 1; 79 +  80 + // FIXME: The following epsilon is hard-coded
 src/modules/stats/mann_whitney_test.cpp 
 ((56 lines not shown)) 56 + typename HandleTraits::ColumnVectorTransparentHandleMap rankSum; 57 + typename HandleTraits::ReferenceToDouble last; 58 +}; 59 + 60 +/** 61 + * @brief Perform the Mann-Whitney-test transition step 62 + */ 63 +AnyType 64 +mw_test_transition::run(AnyType &args) { 65 + MWTestTransitionState > state = args[0]; 66 + int sample = args[1].getAs() ? 0 : 1; 67 + double value = args[2].getAs(); 68 +  69 + if (state.last < value) { 70 + state.numTies.setZero(); 71 + } else if (state.last == value) {
 3 Owner cwelton added a note May 09, 2012 Is there a reason you perform floating point equality here, but allow for an epsilon in wsr_test_transition? Collaborator fschopp added a note May 10, 2012 Not anything I remember. The WSR epsilon test might have been motivated by me testing some sample data. I wrote all this in one week, so there was not much time, and this just seems inconsistent. I'll check again before merging. Collaborator fschopp added a note May 16, 2012 Replaced by almost-equal testing. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Owner
commented

My review is complete.

1) I have concerns about direct invocation of the aggregate sub-functions due to assumptions about the transition state that are never verified.

2) For outstanding issues it would be good to have jiras tracking them so that they do not become lost and forgotten.

3) Some inconsistencies between behavior and documentation.

4) A couple small typos.

commented on the diff
 src/modules/stats/one_way_anova.cpp 
 ((112 lines not shown)) 112 + const Allocator& inAllocator, uint16_t inValue) { 113 +  114 + // FIXME: Think of using proper iterators. Add some safety. Overflow checks. 115 + uint16_t pos = std::lower_bound( 116 + groupValues, groupValues + numGroups, inValue) - groupValues; 117 +  118 + if (pos >= numGroups || groupValues[pos] != inValue) { 119 + // Did not find this group value. We have to start a new group. 120 +  121 + uint16_t numGroupsReserved = utils::nextPowerOfTwo( 122 + static_cast(numGroups)); 123 + if (numGroupsReserved > numGroups) { 124 + // We have enough reserve space allocated. 125 + std::copy(groupValues + pos, groupValues + numGroups, 126 + groupValues + pos + 1); 127 + groupValues[pos] = inValue;
commented on the diff
 src/modules/stats/t_test.cpp 
 ((189 lines not shown)) 189 + double sampleVariance = state.correctedX_square_sum 190 + / degreeOfFreedom; 191 + double t = std::sqrt(state.numX / sampleVariance) 192 + * (state.x_sum / state.numX); 193 +  194 + return internal::tStatsToResult(t, degreeOfFreedom); 195 +} 196 + 197 +/** 198 + * @brief Perform the pooled (i.e., assuming equal variances) two-sample t-Test 199 + * final step 200 + */ 201 +AnyType 202 +t_test_two_pooled_final::run(AnyType &args) { 203 + TTestTransitionState > state = args[0]; 204 +
 2 Collaborator haradh1 added a note May 09, 2012 For t_test_two_pooled_final, t_test_two_unpooled_final, and t_test_final, is there no need to check to return NULL as in t_test_one_final? Collaborator fschopp added a note May 10, 2012 True, that should be done. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
 src/ports/postgres/modules/stats/hypothesis_tests.sql_in 
 ((491 lines not shown)) 491 + * 492 + * @usage 493 + * - Test null hypothesis that two samples stem from the same distribution: 494 + *
SELECT (ks_test(first, value,
495  + *    (SELECT count(value) FROM source WHERE first),
496  + *    (SELECT count(value) FROM source WHERE NOT first)
497  + *    ORDER BY value
498  + *)).* FROM source
499 + * 500 + * @note 501 + * This aggregate must be used as an ordered aggregate 502 + * (ORDER BY \em value) and will raise an exception if values are 503 + * not ordered. 504 + */ 505 +CREATE 506 +m4_ifdef(__GREENPLUM__',ORDERED')
Collaborator
commented

Done my quick review. Assign back to Florian.

 Florian Schoppmann New Statistics module (initial version): - Implemented one-sample t-test and two-sample (pooled and unpooled) t-test - F-test - Kolmogorov-Smirnov test - Mann-Whitney test - Wilcoxon-Signed-Rank test - Implemented One-way ANOVA - Added Pearson's chi-squared goodness-of-fit test (for arbitrary distributions). Can also be used as a test of independence. - Added unit tests for all of the above Utility module: - Added small but common utility functions: assert() and relative_error() C++ AL / utilities: - Allow "masquerading" mutable references to be passed by reference to the real type - Added names for constant-size vectors (so far only for size 2 and 3) Documentation: - Added module hypothesis tests - Added warning for Wilcoxon-Signed-Rank test. This needs to be solved. 12f1713 Florian Schoppmann Documentation: - doxysql: Support for aggregate default arguments and the NULL keyword - Added a bibliography database (in BibTeX format) that we can use in doxygen - Added amsfonts package to LaTeX generator Build system: - Bumped version number to 0.4dev 3ea3524 Florian Schoppmann Documentation: - Updated hypothesis-tests module. Gave more precise description of what is computed and how. Included example for chi-squared test of independence. 2a135e2 Florian Schoppmann Hypothesis tests / One-way ANOVA: - Fixed bug in merge function 882c7a8
 src/modules/stats/wilcoxon_signed_rank_test.cpp 
 ((70 lines not shown)) 70 +wsr_test_transition::run(AnyType &args) { 71 + WSRTestTransitionState > state = args[0]; 72 + double value = args[1].getAs(); 73 +  74 + // Ignore values of zero. 75 + if (value == 0) 76 + return state; 77 +  78 + int sample = value > 0 ? 0 : 1; 79 +  80 + // FIXME: The following epsilon is hard-coded 81 + const double epsilon = 1e-10; 82 +  83 + if (std::fabs(state.last) + epsilon < std::fabs(value)) { 84 + state.numTies.setZero(); 85 + } else if (std::fabs(std::fabs(state.last) - std::fabs(value)) < epsilon) {
 2 Collaborator fschopp added a note May 16, 2012 Replaced by: // For almostEqual, we choose a precision of 2 * 3 units in the last place. // This is because we assume that value is the result of adding up to three // values (in the dependent paired test, value may be computed as // "first - second - mu_0"). if (utils::almostEqual(std::fabs(state.last), std::fabs(value), 3) { [...]  Update: Please disregard this comment. Collaborator fschopp added a note May 16, 2012 Hmm, this one is actually problematic. If value is the result of a subtraction, then we are facing the problem of loss of significance. (We obviously know the difference with much less precision than we knew the original values.) From a numerical perspective, the only clean solution is to change the signature and have three parameters first, second, and mu_0. Alternatively, the user could provide a different argument that tells us the precision with which the differences are known. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
 Florian Schoppmann Changes resulting from discussion around pull request 118. Hypothesis tests: - Out-of-bounds checks for transition states - t-test/f-test: UDAs return NULL in case of insufficient data - Wilcoxon-signed-rank test/Mann-Whitney test: Updated testing of ties. For WSR tests, I added a new precision parameter. This parameter has to be used in the unit test. - Disabled module on GPDB4.0 because ordered aggregates are required - If UDAs are not properly used as ordered aggregates, the error messages are now more informative. - UDA chi2_gof_test now has default arguments k-means: - Fixed a wrong schema name in test 242cbff
Collaborator
commented

I have made the following modifications before merging. The biggest change involves a change in functionality and is shown in bold font below. I'll merge already now to facilitate QA, but I suggest a quick post-merge review of the latest commit as well.

Changes resulting from discussion around pull request 118.
Hypothesis tests:

• Out-of-bounds checks for transition states
• t-test/f-test: UDAs return NULL in case of insufficient data
• Wilcoxon-signed-rank test/Mann-Whitney test: Updated testing of ties. For WSR tests, I added a new precision parameter. This parameter has to be used in the unit test.
• Disabled module on GPDB4.0 because ordered aggregates are required
• If UDAs are not properly used as ordered aggregates, the error messages are now more informative.
• UDA chi2_gof_test now has default arguments

k-means:

• Fixed a wrong schema name in test
merged commit 242cbff into from
closed this

Showing 5 unique commits by 1 author.

May 16, 2012
New Statistics module (initial version):
- Implemented one-sample t-test and two-sample (pooled and unpooled) t-test
- F-test
- Kolmogorov-Smirnov test
- Mann-Whitney test
- Wilcoxon-Signed-Rank test
- Implemented One-way ANOVA
- Added Pearson's chi-squared goodness-of-fit test (for arbitrary distributions). Can also be used as a test of independence.
- Added unit tests for all of the above

Utility module:
- Added small but common utility functions: assert() and relative_error()

C++ AL / utilities:
- Allow "masquerading" mutable references to be passed by reference to the real type
- Added names for constant-size vectors (so far only for size 2 and 3)

Documentation:
- Added warning for Wilcoxon-Signed-Rank test. This needs to be solved.
12f1713
Documentation:
- doxysql: Support for aggregate default arguments and the NULL keyword
- Added a bibliography database (in BibTeX format) that we can use in doxygen
- Added amsfonts package to LaTeX generator

Build system:
- Bumped version number to 0.4dev
3ea3524
Documentation:
- Updated hypothesis-tests module. Gave more precise description of what is computed and how. Included example for chi-squared test of independence.
2a135e2
Hypothesis tests / One-way ANOVA:
- Fixed bug in merge function
882c7a8
May 17, 2012
Changes resulting from discussion around pull request 118.
Hypothesis tests:
- Out-of-bounds checks for transition states
- t-test/f-test: UDAs return NULL in case of insufficient data
- Wilcoxon-signed-rank test/Mann-Whitney test: Updated testing of ties. For WSR tests, I added a new precision parameter. This parameter has to be used in the unit test.
- Disabled module on GPDB4.0 because ordered aggregates are required
- If UDAs are not properly used as ordered aggregates, the error messages are now more informative.
- UDA chi2_gof_test now has default arguments

k-means:
- Fixed a wrong schema name in test
242cbff
Something went wrong with that request. Please try again.