Commits on Sep 27, 2012
  1. New k-means: Simplified silhouette function

    Jira: MADLIB-681
    Implemented a function to compute the simplified silhouette coefficient.
    Additions for this functions in other parts of the code:
    Matrix operations:
    - Function that returns the two closest columns in a matrix (relative to a given
    C++ AL:
    - Support for INTEGER arrays
    Florian Schoppmann committed Sep 27, 2012
  2. Function for retrieving column of a matrix

    Jira: MADLIB-682
    The following query returns NULL in PostgreSQL:
    SELECT matrix[1] FROM (SELECT ARRAY[ARRAY[1,2], ARRAY[3,4]] matrix) q;
    Reason: Since PostgreSQL does not attach dimensionality to the array type 
    information, it is unknown during parsing whether matrix[1] is an expression of
    type INTEGER or INTEGER[]. We still want to be able to retrieve columns (or rows)
    from matrices. I therefore added a function that handles this special case.
    Florian Schoppmann committed Sep 27, 2012
Commits on Sep 26, 2012
  1. SVM: Fix regression when there is no data on segments

    Jira: MADLIB-677
    In parallel mode, we build one sub-model for each segment.  In the combination
    step, we should skip the segments with no source data.
    Although it is questionable to use segment id as a logical unit of calculation,
    we'll address it as a separate issue in the future.
    aojwang committed with haradh1 Sep 20, 2012
  2. Multi-logistic: Remove indication of cg/igd

    Jira: MADLIB-668
    renyi533 committed with haradh1 Sep 21, 2012
Commits on Sep 21, 2012
  1. New k-means: Query tuning

    Jira: MADLIB-678
    The previous optimization fence "OFFSET 0" was suitable for PostgreSQL only. 
    However, an optimization fence is also needed on Greenplum -- with it the innermost
    subquery can be sped up by a factor around 2.
    An alternative would be moving the innermost subquery that contains the
    closest_column calls into a common table expression, but unfortunately that is not
    supported by earlier Greenplum versions.
    It is unfortunate that we have to use such tricks, but we do not want to forgo a
    factor-2 performance improvement.
    Florian Schoppmann committed Sep 21, 2012
Commits on Sep 20, 2012
  1. Sampling: Include macros in SQL file

    Jira: MADLIB-678, MADLIB-584
    File sample.sql_in did not include the file containing our predefined macros.
    This implied that on Greenplum, the weighted sampling aggregate was defined
    without a merge function. This severely impacted performance on multi-node
    Greenplum instances.
    Florian Schoppmann committed Sep 20, 2012
  2. Madpack: Remove MADlibSchema variable

    Jira: MADLIB-645
    haradh1 committed Sep 5, 2012
Commits on Sep 18, 2012
  1. Documentation: Replaced inefficient example of using linear regression

    Jira: MADLIB-664
    The documentation for linear regression included the following example:
    SELECT (linregr(price, array[1, bedroom, bath, size])).* FROM houses;
    However both Postgres and GPDB have issues where they will expand the .* out and
    end up calling the linregr() aggregate more times than is needed.
    farzadras committed with Florian Schoppmann Sep 14, 2012
Commits on Sep 17, 2012
  1. New k-means: Query tuning

    Jira: MADLIB-678
    During development I added an "OFFSET 0" to the main k-means query, which improved
    PostgreSQL performance by a factor of 3, because it caused a GroupAggregate to be
    replaced by a HashAggregate. On Greenplum, however, the OFFSET 0 hurts performance.
    Therefore, the OFFSET 0 is now only added conditionally. In the future, it would
    be nice to completely get rid of this implicit hint to the planner.
    Florian Schoppmann committed Sep 17, 2012
  2. Utilities: Fix for exec_sql_using(), no longer returns invalid memory

    Jira: MADLIB-676
    It turns out that the function internal_execute_using_kmeans_args(), which is 
    internally implemented by the C function exec_sql_using(), calls the function
    datumCopy() to copy the result datum from an SPI query. However, this is done
    inside SPI_connect() and SPI_finish(), i.e., in a memory context that will be
    destroyed by SPI_finish(). Debug builds of Greenplum intentionally overwrite
    deallocated memory, which makes the problem reproducible. The overwritten memory
    then caused the datum to have invalid header information, which then causes
    subsequent usage of that datum to exhibit undefined behavior.
    The solution is to copy the result datum into the "upper executor context", e.g.,
    with the function SPI_copytuple().
    Florian Schoppmann committed Sep 17, 2012
Commits on Sep 16, 2012
  1. Build System: Replace GREENPLUM macro, take 3

    Jira: MADLIB-666
    Some files were missing the necessary m4_include.
    Florian Schoppmann committed Sep 16, 2012
Commits on Sep 15, 2012
  1. k-means random seeding: Fixed query execution problems on PG/GP

    Jira: MADLIB-522
    The previous query for random seeding gave non-deterministic results (sometimes
    0 rows were returned, sometime more). Converting a join to a subquery now
    guarantees that the query is executed correctly.
    Florian Schoppmann committed Sep 15, 2012
Commits on Sep 14, 2012
  1. New k-means: Fixed SQL for compensating for dropped centroids

    Jira: MADLIB-522
    During the reassignment phase k-means may produce stray centroids. In this case
    we sample new centroids using kmeans++. The SQL for this operation had a bug,
    which is now fixed. Also, for GP, the SQL needed to be written so that it would
    not execute on segment nodes.
    I added a unit test for the case where a centroid is dropped (this is easy to
    reproduce by simply giving an initial centroid more than once).
    Florian Schoppmann committed Sep 14, 2012
  2. Build System: Replace GREENPLUM macro, take 2

    Jira: MADLIB-666
    haradh1 committed Sep 14, 2012
  3. Minor changes necessary for integration

    Jira: None
    Most important: [Mutable]Mapped{Matrix|Vector} is now called
    Florian Schoppmann committed Sep 14, 2012
  4. New k-means: Removed argument point id from argument list

    Jira: MADLIB-522
    Now that we have a function to sample vectors directly, it is unnecessary to
    supply a point id argument to k-means (it was only used for weighted sampling
    in k-means++). This should improve performance and usability.
    Other changes:
    - Added return type to closest_column. On Greenplum, we would otherwise see an
      error of the form "cannot serialize transient record type".
    - Adapted k-means unit tests.
    - New unit test for kmeans with initial centroids provided in table
    - Fixed oversight that squaredDistNorm1 did not return the square of the
    - C++ AL: Fixed bug in NativeArrayToMappedMatrix() and
    Florian Schoppmann committed Sep 13, 2012
  5. Sampling: Now use modular transition state of weighted sampling

    Jira: MADLIB-499, MADLIB-584
    Added a function for weighted sampling of vectors (previously only int64 was
    supported). The C++ is completely generic.
    Florian Schoppmann committed Sep 13, 2012
  6. k-means: Shortcut for 4 built-in distance functions

    Jira: MADLIB-522
    - Added a kmeans() function that takes the initial centroids from a relation (as
      opposed to an array)
    - As part of that, extended the exec_sql_using() function that now supports
      queries that return a single value
    - Added a kmeans iteration state that is slightly different from the kmeans
      return type
    - Slight refactoring of closest_column function: Now is a generic function that
      works both with UDFs and normal C++ functions. On top of that, implemented
      cheat for the four most used distance functions. In the future, additional
      performance tuning of the FunctionHandle class should make this cheat
    Florian Schoppmann committed Sep 12, 2012
  7. C++ AL: Directly call FunctionHandle without detour through the backend

    Jira: MADLIB-672
    Performance improvements. However, memory allocation/deallocation for boost any
    type and boost function type are still expensive. Probably should resort to just
    a pointer, even if it is less elegant and type safe.
    Florian Schoppmann committed Sep 11, 2012
  8. k-means: Some tuning of the kmeans SQL

    Jira: MADLIB-522
    Performance tuning:
    - It turns out that the inter-iteration state should better be accessed via
    subqueries instead of joining the state table and adding a WHERE clause.
    If there are multiple subqueries in a subquery, adding an OFFSET 0 can give
    PostgreSQL/GP a hint that the outer query should not be flattened.
    - Replaced all occurrences of the kmeans module by kmeans_new.
    Florian Schoppmann committed Sep 6, 2012
  9. k-means: objective_fn in output, random seeding. Primitive unit test.

    Jira: MADLIB-522
    - k-means now reports the value of the objective function (sum of squared
      distances, over all points) in the output.
    - Random seeding is now included.
    - Replace the old primitive unit test by a new primitive unit tests. The result
      is not checked for sanity.
    Florian Schoppmann committed Aug 8, 2012
  10. Revert "Temporarily deactivate random-sampling module for gppkg relea…

    …se [...]"
    Jira: None
    This reverts commit 15bfc3d and activates again the random-sampling module.
    Florian Schoppmann committed Aug 7, 2012
  11. k-Means: Verify that function parameters have correct signature

    Jira: MADLIB-522
    Florian Schoppmann committed Aug 3, 2012
  12. Vector operations: Added "distances" and normalized average

    Jira: MADLIB-522, MADLIB-532
    List of changes:
    Vector operations:
    - New functions: "sqaured_angle" and "squared_tanimoto"
    - Added a UDA "normalized_avg"
    - Changed UDA "avg_vector" to "avg".
    - New unit tests
    - Parameters "fn_squared_dist" and "agg_mean" now of type TEXT, so that
      specifying overloaded functions is easier and does not require passing a
      regprocedure value any more
    Florian Schoppmann committed Jul 27, 2012
  13. k-means: Minor changes in SQL/Python code to reflect changes in desig…

    …n document
    Jira: None
    - In the k-means SQL/Python code, replaced "col_" by "expr_", because users are
      not limited to passing column names but any expression that has the correct
    - Fixed a compiler warning in avg_vector.cpp.
    Florian Schoppmann committed Jul 24, 2012
  14. k-Means: Modularized recentering

    Jira: MADLIB-522, MADLIB-586
    Linear Algebra:
    - Added the avg_vector UDA that computes the average of a set of vectors
    - Removed the internal kmeans_step aggregate. Instead, kmeans now asks for an aggregate function that calculates the mean, given a set of vectors.
    - Added UDFs for simulating default arguments
    Other changes:
    - Renamed C file containing code for exec_sql_using()
    - Added verbose mode to Python IterationController
    Florian Schoppmann committed May 29, 2012
  15. New k-means implementation, rewritten as an iteratively called aggreg…

    …ate function
    Jira: MADLIB-522, MADLIB-585
    - Complete redesign and rewrite. Main new features: 1) The new implementation is nothing more but an iteratively called aggregate function. No conversion between temporary tables and iteration states takes place. 2) The distance metric is fully pluggable. 3) Clear separation between the seeding algorithm and Lloyd's local search heuristic 4) The new code should be much easier to read and maintain
    - Uses C++ AL
    - First incarnation of general Python abstraction layer (currently still in, though). Main feature: No data movement between backend and PL/Python.
    - Implemented a C function exec_sql_using() that can be used to emulate the EXECUTE ... USING command that is not available before PostgreSQL 8.4. This seems to be the only "guaranteed lossless" way to write function arguments into a table.
    Florian Schoppmann committed May 15, 2012
Commits on Sep 12, 2012
  1. Association Rules: Disable in GPDB4.0

    Jira: MADLIB-661
    Comment out the 'assoc_rule' from src/ports/greenplum/4.0/config/Modules.yml,
    as it's now using array_agg() which is not supported in GPDB 4.0.  We hope
    to support this again by replacing array_agg() with our implementation in the
    near future.
    aojwang committed with haradh1 Sep 11, 2012
  2. New Method: Add linear-chain conditional random field for NLP

    Jira: MADLIB-647
    This is the addition of CRF on top of the existing Viterbi module.  The design
    document still has some space to be improved, but hope additional work on it
    in the future.
    livingstream committed with haradh1 Sep 12, 2012
Commits on Sep 10, 2012
  1. Build System: Remove GREENPLUM from m4_ifdef macro

    Jira: MADLIB-639
    Rahul Iyer committed Sep 7, 2012
  2. SVM: Limit the number of features in the training table

    Jjra: MADLIB-602
    If the number of features of a dataset is greater than 102400, then we will
    report error.  This is due to the current implementation with array, which
    has 1G size limit per datum.  If the table is empty, we will report an error,
    aojwang committed with haradh1 Aug 31, 2012
Commits on Sep 6, 2012
  1. k-means: Improve performance with array

    Jira: MADLIB-454
    It seems sparse vector was the bottleneck of the scalability.  We see 10
    times faster in some of our use cases by using arrays instead of sparse
    vectors.  Although we still have concerns around space efficiency, we
    saw the arrays are also packed enough by the compression and toast mechanism.
    Let's see by having this fix if our test cases verify the improvement.
    renyi533 committed with haradh1 Aug 31, 2012
Commits on Sep 5, 2012
  1. Madpack: Add pg_catalog prefix to version() call

    Jira: MADLIB-473
    Users may have search_path with madlib, public and madpack gets confused
    when calling version() function.  Add pg_catalog prefix to make sure it's
    the one in the platform.
    haradh1 committed Sep 5, 2012
  2. k-means: Check uniqueness of user-provided IDs

    Jira: MADLIB-649
    renyi533 committed with haradh1 Sep 3, 2012
Commits on Sep 1, 2012
  1. Association Rules: Rewrite Apriori for performance improvement

    Jira: MADLIB-462 MADLIB-475 MADLIB-638
    After studying the current implementation of Apriori, we found the following
    1) The result was unstable, since it used the hash key to unique identify a
       svec.  As we know, two svecs may have the same hash key.  Therefore, we
       will not use the hash key.
    2) There were a lot cross joins.  Actually, we can use inner join to do the
       same thing as the cross join did.  The cross join was making the process
    3) It was changing the sparse vector in many place, which caused the memory
       issues.  Now, we don't change the svec.  We just construct the svec and
       then do some operations(such as minus, multiply, etc) on that.
    4) It used PLPGSQL to implement the algorithm, including aggregate function.
    Based on the above analysis, we rewrite it. We use PLPYTHON and C++ AL to
    address these.
    Although we add code in UDF class for SRF, it is still pretty premature.
    It has a lot of spaces to redesign for general purpose.
    aojwang committed with haradh1 Sep 1, 2012