Pivotal tracker: 88636158 Change: - Kmeans++ runs through the data 'k' times to compute the k initial centroids. This is an incredibly slow process for big data. We can speed things up by running the seeding only on a subsample of the data (size controlled by a user-defined parameter). - The default behavior is to seed from the complete dataset (as described) in the original algorithm. A subsampling ratio parameter allows the user to set the size of the subsample for seeding.
Additional author: Rahul Iyer <email@example.com> Pivotal Tracker: #87321270 JIRA: MADLIB-901 Changes: - allow expression as id column in multinom_predict, mlogregr_predict, and coxph_predict - fix _filter_recursive_view_dependency (MADLIB-901) - fix coxph_result in old change lists - drop old lmf, coxph functions for hawq reinstall - remove .block(), still problematic in OSX - add coxph_predict help function with no arguments
Additional author: Rahul Iyer <firstname.lastname@example.org> Pivotal Tracker: #87237284 Changes: - keep var_imp_score in oob_prediction table - vectorize distribution_agg for permutation - edit array_add for use as merge/transition function
Pivotal Tracker: #87646110 Details: The table_exists function in validate_args checks for the table in all schemas in the search path. For output tables, we only want to check in the current schema. We add a boolean as a flag to differentiate the two situations and change all calls when checking for output table. The commit also includes updates to all modules that use table_exists, with the flag set to True when the output table is validated. Others: Add online help message to coxph_predict function Add drop functions and change lists for coxph functions for HAWQ reinstall and upgrade, missed in v1.6
Pivotal tracker: 87646110 Changes: - If the final group state is terminated (or the single state in the no grouping case is terminated) for GLM controllers then the iteration convergence test query returns empty result. This can be avoided by automatically finishing when there are no active states. - Validation for columns_exist_in_table was not comparing the unquoted column names of the columns in table. This has been changed to compare the unquoted input with the unquoted column names. - Couple of minor validation bugs were fixed in ordinal() and multinom().
Additional author: Rahul Iyer <email@example.com> Pivotal Tracker: #88080334 Changes: - remove views w/ duplicate rows that was passed to tree_train (use src_view with poisson count directly instead) - enable updating stats using weights_as_counts for RF
Pivotal tracker: 86025702 Summary function uses PERCENTILE_CONT to compute the percentiles. This functionality was available only on GPDB 4.2.2 or higher and this was explicitly checked for in the code. The function is now available in PostgreSQL 9.4 and on HAWQ 1.2.0, so we add those platforms for quantiles.
Pivotal tracker: 86653930 In random forest, we join the original source table with a Poisson count table as part of the bootstrap. This is a join between two big tables. Replacing that join by building a single temporary table gives about 20% speedup. Further, we also provide a user option to run RF only on a random subsample of the dataset. The bootstrapping is performed on a random subset improving the runtime of the method.
Additional author: Feng, Xixuan (Aaron) <firstname.lastname@example.org> Pivotal Tracker: #86332928 JIRA: MADLIB-753 Changes: - add install-check for correctness test on iris data - add toy dataset install-check testcases, install_test_4 and install_test_5 - add overloaded SQL functions to support numeric/categorical variables - create_nb_prepared_data_tables - create_nb_classify_view - create_nb_probs_view
Pivotal Tracker: #75480818 Changes: - added a function to retrieve information on a particular tree in the forest, in the format required by R's randomForest library.