Skip to content

Latest commit

 

History

History
349 lines (260 loc) · 14.6 KB

v0.20.rst

File metadata and controls

349 lines (260 loc) · 14.6 KB

sklearn

Version 0.20 (under development)

As well as a plethora of new features and enhancements, this release is the first to be accompanied by a glossary developed by Joel Nothman. The glossary is a reference resource to help users and contributors become familiar with the terminology and conventions used in Scikit-learn.

Changed models

The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.

  • decomposition.IncrementalPCA in Python 2 (bug fix)
  • isotonic.IsotonicRegression (bug fix)
  • linear_model.ARDRegression (bug fix)
  • linear_model.OrthogonalMatchingPursuit (bug fix)
  • metrics.roc_auc_score (bug fix)
  • metrics.roc_curve (bug fix)
  • neural_network.BaseMultilayerPerceptron (bug fix)
  • neural_network.MLPRegressor (bug fix)
  • neural_network.MLPClassifier (bug fix)

Details are listed in the changelog below.

(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)

Changelog

Support for Python 3.3 has been officially dropped.

New features

Classifiers and regressors

  • ensemble.GradientBoostingClassifier and ensemble.GradientBoostingRegressor now support early stopping via n_iter_no_change, validation_fraction and tol. 7071 by Raghav RV
  • Added naive_bayes.ComplementNB, which implements the Complement Naive Bayes classifier described in Rennie et al. (2003). 8190 by Michael A. Alcorn <airalcorn2>.

Preprocessing

  • Added preprocessing.CategoricalEncoder, which allows to encode categorical features as a numeric array, either using a one-hot (or dummy) encoding scheme or by converting to ordinal integers. Compared to the existing OneHotEncoder, this new class handles encoding of all feature types (also handles string-valued features) and derives the categories based on the unique values in the features instead of the maximum value in the features. 9151 by Vighnesh Birodkar <vighneshbirodkar> and Joris Van den Bossche.
  • Added preprocessing.PowerTransformer, which implements the Box-Cox power transformation, allowing users to map data from any distribution to a Gaussian distribution. This is useful as a variance-stabilizing transformation in situations where normality and homoscedasticity are desirable. 10210 by Eric Chang <ericchang00> and Maniteja Nandana <maniteja123>.

Model evaluation

  • Added the metrics.balanced_accuracy_score metric and a corresponding 'balanced_accuracy' scorer for binary classification. 8066 by xyguo and Aman Dalmia <dalmia>.
  • Added multioutput.RegressorChain for multi-target regression. 9257 by Kumar Ashutosh <thechargedneutron>.
  • Added the preprocessing.TransformedTargetRegressor which transforms the target y before fitting a regression model. The predictions are mapped back to the original space via an inverse transform. 9041 by Andreas Müller and Guillaume Lemaitre <glemaitre>.

Enhancements

Classifiers and regressors

  • In gaussian_process.GaussianProcessRegressor, method predict is faster when using return_std=True in particular more when called several times in a row. 9234 by andrewww <andrewww> and Minghui Liu <minghui-liu>.
  • Add named_estimators_ parameter in ensemble.VotingClassifier to access fitted estimators. 9157 by Herilalaina Rakotoarison <herilalaina>.
  • Add var_smoothing parameter in naive_bayes.GaussianNB to give a precise control over variances calculation. 9681 by Dmitry Mottl <Mottl>.
  • Add n_iter_no_change parameter in neural_network.BaseMultilayerPerceptron, neural_network.MLPRegressor, and neural_network.MLPClassifier to give control over maximum number of epochs to not meet tol improvement. 9456 by Nicholas Nadeau <nnadeau>.
  • A parameter check_inverse was added to preprocessing.FunctionTransformer to ensure that func and inverse_func are the inverse of each other. 9399 by Guillaume Lemaitre <glemaitre>.
  • Add sample_weight parameter to the fit method of linear_model.BayesianRidge for weighted linear regression. 10111 by Peter St. John <pstjohn>.
  • dummy.DummyClassifier and dummy.DummyRegresssor now only require X to be an object with finite length or shape. 9832 by Vrishank Bhardwaj <vrishank97>.

Cluster

  • cluster.KMeans, cluster.MiniBatchKMeans and cluster.k_means passed with algorithm='full' now enforces row-major ordering, improving runtime. 10471 by Gaurav Dhingra <gxyd>.

Preprocessing

  • preprocessing.PolynomialFeatures now supports sparse input. 10452 by Aman Dalmia <dalmia> and Joel Nothman.

Model evaluation and meta-estimators

  • A scorer based on metrics.brier_score_loss is also available. 9521 by Hanmin Qin <qinhanmin2014>.
  • The default of iid parameter of model_selection.GridSearchCV and model_selection.RandomizedSearchCV will change from True to False in version 0.22 to correspond to the standard definition of cross-validation, and the parameter will be removed in version 0.24 altogether. This parameter is of greatest practical significance where the sizes of different test sets in cross-validation were very unequal, i.e. in group-based CV strategies. 9085 by Laurent Direr <ldirer> and Andreas Müller.

Metrics

  • metrics.roc_auc_score now supports binary y_true other than {0, 1} or {-1, 1}. 9828 by Hanmin Qin <qinhanmin2014>.

Linear, kernelized and related models

  • Deprecate random_state parameter in svm.OneClassSVM as the underlying implementation is not random. 9497 by Albert Thomas <albertcthomas>.

Miscellaneous

  • Add filename attribute to datasets that have a CSV file. 9101 by alex-33 <alex-33> and Maskani Filali Mohamed <maskani-moh>.

Bug fixes

Classifiers and regressors

  • Fixed a bug in isotonic.IsotonicRegression which incorrectly combined weights when fitting a model to data involving points with identical X values. 9432 by Dallas Card <dallascard>
  • Fixed a bug in neural_network.BaseMultilayerPerceptron, neural_network.MLPRegressor, and neural_network.MLPClassifier with new n_iter_no_change parameter now at 10 from previously hardcoded 2. 9456 by Nicholas Nadeau <nnadeau>.
  • Fixed a bug in neural_network.MLPRegressor where fitting quit unexpectedly early due to local minima or fluctuations. 9456 by Nicholas Nadeau <nnadeau>
  • Fixed a bug in naive_bayes.GaussianNB which incorrectly raised error for prior list which summed to 1. 10005 by Gaurav Dhingra <gxyd>.
  • Fixed a bug in linear_model.LogisticRegression where when using the parameter multi_class='multinomial', the predict_proba method was returning incorrect probabilities in the case of binary outcomes. 9939 by Roger Westover <rwolst>.
  • Fixed a bug in linear_model.OrthogonalMatchingPursuit that was broken when setting normalize=False. 10071 by Alexandre Gramfort.
  • Fixed a bug in linear_model.ARDRegression which caused incorrectly updated estimates for the standard deviation and the coefficients. 10153 by Jörg Döpfert <jdoepfert>.
  • Fixed a bug when fitting ensemble.GradientBoostingClassifier or ensemble.GradientBoostingRegressor with warm_start=True which previously raised a segmentation fault due to a non-conversion of CSC matrix into CSR format expected by decision_function. Similarly, Fortran-ordered arrays are converted to C-ordered arrays in the dense case. 9991 by Guillaume Lemaitre <glemaitre>.
  • Fixed a bug in neighbors.NearestNeighbors where fitting a NearestNeighbors model fails when a) the distance metric used is a callable and b) the input to the NearestNeighbors model is sparse. 9579 by Thomas Kober <tttthomasssss>.
  • Fixed a bug in svm.SVC where when the argument kernel is unicode in Python2, the predict_proba method was raising an unexpected TypeError given dense inputs. 10412 by Jiongyan Zhang <qmick>.

Decomposition, manifold learning and clustering

  • Fix for uninformative error in decomposition.IncrementalPCA: now an error is raised if the number of components is larger than the chosen batch size. The n_components=None case was adapted accordingly. 6452. By Wally Gauze <wallygauze>.
  • Fixed a bug where the partial_fit method of decomposition.IncrementalPCA used integer division instead of float division on Python 2 versions. 9492 by James Bourbeau <jrbourbeau>.
  • Fixed a bug where the fit method of cluster.AffinityPropagation stored cluster centers as 3d array instead of 2d array in case of non-convergence. For the same class, fixed undefined and arbitrary behavior in case of training data where all samples had equal similarity. 9612. By Jonatan Samoocha <jsamoocha>.
  • In decomposition.PCA selecting a n_components parameter greater than the number of samples now raises an error. Similarly, the n_components=None case now selects the minimum of n_samples and n_features. 8484. By Wally Gauze <wallygauze>.
  • Fixed a bug in datasets.fetch_kddcup99, where data were not properly shuffled. 9731 by Nicolas Goix.
  • Fixed a bug in decomposition.PCA where users will get unexpected error with large datasets when n_components='mle' on Python 3 versions. 9886 by Hanmin Qin <qinhanmin2014>.
  • Fixed a bug when setting parameters on meta-estimator, involving both a wrapped estimator and its parameter. 9999 by Marcus Voss <marcus-voss> and Joel Nothman.
  • k_means now gives a warning, if the number of distinct clusters found is smaller than n_clusters. This may occur when the number of distinct points in the data set is actually smaller than the number of cluster one is looking for. 10059 by Christian Braune <christianbraune79>.
  • Fixed a bug in datasets.make_circles, where no odd number of data points could be generated. 10037 by Christian Braune <christianbraune79>_.
  • Fixed a bug in cluster.spectral_clustering where the normalization of the spectrum was using a division instead of a multiplication. 8129 by Jan Margeta <jmargeta>, Guillaume Lemaitre <glemaitre>, and Devansh D. <devanshdalal>.

Metrics

  • Fixed a bug in metrics.precision_precision_recall_fscore_support when truncated range(n_labels) is passed as value for labels. 10377 by Gaurav Dhingra <gxyd>.
  • Fixed a bug due to floating point error in metrics.roc_auc_score with non-integer sample weights. 9786 by Hanmin Qin <qinhanmin2014>.
  • Fixed a bug where metrics.roc_curve sometimes starts on y-axis instead of (0, 0), which is inconsistent with the document and other implementations. Note that this will not influence the result from metrics.roc_auc_score 10093 by alexryndin <alexryndin> and Hanmin Qin <qinhanmin2014>.

Neighbors

  • Fixed a bug so predict in neighbors.RadiusNeighborsRegressor can handle empty neighbor set when using non uniform weights. Also raises a new warning when no neighbors are found for samples. 9655 by Andreas Bjerre-Nielsen <abjer>.

Feature Extraction

  • Fixed a bug in feature_extraction.image.extract_patches_2d which would throw an exception if max_patches was greater than or equal to the number of all possible patches rather than simply returning the number of possible patches. 10100 by Varun Agrawal <varunagrawal>
  • Fixed a bug in feature_extraction.text.CountVectorizer, feature_extraction.text.TfidfVectorizer, feature_extraction.text.HashingVectorizer to support 64 bit sparse array indexing necessary to process large datasets with more than 2·10⁹ tokens (words or n-grams). 9147 by Claes-Fredrik Mannby <mannby> and Roman Yurchak.

API changes summary

Linear, kernelized and related models

  • Deprecate random_state parameter in svm.OneClassSVM as the underlying implementation is not random. 9497 by Albert Thomas <albertcthomas>.
  • Deprecate positive=True option in linear_model.Lars as the underlying implementation is broken. Use linear_model.Lasso instead. 9837 by Alexandre Gramfort.

Metrics

  • Deprecate reorder parameter in metrics.auc as it's no longer required for metrics.roc_auc_score. Moreover using reorder=True can hide bugs due to floating point error in the input. 9851 by Hanmin Qin <qinhanmin2014>.

Cluster

  • Deprecate pooling_func unused parameter in cluster.AgglomerativeClustering. 9875 by Kumar Ashutosh <thechargedneutron>.

Changes to estimator checks

  • Allow tests in estimator_checks.check_estimator to test functions that accept pairwise data. 9701 by Kyle Johnson <gkjohns>
  • Allow estimator_checks.check_estimator to check that there is no private settings apart from parameters during estimator initialization. 9378 by Herilalaina Rakotoarison <herilalaina>
  • Add test estimator_checks.check_methods_subset_invariance to check that estimators methods are invariant if applied to a data subset. 10420 by Jonathan Ohayon <Johayon>