sklearn
As well as a plethora of new features and enhancements, this release is the first to be accompanied by a glossary
developed by Joel Nothman. The glossary is a reference resource to help users and contributors become familiar with the terminology and conventions used in Scikit-learn.
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
decomposition.IncrementalPCA
in Python 2 (bug fix)isotonic.IsotonicRegression
(bug fix)linear_model.ARDRegression
(bug fix)linear_model.OrthogonalMatchingPursuit
(bug fix)metrics.roc_auc_score
(bug fix)metrics.roc_curve
(bug fix)neural_network.BaseMultilayerPerceptron
(bug fix)neural_network.MLPRegressor
(bug fix)neural_network.MLPClassifier
(bug fix)
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Support for Python 3.3 has been officially dropped.
Classifiers and regressors
ensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
now support early stopping vian_iter_no_change
,validation_fraction
andtol
.7071
by Raghav RV- Added
naive_bayes.ComplementNB
, which implements the Complement Naive Bayes classifier described in Rennie et al. (2003).8190
byMichael A. Alcorn <airalcorn2>
.
Preprocessing
- Added
preprocessing.CategoricalEncoder
, which allows to encode categorical features as a numeric array, either using a one-hot (or dummy) encoding scheme or by converting to ordinal integers. Compared to the existingOneHotEncoder
, this new class handles encoding of all feature types (also handles string-valued features) and derives the categories based on the unique values in the features instead of the maximum value in the features.9151
byVighnesh Birodkar <vighneshbirodkar>
and Joris Van den Bossche. - Added
preprocessing.PowerTransformer
, which implements the Box-Cox power transformation, allowing users to map data from any distribution to a Gaussian distribution. This is useful as a variance-stabilizing transformation in situations where normality and homoscedasticity are desirable.10210
byEric Chang <ericchang00>
andManiteja Nandana <maniteja123>
.
Model evaluation
- Added the
metrics.balanced_accuracy_score
metric and a corresponding'balanced_accuracy'
scorer for binary classification.8066
byxyguo
andAman Dalmia <dalmia>
. - Added
multioutput.RegressorChain
for multi-target regression.9257
byKumar Ashutosh <thechargedneutron>
. - Added the
preprocessing.TransformedTargetRegressor
which transforms the target y before fitting a regression model. The predictions are mapped back to the original space via an inverse transform.9041
by Andreas Müller andGuillaume Lemaitre <glemaitre>
.
Classifiers and regressors
- In
gaussian_process.GaussianProcessRegressor
, methodpredict
is faster when usingreturn_std=True
in particular more when called several times in a row.9234
byandrewww <andrewww>
andMinghui Liu <minghui-liu>
. - Add named_estimators_ parameter in
ensemble.VotingClassifier
to access fitted estimators.9157
byHerilalaina Rakotoarison <herilalaina>
. - Add var_smoothing parameter in
naive_bayes.GaussianNB
to give a precise control over variances calculation.9681
byDmitry Mottl <Mottl>
. - Add n_iter_no_change parameter in
neural_network.BaseMultilayerPerceptron
,neural_network.MLPRegressor
, andneural_network.MLPClassifier
to give control over maximum number of epochs to not meettol
improvement.9456
byNicholas Nadeau <nnadeau>
. - A parameter
check_inverse
was added topreprocessing.FunctionTransformer
to ensure thatfunc
andinverse_func
are the inverse of each other.9399
byGuillaume Lemaitre <glemaitre>
. - Add sample_weight parameter to the fit method of
linear_model.BayesianRidge
for weighted linear regression.10111
byPeter St. John <pstjohn>
. dummy.DummyClassifier
anddummy.DummyRegresssor
now only require X to be an object with finite length or shape.9832
byVrishank Bhardwaj <vrishank97>
.
Cluster
cluster.KMeans
,cluster.MiniBatchKMeans
andcluster.k_means
passed withalgorithm='full'
now enforces row-major ordering, improving runtime.10471
byGaurav Dhingra <gxyd>
.
Preprocessing
preprocessing.PolynomialFeatures
now supports sparse input.10452
byAman Dalmia <dalmia>
and Joel Nothman.
Model evaluation and meta-estimators
- A scorer based on
metrics.brier_score_loss
is also available.9521
byHanmin Qin <qinhanmin2014>
. - The default of
iid
parameter ofmodel_selection.GridSearchCV
andmodel_selection.RandomizedSearchCV
will change fromTrue
toFalse
in version 0.22 to correspond to the standard definition of cross-validation, and the parameter will be removed in version 0.24 altogether. This parameter is of greatest practical significance where the sizes of different test sets in cross-validation were very unequal, i.e. in group-based CV strategies.9085
byLaurent Direr <ldirer>
and Andreas Müller.
Metrics
metrics.roc_auc_score
now supports binaryy_true
other than{0, 1}
or{-1, 1}
.9828
byHanmin Qin <qinhanmin2014>
.
Linear, kernelized and related models
- Deprecate
random_state
parameter insvm.OneClassSVM
as the underlying implementation is not random.9497
byAlbert Thomas <albertcthomas>
.
Miscellaneous
- Add
filename
attribute to datasets that have a CSV file.9101
byalex-33 <alex-33>
andMaskani Filali Mohamed <maskani-moh>
.
Classifiers and regressors
- Fixed a bug in
isotonic.IsotonicRegression
which incorrectly combined weights when fitting a model to data involving points with identical X values.9432
byDallas Card <dallascard>
- Fixed a bug in
neural_network.BaseMultilayerPerceptron
,neural_network.MLPRegressor
, andneural_network.MLPClassifier
with newn_iter_no_change
parameter now at 10 from previously hardcoded 2.9456
byNicholas Nadeau <nnadeau>
. - Fixed a bug in
neural_network.MLPRegressor
where fitting quit unexpectedly early due to local minima or fluctuations.9456
byNicholas Nadeau <nnadeau>
- Fixed a bug in
naive_bayes.GaussianNB
which incorrectly raised error for prior list which summed to 1.10005
byGaurav Dhingra <gxyd>
. - Fixed a bug in
linear_model.LogisticRegression
where when using the parametermulti_class='multinomial'
, thepredict_proba
method was returning incorrect probabilities in the case of binary outcomes.9939
byRoger Westover <rwolst>
. - Fixed a bug in
linear_model.OrthogonalMatchingPursuit
that was broken when settingnormalize=False
.10071
by Alexandre Gramfort. - Fixed a bug in
linear_model.ARDRegression
which caused incorrectly updated estimates for the standard deviation and the coefficients.10153
byJörg Döpfert <jdoepfert>
. - Fixed a bug when fitting
ensemble.GradientBoostingClassifier
orensemble.GradientBoostingRegressor
withwarm_start=True
which previously raised a segmentation fault due to a non-conversion of CSC matrix into CSR format expected bydecision_function
. Similarly, Fortran-ordered arrays are converted to C-ordered arrays in the dense case.9991
byGuillaume Lemaitre <glemaitre>
. - Fixed a bug in
neighbors.NearestNeighbors
where fitting a NearestNeighbors model fails when a) the distance metric used is a callable and b) the input to the NearestNeighbors model is sparse.9579
byThomas Kober <tttthomasssss>
. - Fixed a bug in
svm.SVC
where when the argumentkernel
is unicode in Python2, thepredict_proba
method was raising an unexpected TypeError given dense inputs.10412
byJiongyan Zhang <qmick>
.
Decomposition, manifold learning and clustering
- Fix for uninformative error in
decomposition.IncrementalPCA
: now an error is raised if the number of components is larger than the chosen batch size. Then_components=None
case was adapted accordingly.6452
. ByWally Gauze <wallygauze>
. - Fixed a bug where the
partial_fit
method ofdecomposition.IncrementalPCA
used integer division instead of float division on Python 2 versions.9492
byJames Bourbeau <jrbourbeau>
. - Fixed a bug where the
fit
method ofcluster.AffinityPropagation
stored cluster centers as 3d array instead of 2d array in case of non-convergence. For the same class, fixed undefined and arbitrary behavior in case of training data where all samples had equal similarity.9612
. ByJonatan Samoocha <jsamoocha>
. - In
decomposition.PCA
selecting a n_components parameter greater than the number of samples now raises an error. Similarly, then_components=None
case now selects the minimum of n_samples and n_features.8484
. ByWally Gauze <wallygauze>
. - Fixed a bug in
datasets.fetch_kddcup99
, where data were not properly shuffled.9731
by Nicolas Goix. - Fixed a bug in
decomposition.PCA
where users will get unexpected error with large datasets whenn_components='mle'
on Python 3 versions.9886
byHanmin Qin <qinhanmin2014>
. - Fixed a bug when setting parameters on meta-estimator, involving both a wrapped estimator and its parameter.
9999
byMarcus Voss <marcus-voss>
and Joel Nothman. k_means
now gives a warning, if the number of distinct clusters found is smaller thann_clusters
. This may occur when the number of distinct points in the data set is actually smaller than the number of cluster one is looking for.10059
byChristian Braune <christianbraune79>
.- Fixed a bug in
datasets.make_circles
, where no odd number of data points could be generated.10037
byChristian Braune <christianbraune79>
_. - Fixed a bug in
cluster.spectral_clustering
where the normalization of the spectrum was using a division instead of a multiplication.8129
byJan Margeta <jmargeta>
,Guillaume Lemaitre <glemaitre>
, andDevansh D. <devanshdalal>
.
Metrics
- Fixed a bug in
metrics.precision_precision_recall_fscore_support
when truncated range(n_labels) is passed as value for labels.10377
byGaurav Dhingra <gxyd>
. - Fixed a bug due to floating point error in
metrics.roc_auc_score
with non-integer sample weights.9786
byHanmin Qin <qinhanmin2014>
. - Fixed a bug where
metrics.roc_curve
sometimes starts on y-axis instead of (0, 0), which is inconsistent with the document and other implementations. Note that this will not influence the result frommetrics.roc_auc_score
10093
byalexryndin <alexryndin>
andHanmin Qin <qinhanmin2014>
.
Neighbors
- Fixed a bug so
predict
inneighbors.RadiusNeighborsRegressor
can handle empty neighbor set when using non uniform weights. Also raises a new warning when no neighbors are found for samples.9655
byAndreas Bjerre-Nielsen <abjer>
.
Feature Extraction
- Fixed a bug in
feature_extraction.image.extract_patches_2d
which would throw an exception ifmax_patches
was greater than or equal to the number of all possible patches rather than simply returning the number of possible patches.10100
byVarun Agrawal <varunagrawal>
- Fixed a bug in
feature_extraction.text.CountVectorizer
,feature_extraction.text.TfidfVectorizer
,feature_extraction.text.HashingVectorizer
to support 64 bit sparse array indexing necessary to process large datasets with more than 2·10⁹ tokens (words or n-grams).9147
byClaes-Fredrik Mannby <mannby>
and Roman Yurchak.
Linear, kernelized and related models
- Deprecate
random_state
parameter insvm.OneClassSVM
as the underlying implementation is not random.9497
byAlbert Thomas <albertcthomas>
. - Deprecate
positive=True
option inlinear_model.Lars
as the underlying implementation is broken. Uselinear_model.Lasso
instead.9837
by Alexandre Gramfort.
Metrics
- Deprecate
reorder
parameter inmetrics.auc
as it's no longer required formetrics.roc_auc_score
. Moreover usingreorder=True
can hide bugs due to floating point error in the input.9851
byHanmin Qin <qinhanmin2014>
.
Cluster
- Deprecate
pooling_func
unused parameter incluster.AgglomerativeClustering
.9875
byKumar Ashutosh <thechargedneutron>
.
- Allow tests in
estimator_checks.check_estimator
to test functions that accept pairwise data.9701
byKyle Johnson <gkjohns>
- Allow
estimator_checks.check_estimator
to check that there is no private settings apart from parameters during estimator initialization.9378
byHerilalaina Rakotoarison <herilalaina>
- Add test
estimator_checks.check_methods_subset_invariance
to check that estimators methods are invariant if applied to a data subset.10420
byJonathan Ohayon <Johayon>