Scikit-learn contains a number of utilities to help with development. These are located in sklearn.utils
, and include tools in a number of categories. All the following functions and classes are in the module sklearn.utils
.
sklearn.utils
These are tools used to check and validate input. When you write a function which accepts arrays, matrices, or sparse matrices as arguments, the following should be used when applicable.
assert_all_finite
: Throw an error if array contains NaNs or Infs.safe_asarray
: Convert input to array or sparse matrix. Equivalent tonp.asarray
, but sparse matrices are passed through.as_float_array
: convert input to an array of floats. If a sparse matrix is passed, a sparse matrix will be returned.array2d
: equivalent tonp.atleast_2d
, but theorder
anddtype
of the input are maintained.atleast2d_or_csr
: equivalent toarray2d
, but if a sparse matrix is passed, will convert to csr format. Also callsassert_all_finite
.check_arrays
: check that all input arrays have consistent first dimensions. This will work for an arbitrary number of arrays.warn_if_not_float
: Warn if input is not a floating-point value. the inputX
is assumed to haveX.dtype
.
If your code relies on a random number generator, it should never use functions like numpy.random.random
or numpy.random.normal
. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState
object should be used, which is built from a random_state
argument passed to the class or function. The function check_random_state
, below, can then be used to create a random number generator object.
check_random_state
: create anp.random.RandomState
object from a parameterrandom_state
.- If
random_state
isNone
ornp.random
, then a randomly-initializedRandomState
object is returned. - If
random_state
is an integer, then it is used to seed a newRandomState
object. - If
random_state
is aRandomState
object, then it is passed through.
- If
For example:
>>> from sklearn.utils import check_random_state
>>> random_state = 0
>>> random_state = check_random_state(random_state)
>>> random_state.rand(4)
array([ 0.5488135 , 0.71518937, 0.60276338, 0.54488318])
extmath.randomized_range_finder
: construct an orthonormal matrix whose range approximates the range of the input. This is used inextmath.randomized_svd
, below.extmath.randomized_svd
: compute the k-truncated randomized SVD. This algorithm finds the exact truncated singular values decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components.arrayfuncs.cholesky_delete
: (used insklearn.linear_model.least_angle.lars_path
) Remove an item from a cholesky factorization.arrayfuncs.min_pos
: (used insklearn.linear_model.least_angle
) Find the minimum of the positive values within an array.extmath.norm
: computes Euclidean (L2) vector norm by directly calling the BLASnrm2
function. This is more stable thanscipy.linalg.norm
. See Fabian's blog post for a discussion.extmath.fast_logdet
: efficiently compute the log of the determinant of a matrix.extmath.density
: efficiently compute the density of a sparse vectorextmath.safe_sparse_dot
: dot product which will correctly handlescipy.sparse
inputs. If the inputs are dense, it is equivalent tonumpy.dot
.extmath.logsumexp
: compute the sum of X assuming X is in the log domain. This is equivalent to callingnp.log(np.sum(np.exp(X)))
, but is robust to overflow/underflow errors. Note that there is similar functionality innp.logaddexp.reduce
, but because of the pairwise nature of this routine, it is slower for large arrays. Scipy has a similar routine inscipy.misc.logsumexp
(In scipy versions < 0.10, this is found inscipy.maxentropy.logsumexp
), but the scipy version does not accept anaxis
keyword.extmath.weighted_mode
: an extension ofscipy.stats.mode
which allows each item to have a real-valued weight.resample
: Resample arrays or sparse matrices in a consistent way. used inshuffle
, below.shuffle
: Shuffle arrays or sparse matrices in a consistent way. Used insklearn.cluster.k_means
.
random.sample_without_replacement
: implements efficient algorithms for sampling n_samples integers from a population of size n_population without replacement.
The sklearn.utils.sparsefuncs
cython module hosts compiled extensions to efficiently process scipy.sparse
data.
sparsefuncs.mean_variance_axis0
: compute the means and variances along axis 0 of a CSR matrix. Used for normalizing the tolerance stopping criterion insklearn.cluster.k_means_.KMeans
.sparsefuncs.inplace_csr_row_normalize_l1
andsparsefuncs.inplace_csr_row_normalize_l2
: can be used to normalize individual sparse samples to unit l1 or l2 norm as done insklearn.preprocessing.Normalizer
.sparsefuncs.inplace_csr_column_scale
: can be used to multiply the columns of a CSR matrix by a constant scale (one scale per column). Used for scaling features to unit standard deviation insklearn.preprocessing.StandardScaler
.
graph.single_source_shortest_path_length
: (not currently used in scikit-learn) Return the shortest path from a single source to all connected nodes on a graph. Code is adapted from networkx. If this is ever needed again, it would be far faster to use a single iteration of Dijkstra's algorithm fromgraph_shortest_path
.graph.graph_laplacian
: (used insklearn.cluster.spectral.spectral_embedding
) Return the Laplacian of a given graph. There is specialized code for both dense and sparse connectivity matrices.graph_shortest_path.graph_shortest_path
: (used in`sklearn.manifold.Isomap
) Return the shortest path between all pairs of connected points on a directed or undirected graph. Both the Floyd-Warshall algorithm and Dijkstra's algorithm are available. The algorithm is most efficient when the connectivity matrix is a scipy.sparse.csr_matrix`.
fixes.Counter
(partial backport ofcollections.Counter
from Python 2.7) Used insklearn.feature_extraction.text
.fixes.unique
: (backport ofnp.unique
from numpy 1.4). Find the unique entries in an array. In numpy versions < 1.4,np.unique
is less flexible. Used insklearn.cross_validation
.fixes.copysign
: (backport ofnp.copysign
from numpy 1.4). Change the sign ofx1
to that ofx2
, element-wise.fixes.in1d
: (backport ofnp.in1d
from numpy 1.4). Test whether each element of an array is in a second array. Used insklearn.datasets.twenty_newsgroups
andsklearn.feature_extraction.image
.fixes.savemat
(backport ofscipy.io.savemat
from scipy 0.7.2). Save an array in MATLAB-format. In earlier versions, the keywordoned_as
is not available.fixes.count_nonzero
(backport ofnp.count_nonzero
from numpy 1.6). Count the nonzero elements of a matrix. Used in tests ofsklearn.linear_model
.arrayfuncs.solve_triangular
(Back-ported from scipy v0.9) Used insklearn.linear_model.omp
, independent back-ports insklearn.mixture.gmm
andsklearn.gaussian_process
.sparsetools.connected_components
(backported fromscipy.sparse.connected_components
in scipy 0.12). Used insklearn.cluster.hierarchical
, as well as in tests forsklearn.feature_extraction
.
arpack.eigs
(backported fromscipy.sparse.linalg.eigs
in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version ofeigs
is available in earlier scipy versions.arpack.eigsh
(backported fromscipy.sparse.linalg.eigsh
in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version ofeigsh
is available in earlier scipy versions.arpack.svds
(backported fromscipy.sparse.linalg.svds
in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version ofsvds
is available in earlier scipy versions.
bench.total_seconds
(back-ported fromtimedelta.total_seconds
in Python 2.7). Used inbenchmarks/bench_glm.py
.
testing.assert_in
,testing.assert_not_in
: Assertions for container membership. Designed for forward compatibility with Nose 1.0.testing.assert_raise_message
: Assertions for checking the error raise message.testing.mock_mldata_urlopen
: Mocks the urlopen function to fake requests to mldata.org. Used in tests ofsklearn.datasets
.testing.all_estimators
: returns a list of all estimators in sklearn to test for consistent behavior and interfaces.
multiclass.is_multilabel
: Helper function to check if the task is a multi-label classification one.multiclass.is_label_indicator_matrix
: Helper function to check if a classification output is in label indicator matrix format.multiclass.unique_labels
: Helper function to extract an ordered array of unique labels from a list of labels.
gen_even_slices
: generator to createn
-packs of slices going up ton
. Used insklearn.decomposition.dict_learning
andsklearn.cluster.k_means
.arraybuilder.ArrayBuilder
: Helper class to incrementally build a 1-d numpy.ndarray. Currently used insklearn.datasets._svmlight_format.pyx
.safe_mask
: Helper function to convert a mask to the format expected by the numpy array or scipy sparse matrix on which to use it (sparse matrices support integer indices only while numpy arrays support both boolean masks and integer indices).safe_sqr
: Helper function for unified squaring (**2
) of array-likes, matrices and sparse matrices.
murmurhash3_32
provides a python wrapper for the MurmurHash3_x86_32 C++ non cryptographic hash function. This hash function is suitable for implementing lookup tables, Bloom filters, Count Min Sketch, feature hashing and implicitly defined sparse random projections:>>> from sklearn.utils import murmurhash3_32 >>> murmurhash3_32("some feature", seed=0) == -384616559 True >>> murmurhash3_32("some feature", seed=0, positive=True) == 3910350737 True
The
sklearn.utils.murmurhash
module can also be "cimported" from other cython modules so as to benefit from the high performance of MurmurHash while skipping the overhead of the Python interpreter.
deprecated
: Decorator to mark a function or class as deprecated.ConvergenceWarning
: Custom warning to catch convergence problems. Used insklearn.covariance.graph_lasso
.