Skip to content

Latest commit

 

History

History
297 lines (213 loc) · 11.1 KB

utilities.rst

File metadata and controls

297 lines (213 loc) · 11.1 KB

Utilities for Developers

Scikit-learn contains a number of utilities to help with development. These are located in :mod:`sklearn.utils`, and include tools in a number of categories. All the following functions and classes are in the module :mod:`sklearn.utils`.

Warning

These utilities are meant to be used internally within the scikit-learn package. They are not guaranteed to be stable between versions of scikit-learn. Backports, in particular, will be removed as the scikit-learn dependencies evolve.

.. currentmodule:: sklearn.utils

Validation Tools

These are tools used to check and validate input. When you write a function which accepts arrays, matrices, or sparse matrices as arguments, the following should be used when applicable.

  • :func:`assert_all_finite`: Throw an error if array contains NaNs or Infs.
  • :func:`as_float_array`: convert input to an array of floats. If a sparse matrix is passed, a sparse matrix will be returned.
  • :func:`check_array`: convert input to 2d array, raise error on sparse matrices. Allowed sparse matrix formats can be given optionally, as well as allowing 1d or nd arrays. Calls :func:`assert_all_finite` by default.
  • :func:`check_X_y`: check that X and y have consistent length, calls check_array on X, and column_or_1d on y. For multilabel classification or multitarget regression, specify multi_output=True, in which case check_array will be called on y.
  • :func:`indexable`: check that all input arrays have consistent length and can be sliced or indexed using safe_index. This is used to validate input for cross-validation.

If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function. The function :func:`check_random_state`, below, can then be used to create a random number generator object.

  • :func:`check_random_state`: create a np.random.RandomState object from a parameter random_state.
    • If random_state is None or np.random, then a randomly-initialized RandomState object is returned.
    • If random_state is an integer, then it is used to seed a new RandomState object.
    • If random_state is a RandomState object, then it is passed through.

For example:

>>> from sklearn.utils import check_random_state
>>> random_state = 0
>>> random_state = check_random_state(random_state)
>>> random_state.rand(4)
array([ 0.5488135 ,  0.71518937,  0.60276338,  0.54488318])

Efficient Linear Algebra & Array Operations

Efficient Random Sampling

Efficient Routines for Sparse Matrices

The sklearn.utils.sparsefuncs cython module hosts compiled extensions to efficiently process scipy.sparse data.

Graph Routines

Backports

ARPACK

  • :func:`arpack.eigs` (backported from scipy.sparse.linalg.eigs in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version of eigs is available in earlier scipy versions.
  • :func:`arpack.eigsh` (backported from scipy.sparse.linalg.eigsh in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version of eigsh is available in earlier scipy versions.
  • :func:`arpack.svds` (backported from scipy.sparse.linalg.svds in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version of svds is available in earlier scipy versions.

Benchmarking

Testing Functions

Multiclass and multilabel utility function

Helper Functions

  • :class:`gen_even_slices`: generator to create n-packs of slices going up to n. Used in sklearn.decomposition.dict_learning and sklearn.cluster.k_means.
  • :func:`safe_mask`: Helper function to convert a mask to the format expected by the numpy array or scipy sparse matrix on which to use it (sparse matrices support integer indices only while numpy arrays support both boolean masks and integer indices).
  • :func:`safe_sqr`: Helper function for unified squaring (**2) of array-likes, matrices and sparse matrices.

Hash Functions

  • :func:`murmurhash3_32` provides a python wrapper for the MurmurHash3_x86_32 C++ non cryptographic hash function. This hash function is suitable for implementing lookup tables, Bloom filters, Count Min Sketch, feature hashing and implicitly defined sparse random projections:

    >>> from sklearn.utils import murmurhash3_32
    >>> murmurhash3_32("some feature", seed=0) == -384616559
    True
    
    >>> murmurhash3_32("some feature", seed=0, positive=True) == 3910350737
    True
    

    The sklearn.utils.murmurhash module can also be "cimported" from other cython modules so as to benefit from the high performance of MurmurHash while skipping the overhead of the Python interpreter.

Warnings and Exceptions