Skip to content

Commit

Permalink
Merge branch 'master' into multiple_grid_search
Browse files Browse the repository at this point in the history
Conflicts:
	sklearn/grid_search.py
	sklearn/learning_curve.py
  • Loading branch information
mblondel committed Feb 7, 2014
2 parents eaa3aeb + 5319994 commit 5d8570b
Show file tree
Hide file tree
Showing 115 changed files with 14,454 additions and 11,264 deletions.
18 changes: 14 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,22 @@ env:
- COVERAGE=--with-coverage
python:
- "2.7"
- "2.6"
- "3.3"
virtualenv:
system_site_packages: true
system_site_packages: true
before_install:
- sudo apt-get update -qq
- sudo apt-get install -qq python-scipy python-nose
- sudo apt-get install python-pip
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then wget http://repo.continuum.io/miniconda/Miniconda-2.2.2-Linux-x86_64.sh -O miniconda.sh ; fi
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then chmod +x miniconda.sh ; fi
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then ./miniconda.sh -b ; fi
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then export PATH=/home/travis/anaconda/bin:$PATH ; fi
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then conda update --yes conda ; fi
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then conda update --yes conda ; fi
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then conda create -n testenv --yes pip python=$TRAVIS_PYTHON_VERSION ; fi
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then source activate testenv ; fi
- if [[ $TRAVIS_PYTHON_VERSION != '2.7' ]]; then conda install --yes numpy scipy nose ; fi
- if [[ $TRAVIS_PYTHON_VERSION == '2.7' ]]; then sudo apt-get update -qq ; fi
- if [[ $TRAVIS_PYTHON_VERSION == '2.7' ]]; then sudo apt-get install -qq python-scipy python-nose python-pip ; fi
install:
- python setup.py build_ext --inplace
- if [ "${COVERAGE}" == "--with-coverage" ]; then sudo pip install coverage; fi
Expand Down
4 changes: 4 additions & 0 deletions doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ clean:
-rm -rf modules/generated/*

html:
# These two lines make the build a bit more lengthy, and the
# the embedding of images more robust
rm -rf $(BUILDDIR)/html/_images
#rm -rf _build/doctrees/
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html/stable
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable"
Expand Down
1 change: 1 addition & 0 deletions doc/model_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ Model selection and evaluation
modules/grid_search
modules/pipeline
modules/model_evaluation
modules/learning_curve
4 changes: 4 additions & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -624,6 +624,7 @@ From text
:template: function.rst

learning_curve.learning_curve
learning_curve.validation_curve

.. _linear_model_ref:

Expand Down Expand Up @@ -657,6 +658,8 @@ From text
linear_model.LogisticRegression
linear_model.MultiTaskLasso
linear_model.MultiTaskElasticNet
linear_model.MultiTaskLassoCV
linear_model.MultiTaskElasticNetCV
linear_model.OrthogonalMatchingPursuit
linear_model.OrthogonalMatchingPursuitCV
linear_model.PassiveAggressiveClassifier
Expand Down Expand Up @@ -1057,6 +1060,7 @@ Pairwise metrics
preprocessing.Normalizer
preprocessing.OneHotEncoder
preprocessing.StandardScaler
preprocessing.PolynomialFeatures

.. autosummary::
:toctree: generated/
Expand Down
49 changes: 44 additions & 5 deletions doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -309,12 +309,44 @@ of each iterates until convergence.

Mean Shift
==========
:class:`MeanShift` clustering aims to discover *blobs* in a smooth density of
samples. It is a centroid based algorithm, which works by updating candidates
for centroids to be the mean of the points within a given region. These
candidates are then filtered in a
post-processing stage to eliminate near-duplicates to form the final set of
centroids.

Given a candidate centroid :math:`x_i` for iteration :math:`t`, the candidate
is updated according to the following equation:

.. math::
x_i^{t+1} = x_i^t + m(x_i^t)
Where :math:`N(x_i)` is the neighborhood of samples within a given distance
around :math:`x_i` and :math:`m` is the *mean shift* vector that is computed
for each centroid that
points towards a region of the maximum increase in the density of points. This
is computed using the following equation, effectively updating a centroid to be
the mean of the samples within its neighborhood:

.. math::
m(x_i) = \frac{\sum_{x_j \in N(x_i)}K(x_j - x_i)x_j}{\sum_{x_j \in N(x_i)}K(x_j - x_i)}
The algorithm automatically sets the number of clusters, instead of relying on a
parameter `bandwidth`, which dictates the size of the region to search through.
This parameter can be set manually, but can be estimated using the provided
`estimate_bandwidth` function, which is called if the bandwidth is not set.

The algorithm is not highly scalable, as it requires multiple nearest neighbor
searches during the execution of the algorithm. The algorithm is guaranteed to
converge, however the algorithm will stop iterating when the change in centroids
is small.

Labelling a new sample is performed by finding the nearest centroid for a
given sample.

:class:`MeanShift` clusters data by estimating *blobs* in a smooth
density of points matrix. This algorithm automatically sets its numbers
of cluster. It will have difficulties scaling to thousands of samples.
The utility function :func:`estimate_bandwidth` can be used to guess
the optimal bandwidth for :class:`MeanShift` from the data.

.. figure:: ../auto_examples/cluster/images/plot_mean_shift_1.png
:target: ../auto_examples/cluster/plot_mean_shift.html
Expand All @@ -327,6 +359,13 @@ the optimal bandwidth for :class:`MeanShift` from the data.
* :ref:`example_cluster_plot_mean_shift.py`: Mean Shift clustering
on a synthetic 2D datasets with 3 classes.

.. topic:: References:

* `"Mean shift: A robust approach toward feature space analysis."
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.76.8968&rep=rep1&type=pdf>`_
D. Comaniciu, & P. Meer *IEEE Transactions on Pattern Analysis and Machine Intelligence* (2002)


.. _spectral_clustering:

Spectral clustering
Expand Down
6 changes: 3 additions & 3 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ AdaBoost
========

The module :mod:`sklearn.ensemble` includes the popular boosting algorithm
AdaBoost, introduced in 1995 by Freud and Schapire [FS1995]_.
AdaBoost, introduced in 1995 by Freund and Schapire [FS1995]_.

The core principle of AdaBoost is to fit a sequence of weak learners (i.e.,
models that are only slightly better than random guessing, such as small
Expand Down Expand Up @@ -388,8 +388,8 @@ decision trees).

.. topic:: References

.. [FS1995] Y. Freud, and R. Schapire, "A decision theoretic generalization of
online learning and an application to boosting", 1997.
.. [FS1995] Y. Freund, and R. Schapire, "A Decision-Theoretic Generalization of
On-Line Learning and an Application to Boosting", 1997.
.. [ZZRH2009] J. Zhu, H. Zou, S. Rosset, T. Hastie. "Multi-class AdaBoost",
2009.
Expand Down
18 changes: 9 additions & 9 deletions doc/modules/feature_extraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ suitable for feeding into a classifier (maybe after being piped into a
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
<1x6 sparse matrix of type '<... 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
Expand Down Expand Up @@ -176,7 +176,7 @@ can be constructed using::

and fed to a hasher with::

hasher = FeatureHasher(input_type=string)
hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)

to get a ``scipy.sparse`` matrix ``X``.
Expand Down Expand Up @@ -310,7 +310,7 @@ corpus of text documents::
>>> X = vectorizer.fit_transform(corpus)
>>> X # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Column format>
with 19 stored elements in Compressed Sparse ... format>

The default configuration tokenizes the string by extracting words of
at least 2 letters. The specific function that does this step can be
Expand Down Expand Up @@ -430,7 +430,7 @@ content of the documents::
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
<6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>
with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray() # doctest: +ELLIPSIS
array([[ 0.85..., 0. ..., 0.52...],
Expand All @@ -457,7 +457,7 @@ class called :class:`TfidfVectorizer` that combines all the options of
>>> vectorizer.fit_transform(corpus)
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
<4x9 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
with 19 stored elements in Compressed Sparse ... format>

While the tf–idf normalization is often very useful, there might
be cases where the binary occurrence markers might offer better
Expand Down Expand Up @@ -621,7 +621,7 @@ span across words::
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Column format>
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True
Expand All @@ -630,7 +630,7 @@ span across words::
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Column format>
with 5 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True
Expand Down Expand Up @@ -699,7 +699,7 @@ meaning that you don't have to call ``fit`` on it::
>>> hv.transform(corpus)
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
<4x10 sparse matrix of type '<... 'numpy.float64'>'
with 16 stored elements in Compressed Sparse Row format>
with 16 stored elements in Compressed Sparse ... format>

You can see that 16 non-zero feature tokens were extracted in the vector
output: this is less than the 19 non-zeros extracted previously by the
Expand All @@ -724,7 +724,7 @@ Let's try again with the default setting::
>>> hv.transform(corpus)
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
with 19 stored elements in Compressed Sparse ... format>

We no longer get the collisions, but this comes at the expense of a much larger
dimensionality of the output space.
Expand Down
12 changes: 6 additions & 6 deletions doc/modules/label_propagation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ Semi-Supervised
`Semi-supervised learning
<http://en.wikipedia.org/wiki/Semi-supervised_learning>`_ is a situation
in which in your training data some of the samples are not labeled. The
semi-supervised estimators, in :mod:`sklean.semi_supervised` are able to
make use of this addition unlabeled data to capture better the shape of
semi-supervised estimators in :mod:`sklearn.semi_supervised` are able to
make use of this additional unlabeled data to better capture the shape of
the underlying data distribution and generalize better to new samples.
These algorithms can perform well when we have a very small amount of
labeled points and a large amount of unlabeled points.
Expand All @@ -19,14 +19,14 @@ labeled points and a large amount of unlabeled points.

It is important to assign an identifier to unlabeled points along with the
labeled data when training the model with the `fit` method. The identifier
that this implementation uses the integer value :math:`-1`.
that this implementation uses is the integer value :math:`-1`.

.. _label_propagation:

Label Propagation
=================

Label propagation denote a few variations of semi-supervised graph
Label propagation denotes a few variations of semi-supervised graph
inference algorithms.

A few features available in this model:
Expand Down Expand Up @@ -75,11 +75,11 @@ available:
* knn (:math:`1[x' \in kNN(x)]`). :math:`k` is specified by keyword
n_neighbors.

RBF kernel will produce a fully connected graph which is represented in memory
The RBF kernel will produce a fully connected graph which is represented in memory
by a dense matrix. This matrix may be very large and combined with the cost of
performing a full matrix multiplication calculation for each iteration of the
algorithm can lead to prohibitively long running times. On the other hand,
the KNN kernel will produce a much more memory friendly sparse matrix
the KNN kernel will produce a much more memory-friendly sparse matrix
which can drastically reduce running times.

.. topic:: Examples
Expand Down
Loading

0 comments on commit 5d8570b

Please sign in to comment.