Skip to content

Commit

Permalink
Merge branch 'master' into emnb
Browse files Browse the repository at this point in the history
Conflicts:
	sklearn/preprocessing/__init__.py
  • Loading branch information
larsmans committed Dec 21, 2011
2 parents 0ce2348 + d38e3a7 commit be29a50
Show file tree
Hide file tree
Showing 64 changed files with 5,949 additions and 2,689 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Expand Up @@ -5,8 +5,8 @@
*.swp
.DS_Store
build
scikits/learn/datasets/__config__.py
scikits/learn/**/*.html
sklearn/datasets/__config__.py
sklearn/**/*.html

dist/
doc/_build/
Expand Down
1 change: 1 addition & 0 deletions doc/datasets/index.rst
Expand Up @@ -116,6 +116,7 @@ can be used to build artifical datasets of controled size and complexity.
:template: function.rst

make_classification
make_multilabel_classification
make_regression
make_blobs
make_friedman1
Expand Down
20 changes: 11 additions & 9 deletions doc/datasets/twenty_newsgroups.rst
Expand Up @@ -7,15 +7,14 @@ and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

The 20 newsgroups dataset is also available through the generic
``mldata`` dataset loader introduced earlier. However mldata
provides a version where the data is already vectorized.

This is not the case for this loader. Instead, it returns the list of
the raw text files that can be fed to text feature extractors such as
:class:`sklearn.feature_extraction.text.Vectorizer` with custom
parameters so as to extract feature vectors.

This module contains two loaders. The first one,
``sklearn.datasets.fetch_20newsgroups``,
returns a list of the raw text files that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.Vectorizer`
with custom parameters so as to extract feature vectors.
The second one, ``sklearn.datasets.fetch_20newsgroups_vectorized``,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

Usage
-----
Expand Down Expand Up @@ -99,6 +98,9 @@ zero features)::
>>> vectors.nnz / vectors.shape[0]
118

``sklearn.datasets.fetch_20newsgroups_vectorized`` is a function which returns
ready-to-use tfidf features instead of file names.

.. _`20 newsgroups website`: http://people.csail.mit.edu/jrennie/20Newsgroups/
.. _`TF-IDF`: http://en.wikipedia.org/wiki/Tf-idf

Expand Down
81 changes: 62 additions & 19 deletions doc/developers/index.rst
@@ -1,3 +1,9 @@
.. toctree::
:numbered:
index.rst
performance.rst
utilities.rst

============
Contributing
============
Expand All @@ -18,6 +24,7 @@ You are also welcome to post there feature requests or links to pull-requests.

.. _git_repo:


Retrieving the latest code
==========================

Expand All @@ -44,28 +51,27 @@ additional utilities.
Contributing code
=================

.. note:
.. note::

To avoid duplicated work it is highly advised to contact the developers
mailing list before starting work on a non-trivial feature.

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

How to contribute
-----------------

The prefered way to contribute to `scikit-learn` is to fork the main
The prefered way to contribute to Scikit-Learn is to fork the main
repository on
`github <http://github.com/scikit-learn/scikit-learn/>`__:

1. `Create an account <https://github.com/signup/free>`_ on
github if you don't have one already.

2. Fork the `scikit-learn repo
2. Fork the `project repository
<http://github.com/scikit-learn/scikit-learn>`__: click on the 'Fork'
button, at the top, center of the page. This creates a copy of
the code on the github server where you can work.
the code on the GitHub server where you can work.

3. Clone this copy to your local disk (you need the `git` program to do
this)::
Expand Down Expand Up @@ -103,6 +109,10 @@ rules before submitting a pull request:

* Follow the `coding-guidelines`_ (see below).

* When applicable, use the Validation tools and other code in the
``sklearn.utils`` submodule. A list of utility routines available
for developers can be found in the :ref:`developers-utils` page.

* All public methods should have informative docstrings with sample
usage presented as doctests when appropriate.

Expand Down Expand Up @@ -144,7 +154,6 @@ You can also check for common programming errors with the following tools:
$ pip install pep8
$ pep8 path/to/module.py


Bonus points for contributions that include a performance analysis with
a benchmark script and profiling output (please report on the mailing
list or on the github wiki).
Expand All @@ -159,7 +168,6 @@ details on profiling and cython optimizations.
on all new contributions will get the overall code base quality in the
right direction.


EasyFix Issues
--------------

Expand Down Expand Up @@ -267,25 +275,67 @@ In addition, we add the following guidelines:
<https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt>`_
in all your docstrings.


A good example of code that we like can be found `here
<https://svn.enthought.com/enthought/browser/sandbox/docs/coding_standard.py>`_.


Input validation
----------------

The module ``sklearn.utils`` contains various functions for doing input
.. currentmodule:: sklearn.utils

The module :mod:`sklearn.utils` contains various functions for doing input
validation/conversion. Sometimes, ``np.asarray`` suffices for validation;
do `not` use ``np.asanyarray`` or ``np.atleast_2d``, since those let NumPy's
``np.matrix`` through, which has a different API
(e.g., ``*`` means dot product on ``np.matrix``,
but Hadamard product on ``np.ndarray``).

In other cases, be sure to call ``safe_asarray``, ``atleast2d_or_csr``,
``as_float_array`` or ``array2d`` on any array-like argument passed to a
In other cases, be sure to call :func:`safe_asarray`, :func:`atleast2d_or_csr`,
:func:`as_float_array` or :func:`array2d` on any array-like argument passed to a
scikit-learn API function. The exact function to use depends mainly on whether
``scipy.sparse`` matrices must be accepted.

For more information, refer to the :ref:`developers-utils` page.

Random Numbers
--------------

If your code depends on a random number generator, do not use
``numpy.random.random()`` or similar routines. To ensure
repeatability in error checking, the routine should accept a keyword
``random_state`` and use this to construct a
``numpy.random.RandomState`` object.
See :func:`sklearn.utils.check_random_state` in :ref:`developers-utils`.

Here's a simple example of code using some of the above guidelines:

::

from sklearn.utils import array2d, check_random_state

def choose_random_sample(X, random_state=0):
"""
Choose a random point from X

Parameters
----------
X : array-like, shape = (n_samples, n_features)
array representing the data
random_state : RandomState or an int seed (0 by default)
A random number generator instance to define the state of the
random permutations generator.

Returns
-------
x : numpy array, shape = (n_features,)
A random point selected from X
"""
X = array2d(X)
random_state = check_random_state(random_state)
i = random_state.randint(X.shape[0])
return X[i]


APIs of scikit-learn objects
============================
Expand All @@ -295,7 +345,6 @@ objects. In addition, to avoid the proliferation of framework code, we
try to adopt simple conventions and limit to a minimum the number of
methods an object has to implement.


Different objects
-----------------

Expand Down Expand Up @@ -333,7 +382,6 @@ multiple interfaces):

score = obj.score(data)


Estimators
----------

Expand Down Expand Up @@ -397,10 +445,9 @@ following is wrong::
# the argument in the constructor
self.param3 = param2

The scikit-learn relies on this mechanism to introspect object to set
Scikit-Learn relies on this mechanism to introspect object to set
their parameters by cross-validation.


Fitting
^^^^^^^

Expand Down Expand Up @@ -456,14 +503,12 @@ Any attribute that ends with ``_`` is expected to be overridden when
you call ``fit`` a second time without taking any previous value into
account: **fit should be idempotent**.


Optional Arguments
^^^^^^^^^^^^^^^^^^

In iterative algorithms, number of iterations should be specified by
an int called ``n_iter``.


Unresolved API issues
----------------------

Expand All @@ -473,15 +518,13 @@ Some things are must still be decided:
* which exception should be raised when arrays' shape do not match
in fit() ?


Working notes
---------------

For unresolved issues, TODOs, remarks on ongoing work, developers are
adviced to maintain notes on the github wiki:
https://github.com/scikit-learn/scikit-learn/wiki


Specific models
-----------------

Expand Down

0 comments on commit be29a50

Please sign in to comment.