Merge branch 'master' into emnb

Conflicts: sklearn/preprocessing/__init__.py
larsmans · Dec 21, 2011 · be29a50 · be29a50
2 parents 0ce2348 + d38e3a7
commit be29a50
Show file tree

Hide file tree

Showing 64 changed files with 5,949 additions and 2,689 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,8 +5,8 @@
 *.swp
 .DS_Store
 build
-scikits/learn/datasets/__config__.py
-scikits/learn/**/*.html
+sklearn/datasets/__config__.py
+sklearn/**/*.html
 
 dist/
 doc/_build/

diff --git a/doc/datasets/index.rst b/doc/datasets/index.rst
@@ -116,6 +116,7 @@ can be used to build artifical datasets of controled size and complexity.
    :template: function.rst
 
    make_classification
+   make_multilabel_classification
    make_regression
    make_blobs
    make_friedman1

diff --git a/doc/datasets/twenty_newsgroups.rst b/doc/datasets/twenty_newsgroups.rst
@@ -7,15 +7,14 @@ and the other one for testing (or for performance evaluation). The split
 between the train and test set is based upon a messages posted before
 and after a specific date.
 
-The 20 newsgroups dataset is also available through the generic
-``mldata`` dataset loader introduced earlier. However mldata
-provides a version where the data is already vectorized.
-
-This is not the case for this loader. Instead, it returns the list of
-the raw text files that can be fed to  text feature extractors such as
-:class:`sklearn.feature_extraction.text.Vectorizer` with custom
-parameters so as to extract feature vectors.
-
+This module contains two loaders. The first one, 
+``sklearn.datasets.fetch_20newsgroups``,
+returns a list of the raw text files that can be fed to text feature
+extractors such as :class:`sklearn.feature_extraction.text.Vectorizer`
+with custom parameters so as to extract feature vectors.
+The second one, ``sklearn.datasets.fetch_20newsgroups_vectorized``,
+returns ready-to-use features, i.e., it is not necessary to use a feature
+extractor.
 
 Usage
 -----
@@ -99,6 +98,9 @@ zero features)::
   >>> vectors.nnz / vectors.shape[0]
   118
 
+``sklearn.datasets.fetch_20newsgroups_vectorized`` is a function which returns 
+ready-to-use tfidf features instead of file names.
+
 .. _`20 newsgroups website`: http://people.csail.mit.edu/jrennie/20Newsgroups/
 .. _`TF-IDF`: http://en.wikipedia.org/wiki/Tf-idf
 

diff --git a/doc/developers/index.rst b/doc/developers/index.rst
@@ -1,3 +1,9 @@
+.. toctree::
+   :numbered:
+   index.rst
+   performance.rst
+   utilities.rst
+
 ============
 Contributing
 ============
@@ -18,6 +24,7 @@ You are also welcome to post there feature requests or links to pull-requests.
 
 .. _git_repo:
 
+
 Retrieving the latest code
 ==========================
 
@@ -44,28 +51,27 @@ additional utilities.
 Contributing code
 =================
 
-.. note:
+.. note::
 
   To avoid duplicated work it is highly advised to contact the developers
   mailing list before starting work on a non-trivial feature.
 
   https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
-
 How to contribute
 -----------------
 
-The prefered way to contribute to `scikit-learn` is to fork the main
+The prefered way to contribute to Scikit-Learn is to fork the main
 repository on
 `github <http://github.com/scikit-learn/scikit-learn/>`__:
 
  1. `Create an account <https://github.com/signup/free>`_ on
     github if you don't have one already.
 
- 2. Fork the `scikit-learn repo
+ 2. Fork the `project repository
     <http://github.com/scikit-learn/scikit-learn>`__: click on the 'Fork'
     button, at the top, center of the page. This creates a copy of
-    the code on the github server where you can work.
+    the code on the GitHub server where you can work.
 
  3. Clone this copy to your local disk (you need the `git` program to do
     this)::
@@ -103,6 +109,10 @@ rules before submitting a pull request:
 
     * Follow the `coding-guidelines`_ (see below).
 
+    * When applicable, use the Validation tools and other code in the
+      ``sklearn.utils`` submodule.  A list of utility routines available
+      for developers can be found in the :ref:`developers-utils` page.
+
     * All public methods should have informative docstrings with sample
       usage presented as doctests when appropriate.
 
@@ -144,7 +154,6 @@ You can also check for common programming errors with the following tools:
         $ pip install pep8
         $ pep8 path/to/module.py
 
-
 Bonus points for contributions that include a performance analysis with
 a benchmark script and profiling output (please report on the mailing
 list or on the github wiki).
@@ -159,7 +168,6 @@ details on profiling and cython optimizations.
   on all new contributions will get the overall code base quality in the
   right direction.
 
-
 EasyFix Issues
 --------------
 
@@ -267,25 +275,67 @@ In addition, we add the following guidelines:
       <https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt>`_
       in all your docstrings.
 
+
 A good example of code that we like can be found `here
 <https://svn.enthought.com/enthought/browser/sandbox/docs/coding_standard.py>`_.
 
-
 Input validation
 ----------------
 
-The module ``sklearn.utils`` contains various functions for doing input
+.. currentmodule:: sklearn.utils
+
+The module :mod:`sklearn.utils` contains various functions for doing input
 validation/conversion. Sometimes, ``np.asarray`` suffices for validation;
 do `not` use ``np.asanyarray`` or ``np.atleast_2d``, since those let NumPy's
 ``np.matrix`` through, which has a different API
 (e.g., ``*`` means dot product on ``np.matrix``,
 but Hadamard product on ``np.ndarray``).
 
-In other cases, be sure to call ``safe_asarray``, ``atleast2d_or_csr``,
-``as_float_array`` or ``array2d`` on any array-like argument passed to a
+In other cases, be sure to call :func:`safe_asarray`, :func:`atleast2d_or_csr`,
+:func:`as_float_array` or :func:`array2d` on any array-like argument passed to a
 scikit-learn API function. The exact function to use depends mainly on whether
 ``scipy.sparse`` matrices must be accepted.
 
+For more information, refer to the :ref:`developers-utils` page.
+
+Random Numbers
+--------------
+
+If your code depends on a random number generator, do not use
+``numpy.random.random()`` or similar routines.  To ensure
+repeatability in error checking, the routine should accept a keyword
+``random_state`` and use this to construct a
+``numpy.random.RandomState`` object.
+See :func:`sklearn.utils.check_random_state` in :ref:`developers-utils`.
+
+Here's a simple example of code using some of the above guidelines:
+
+::
+
+    from sklearn.utils import array2d, check_random_state
+
+    def choose_random_sample(X, random_state=0):
+      	"""
+	Choose a random point from X
+
+	Parameters
+	----------
+	X : array-like, shape = (n_samples, n_features)
+	    array representing the data
+        random_state : RandomState or an int seed (0 by default)
+            A random number generator instance to define the state of the
+            random permutations generator.
+
+	Returns
+	-------
+	x : numpy array, shape = (n_features,)
+	    A random point selected from X
+	"""
+    	X = array2d(X)
+        random_state = check_random_state(random_state)
+	i = random_state.randint(X.shape[0])
+	return X[i]
+
 
 APIs of scikit-learn objects
 ============================
@@ -295,7 +345,6 @@ objects. In addition, to avoid the proliferation of framework code, we
 try to adopt simple conventions and limit to a minimum the number of
 methods an object has to implement.
 
-
 Different objects
 -----------------
 
@@ -333,7 +382,6 @@ multiple interfaces):
 
       score = obj.score(data)
 
-
 Estimators
 ----------
 
@@ -397,10 +445,9 @@ following is wrong::
         # the argument in the constructor
         self.param3 = param2
 
-The scikit-learn relies on this mechanism to introspect object to set
+Scikit-Learn relies on this mechanism to introspect object to set
 their parameters by cross-validation.
 
-
 Fitting
 ^^^^^^^
 
@@ -456,14 +503,12 @@ Any attribute that ends with ``_`` is expected to be overridden when
 you call ``fit`` a second time without taking any previous value into
 account: **fit should be idempotent**.
 
-
 Optional Arguments
 ^^^^^^^^^^^^^^^^^^
 
 In iterative algorithms, number of iterations should be specified by
 an int called ``n_iter``.
 
-
 Unresolved API issues
 ----------------------
 
@@ -473,15 +518,13 @@ Some things are must still be decided:
     * which exception should be raised when arrays' shape do not match
       in fit() ?
 
-
 Working notes
 ---------------
 
 For unresolved issues, TODOs, remarks on ongoing work, developers are
 adviced to maintain notes on the github wiki:
 https://github.com/scikit-learn/scikit-learn/wiki
 
-
 Specific models
 -----------------