Skip to content

Commit

Permalink
changed readme to RST
Browse files Browse the repository at this point in the history
  • Loading branch information
paulgb committed Apr 25, 2013
1 parent b10d6ee commit d46db24
Show file tree
Hide file tree
Showing 5 changed files with 85 additions and 25 deletions.
48 changes: 48 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
sklearn-pandas -- bridge code for cross-validation of pandas data frames
with sklearn

This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.

Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:

1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.

Paul Butler <paulgb@gmail.com>

The source code of DataFrameMapper is derived from code originally written by
Ben Hamner and released under the following license.

Copyright (c) 2013, Ben Hamner
Author: Ben Hamner (ben@benhamner.com)
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
include LICENSE
include README.rst
57 changes: 34 additions & 23 deletions README.md → README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Sklearn-pandas
==============

This module provides a bridge between [Scikit-Learn](http://scikit-learn.org/stable/)'s machine learning methods and [pandas](http://pandas.pydata.org/)-style Data Frames.
This module provides a bridge between `Scikit-Learn <http://scikit-learn.org/stable/>`__'s machine learning methods and `pandas <http://pandas.pydata.org/>`__-style Data Frames.

In particular, it provides:

Expand All @@ -12,40 +12,42 @@ In particular, it provides:
Installation
------------

You can install `sklearn-pandas` with `pip`.
You can install ``sklearn-pandas`` with ``pip``::

# pip install sklearn-pandas

Tests
-----

The examples in this file double as basic sanity tests. To run them, use `doctest`, which is included with python.
The examples in this file double as basic sanity tests. To run them, use ``doctest``, which is included with python::

# python -m doctest README.md

Usage
-----

### Import
Import
******

Import what you need from the `sklearn_pandas` package. The choices are:
Import what you need from the ``sklearn_pandas`` package. The choices are:

* `DataFrameMapper`, a class for mapping pandas data frame columns to different sklearn transformations
* `cross_val_score`, similar to `sklearn.cross_validation.cross_val_score` but working on pandas DataFrames
* ``DataFrameMapper``, a class for mapping pandas data frame columns to different sklearn transformations
* ``cross_val_score``, similar to `sklearn.cross_validation.cross_val_score` but working on pandas DataFrames

For this demonstration, we will import both.
For this demonstration, we will import both::

>>> from sklearn_pandas import DataFrameMapper, cross_val_score

For these examples, we'll also use pandas and sklearn.
For these examples, we'll also use pandas and sklearn::

>>> import pandas as pd
>>> import sklearn.preprocessing, sklearn.decomposition, \
... sklearn.linear_model, sklearn.pipeline, sklearn.metrics

### Load some Data
Load some Data
**************

Normally you'll read the data from a file, but for demonstration purposes I'll create a data frame from a Python dict.
Normally you'll read the data from a file, but for demonstration purposes I'll create a data frame from a Python dict::

>>> data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
... 'children': [4., 6, 3, 3, 2, 3, 5, 4],
Expand All @@ -54,19 +56,21 @@ Normally you'll read the data from a file, but for demonstration purposes I'll c
Transformation Mapping
----------------------

### Map the Columns to Transformations
Map the Columns to Transformations
**********************************

The mapper takes a list of pairs. The first is a column name from the pandas DataFrame (or a list of multiple columns, as we will see later). The second is an object which will perform the transformation which will be applied to that column.
The mapper takes a list of pairs. The first is a column name from the pandas DataFrame (or a list of multiple columns, as we will see later). The second is an object which will perform the transformation which will be applied to that column::

>>> mapper = DataFrameMapper([
... ('pet', sklearn.preprocessing.LabelBinarizer()),
... ('children', sklearn.preprocessing.StandardScaler())
... ])


### Test the Transformation
Test the Transformation
***********************

We can use the `fit_transform` shortcut to both fit the model and see what transformed data looks like.
We can use the ``fit_transform`` shortcut to both fit the model and see what transformed data looks like::

>>> mapper.fit_transform(data)
array([[ 1. , 0. , 0. , 0.20851441],
Expand All @@ -78,22 +82,23 @@ We can use the `fit_transform` shortcut to both fit the model and see what trans
[ 1. , 0. , 0. , 1.04257207],
[ 0. , 0. , 1. , 0.20851441]])

Note that the first three columns are the output of the `LabelBinarizer` (corresponding to _cat_, _dog_, and _fish_ respectively) and the fourth column is the standardized value for the number of children. In general, the columns are ordered according to the order given when the `DataFrameMapper` is constructed.
Note that the first three columns are the output of the ``LabelBinarizer`` (corresponding to _cat_, _dog_, and _fish_ respectively) and the fourth column is the standardized value for the number of children. In general, the columns are ordered according to the order given when the ``DataFrameMapper`` is constructed.

Now that the transformation is trained, we confirm that it works on new data.
Now that the transformation is trained, we confirm that it works on new data::

>>> mapper.transform({'pet': ['cat'], 'children': [5.]})
array([[ 1. , 0. , 0. , 1.04257207]])

### Transform Multiple Columns
Transform Multiple Columns
**************************

Transformations may require multiple input columns. In these cases, the column names can be specified in a list.
Transformations may require multiple input columns. In these cases, the column names can be specified in a list::

>>> mapper2 = DataFrameMapper([
... (['children', 'salary'], sklearn.decomposition.PCA(1))
... ])
Now running `fit_transform` will run PCA on the `children` and `salary` columns and return the first principal component.
Now running ``fit_transform`` will run PCA on the ``children`` and ``salary`` columns and return the first principal component::

>>> mapper2.fit_transform(data)
array([[ 47.62288153],
Expand All @@ -108,14 +113,20 @@ Now running `fit_transform` will run PCA on the `children` and `salary` columns
Cross-Validation
----------------

Now that we can combine features from pandas DataFrames, we may want to use cross-validation to see whether our model works. Scikit-learn provides features for cross-validation, but they expect numpy data structures and won't work with `DataFrameMapper`.
Now that we can combine features from pandas DataFrames, we may want to use cross-validation to see whether our model works. Scikit-learn provides features for cross-validation, but they expect numpy data structures and won't work with ``DataFrameMapper``.

To get around this, sklearn-pandas provides a wrapper on sklearn's `cross_val_score` function which passes a pandas DataFrame to the estimator rather than a numpy array.
To get around this, sklearn-pandas provides a wrapper on sklearn's ``cross_val_score`` function which passes a pandas DataFrame to the estimator rather than a numpy array::

>>> pipe = sklearn.pipeline.Pipeline([
... ('featurize', mapper),
... ('lm', sklearn.linear_model.LinearRegression())])
>>> cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_error)
array([ 2018.185 , 6.72033058, 1899.58333333])

Sklearn-pandas' `cross_val_score` function provides exactly the same interface as sklearn's function of the same name.
Sklearn-pandas' ``cross_val_score`` function provides exactly the same interface as sklearn's function of the same name.

Credit
------

The code for ``DataFrameMapper`` is based on code originally written by `Ben Hamner <https://github.com/benhamner>`__.

2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from setuptools import setup

setup(name='sklearn-pandas',
version='0.2',
version='0.0.1',
description='Pandas integration with sklearn',
author='Paul Butler',
author_email='paulgb@gmail.com',
Expand Down
1 change: 0 additions & 1 deletion sklearn_pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import cross_validation
import pdb

def cross_val_score(estimator, X, *args, **kwargs):
class DataFrameWrapper(object):
Expand Down

0 comments on commit d46db24

Please sign in to comment.