An integration of pandas dataframes with scikit learn.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
pandas_sklearn
tests
.gitignore
.travis.yml
LICENSE
README.md
pytest.ini
requirements.txt
setup.py
tox.ini

README.md

Build Status

pandas2sklearn

An integration of pandas dataframes with scikit learn.

The module contains:

  • dealing with dataframes in a scikit learn DataSet fashion.
  • transformation mechanism that can be easily integrated in scikit learn pipelines, DataSetTransformer.

Installation

The module can be easily installed with pip:

> pip install pandas2sklearn

Tests

The module contains some basic testing of the provided functionalities.

> py.test

Usage

The module contains two classes:

DataSet

The DataSet is wrapper around pandas DataFrame, that converts you can use to select:

  • id
  • features
  • target

Example, suppose we have a DataFrame that has the following columns;

df.coumns = id, FN1, FN2, FN3, FN4, FN5, FC1, FC2, FC3, FC4, FC5, FC6, target

from pandas_sklearn import DataSet

dataset = DataSet(df, target_column='target', id_column='id')

dataset.has_target() == True
dataset.has_id() == True
dataset.target == df['target']
dataset.id == df['id']
dataset.target_names == ['FN1', 'FN2', 'FN3', 'FN4', 'FN5', 'FC1', 'FC2', 'FC3', 'FC4', 'FC5', 'FC6']
dataset.data == df[['FN1', 'FN2', 'FN3', 'FN4', 'FN5', 'FC1', 'FC2', 'FC3', 'FC4', 'FC5', 'FC6']]


# removing some features that are not needed FN4, FN5, FC1, FC5, FC6
dataset.set_feature_names(usage=DataSet.EXCLUDE, columns=['FN4', 'FN5', 'FC1', 'FC5', 'FC6'])
dataset.target_names == ['FN1', 'FN2', 'FN3', 'FC2', 'FC3', 'FC4']

# converting the dataset to dictionary
dataset.to_dict() == [
    {'FN1': 12, 'FN2': 23, 'FC2': 'coffee', 'FC2': 'xbox one', 'FC4': 'inch'},
    ...
]

DataSetTransformer

A feature wise transformer, applies a scikit-learn transformer to one or more features. e.g.

DataSetTransformer([
    (['petal length (cm)', 'petal width (cm)'], StandardScaler()),
    ('sepal length (cm)', MinMaxScaler()),
    ('sepal width (cm)', None),
]))

It could be used together with pipelines, e.g.

pipeline = Pipeline([
    ('preprocess', DataSetTransformer([
        (['petal length (cm)', 'petal width (cm)'], StandardScaler()),
        ('sepal length (cm)', MinMaxScaler()),
        ('sepal width (cm)', None),
    ])),
    ('classify', SVC(kernel='linear'))
])

Credit

The DataSetTransformer is based on the work of Ben Hamner and Paul Butler.