An implementation of the k-means++ clustering algorithm using Pandas
Python
Latest commit da64a0e Mar 4, 2015 Jack Maney Cleanup. Should close #3

README.md

K-means++ in Pandas

An implementation of the k-means++ clustering algorithm using Pandas.

IMPORTANT NOTE

This package should not be used in production. The implementation of k-means++ contained therein is much slower than that of scikit-learn. Use that instead.

The only reason why I wrote any of this is to teach myself Pandas.

Prerequisites

  • Python 2.7 or lower; this is not Python 3 compatible (yet).
  • Pandas (obviously).
  • NumPy

Installation

If you have pip, then just do

pip install k-means-plus-plus

Otherwise,

  • Clone the repository:

    git clone https://github.com/jackmaney/k-means-plus-plus-pandas.git
    
  • Enter the newly-created folder containing the repo

    cd k-means-plus-plus-pandas
    
  • And run the installation manually:

    python setup.py install
    

Usage

Here are the constructor arguments:

  • data_frame: A Pandas data frame representing the data that you wish to cluster. Rows represent observations, and columns represent variables.

  • k: The number of clusters that you want.

  • columns=None: A list of column names upon which you wish to cluster your data. If this argument isn't provided, then all of the columns are selected. Note: Columns upon which you want to cluster must be numeric and have no numpy.nan values.

  • max_iterations=None: The maximum number of times that you wish to iterate k-means. If no value is provided, then the iterations continue until stability is reached (ie the cluster assignments don't change between one iteration and the next).

  • appended_column_name=None: If this value is set with a string, then a column will be appended to your data with the given name that contains the cluster assignments (which are integers from 0 to k-1). If this argument is not set, then you still have access to the clusters via the clusters attribute.

Once you've constructed a KMeansPlusPlus object, then just call the cluster method, and everything else should happen automagically. Take a look at the examples folder.

TODO:

  • Add on features that take iterations of k-means++ clusters and compares them via, eg, concordance matrices, Jaccard indices, etc.

  • Given a data frame, implement the so-called Elbow Method to take a stab at an optimal value for k.

  • Make this into a proper Python module that can be installed via pip.

  • Python 3 compatibility (probably via six).