Skip to content

Commit

Permalink
Drop support for missing values, following sklearn
Browse files Browse the repository at this point in the history
  • Loading branch information
nicodv committed Aug 30, 2021
1 parent 61d9b16 commit a20f6ed
Showing 1 changed file with 8 additions and 23 deletions.
31 changes: 8 additions & 23 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,29 +98,6 @@ Usage
The examples directory showcases simple use cases of both k-modes
('soybean.py') and k-prototypes ('stocks.py').

Missing / unseen data
_____________________

The k-modes algorithm accepts :code:`np.NaN` values as missing values in
the :code:`X` matrix. However, users are strongly suggested to consider
filling in the missing data themselves in a way that makes sense for
the problem at hand. This is especially important in case of many missing
values.

The k-modes algorithm currently handles missing data as follows. When
fitting the model, :code:`np.NaN` values are encoded into their own
category (let's call it "unknown values"). When predicting, the model
treats any values in :code:`X` that (1) it has not seen before during
training, or (2) are missing, as being a member of the "unknown values"
category. Simply put, the algorithm treats any missing / unseen data as
matching with each other but mismatching with non-missing / seen data
when determining similarity between points.

The k-prototypes also accepts :code:`np.NaN` values as missing values for
the categorical variables, but does *not* accept missing values for the
numerical values. It is up to the user to come up with a way of
handling these missing data that is appropriate for the problem at hand.

Parallel execution
------------------

Expand Down Expand Up @@ -171,6 +148,14 @@ A: This is a feature, not a bug. :code:`kmodes` is telling you that it can't mak

----

**Q: I'm getting the following error: "ValueError: Input contains NaN, infinity, or a value too large for dtype('float64')."**

A: Following scikit-learn, the k-modes algorithm does not accept :code:`np.NaN`
values in the :code:`X` matrix. Users are suggested to fill in the missing
data in a way that makes sense for the problem at hand.

----

**Q: How would like your library to be cited?**

A: Something along these lines would do nicely:
Expand Down

0 comments on commit a20f6ed

Please sign in to comment.