Skip to content

Errors in NearestNeighborLearner #21

@GoogleCodeExporter

Description

@GoogleCodeExporter
What steps will reproduce the problem?
Tried to use the learning.NearestNeighborLearner on the Sex Classification 
dataset from this Wikipedia article on Naive Bayes classifiers: 
http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_Classification

What is the expected output? What do you see instead?
Program wouldn't run due to bugs in the implementation of NNLearner

What version of the product are you using?
Bug exists in r30


Please provide any additional information below.
Here's my sample code:

import learning

examples = 
[[6,180,12,'male'],[5.92,190,11,'male'],[5.58,170,12,'male'],[5,100,6,'female'],
[5.5,150,8,'female'],[5.42,130,7,'female'],[5.75,150,9,'female']]

ds = learning.DataSet(examples)
nnl = learning.NearestNeighborLearner(2)
nnl.train(ds)
print nnl.predict([5.1,105,6.3])

And I would expect it to print 'female'.

I believe the following fixes should work:
old learning.py, lines 217 - 231

        else:
            ## Maintain a sorted list of (distance, example) pairs.
            ## For very large k, a PriorityQueue would be better
            best = [] 
            for e in examples:
                d = self.distance(e, example)
                if len(best) < k: 
                    e.append((d, e))
                elif d < best[-1][0]:
                    best[-1] = (d, e)
                    best.sort()
            return mode([e[self.dataset.target] for (d, e) in best])

    def distance(self, e1, e2):
        return mean_boolean_error(e1, e2)


new learning.py:

        else:
            ## Maintain a sorted list of (distance, example) pairs.
            ## For very large k, a PriorityQueue would be better
            best = [] 
            for e in self.dataset.examples:
                d = self.distance(e, example)
                if len(best) < self.k: 
                    best.append((d, e))
                elif d < best[-1][0]:
                    best[-1] = (d, e)
                    best.sort()
            return mode([e[self.dataset.target] for (d, e) in best])

    def distance(self, e1, e2):
        return mean_error(e1, e2)


Specifically:
1) changed 'examples' to self.dataset.examples. 
2) changed e.append((d,e)) to best.append((d, e))
3) and I could be wrong, but I believe you wanted mean_error, not 
mean_boolean_error in your distance function.

For the gender classification example, it seems to work great. Thanks!

Original issue reported on code.google.com by tblana...@gmail.com on 19 Oct 2010 at 5:21

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions