In [1]:
# Example from Programming Collective Intelligence, Chapter 6

### Classify documents based on their contents ###

In [2]:
from docclass import *

Build simple classifier:

In [3]:
cl = classifier(getwords)
sampletrain(cl)

Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps


 Calculate the number of times when instances containing the word 'quick' are labelled as 'good':

In [4]:
cl.fcount('quick','good')

2.0

Calculate the probability that document, containing the word 'quick', will be labeled as 'good':

In [5]:
cl.fprob('quick', 'good')

0.6666666666666666

Consider assumed probability, which will be used when you have very little information about the feature in question. A good number to start with is 0.5. You’ll also need to decide how much to weight the assumed probability—a weight of 1 means the assumed probability is weighted the same as one word. The weighted probability returns a weighted average of
getprobability and the assumed probability.

Calculate this probability that document, containing the word 'money', will be labeled as 'good':

In [6]:
print(cl.weightedprob('money','good',cl.fprob))
print(cl.weightedprob('money','bad',cl.fprob))

0.25
0.5


Train simple Naive Bayes classifier:

In [7]:
cl=naivebayes(getwords)
sampletrain(cl)

Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps


Calculate the probabilities for the document, containing the words 'quick rabbit':

In [8]:
print(cl.prob('quick rabbit','good'))
print(cl.prob('quick rabbit','bad'))

quick rabbit
0.15625
quick rabbit
0.05


Classify new instances:

In [9]:
print(cl.classify('quick rabbit',default='unknown'))
print(cl.classify('quick money',default='unknown'))

quick rabbit
quick rabbit
good
quick money
quick money
bad


In the case of spam filtering, it’s much more important to avoid having good email messages classified as spam than it is to catch every single spam message. The occasional spam message in your inbox can be tolerated, but an important email that is
automatically filtered to junk mail might get overlooked completely. If you have to search through your junk mail folder for important email messages, there’s really no point in having a spam filter.

To deal with this problem, you can set up a minimum threshold for each category.For a new item to be classified into a particular category, its probability must be a specified amount larger than the probability for any other category. This specified amount is the threshold. For spam filtering, the threshold to be filtered to bad could be 3, so that the probability for bad would have to be 3 times higher than the probability for good. The threshold for good could be set to 1, so anything would be good if the probability were at all better than for the bad category. Any message where the probability for bad is higher, but not 3 times higher, would be classified as unknown.

Set threshold equal to 3 and calculate probability:

In [10]:
cl.setthreshold('bad',3.0)
cl.classify('quick money',default='unknown')

quick money
quick money


'unknown'

Use simple oversampling:

In [13]:
for i in range(10): 
    sampletrain(cl)

Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
Nobody owns t

In [14]:
print(cl.classify('quick money',default='unknown'))

quick money
quick money
bad


Train Fisher discriminant classifier:

In [15]:
cl=fisherclassifier(getwords)
sampletrain(cl)

Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps


Calculate probabilities:

In [16]:
print(cl.cprob('quick','good'))
print(cl.cprob('money','bad'))
print(cl.weightedprob('money','bad',cl.cprob))

0.571428571429
1.0
0.75


Fisher probabilities:

In [17]:
print(cl.fisherprob('quick rabbit','good'))
print(cl.fisherprob('quick rabbit','bad'))

quick rabbit
0.78013986589
quick rabbit
0.356335962833


Perform classification with Fisher discriminant classifier:

In [18]:
print(cl.classify('quick rabbit'))
print(cl.classify('quick money'))
cl.setminimum('bad',0.8)
print(cl.classify('quick money'))
cl.setminimum('good',0.4)
print(cl.classify('quick money'))

quick rabbit
quick rabbit
good
quick money
quick money
bad
quick money
quick money
good
quick money
quick money
good
