# Naïve Bayes

Pros: Works with a small amount of data, handles multiple classes

Cons: Sensitive to how the input data is prepared

Works with: Nominal values

## General approach to naïve Bayes

1. Collect: Any method.
    
2. Prepare: Numeric or Boolean values are needed.

3. Analyze: With many features, plotting features isn’t helpful. Looking at histograms is a better idea.

4. Train: Calculate the conditional probabilities of the independent features.

5. Test: Calculate the error rate.

6. Use: One common application of naïve Bayes is document classification. You
can use naïve Bayes in any classification setting. It doesn’t have to be text.

Let’s make a quick filter for an online message board that flags
a message as inappropriate if the author uses negative or abusive language. Filtering
out this sort of thing is common because abusive postings make people not come back
and can hurt an online community. We’ll have two categories: abusive and not. We’ll
use 1 to represent abusive and 0 to represent not abusive.

First, we’re going to show how to transform lists of text into a vector of numbers.
Next, we’ll show how to calculate conditional probabilities from these vectors. Then,
we’ll create a classifier, and finally, we’ll look at some practical considerations

# Prepare: making word vectors from text

In [1]:
from numpy import *

In [2]:
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

The text has been labeled by a
human and will be used to train a program to automatically detect abusive posts.

the function createVocabList() will create a list of all the unique words in all
of our documents. To create this unique list you use the Python set data type. You can
give a list of items to the set constructor, and it will only return a unique list.

In [3]:
def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        #The | operator is used for union of two sets; recall that this is the bitwise OR operator
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

function
setOfWords2Vec(), which takes the vocabulary list and a document and outputs a vector
of 1s and 0s to represent whether a word from our vocabulary is present or not in
the given document. You then create a vector the same length as the vocabulary list and
fill it up with 0s.

In [4]:
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print ("the word: %s is not in my Vocabulary!") % word
    return returnVec

In [5]:
listOposts,listClasses = loadDataSet()

In [6]:
myVocabList = createVocabList(listOposts)

In [7]:
myVocabList

['flea',
 'posting',
 'to',
 'steak',
 'mr',
 'not',
 'stupid',
 'help',
 'has',
 'quit',
 'garbage',
 'him',
 'park',
 'so',
 'dalmation',
 'please',
 'maybe',
 'take',
 'I',
 'my',
 'stop',
 'cute',
 'is',
 'problems',
 'buying',
 'food',
 'licks',
 'how',
 'love',
 'worthless',
 'dog',
 'ate']

In [8]:
setOfWords2Vec(myVocabList,listOposts[0])

[1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0]

## Train: calculating probabilities from word vectors

Count the number of documents in each class

for every training document:

    for each class:
    
        if a token appears in the document ➞ increment the count for that token
        increment the count for tokens
    
    for each class:
            
         for each token:
            
            divide the token count by the total token count to get conditional probabilities
    
    return conditional probabilities for each class

In [9]:
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    #Initialize Probabality
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    
    for i in range(numTrainDocs):
        #Vector Addition
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    #Element-wise Division
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive
# we use log because if there is too much multiplication of small numbers in python. it leads to zero

In [10]:
trainMat = []

In [11]:
#This for loop populates the trainMat list with word vectors
for postinDc in listOposts:
    trainMat.append(setOfWords2Vec(myVocabList,postinDc))
p0V,p1V,pAb = trainNB0(trainMat,listClasses)

In [12]:
pAb

0.5

This is just the probability of any document being abusive.

## Naïve Bayes classify function

In [13]:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise multiplication
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

In [14]:
def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print (testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print (testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))


In [15]:
testingNB()

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1


If a word appears more than once in a
document, that might convey some sort of information about the document over just
the word occurring in the document or not. This approach is known as a bag-of-words
model. A bag of words can have multiple occurrences of each word, whereas a set of
words can have only one occurrence of each word.

In [16]:
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

In [17]:
emailText =open(r'C:\Users\piush\Desktop\Data\pythonTutorials\ML in action\email\ham\6.txt').read()

textParse(), takes a big string and parses out the text into a list of
strings. It eliminates anything under two characters long and converts everything to
lowercase.

spamTest(), automates the naïve Bayes spam classifier.

You’ve done only one iteration, but to get a good estimate
of our classifier’s true error, you should do this multiple times and take the average
error rate.

In [18]:
#for anything that isn't a word or number
def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
    
def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open(r'C:\Users\piush\Desktop\Data\pythonTutorials\ML in action\email\spam\%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open(r'C:\Users\piush\Desktop\Data\pythonTutorials\ML in action\email\ham\%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    #The emails that go into the test set and the training set will be randomly selected.
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print ("classification error",docList[docIndex])
    print ('the error rate is: ',float(errorCount)/len(testSet))
    #return vocabList,fullText

In [19]:
spamTest()

classification error ['scifinance', 'now', 'automatically', 'generates', 'gpu', 'enabled', 'pricing', 'risk', 'model', 'source', 'code', 'that', 'runs', '300x', 'faster', 'than', 'serial', 'code', 'using', 'new', 'nvidia', 'fermi', 'class', 'tesla', 'series', 'gpu', 'scifinance', 'derivatives', 'pricing', 'and', 'risk', 'model', 'development', 'tool', 'that', 'automatically', 'generates', 'and', 'gpu', 'enabled', 'source', 'code', 'from', 'concise', 'high', 'level', 'model', 'specifications', 'parallel', 'computing', 'cuda', 'programming', 'expertise', 'required', 'scifinance', 'automatic', 'gpu', 'enabled', 'monte', 'carlo', 'pricing', 'model', 'source', 'code', 'generation', 'capabilities', 'have', 'been', 'significantly', 'extended', 'the', 'latest', 'release', 'this', 'includes']
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is:  0.2


  return _compile(pattern, flags).split(string, maxsplit)


In [20]:
import feedparser

In [21]:
ny = feedparser.parse('http://newyork.craiglist.org/stp/index.rss')

In [23]:
ny['entries']
len(ny['entries'])

25

In [29]:
#Calculate frequency of occurence
def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.items(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]

A small percentage of the total
words makes up a large portion of the text. The reason for this is that a large percentage
of language is redundancy and structural glue. Another common approach is to not just
remove the most common words but to also remove this structural glue from a predefined
list. This is known as a stop word list, and there are a number of sources of this
available.

In [33]:
def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    for i in range(minLen):
        #Access one feed at a time
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
    for pairW in top30Words:
        if pairW[0] in vocabList: vocabList.remove(pairW[0])
    trainingSet = list(range(2*minLen)); testSet=[]           #create test set
    for i in range(20):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
    print ('the error rate is: ',float(errorCount)/len(testSet))
    return vocabList,p0V,p1V

In [34]:
ny = feedparser.parse('http://newyork.craiglist.org/stp/index.rss')
sf = feedparser.parse('http://sfbay.craiglist.org/stp/index.rss')

In [36]:
vocabList,pSF,pNY = localWords(ny,sf)

the error rate is:  0.45


  return _compile(pattern, flags).split(string, maxsplit)


# Analyze: displaying locally used words

You can sort the vectors pSF and pNY and then print out the words from vocabList at
the same index.

In [40]:
def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**")
    for item in sortedSF:
        print (item[0])
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print ("NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**")
    for item in sortedNY:
        print (item[0])

In [41]:
getTopWords(ny,sf)

the error rate is:  0.35
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
would
indian
open
home
work
lady
hate
professional
coming
from
what
chat
more
good
alone
really
people
shopping
working
park
area
wearing
hoping
give
will
see
married
sure
funny
all
office
wouldn
take
going
wear
feel
any
read
attractive
older
underwear
they
hates
find
poetry
interests
emails
still
exchange
write
age
etc
basically
two
here
little
tell
ideas
water
whether
hello
poker
idea
mature
laid
friend
symphony
wanted
browsing
leisure
participle
nickel
evenings
years
benefits
maybe
only
lol
drug
sharing
sun
ads
expectations
creek
wednesday
answer
bad
enforcement
wants
yet
learn
already
talking
well
tall
know
email
male
down
drugs
man
woman
got
expert
board
dinner
too
semi
city
lonely
hang
women
word
week
luck
wonder
occasion
desi
discreet
boat
taking
plays
try
settled
games
party
thhongs
movies
dime
because
meet
couple
while
shores
others
massage
other
also
daily
along
without
basicall
cute
fra

  return _compile(pattern, flags).split(string, maxsplit)


One thing to note: a lot of stop words
appear in the output. It would be interesting to see how things would change if
you removed the fixed stop words. Classification error will also
go down.

Underflow is one problem that can be addressed
by using the logarithm of probabilities in your calculations. The bag-of-words model is
an improvement on the set-of-words model when approaching document classification.
There are a number of other improvements, such as removing stop words, and
you can spend a long time optimizing a tokenizer.