In this program analysis I’ll be looking at a simple Python script from Chapter 4 of [Machine Learning in Action](https://www.manning.com/books/machine-learning-in-action). It uses a naive bayes classifier to determine if a forum post is abusive or not.

The [original code](https://github.com/pbharrin/machinelearninginaction/blob/master/Ch04/bayes.py) can be found in the books associated Github Repository.

The first function defined is `loadDataSet` which creates the dataset which will be used throughout the program. It does this by defining an array of posts which are each formatted as an array containing all the words in a post as they occur. A second variable `classVec` is also defined which contains a binary value for each post indicating if it abusive.

In [4]:
'''
From https://github.com/pbharrin/machinelearninginaction
Example in Chapter 4 of Machine Learning in Action
'''

from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

listOPosts, listClasses = loadDataSet()

listOPosts

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

Next we have `createVocabList` which creates an array containing all the unique words used in all our posts. This function would be used by passing in the list created by `loadDataSet`.

In [18]:
def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

vocabList = createVocabList(listOPosts)
vocabList

['food',
 'so',
 'love',
 'how',
 'maybe',
 'dalmation',
 'posting',
 'mr',
 'take',
 'worthless',
 'garbage',
 'ate',
 'to',
 'buying',
 'stop',
 'please',
 'my',
 'is',
 'stupid',
 'help',
 'cute',
 'park',
 'problems',
 'him',
 'quit',
 'I',
 'dog',
 'licks',
 'flea',
 'steak',
 'has',
 'not']

After that `setOfWords2Vec` is created a function which takes the vocab list created by `createVocabList` and a single post from `loadDataSet`. So if our list of posts was created using:

    listOPosts, listClasses = loadDataSet()

The second value passed into `setOfWords2Vec` might be `listOPosts[0]` which specifies the first post from the list. The function then returns an array of binary values the length of the vocabulary list where words present in the post are replaced with a `1` and those missing a `0`.

In [19]:
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print("the word: %s is not in my Vocabulary!" % word)
    return returnVec

exampleVec = setOfWords2Vec(vocabList, listOPosts[0])
exampleVec

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0]

With these 3 data points we beggin on the training phase. The first function in the training process is `trainNB0`. This function looks at each word from the vocabulary list and examines how often posts containing it are classified as abusive. The number of abusive posts using a word are then divided by the total number of posts using said word. This gives us a ratio of often a word is used in an abusive context.

In [20]:
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

`trainNB0` expects a matrix of vectors as input. To create this matrix we need to run `setOfWords2Vec` on every post in our dataset. We can do this with the following code snippet as outlined in Chapter 4:

In [22]:
trainMat=[]
for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(vocabList, postinDoc))
    
trainMat

[[0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  1,
  0],
 [0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1],
 [0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0],
 [0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0],
 [1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  0]]

This results in a training matrix we can use called `trainMat`. We then use the matrix with `trainNB0` and the array which includes the results for each of the posts in our list:

In [26]:
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
p0V

array([-3.25809654, -2.56494936, -2.56494936, -2.56494936, -3.25809654,
       -2.56494936, -3.25809654, -2.56494936, -3.25809654, -3.25809654,
       -3.25809654, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
       -2.56494936, -1.87180218, -2.56494936, -3.25809654, -2.56494936,
       -2.56494936, -3.25809654, -2.56494936, -2.15948425, -3.25809654,
       -2.56494936, -2.56494936, -2.56494936, -2.56494936, -2.56494936,
       -2.56494936, -3.25809654])

Running `trainNB0` returns 3 values. The first `p0V` is the probability of a word being in an abusive document. The second `p1V` is the probability that a word will appear in a post. Finally the third `pAb` is the probability that a document will be abusive, so given our dataset it is 0.5 as half the documents are abusive.

Next we have `classifyNB` which takes a forum post in the format of a vector and the 3 values generated from `trainNB0`. It then returns 0 if the post is not abusive and 1 if it is.

In [30]:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

the word: that is not in my Vocabulary!
the word: cat is not in my Vocabulary!


0

In [35]:
examplePost = ["that", "dog", "is", "cute"]
exampleVec = setOfWords2Vec(vocabList, examplePost)
classifyNB(exampleVec, p0V, p1V, pAb)

the word: that is not in my Vocabulary!


0

In [36]:
examplePost = ["your", "post", "is", "stupid"]
exampleVec = setOfWords2Vec(vocabList, examplePost)
classifyNB(exampleVec, p0V, p1V, pAb)

the word: your is not in my Vocabulary!
the word: post is not in my Vocabulary!


1

After `classifyNB` we have `bagOfWords2VecMN` which is an alternative to `setOfWords2Vec`. Where as `setOfWords2Vec` would simply record a binary value for each word to indicate if it appeared or not, the bag of words vector keeps track of words which appear more than once in a post. It does this by storing the number of times each word appears instead of a binary “1” value.

In [37]:
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

multipleWordExample = ["cute", "dog", "looks", "like", "my", "dog"]
bagVecExample = bagOfWords2VecMN(vocabList, multipleWordExample)
bagVecExample

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 0,
 0,
 0,
 0]

After the function for creating a bag of words style vector we have `testingNB`. This function essentially wraps the other functions so that they can be called in conjunction without manually walking through all the steps we’ve listed one by one. It has two hardcoded examples which are tested “I love my dalmatian” which is classified as not abusive and “stupid garbage” which is classified as abusive.

In [43]:
def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
    
testingNB()

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1


Next we have `textParse` which is simply a helper function which takes a text value and returns it as an array containing all words longer than 2 characters. It allows for easy use of input text without having to manually write each sentence as an array. An example of using it is:

In [44]:
def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

textParse("This is a large body of text")

  return _compile(pattern, flags).split(string, maxsplit)


['this', 'large', 'body', 'text']

The functions beyond `textParse` are related to another example for using naive bayes to categorize emails as spam or not spam. We won’t go into these functions as they are outside the scope of this analysis.