## 1. 基于贝叶斯决策理论的分类方法

> 优点：在数据较少的情况下仍然有效，可以处理多类别问题

> 缺点：对于输入数据的准备方式较为敏感

> 适用数据类型：标称型数据

贝叶斯决策理论的核心思想，就是选择具有最高概率的决策。很合理

## 2. 条件概率

条件概率就是在已知某一事件发生的条件下计算另一事件的发生概率。

$p\left( c|x\right) =\dfrac {p\left( xc\right)}{p\left( x\right) }$

根据先验知识计算后验概率是最精彩的：

$p\left( c|x\right) =\dfrac {p\left( x|c\right) p\left( c\right) }{p\left( x\right) }$

## 3. 使用朴素贝叶斯进行文档分类

文档的特征往往很多，如果每个特征需要N个样本，那么样本数就会呈指数增长，这个时候需要假设特征间相互独立，那么样本数立马变成线性的了，这也是朴素贝叶斯里*朴素*的来源。

## 4. 使用Python进行文本分类

### 4.1 准备数据：从文本中构建词向量

In [9]:
def loadDataSet():
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                  ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                  ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                  ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                  ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                  ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]
    return postingList, classVec

In [10]:
def createVocabList(dataSet):
    vocabSet = set()
    for document in dataSet:
        vocabSet = vocabSet | set(document)
    return vocabSet

In [11]:
def setOfWords2Vec(vocabList, inputSet):
    vocabList = list(vocabList)
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print('the word : {0} is not in my Vocabulary!'.format(word))
    return returnVec

In [12]:
# test
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
myVocabList

{'I',
 'ate',
 'buying',
 'cute',
 'dalmation',
 'dog',
 'flea',
 'food',
 'garbage',
 'has',
 'help',
 'him',
 'how',
 'is',
 'licks',
 'love',
 'maybe',
 'mr',
 'my',
 'not',
 'park',
 'please',
 'posting',
 'problems',
 'quit',
 'so',
 'steak',
 'stop',
 'stupid',
 'take',
 'to',
 'worthless'}

In [13]:
setOfWords2Vec(myVocabList, listOPosts[0])[:10]

[0, 0, 0, 1, 0, 0, 0, 0, 0, 1]

### 4.2 训练算法：从词向量计算概率

In [14]:
import numpy as np

In [15]:
# 训练函数
def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = np.sum(trainCategory)/numTrainDocs
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)
    p0Denom = 0.0
    p1Denom = 0.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += np.sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += np.sum(trainMatrix[i])
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    return p0Vect, p1Vect, pAbusive

In [16]:
trainMat = []
for postingDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postingDoc))

p0V, p1V, pAb = trainNB0(trainMat, listClasses)

In [17]:
p0V

array([ 0.        ,  0.04166667,  0.04166667,  0.04166667,  0.        ,
        0.04166667,  0.        ,  0.04166667,  0.        ,  0.04166667,
        0.04166667,  0.04166667,  0.08333333,  0.04166667,  0.04166667,
        0.04166667,  0.04166667,  0.        ,  0.04166667,  0.        ,
        0.04166667,  0.04166667,  0.04166667,  0.        ,  0.04166667,
        0.125     ,  0.        ,  0.04166667,  0.04166667,  0.        ,
        0.        ,  0.        ])

In [18]:
p1V

array([ 0.05263158,  0.        ,  0.        ,  0.        ,  0.05263158,
        0.        ,  0.05263158,  0.        ,  0.05263158,  0.        ,
        0.        ,  0.05263158,  0.05263158,  0.        ,  0.        ,
        0.10526316,  0.05263158,  0.05263158,  0.        ,  0.05263158,
        0.        ,  0.        ,  0.        ,  0.10526316,  0.        ,
        0.        ,  0.05263158,  0.        ,  0.        ,  0.05263158,
        0.05263158,  0.15789474])

In [19]:
pAb

0.5

### 4.3 测试算法：根据现实情况修改分类器

利用贝叶斯分类器进行文档分类时，要计算多个概率的乘积，如果其中一个概率为0，则最后乘积也为0，为避免这种影响，初始化应该修改。

另一个问题是如果太多很小的数相乘或造成下溢出，这时候可采取取对数的方法，同时将乘法转变成加法。

In [20]:
def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = np.sum(trainCategory)/numTrainDocs
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += np.sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += np.sum(trainMatrix[i])
    p1Vect = np.log(p1Num/p1Denom)
    p0Vect = np.log(p0Num/p0Denom)
    return p0Vect, p1Vect, pAbusive

In [21]:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = np.sum(vec2Classify * p1Vec) + np.log(pClass1)
    p0 = np.sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0

In [22]:
def testingNB():
    listOPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    for postingDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postingDoc))
    p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print('{0} classified as: {1}'.format(testEntry, classifyNB(thisDoc, p0V, p1V, pAb)))
    testEntry = ['stupid', 'garbage']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print('{0} classified as: {1}'.format(testEntry, classifyNB(thisDoc, p0V, p1V, pAb)))

In [23]:
testingNB()

['love', 'my', 'dalmation'] classified as: 0
['stupid', 'garbage'] classified as: 1


### 4.4 准备数据：文档词袋模型

上面我们将每个词的是否出现作为一个特征，这个叫**词集模型**，如果一个词在文档中出现不止一次，这可能包含更多的信息，这叫做**词袋模型**，修改起来也是比较容易的。

In [72]:
def bagOfWordVecMN(vocabList, inputSet):
    vocabList = list(vocabList)
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec 

## 5. 示例：使用朴素贝叶斯过滤垃圾邮件

### 5.1 准备数据：切分文本

In [25]:
mySent = 'This book is the best book on Python or M.L. I have ever laid eye upon.'
mySent.split()

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M.L.',
 'I',
 'have',
 'ever',
 'laid',
 'eye',
 'upon.']

切分效果不错，但是标点符号也被当成了词的一部分。

In [26]:
import re
regEx = re.compile('\\W*')
listOfTokens = regEx.split(mySent)
listOfTokens

  This is separate from the ipykernel package so we can avoid doing imports until


['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eye',
 'upon',
 '']

存在空字符

In [27]:
[tok.lower() for tok in listOfTokens if len(tok) > 0]

['this',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'python',
 'or',
 'm',
 'l',
 'i',
 'have',
 'ever',
 'laid',
 'eye',
 'upon']

### 5.2 测试算法：使用朴素贝叶斯进行交叉验证

In [28]:
def textParse(bigString):
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

In [53]:
def spamTest():
    docList = []
    classList = []
    fullText = []
    for i in range(1, 26):
        wordList = textParse(open('email/spam/{0}.txt'.format(i)).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/{0}.txt'.format(i)).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)
    trainingSet = list(range(50))
    testSet = []
    for i in range(10):
        randIndex = int(np.random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del trainingSet[randIndex]
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))
    errorCount = 0
    for docIndex in testSet:
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])
        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
    print('the error rate is: {}'.format(errorCount/len(testSet)))

In [55]:
spamTest()

the error rate is: 0.1


  return _compile(pattern, flags).split(string, maxsplit)


## 6. 示例：使用朴素贝叶斯分类器从个人广告中获取区域倾向

### 6.1 收集数据：导入RSS源

In [56]:
import feedparser

In [63]:
ny = feedparser.parse('https://www.craigslist.org/about/best/all/index.rss')
len(ny['entries'])

25

In [66]:
ny['entries']

[{'author': 'robot@craigslist.org',
  'author_detail': {'email': 'robot@craigslist.org'},
  'authors': [{'email': 'robot@craigslist.org'}],
  'dc_source': 'https://www.craigslist.org/about/best/tpk/6581431260.html',
  'dc_type': 'text',
  'id': 'https://www.craigslist.org/about/best/tpk/6581431260.html',
  'link': 'https://www.craigslist.org/about/best/tpk/6581431260.html',
  'links': [{'href': 'https://www.craigslist.org/about/best/tpk/6581431260.html',
    'rel': 'alternate',
    'type': 'text/html'}],
  'rights': 'copyright 2018 craigslist',
  'rights_detail': {'base': 'https://www.craigslist.org/about/best/all/index.rss',
   'language': None,
   'type': 'text/plain',
   'value': 'copyright 2018 craigslist'},
  'summary': "1995 Honda CBR 900 RR, very unique.  Yes it is wrapped with wrinke crush velvet.  Slavage title and has a 96 motor in it.  Will need tires.  We've had the bike since 2001. Cross posted.<br>\n<br>",
  'summary_detail': {'base': 'https://www.craigslist.org/about/bes

In [64]:
def calcMostFreaq(vocabList, fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token] = fullText.count(token)
    sortedFreq = sorted(freqDict.items(), key=operator.itemgetter(1), reverse=True)
    return sortedFreq[:30]

In [82]:
def localWords(feed1, feed0):
    import feedparser
    docList = []
    classList = []
    fullText = []
    minLen = min(len(feed1['entries']), len(feed0['entries']))
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocaList = createVocabList(docList)
    top30Words = calcMostFreaq(vocaList, fullText)
    for pairW in top30Words:
        # 去掉次数最高的词
        if pairW[0] in vocaList:
            vocaList.remove(pairW[0])
    trainingSet = list(range(2*minLen))
    testSet = []
    for i in range(5):
        randIndex = int(np.random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del trainingSet[randIndex]
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(bagOfWordVecMN(vocaList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))
    errorCount = 0
    for docIndex in testSet:
        wordVector = bagOfWordVecMN(vocaList, docList[docIndex])
        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
    print('the error rate is: {}'.format(errorCount/len(testSet)))
    return vocaList, p0V, p1V

In [83]:
ny = feedparser.parse('https://newyork.craigslist.org/search/res?format=rss')
sf = feedparser.parse('https://sfbay.craigslist.org/search/apa?format=rss')
vocabList, pSF, pNY = localWords(ny, sf)

the error rate is: 0.0


  return _compile(pattern, flags).split(string, maxsplit)


### 6.2 分析数据：显示地域相关的用词

In [92]:
def getTopWords(ny, sf):
    import operator
    vocabList, p0V, p1V = localWords(ny, sf)
    vocabList = list(vocabList)
    topNY = []
    topSF = []
    for i in range(len(p0V)):
        if p0V[i] > -6.0:
            topSF.append((vocabList[i], p0V[i]))
        if p1V[i] > -6.0:
            topNY.append((vocabList[i], p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print('SF**SF**SF**SF**SF**SF**SF**SF**SF**SF')
    for item in sortedSF:
        print(item[0])
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print('NY**NY**NY**NY**NY**NY**NY**NY**NY**NY')
    for item in sortedNY:
        print(item[0])

In [93]:
getTopWords(ny, sf)

the error rate is: 0.0
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF
water
garbage
bath
bedrooms
5
location
floor
great
newly
hardwood
floors
family
refrigerator
building
bathroom
this
park
updated
house
room
renovated
full
rent
remodeled
stove
access
over
walk
space
on
single
3
sq
shower
street
10
located
high
spacious
ft
tub
car
features
dishwasher
easy
apartment
gas
living
appliances
recently
neighborhood
garage
group
story
pays
meder
end
range
large
washer
sf
280
available
flooring
details
owner
storage
gate
baths
first
entertainment
2nd
disposal
ca
parking
grocery
stores
nine
private
has
requirements
beautiful
closet
copy
near
schools
distance
w
incl
eat
included
garden
granite
all
month
pane
walking
text
steam
heat
at
furnished
apartments
electric
appliance
dryer
tops
cdl
kitchens
lots
now
windows
bathrooms
charming
101
downtown
freeways
site
dan
verifiable
your
bay
ready
luxury
appointment
level
dinner
small
areas
oven
850
nobel
me
modern
500
combo
winning
call
creek
c
year
2018
flat
h

  return _compile(pattern, flags).split(string, maxsplit)


> 对于分类而言，使用概率又是要比使用硬规则更为有效。

> 朴素贝叶斯的特征之间独立性假设降低对数据量的需求，已经可以取得很好的效果。

> 概率取对数、词袋模型以及移除停用词都是常用技巧。