Here I will to show how to use bayes on multi-class classification/discrimination

import class sklearn.naive_bayes.MultinomialNB for Multinomial logistic regression (logistic regression of multi-class)

But if you want to classify binary/boolean class, it is better to use BernoulliNB 

I will use also compare accuracy for using BOW, TF-IDF, and HASHING for vectorizing technique

In [1]:
# to get f1 score
from sklearn import metrics
import numpy as np
import sklearn.datasets
import re
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split



Define some function to help us for preprocessing

In [2]:
# clear string
def clearstring(string):
    string = re.sub('[^A-Za-z0-9 ]+', '', string)
    string = string.split(' ')
    string = filter(None, string)
    string = [y.strip() for y in string]
    string = ' '.join(string)
    return string

# because of sklean.datasets read a document as a single element
# so we want to split based on new line
def separate_dataset(trainset):
    datastring = []
    datatarget = []
    for i in range(len(trainset.data)):
        data_ = trainset.data[i].split('\n')
        # python3, if python2, just remove list()
        data_ = list(filter(None, data_))
        for n in range(len(data_)):
            data_[n] = clearstring(data_[n])
        datastring += data_
        for n in range(len(data_)):
            datatarget.append(trainset.target[i])
    return datastring, datatarget

I included 6 classes in local/
1. adidas (wear)
2. apple (electronic)
3. hungry (status)
4. kerajaan (government related)
5. nike (wear)
6. pembangkang (opposition related)

In [3]:
# you can change any encoding type
trainset = sklearn.datasets.load_files(container_path = 'local', encoding = 'UTF-8')
trainset.data, trainset.target = separate_dataset(trainset)
print (trainset.target_names)
print (len(trainset.data))
print (len(trainset.target))

['adidas', 'apple', 'hungry', 'kerajaan', 'nike', 'pembangkang']
25292
25292


So we got 25292 of strings, and 6 classes

It is time to change it into vector representation

In [4]:
# bag-of-word
bow = CountVectorizer().fit_transform(trainset.data)

#tf-idf, must get from BOW first
tfidf = TfidfTransformer().fit_transform(bow)

#hashing, default n_features, probability cannot divide by negative
hashing = HashingVectorizer(non_negative = True).fit_transform(trainset.data)



Feed Naive Bayes using BOW

but split it first into train-set (80% of our data-set), and validation-set (20% of our data-set)

In [5]:
train_X, test_X, train_Y, test_Y = train_test_split(bow, trainset.target, test_size = 0.2)

bayes_multinomial = MultinomialNB().fit(train_X, train_Y)
predicted = bayes_multinomial.predict(test_X)
print('accuracy validation set: ', np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set:  0.847598339593
             precision    recall  f1-score   support

     adidas       0.90      0.77      0.83       289
      apple       0.79      0.90      0.84       460
     hungry       0.86      0.95      0.90      1074
   kerajaan       0.85      0.82      0.84      1407
       nike       0.90      0.78      0.84       330
pembangkang       0.84      0.82      0.83      1499

avg / total       0.85      0.85      0.85      5059



Feed Naive Bayes using TF-IDF

but split it first into train-set (80% of our data-set), and validation-set (20% of our data-set)

In [6]:
train_X, test_X, train_Y, test_Y = train_test_split(tfidf, trainset.target, test_size = 0.2)

bayes_multinomial = MultinomialNB().fit(train_X, train_Y)
predicted = bayes_multinomial.predict(test_X)
print('accuracy validation set: ', np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set:  0.800553469065
             precision    recall  f1-score   support

     adidas       0.94      0.53      0.68       320
      apple       0.98      0.58      0.73       456
     hungry       0.82      0.92      0.86      1087
   kerajaan       0.86      0.82      0.84      1372
       nike       0.91      0.59      0.72       330
pembangkang       0.70      0.87      0.77      1494

avg / total       0.82      0.80      0.80      5059



Feed Naive Bayes using hashing

but split it first into train-set (80% of our data-set), and validation-set (20% of our data-set)

In [7]:
train_X, test_X, train_Y, test_Y = train_test_split(hashing, trainset.target, test_size = 0.2)

bayes_multinomial = MultinomialNB().fit(train_X, train_Y)
predicted = bayes_multinomial.predict(test_X)
print('accuracy validation set: ', np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set:  0.793239770706
             precision    recall  f1-score   support

     adidas       0.95      0.58      0.72       286
      apple       1.00      0.48      0.65       461
     hungry       0.92      0.91      0.91      1077
   kerajaan       0.86      0.80      0.83      1366
       nike       0.98      0.54      0.70       345
pembangkang       0.64      0.89      0.75      1524

avg / total       0.83      0.79      0.79      5059

