# Multinomial Bayes Classifiers for Text Classification

Bayes Klassifikatoren werden auch Naive Bayes oder Indepedence Bayes genannt.

In [1]:
import utils_bayes
from sklearn.naive_bayes import MultinomialNB
import numpy as np

In [2]:
available_themes = list(utils_bayes.newsgroups_categories.keys())
available_themes

['Religion2',
 'Computer',
 'Computer2',
 'Computer3',
 'Computer4',
 'Computer5',
 'Verkauf',
 'Autos',
 'Motorräder',
 'Baseball',
 'Hockey',
 'Kryptographie',
 'Elektronik',
 'Medizin',
 'Space',
 'Religion',
 'Waffen (US)',
 'Mittlerer Westen',
 'Politik']

In [3]:
themen = ["Space", "Hockey", "Baseball", "Computer4", "Politik"]

model, predict_category = utils_bayes.get_multinomial_nb_model(
    themen=available_themes
)

`model` ist eine Pipeline aus einem sogenannten `TfidfVectorizer` und einem Bayes-Klassifikator `MultinomialNB`

In [4]:
model

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('multinomialnb',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [5]:
predict_category("Launching a rocket to Mars in 2035.")

'Space'

In [6]:
predict_category("Team A won the championship")

'Hockey'

In [7]:
predict_category("I bought a new MacBook Pro but it's actually slower than my Lenovo laptop!")

'Computer3'

In [8]:
predict_category("I broke my keyboard. Accidentally spilled coffee all over it")

'Computer5'

In [9]:
predict_category("I bought a new computer but it's actually slower than my Lenovo laptop!")

'Computer3'

In [10]:
predict_category("new computer is broken")

'Verkauf'

In [11]:
predict_category("The team rocket won the MacBook-Price, that is like a MarsBall")

'Hockey'

In [12]:
predict_category("hungry")

'Medizin'

In [13]:
predict_category("I bought a new mouse and keyboard and it's broken")

'Computer4'

In [14]:
predict_category("The mouse got the MacBook-Price, that is like a MarsBall")

'Computer4'

### 2. Scikit-Learn's Newsgroups Dataset

In [15]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset="train")

In [16]:
print(newsgroups["DESCR"])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [17]:
data = newsgroups["data"]

In [18]:
len(data)

11314

In [19]:
print(data[2])

From: twillis@ec.ecn.purdue.edu (Thomas E Willis)
Subject: PB questions...
Organization: Purdue University Engineering Computer Network
Distribution: usa
Lines: 36

well folks, my mac plus finally gave up the ghost this weekend after
starting life as a 512k way back in 1985.  sooo, i'm in the market for a
new machine a bit sooner than i intended to be...

i'm looking into picking up a powerbook 160 or maybe 180 and have a bunch
of questions that (hopefully) somebody can answer:

* does anybody know any dirt on when the next round of powerbook
introductions are expected?  i'd heard the 185c was supposed to make an
appearence "this summer" but haven't heard anymore on it - and since i
don't have access to macleak, i was wondering if anybody out there had
more info...

* has anybody heard rumors about price drops to the powerbook line like the
ones the duo's just went through recently?

* what's the impression of the display on the 180?  i could probably swing
a 180 if i got the 80Mb disk

In [20]:
newsgroups["target_names"]

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [21]:
labels = newsgroups["target"]

### 3. Text vektorisieren

In [22]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [23]:
labels

array([7, 4, 4, ..., 3, 1, 8])

In [24]:
CountVectorizer?

In [25]:
count_vectorizer = CountVectorizer(stop_words="english")
count_vectorizer.fit(data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [26]:
# die Länge des Vokabulars entspricht der Menge unterschiedlicher Wörter im gesamten Trainingsdatensatz
len(count_vectorizer.vocabulary_)

129796

In [27]:
# der CountVectorizer ist ein Transformer
vectorized_data = count_vectorizer.transform(data).todense()

In [28]:
vectorized_data.shape

(11314, 129796)

In [29]:
vectorized_data

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [30]:
len(np.nonzero(vectorized_data[0, :])[1])

55

In [31]:
# der CountVectorizer ist ein Transformer
vectorized_data = count_vectorizer.transform(data)

bayes_classifier = MultinomialNB()

bayes_classifier.fit(vectorized_data, labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)