In [16]:
###### Config #####
import sys, os, platform
if os.path.isdir("ds-assets"):
  !cd ds-assets && git pull
else:
  !git clone https://github.com/lutzhamel/ds-assets.git
colab = True if 'google.colab' in os.sys.modules else False
system = platform.system() # "Windows", "Linux", "Darwin"
home = "ds-assets/assets/"
sys.path.append(home)

Already up to date.


In [17]:
# notebook level imports
import pandas as pd
import dsutils
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV


# NLP & ML: Classification

We saw that we convert text document into a ‘vector model’ (bag-of-words).

We showed that the vector model allows us to perform mathematical analysis on documents such as *which documents are similar to each other?*

Next question: can we construct *machine learning classification models* on document collections using the vector model? -- **Yes!**



Consider again our news article data set. We would like to construct a classifier that can correctly classifier political and science documents.


# Data

Preprocess our data into a docarray and set up train and test sets.

We are using the noheaders version because headers are too predictive.
If we use the headers we will always get close to 100% accurate models.


In [18]:
# get the newsgroup database

# NOTE: news article headers are extremely predictive so we skip them
#       here.  Hint: try the exercise below with the headers,

# newsgroups = pd.read_csv(home+"newsgroups.csv")
newsgroups = pd.read_csv(home+"newsgroups-noheaders.csv")
newsgroups.head(n=10)

Unnamed: 0,text,label
0,\nIn billions of dollars (%GNP):\nyear GNP ...,space
1,ajteel@dendrite.cs.Colorado.EDU (A.J. Teel) w...,space
2,\nMy opinion is this: In a society whose econ...,space
3,"Ahhh, remember the days of Yesterday? When we...",space
4,"\n""...a la Chrysler""?? Okay kids, to the near...",space
5,"\n As for advertising -- sure, why not? A N...",politics
6,"\n What, pray tell, does this mean? Just who ...",space
7,\nWhere does the shadow come from? There's no...,politics
8,^^^^^^^^^...,politics
9,"#Yet, when a law was proposed for Virginia tha...",space


In [19]:
newsgroups.shape

(1038, 2)

Constructing our docterm matrix.

In [20]:
docterm = dsutils.docterm_matrix(newsgroups['text'], 
                                 min_df=2,
                                 token_pattern="[a-zA-Z]+",
                                 stop_words="english",
                                 stem=True)   
docterm

Unnamed: 0,aa,abandon,abbey,abc,abil,abl,aboard,abolish,abort,abroad,...,yugoslavia,yup,z,zealand,zenit,zero,zeta,zip,zone,zoo
doc0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
doc1033,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc1034,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc1035,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc1036,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We want to be consistent with our testing, therefore we split out a test set.

In [21]:
# set up train and test sets
X_train, X_test, y_train, y_test = \
  train_test_split(docterm,
                   newsgroups['label'],
                   train_size=0.8,
                   test_size=0.2,
                   random_state=2)

# Decision Trees

In [22]:
# tree model

# model object
model = DecisionTreeClassifier()

# grid search
param_grid = {'max_depth': list(range(1,31))}
grid = GridSearchCV(model, param_grid, cv=3).fit(X_train, y_train)

print("Grid Search: best parameters: {}".format(grid.best_params_))
best_model = grid.best_estimator_

Grid Search: best parameters: {'max_depth': 27}


Evaluate the model.

In [23]:
# accuracy
dsutils.acc_score(best_model, X_test, y_test,as_string=True)


'Accuracy: 0.74 (0.68, 0.80)'

In [24]:
# confusion matrix
dsutils.confusion_matrix(best_model, X_test, y_test)

Unnamed: 0,politics,space
politics,88,21
space,34,65


# KNN

Now let's apply  our KNN algorithm (k nearest neighbors). Since documents are considered points in an n-dimensional space KNN seems well suited for this problem.

In [25]:

# KNN
model = KNeighborsClassifier()

# grid search
param_grid = {'n_neighbors': list(range(1,11))}
grid = GridSearchCV(model, param_grid, cv=3).fit(X_train, y_train)

print("Grid Search: best parameters: {}".format(grid.best_params_))
best_model = grid.best_estimator_

Grid Search: best parameters: {'n_neighbors': 2}


In [26]:
# accuracy
dsutils.acc_score(best_model, X_test, y_test,as_string=True)


'Accuracy: 0.53 (0.47, 0.60)'

In [27]:
# confusion matrix
dsutils.confusion_matrix(best_model, X_test, y_test)

Unnamed: 0,politics,space
politics,19,90
space,7,92


# Naive Bayes (NB)

* “Standard” model for text processing
* Fast to train, has no problems with very high dimensional data
* NB is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors.
* In simple terms, a NB classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
* For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.


## The Algorithm


Bayes theorem provides a way of calculating probability prediction of label $c$ given feature $x$,
<br>
<br>
$$
P(c|x) = \frac{P(x|c)P(c)}{P(x)}
$$

where
  * $P(c|x)$ is the probability of label $c$ given the feature $x$.
  * $P(c)$ is the probability of label $c$ .
  * $P(x|c)$ is the probability of the feature $x$ given label $c$.
  * $P(x)$ is the probability of feature $x$.

What is remarkable about the Naive Bayes algorithm is that all these probabilities can be computed by just counting values in the training data.

In order to predict a label $c$ based on multiple features $x_1$ through $x_n$ we use the formula,

$$
P(c|x_1,x_2,\ldots,x_n) = P(c|x_1)\times P(c|x_2)\times\ldots \times P(c|x_n)\times P(c)
$$

For example, take our tennis playing data

[Source](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained)


###  Example

**Think of Naive Bayes like a really organized sorting system**

Imagine you're building a spam filter for emails. You want to sort incoming messages into two buckets: **"Spam"** and **"Not Spam"**. Naive Bayes is like a very fast and simple sorting agent that looks at the **words** in each message to decide which bucket it should go into.

Here’s how it works, step by step.


**1. Learn from Examples**

You first show the agent a bunch of emails that are already labeled as spam or not spam. The agent reads all the emails and **counts** how often each word shows up in spam emails versus not spam emails.

For example, it might notice:
- The word *“free”* shows up a lot in spam.
- The word *“meeting”* shows up more in not-spam.

It stores these counts like a memory of what words are common in each category.


**2. Make Decisions Based on Word Presence**

When a new email comes in, the agent looks at each word in the email and asks:
- “Have I seen this word more often in spam or in not-spam messages?”

It gives a kind of "score" to both categories (spam and not spam) based on what it remembers. The category with the higher score wins, and the email gets sorted into that bucket.


**3. Why It’s Called “Naive”**

The agent assumes that **each word works independently**. That means it doesn’t care about the order of the words or how they relate to each other. It just adds up the scores from each word as if they’re separate votes. This “naive” assumption makes things fast and simple.


**Summary**

Naive Bayes is like a fast, word-counting sorting machine that decides what category something belongs to based on past examples. It learns which words tend to show up in each category, and then classifies new items by seeing which category the words in them are more similar to.



## Text Classification

Let’s take our text classification problem and use a Naive Bayes classifier on it.

In [28]:

# Naive Bayes
model = MultinomialNB().fit(X_train, y_train)

# NOTE: NB does not have any hyper-parameters - 
#       no searching over parameter space!

In [29]:
# accuracy
dsutils.acc_score(model, X_test, y_test,as_string=True)


'Accuracy: 0.96 (0.93, 0.98)'

In [30]:
# confusion matrix
dsutils.confusion_matrix(model, X_test, y_test)

Unnamed: 0,politics,space
politics,102,7
space,2,97


**Observation**: Trains very fast and has a higher accuracy than DT or KNN and the difference in accuracy is statistically significant!


# Project

Lab #6, see BrightSpace