In [16]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)      # add home folder to module search path
!TZ=America/New_York date

Already up to date.
Fri Feb 16 07:04:41 AM EST 2024


# NLP & ML: Classification

We saw that we convert text document into a ‘vector model’ (bag-of-words).

We showed that the vector model allows us to perform mathematical analysis on documents such as *which documents are similar to each other?*

Next question: can we construct *machine learning classification models* on document collections using the vector model? -- **Yes!**



Consider again our news article data set. We would like to construct a classifier that can correctly classifier political and science documents.

Let's set up our environment.


#Setup

Set up the general modules we need for our text classification.

In [17]:
# data handling
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split

# model evaluation
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from confint import classification_confint

# Data

Preprocess our data into a docarray and set up train and test sets.

We are using the noheaders version because headers are too predictive.
If we use the headers we will always get close to 100% accurate models.


In [18]:
# get the newsgroup database
#newsgroups = pd.read_csv(home+"newsgroups.csv")
newsgroups = pd.read_csv(home+"newsgroups-noheaders.csv")
newsgroups.head(n=10)

Unnamed: 0,text,label
0,\nIn billions of dollars (%GNP):\nyear GNP ...,space
1,ajteel@dendrite.cs.Colorado.EDU (A.J. Teel) w...,space
2,\nMy opinion is this: In a society whose econ...,space
3,"Ahhh, remember the days of Yesterday? When we...",space
4,"\n""...a la Chrysler""?? Okay kids, to the near...",space
5,"\n As for advertising -- sure, why not? A N...",politics
6,"\n What, pray tell, does this mean? Just who ...",space
7,\nWhere does the shadow come from? There's no...,politics
8,^^^^^^^^^...,politics
9,"#Yet, when a law was proposed for Virginia tha...",space


In [19]:
newsgroups.shape

(1038, 2)

In [20]:
# construct the docarray

# build the stemmer object
stemmer = PorterStemmer()

# build a new default analyzer using CountVectorizer that only
# uses words, [a-zA-Z]+, and also eliminates stop words
analyzer= CountVectorizer(analyzer = "word",
                          stop_words = 'english',
                          token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to
# create the words to be stemmed
def stemmed_words(doc):
    return [stemmer.stem(w) for w in analyzer(doc)]

# build docarray
vectorizer = CountVectorizer(analyzer=stemmed_words,
                             binary=True,
                             min_df=2) # each word has to appear at least twice
docarray = vectorizer.fit_transform(newsgroups['text']).toarray()
doc_df = pd.DataFrame(docarray, columns=list(vectorizer.get_feature_names_out()))
print("We have {} articles with {} features".format(doc_df.shape[0],doc_df.shape[1]))
doc_df.head()

We have 1038 articles with 6045 features


Unnamed: 0,aa,abandon,abbey,abc,abil,abl,aboard,abolish,abort,abroad,...,yugoslavia,yup,z,zealand,zenit,zero,zeta,zip,zone,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [21]:
# set up train and test sets
X_train, X_test, y_train, y_test = \
  train_test_split(doc_df,
                   newsgroups['label'],
                   train_size=0.8,
                   test_size=0.2,
                   random_state=2)

# Decision Trees

In [22]:
# train tree model
from sklearn.tree import DecisionTreeClassifier

# model object
model = DecisionTreeClassifier(random_state=0)

# grid search
param_grid = {'max_depth': list(range(1,31))}
grid = GridSearchCV(model, param_grid, cv=3, verbose=10, n_jobs=-1)
grid.fit(X_train, y_train)
print("Grid Search: best parameters: {}".format(grid.best_params_))
best_model = grid.best_estimator_

Fitting 3 folds for each of 30 candidates, totalling 90 fits
Grid Search: best parameters: {'max_depth': 21}


**Observation**: The resulting model is very complex with a depth of 21.

In [23]:
# Evaluate the best model
predict_y = best_model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb,ub = classification_confint(acc,X_test.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

Accuracy: 0.74 (0.68,0.80)


In [24]:
# build the confusion matrix
labels = ['politics','space']
cm = confusion_matrix(y_test, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))

Confusion Matrix:
          politics  space
politics        90     19
space           36     63


# KNN

Now let's apply  our KNN algorithm (k nearest neighbors). Since documents are considered point in an n-dimensional space KNN seems well suited for this problem.

In [25]:
from sklearn.neighbors import KNeighborsClassifier

# KNN
model = KNeighborsClassifier()

# grid search
param_grid = {'n_neighbors': list(range(1,11))}
grid = GridSearchCV(model, param_grid, cv=3, verbose=10, n_jobs=-1)
grid.fit(X_train, y_train)
print("Grid Search: best parameters: {}".format(grid.best_params_))
best_model = grid.best_estimator_

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Grid Search: best parameters: {'n_neighbors': 2}


In [26]:
# Evaluate the best model
predict_y = best_model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb,ub = classification_confint(acc,X_test.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

Accuracy: 0.53 (0.47,0.60)


In [27]:
# build the confusion matrix
labels = ['politics','space']
cm = confusion_matrix(y_test, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))

Confusion Matrix:
          politics  space
politics        19     90
space            7     92


# Naive Bayes (NB)

* “Standard” model for text processing
* Fast to train, has no problems with very high dimensional data
* NB is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors.
* In simple terms, a NB classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
* For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.


## The Algorithm


Bayes theorem provides a way of calculating probability prediction of label $c$ given feature $x$,
<br>
<br>
<center>
$
\Large
P(c|x) = \frac{P(x|c)P(c)}{P(x)}
$
</center>

where
  * $P(c|x)$ is the probability of label $c$ given the feature $x$.
  * $P(c)$ is the probability of label $c$ .
  * $P(x|c)$ is the probability of the feature $x$ given label $c$.
  * $P(x)$ is the probability of feature $x$.

What is remarkable about the Naive Bayes algorithm is that all these probabilities can be computed by just counting values in the training data.

## Example

Let's assume we have a feature `Weather` with feature values `Sunny`, `Overcast`, and `Rainy`, as well as a target `Play` that contains the labels `yes` and `no`.  

<img src="https://www.analyticsvidhya.com/wp-content/uploads/2015/08/Bayes_41.png">

We want to compute if we play tennis when sunny.  That is we compute the two probabilities,

1. $P(Yes|Sunny)$
1. $P(No|Sunny)$

and then pick the statement with the higher probability.

Let's look at $P(Yes|Sunny)$,
<br>
<br>
<center>
$
\large
P(Yes|Sunny) = \frac{P(Sunny|Yes)P(Yes)}{P(Sunny)} = \frac{3/9\times 9/14}{5/14} = \frac{.33 \times .64}{.36}=.60$
</center>

where,
* $P(Sunny|Yes)$ is computed by counting the number of `Yes` labels (9) and then counting how often `Sunny` appears in the context of `Yes` (3), therefore, $P(Sunny|Yes) = 3/9$.
* $P(Yes)$ is computed by counting how many `Yes` labels appear in the whole data set (9) and then dividing by the number of rows in the data set (14), therefore, $P(Yes) = 9/14$.
* $P(Sunny)$ is computed by counting how often the label `Sunny` appears in the data set (5) and divide it by the number of rows of the data set (14), therefore, $P(Sunny) = 5/14$.
<br>
<br>

Now, let's look at $P(No|Sunny)$,
<br>
<br>
<center>
$
\large
P(No|Sunny) = \frac{P(Sunny|No)P(No)}{P(Sunny)} = \frac{2/5\times 5/14}{5/14} = \frac{.40 \times .36}{.36}=.40$
</center>
<br>
<br>
Here, the individual probabilities are computed in a similar fashion as above.


**Observation**: We are playing tennis when sunny because the probability $P(Yes|Sunny)$ is higher.

In order to predict a label $c$ based on multiple features $x_1$ through $x_n$ we use the formula,

<br>
<br>
<center>
$
\large
P(c|x_1,x_2,\ldots,x_n) = P(c|x_1)\times P(c|x_2)\times\ldots \times P(c|x_n)\times P(c)$
</center>
<br>
<br>

[Source](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained)


## Text Classification

Let’s take our text classification problem and use a Naive Bayes classifier on it.

The setup and data prep is the same as in the case of the KNN classifier.

In [28]:

from sklearn.naive_bayes import MultinomialNB

# Naive Bayes
model = MultinomialNB()

# train the model
# NOTE: NB does not have any hyper-parameters - no searching over parameter space!
best_model = model.fit(X_train, y_train)


In [29]:
# Evaluate the best model
predict_y = best_model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb,ub = classification_confint(acc,X_test.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

Accuracy: 0.96 (0.93,0.98)


In [30]:
# build the confusion matrix
labels = ['politics','space']
cm = confusion_matrix(y_test, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))

Confusion Matrix:
          politics  space
politics       102      7
space            2     97


**Observation**: Trains very fast and has a higher accuracy than DT or KNN and the difference in accuracy is statistically significant!



#Project

See BrightSpace