# DAML 07 - Feature Engineering and Naive Bayes

Michal Grochmal <michal.grochmal@city.ac.uk>

Machine Learning deals only with numeric data
but not all data in the world is numeric.
Two examples of non-numerical data that is meaningful for learning about data are:
categorical features and plain text (e.g. product reviews).
There are tricks that allow us to deal with non-numerical data,
these tricks are part of *feature engineering*.

For a start let's import a handful of things.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')

## Feature Engineering

Dealing with non-numerical data is only a part of feature engineering,
although is often the most common application that is called by this name.
Actually feature engineering is not a collection of techniques
but a generic name to define tasks performed around input data to our model.
These include:

- Modifying existing features - e.g. scaling
- Selecting only a subset of features - e.g. removing highly correlated features
- Building new features from existing ones - e.g. squaring features to get only positive values
- Encoding features in a different representation - e.g. one-hot-encoding
- Learning new features from data - e.g. huge neural networks fed with lots of data

The last example requires really huge amounts of data,
hundreds of millions of samples.

## Categorical Data

When we deal with the proper names of things or people we are most often dealing
with categorical data.  One example of such would be:

In [None]:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man']
area = np.array([14130, 77933, 20779, 130279, 572])
population2011 = np.array([1810863, 5313600, 3063456, 53012456, 83314])
data = pd.DataFrame({'area': area,
                     'country': country,
                     'population': population2011})
data.country = data.country.astype('category')
data

Note that we forced the country to have a categorical type.
In `pandas` that is a way of assigning numbers to a column,
these numbers then reference a set of categorical labels.

This also means that the data now is completely numerical.
i.e. we can do this:

In [None]:
pd.merge(pd.DataFrame({'country': data.country.cat.codes}),
         data[['area', 'population']], left_index=True, right_index=True)

Yet, that is *not* enough.
Numerical values have an order, therefore we can test for inequality.
Based on the data above we can say that:

$$\texttt{Isle of Man} < \texttt{Wales}$$

or

$$\texttt{Scotland} > \texttt{England}$$

Unfortunately, apart from their use in rugby jokes, these inequalities are rather useless.
Moreover these inequalities are likely to confuse an ML algorithm.
Instead we need to encode the data into a form called **one-hot-encoding**.
Each sample has several features built from the categorical feature
but only one of the columns contain a one, all other columns contain zeros.

`pandas`' `get_dummies` exists for this exact purpose,
to build a one-hot-encoding from a categorical feature.

In [None]:
pd.get_dummies(data, prefix_sep='=')

And this is something that we can feed into an ML technique without worrying about confusing it.
That said, this representation can use huge amounts of memory if there is a big number of features.
To alleviate the memory problem `sklearn` can perform one-hot-encoding on sparse matrices (from `scipy`),
this way we only need to store the ones.

## Textual Data

Plain, unorganized, text data present different challenges to transform into a numeric representation.
For a start we cannot just one-hot-encode words because they may appear more than once in each sample.
We could encode the presence of words in each sample but when distinguishing between samples
certain words are certainly more important than others, e.g. we can safely assume that
the word "the" will appear in almost every sample.

Search engine research produced an elegant technique to encode words in plain test:
*Term Frequency by Inverse Document Frequency* (TF-IDF).
Each word in a sample is represented by the count of this word divided by the frequency
of this same word across all samples.
Each sample has a feature per each word in the entire corpus (all samples),
all words that are not present in the sample are encoded as zeros.

This produces a huge sparse matrix representation of the data.
We can try it out with samples from *newsgroups*.
And since newsgroups are aggregated by topic we will try to classify the samples into topics.

In [None]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = [
    'comp.graphics',
    'comp.windows.x',
    'misc.forsale',
    'rec.autos',
    'rec.sport.hockey',
    'sci.med',
    'sci.space',
    'soc.religion.christian',
    'talk.politics.misc',
]
train = fetch_20newsgroups(categories=newsgroups, subset='train')
test = fetch_20newsgroups(categories=newsgroups, subset='test')
train.target_names

The dataset is already divided into train and test sets.
We will make a pipeline of a TF-IDF preprocessor and a Naive Bayes classifier.
The Naive Bayes classifier is a very simple **non-parametric** technique that just attempt
to build (hyper)spherical probabilistic generators around the center of each class.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

Since Naive Bayes has no specific parameter and no tunable hyperparameters,
it is a very good technique for a classification baseline.
Here we use a multinomial Naive Bayes classifier because we have many features.

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(train.data, train.target)
labels = model.predict(test.data)

Since we have lots of classes (9 different newsgroup topics) a single score may
not be the best approach to understand how our model works.
Instead we will build a confusion matrix, which will give us
true positives, false positives, true negatives and false negatives for each class.
We can then evaluate which classes the model is better at identifying.

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
mat = confusion_matrix(test.target, labels)
fig = plt.figure(figsize=(12, 12))
ax = sns.heatmap(mat.T, square=True, annot=True, fmt='d', cmap='viridis',
                 xticklabels=train.target_names, yticklabels=train.target_names)
ax.set_xlabel('true label')
ax.set_ylabel('predicted label');

The worst result is across *religion* and *politics*.
No surprises there, these topics get intermingled in the real world too.

That said, with a very simple classifier and some data encoding we have built
a model that can tell us the topic of a sentence.
We can see it in action with a small helper function:

In [None]:
def predict_chat(sentence):
    predicted = model.predict([sentence])
    return train.target_names[predicted[0]]


print('TUNING', predict_chat("I've added a new set of cyllinders, now I'm not even making 10 miles per galon"))
print('BALL', predict_chat('The ball never went even close to the goal'))
print('BUTTON', predict_chat("Dude, I'm telling you, there is no such button on my screen"))
print('WIFE', predict_chat('My wife went shopping in the morning, has not come back yet'))
print('PRESCRIPTION', predict_chat('Got my prescription rejected at the pharmacy'))
print('APOLLO', predict_chat('No one ever landed on the moon, it was all a farse'))

Given that all this is doing is checking the word frequency probabilities,
this is a rather amazing result for a such a simple algorithm.

And we can still see the problems with *religion* and *politics* in the predictions.
This problem happens because these two topics use lots of *stop words*,
i.e. words that are commonly used in sentence construction.
For example:

In [None]:
predict_chat('the what where')

If we remove the stop words from the data representation we should
get a better separation between religion and politics.

## References

- [Term frequency and weighting - Introduction to Information Retrieval][1]
- [Weighting Schemes - Introduction to Information Retrieval][2]

[1]: https://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html "TF-IDF"
[2]: https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html "weighting"