This notebook follows the [scikit-learn text analytics tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html), albeit reordering some cells and adding some explanations.

We first will load all the packages and modules used in the tutorial (which imports them one by one as needed).

In [None]:
import numpy as np # to compute accuracy
import pandas as pd # to print nicer tables

from sklearn.datasets import fetch_20newsgroups # get dataset
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer # for preprocessing, frequencies

# classification algorithms
from sklearn.naive_bayes import MultinomialNB # naive bayes
from sklearn.linear_model import SGDClassifier # support vector machine

# evaluate
from sklearn import metrics

# streamline code
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Get dataset

The dataset is a part of the "20 Newsgroups" built-in dataset. Many packages for machine learning and data science come with built-in datasets for tests and tutorials.
In this case, the dataset can be retrieved with the `sklearn.datasets.fetch_20newsgroups()` function. Let's check out the documentation.

In [None]:
help(fetch_20newsgroups)

The tutorial provides a list of 4 out of the 20 categories for a partial dataset.

In [None]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [None]:
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [None]:
print(twenty_train.DESCR) # get full documentation of the dataset

As indicated in the documentation, `twenty_train` has a few interesting attributes:

- `data` = the contents
- `filenames` = the paths to the files where the contents are stored
- `target` = the index of the category that each element belongs to

These three attributes have the same length. For a given index `i`,
the filename of `twenty_train.data[i]` is `twenty_train.filenames[i]`,
and its category is `twenty_train.target_names[twenty_train.target[i]]`.

**What's the point of `target`?**
In machine learning, the idea of a **classifier** is that you build a model that can predict the category that something belongs to. You build the model by studying a **training** dataset, learning patterns from it, and then applying them to unseen **test** data.
A **supervised learning algorithm** requires the training dataset to be labelled with the categories that you want to learn. In this case, we have thousands of items and for each of them we know whether they belong to the category "atheism", "computer graphics", "medicine" or "Christanity". The idea is that the model learns the patterns that characterize the texts from each category so that, given a new text from the test dataset, it can reliably classify it in one of those four categories.

In [None]:
twenty_train.target_names

In [None]:
type(twenty_train)

In [None]:
len(twenty_train.data)

In [None]:
twenty_train.filenames[0]

In [None]:
print(twenty_train.data[0])

In [None]:
twenty_train.target[0]

In [None]:
twenty_train.target_names[twenty_train.target[0]]

# BOW representation of the text
With simple text, all that a computer sees of the words are their characters. However, the similarity between texts based on their _characters_ is not very informative.
Instead, what we do is represent a word as a sequence of numbers (a **vector**), each number representing the frequency of that word in a given document. This also in turns represents a document as a vector, each item representing the frequency of a given word in it. The idea is that documents that have the same word occurring in it with a similar frequency are similar to each other.

This will be represented as a **sparse matrix**: a table of numbers that has mostly zeros (because most words do not occur in most documents).

In [None]:
help(CountVectorizer)

In [None]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape # a matrix with one row per document and one column per word

In [None]:
count_vect.vocabulary_.get(u'algorithm') # frequency of 'algorithm' in the corpus

In [None]:
len(count_vect.vocabulary_)

We can then transform the raw frequencies into **tf-idf** (Term frequency x Inverse Document Frequency) so that:
- The frequencies are relative to the size of the document
- Words that occur in many documents have less weight

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

# Train a classifier
For this tutorial, a naïve Bayes classifier is used.

The previous functions, which just transformed the data, only needed the data itself.
In this case, instead, we need to elements for training: the (transformed) data and the labels: the `i`th label corresponds to the `i`th row of the data matrix, which is a numerical representation of the document used for training.

In [None]:
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

We can now use our model `clf` to **predict** the label of new data, such as the strings below in `docs_new`.
First, we have to transform this new data in the same way, using the tokenizer `count_vect` and the `tfidf_transformer`... although we don't need to _fit_ them anymore, just transform the data itself.

In [None]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
X_new_tfidf.shape

In [None]:
words_present = [i for i in range(X_new_tfidf.shape[1]) if sum(X_new_tfidf.toarray()[:,i]) > 0]
words_present

In [None]:
# raw counts of the words in the mini corpus
pd.DataFrame(X_new_counts.toarray()[:,words_present],
            columns = count_vect.get_feature_names_out()[words_present])

In [None]:
# tf-idf of the words in the mini corpus
pd.DataFrame(X_new_tfidf.toarray()[:,words_present],
            columns = count_vect.get_feature_names_out()[words_present])

Once we have transformed the test data we can apply our model to try to predict the labels.

In [None]:
predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print(f'{doc} => {twenty_train.target_names[category]}')

For this tiny test set, it seems alright!

# Pipeline
If you need to run this workflow many times you can make it more compact by using a pipeline, which sequentially applies a list of transformations (such as `CountVectorizer` and `TfidfTransformer`) ad then a final estimator (in this case `MultinomialNB`).

Why would you want to run this many times? Because, even though we didn't see this, the transformations and the estimator have _hyperparameters_ which can return different results. You might want to explore different values of these hyperparameters to see which combination returns the best result, thus **tuning your model**. In that case, a pipeline streamlines your code.

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
text_clf.fit(twenty_train.data, twenty_train.target)

In [None]:
predicted_via_pipeline = text_clf.predict(docs_new)

for doc, category in zip(docs_new, predicted_via_pipeline):
    print(f'{doc} => {twenty_train.target_names[category]}')

As you can see, one line of code sufficed to apply all transformations _and_ training to the train data, as well as all transforamtions _and_ prediction on the test data.

# Evaluate performance
Here we only tested on two sentences that were, in addition, designed to be very easy to classify. If we use a larger test dataset, looking at the matches manually to evaluate the model can be time consuming and hard to assess.

The tutorial suggests a very simple measure of _accuracy_, i.e. the proportion of labels that the model got right.

In [None]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42) # retrieve test data
docs_test = twenty_test.data # extract the data part
predicted = text_clf.predict(docs_test) # predict with the train model
[(a, b, a == b) for a, b in zip(twenty_test.target, predicted)]

In [None]:
np.mean(predicted == twenty_test.target)

As a following step, the tutorial shows how to use a diferent kind of classification algorithm: a support vector machine. The chunks below does the same thing we did above but with a different algorithm.

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

text_clf.fit(twenty_train.data, twenty_train.target)

In [None]:
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

With some functions from `sklearn.metrics` we can also look into the performance in more detail.

In [None]:
print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))

In the confusion matrix, the rows indicate the true result, and the columns the predicted label.

In [None]:
pd.DataFrame(metrics.confusion_matrix(twenty_test.target, predicted), columns=categories, index=categories)

## Grid search

As mentioned before, we might want to try out different values for hyperparameters that could affect our model. A grid search looks at all possible combinations of values that you provide.

This is done by providing the pipeline and a dictionary of parameters to `sklearn.model_selection.GridSearchCV`. This dictionary has the hyperparameter names as keys, with the format `<pipeline-item>__<argument>`, and a list or tuple of possible arguments. The one below, for example, requests:

- The values `(1, 1)` and `(1, 2)` for the `ngram_range` argument of `CountVectorizer()`, which is the `vect` element of the Pipeline. These values generate monograms and bigrams respectively.
- The values `True` and `False` for the `use_idf` argument of `TfidfTransformer()`, which is the `tfidf` element of the Pipeline. These values switch between including IDF smoothing or not in the transformation of the frequencies.
- The values `0.01` and `0.001` for the `alpha` argument of `SGDClassifier()`, which is the `clf` element of the Pipeline. This is a penalty parameter of the algorithm.

The `cv` argument defines the number of folds for cross-validation. This means that the training set is split in 5 equal pieces and the train-test workflow is done 5 times _on each combination of hyperparameters_, with one of the splits as test set and the other 4 as training set.

Finally, `n_jobs = -1` tells the computer to use as many CPU cores as we have available to run this computationally expensive task.

In [None]:
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}

In [None]:
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)

Because it can take so long with the full training dataset, the tutorial suggests looking at the first 400 documents only.

In [None]:
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

We do get as a result a classifier that we can use to predict - the difference with the previous one is that it has been tuned on a variety of hyperparameters and training-test sets.

In [None]:
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]

In [None]:
gs_clf.best_score_ # the score of the best model

In [None]:
gs_clf.best_params_ # the hyperparameters of the best model

Finally, the `cv_results_` attribute gives us the details of the result with a matrix with one row per combination of hyperparameters and columns for different properties, such as the time it took to test it, the value of each parameter, the score of each test and a summary of these scores.

In [None]:
pd.DataFrame(gs_clf.cv_results_)