# Experiments with text classifiers in sklearn

In this exercise we'll be experimenting with various classification algorithms in scikit learn using the [20 Newsgroups collection](http://people.csail.mit.edu/jrennie/20Newsgroups/).

The first part of the notebook shows a detailed example usage of text classification using sklearn (based on [scikit learn's "Working with text data" tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)).
The real exercise is at the bottom, where you'll be asked to perform various experiments.

## Load data

In order to get faster execution times, we will work on a partial dataset with only 5 categories out of the 20 available in the dataset:

In [None]:
pip install ipytest

In [None]:
categories = [
    "alt.atheism",
    "soc.religion.christian", 
    "talk.religion.misc",
    "comp.sys.ibm.pc.hardware",
    "comp.sys.mac.hardware"
]

We load the documents from those categories, divided into train and test sets.

In [None]:
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=123)
test = fetch_20newsgroups(subset="test", categories=categories, shuffle=True, random_state=123)

Check which categories got loaded.

In [None]:
print(train.target_names)

Check the size of training and test splits.

In [None]:
print("Training instances: {}".format(len(train.data)))
print("Test instances:     {}".format(len(test.data)))

Check target labels of some of the train and test instances.

In [None]:
print(train.target[:10])
print(test.target[:10])

## Train a model

Bag-of-words document representation, using raw term counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.data)

Check dimensionality (instances x features).

In [None]:
print(X_train_counts.shape)

Check vocabulary (sample 10 terms).

In [None]:
for idx, term in enumerate(count_vect.vocabulary_.keys()):
    if idx < 10:
        print(f"{term} (ID: {count_vect.vocabulary_[term]})")

Learn a Naive Bayes model on the training data (by default it uses Laplace smoothing with alpha=1).

In [None]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB(alpha=1.0)
classifier.fit(X_train_counts, train.target)

## Apply the model

First, extract the same feature representation by re-using the `CountVectorizer` from before.

In [None]:
X_test_counts = count_vect.transform(test.data)

Check dimensionality (documents x features).

In [None]:
print(X_test_counts.shape)

Then, predict labels for test instances.

In [None]:
predicted = classifier.predict(X_test_counts)

Look at some of the predicted labels.

In [None]:
print(predicted[:10])

## Evaluate model performance

We use Accuracy as our measure here.

In [None]:
from sklearn import metrics

print(f"{metrics.accuracy_score(test.target, predicted):.3f}")

## Exercise

1) Use TF weighting instead of the raw counts. (See the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) for `TfidfTransformer` usage.)

2) Try at least one different classifier, e.g., [linear SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) (or [other SVMs](https://scikit-learn.org/stable/modules/svm.html#svm-classification)).

3) Record the results you got in the table below. How far can you push accuracy?

### Solution

Building a pipeline for each row in the table, then running an evaluating them in a single loop.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

#### Naive Bayes variants

In [None]:
pipeline_nb_raw = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

In [None]:
pipeline_nb_tf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer(use_idf=False)),
    ('clf', MultinomialNB()),
])

#### SVM variants

In [None]:
pipeline_svm_raw = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', SGDClassifier()),
])

In [None]:
pipeline_svm_tf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer(use_idf=False)),
    ('clf', SGDClassifier()),
])

In [None]:
for pipeline in [
    pipeline_nb_raw, pipeline_nb_tf, 
    pipeline_svm_raw, pipeline_svm_tf
]:
    pipeline.fit(train.data, train.target)
    predicted = pipeline.predict(test.data)
    print(f"{metrics.accuracy_score(test.target, predicted):.3f}")

### Results

| Model | Term weighting | Accuracy |
| -- | -- |:--:|
| Naive Bayes | Raw counts | 0.864 |
| Naive Bayes | TF | 0.667 |
| SVM | Raw counts | 0.819 |
| SVM | TF | 0.851 |
| ... | ... | ... | 


## Optional exercise

Can you push performance ever further? You could try, for example, more sophisticated text preprocessing (tokenization, stopwords removal, and stemming) using [NLTK](https://www.nltk.org/) (which is part of the Anaconda distribution). See, e.g., [this article](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a) for some hints.