# Experiments with text classifiers in sklearn

In this exercise we'll be experimenting with various classification algorithms in scikit learn using the [20 Newsgroups collection](http://people.csail.mit.edu/jrennie/20Newsgroups/).

The first part of the notebook shows a detailed example usage of text classification using sklearn (based on [scikit learn's "Working with text data" tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)).
The real exercise is at the bottom, where you'll be asked to perform various experiments.

## Load data

In order to get faster execution times, we will work on a partial dataset with only 5 categories out of the 20 available in the dataset:

In [1]:
categories = [
    "alt.atheism",
    "soc.religion.christian", 
    "talk.religion.misc",
    "comp.sys.ibm.pc.hardware",
    "comp.sys.mac.hardware"
]

We load the documents from those categories, divided into train and test sets.

In [2]:
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=123)
test = fetch_20newsgroups(subset="test", categories=categories, shuffle=True, random_state=123)

Check which categories got loaded.

In [3]:
print(train.target_names)

['alt.atheism', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'soc.religion.christian', 'talk.religion.misc']


Check the size of training and test splits.

In [4]:
print("Training instances: {}".format(len(train.data)))
print("Test instances:     {}".format(len(test.data)))

Training instances: 2624
Test instances:     1745


Check target labels of some of the train and test instances.

In [5]:
print(train.target[:10])
print(test.target[:10])

[3 2 1 1 2 3 3 3 1 3]
[2 4 1 3 1 0 4 1 0 1]


## Train a model

Bag-of-words document representation, using raw term counts.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.data)

Check dimensionality (instances x features).

In [7]:
print(X_train_counts.shape)

(2624, 35019)


Check vocabulary (sample 10 terms).

In [8]:
for idx, term in enumerate(count_vect.vocabulary_.keys()):
    if idx < 10:
        print(f"{term} (ID: {count_vect.vocabulary_[term]})")

from (ID: 15098)
news (ID: 22831)
cbnewsk (ID: 8460)
att (ID: 6168)
com (ID: 9375)
subject (ID: 30550)
re (ID: 26598)
bible (ID: 7004)
unsuitable (ID: 33054)
for (ID: 14816)


Learn a Naive Bayes model on the training data (by default it uses Laplace smoothing with alpha=1).

In [9]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB(alpha=1.0)
classifier.fit(X_train_counts, train.target)

MultinomialNB()

## Apply the model

First, extract the same feature representation by re-using the `CountVectorizer` from before.

In [10]:
X_test_counts = count_vect.transform(test.data)

Check dimensionality (documents x features).

In [11]:
print(X_test_counts.shape)

(1745, 35019)


Then, predict labels for test instances.

In [12]:
predicted = classifier.predict(X_test_counts)

Look at some of the predicted labels.

In [13]:
print(predicted[:10])

[2 4 1 3 1 0 3 2 0 1]


## Evaluate model performance

We use Accuracy as our measure here.

In [14]:
from sklearn import metrics

print(f"{metrics.accuracy_score(test.target, predicted):.3f}")

0.864


## Exercise

1) Use TF weighting and TF-IDF weighting instead of the raw counts. (See the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) for `TfidfTransformer` usage.)

2) Try at least one different classifier, e.g., [linear SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) (or [other SVMs](https://scikit-learn.org/stable/modules/svm.html#svm-classification)).

3) Record the results you got in the table below. How far can you push accuracy?

In [16]:
# TODO Add your code here.

### Results

| Model | Term weighting | Accuracy |
| -- | -- |:--:|
| Naive Bayes | Raw counts | 0.864 |
| Naive Bayes | TF | |
| Naive Bayes | TF-IDF | |
| SVM | Raw counts | |
| SVM | TF | |
| SVM | TF-IDF | |
| ... | ... | ... | 


## Optional exercise

Can you push performance ever further? You could try, for example, more sophisticated text preprocessing (tokenization, stopwords removal, and stemming) using [NLTK](https://www.nltk.org/) (which is part of the Anaconda distribution). See, e.g., [this article](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a) for some hints.