# Exercise #2: Experimenting with text classification

In this exercise we will be using the [20 Newsgroups collection](http://people.csail.mit.edu/jrennie/20Newsgroups/).

In order to get faster execution times, we will work on a partial dataset with only 5 categories out of the 20 available in the dataset:

In [13]:
categories = ['alt.atheism', 'soc.religion.christian', 'talk.religion.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware']

## Loading data

We load the documents from those categories, divided into train and test sets.

In [14]:
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

We check the categories that were loaded

In [15]:
train.target_names

['alt.atheism',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'soc.religion.christian',
 'talk.religion.misc']

Check the size of training and test splits

In [16]:
print("Training instances: {}".format(len(train.data)))
print("Test instances:     {}".format(len(test.data)))

Training instances: 2624
Test instances:     1745


## Building a model

### Simply using word counts

Create a bag-of-words representation

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.data)
X_train_counts.shape

(2624, 35019)

Learn a model on the training data

In [18]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train_counts, train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Apply the model on the test data (by first extracting the same feature representation)

In [19]:
X_test_counts = count_vect.transform(test.data)
predicted = classifier.predict(X_test_counts)

Evaluate model performance

In [20]:
from sklearn import metrics
metrics.accuracy_score(test.target, predicted)

0.8641833810888252

## Exercise

1) Use TF weighting and TF-IDF weighting instead of the raw counts. See [this notebook](../../code/text_feature_extraction.ipynb) or the [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) for `TfidfTransformer` usage.

2) Try at least one different classifier, e.g., [linear SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) (or [other SVMs](https://scikit-learn.org/stable/modules/svm.html#svm-classification)).

3) Record the results you got in the table below. How far can you push accuracy?

In [21]:
# TODO: write code here

### Results

| Model | Term weighting | Accuracy |
| -- | -- |:--:|
| Naive Bayes | Raw counts | 0.864 |
| Naive Bayes | TF | |
| Naive Bayes | TF-IDF | |
| SVM | Raw counts | |
| SVM | TF | |
| SVM | TF-IDF | |
| ... | ... | ... | 
