# Exercise #3: Building a multiclass classifier using scikit-learn

In this exercise we will be using the 20 Newsgroups collection.  
Official description, quoted from the [website](http://people.csail.mit.edu/jrennie/20Newsgroups/):

> The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In order to get faster execution times, we will work on a partial dataset with only 4 categories out of the 20 available in the dataset:

In [None]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

## Loading data

We load the documents from those categories, divided into train and test sets.

In [None]:
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

We check the categories that were loaded

In [None]:
train.target_names

Check the size of training and test splits

In [None]:
print("Training instances: {}".format(len(train.data)))
print("Test instances:     {}".format(len(test.data)))

Check target labels

In [None]:
for t in train.target[:10]:
    print(t, train.target_names[t])

## Building a model

Create a bag-of-words representation

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.data)
X_train_counts.shape

Learn a model on the training data

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train_counts, train.target)
classifier.fit(X_train_counts, train.target)

Apply the model on the test data (by first extracting the same feature representation)

In [None]:
X_test_counts = count_vect.transform(test.data)
predicted = classifier.predict(X_test_counts)

Evaluate model performance

In [None]:
# TODO compare the predicted labels (`predicted`) against the gold test labels (`test.target`)