# Building a text classifier using scikit-learn

In this example we are using the 20 Newsgroups collection.  
Official description, quoted from the [website](http://people.csail.mit.edu/jrennie/20Newsgroups/):

> The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In order to get faster execution times, we will work on a partial dataset with only 4 categories out of the 20 available in the dataset:

In [1]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

## Loading data

We load the documents from those categories, divided into train and test sets.

In [3]:
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

We check the categories that were loaded

In [4]:
train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

Check the size of training and test splits

In [5]:
print("Training instances: {}".format(len(train.data)))
print("Test instances:     {}".format(len(test.data)))

Training instances: 2257
Test instances:     1502


Check some target labels

In [7]:
for t in train.target[:10]:
    print(t, train.target_names[t])

1 comp.graphics
1 comp.graphics
3 soc.religion.christian
3 soc.religion.christian
3 soc.religion.christian
3 soc.religion.christian
3 soc.religion.christian
2 sci.med
2 sci.med
2 sci.med


## Building a model

### Simply using word counts

Create a bag-of-words representation

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.data)
X_train_counts.shape

(2257, 35788)

Learn a model on the training data

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

classifier = MultinomialNB()
classifier.fit(X_train_counts, train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Apply the model on the test data (by first extracting the same feature representation)

In [10]:
X_test_counts = count_vect.transform(test.data)
predicted = classifier.predict(X_test_counts)

Evaluate model performance

In [11]:
from sklearn import metrics

metrics.accuracy_score(test.target, predicted)

0.93408788282290278