This is an example showing how scikit-learn can be used to classify documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features and demonstrates various classifiers that can efficiently handle sparse matrices.

The dataset used in this example is the 20 newsgroups dataset. It will be automatically downloaded, then cached.

The bar plot indicates the accuracy, training time (normalized) and test time (normalized) of each classifier.


#### New to Plotly?
Plotly's Python library is free and open source! [Get started](https://plot.ly/python/getting-started/) by downloading the client and [reading the primer](https://plot.ly/python/getting-started/).
<br>You can set up Plotly to work in [online](https://plot.ly/python/getting-started/#initialization-for-online-plotting) or [offline](https://plot.ly/python/getting-started/#initialization-for-offline-plotting) mode, or in [jupyter notebooks](https://plot.ly/python/getting-started/#start-plotting-online).
<br>We also have a quick-reference [cheatsheet](https://images.plot.ly/plotly-documentation/images/python_cheat_sheet.pdf) (new!) to help you get started!

### Version

In [1]:
import sklearn
sklearn.__version__

'0.18.1'

### Imports

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go

from __future__ import print_function

import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics

### Calculations

Display progress logs on stdout

In [3]:
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')


Parse commandline arguments

In [4]:
op = OptionParser()
op.add_option("--report",
              action="store_true", dest="print_report",
              help="Print a detailed classification report.")
op.add_option("--chi2_select",
              action="store", type="int", dest="select_chi2",
              help="Select some number of features using a chi-squared test")
op.add_option("--confusion_matrix",
              action="store_true", dest="print_cm",
              help="Print the confusion matrix.")
op.add_option("--top10",
              action="store_true", dest="print_top10",
              help="Print ten most discriminative terms per class"
                   " for every classifier.")
op.add_option("--all_categories",
              action="store_true", dest="all_categories",
              help="Whether to use all categories or not.")
op.add_option("--use_hashing",
              action="store_true",
              help="Use a hashing vectorizer.")
op.add_option("--n_features",
              action="store", type=int, default=2 ** 16,
              help="n_features when using the hashing vectorizer.")
op.add_option("--filtered",
              action="store_true",
              help="Remove newsgroup information that is easily overfit: "
                   "headers, signatures, and quoting.")

op.print_help()

Usage: __main__.py [options]

Options:
  -h, --help            show this help message and exit
  --report              Print a detailed classification report.
  --chi2_select=SELECT_CHI2
                        Select some number of features using a chi-squared
                        test
  --confusion_matrix    Print the confusion matrix.
  --top10               Print ten most discriminative terms per class for
                        every classifier.
  --all_categories      Whether to use all categories or not.
  --use_hashing         Use a hashing vectorizer.
  --n_features=N_FEATURES
                        n_features when using the hashing vectorizer.
  --filtered            Remove newsgroup information that is easily overfit:
                        headers, signatures, and quoting.


To Get command line arguments add

    (opts, args) = op.parse_args()
    
and set the following as:

all_categories = opts.all_categories

filtered = opts.filtered

use_hashing = opts.use_hashing

n_features = opts.n_features

select_chi2 = opts.select_chi2

print_cm = opts.print_cm

print_top10 = opts.print_top10

print_report = opts.print_report

For this tutorial we are taking these values as:

In [5]:
all_categories = True
filtered = True
use_hashing = True
n_features = 2 ** 16
select_chi2 = 10
print_cm = True
print_top10 = True
print_report = True

Load some categories from the training set

In [6]:
if all_categories:
    categories = None
else:
    categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space',
    ]

In [7]:
if filtered:
    remove = ('headers', 'footers', 'quotes')
else:
    remove = ()

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')

Loading 20 newsgroups dataset for categories:
all
data loaded


Order of labels in `target_names` can be different from `categories`

In [8]:
target_names = data_train.target_names

def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

In [9]:
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))

11314 documents - 13.782MB (training set)
7532 documents - 8.262MB (test set)


Split a training set and a test set

In [10]:
y_train, y_test = data_train.target, data_test.target

print("Extracting features from the training data using a sparse vectorizer")
t0 = time()
if use_hashing:
    vectorizer = HashingVectorizer(stop_words='english', non_negative=True,
                                   n_features=n_features)
    X_train = vectorizer.transform(data_train.data)
else:
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    X_train = vectorizer.fit_transform(data_train.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()

print("Extracting features from the test data using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()

Extracting features from the training data using a sparse vectorizer
done in 3.209865s at 4.294MB/s
n_samples: 11314, n_features: 65536

Extracting features from the test data using the same vectorizer
done in 1.774443s at 4.656MB/s
n_samples: 7532, n_features: 65536



Mapping from integer feature name to original token string

In [11]:
if use_hashing:
    feature_names = None
else:
    feature_names = vectorizer.get_feature_names()

if select_chi2:
    print("Extracting %d best features by a chi-squared test" %
          select_chi2)
    t0 = time()
    ch2 = SelectKBest(chi2, k=select_chi2)
    X_train = ch2.fit_transform(X_train, y_train)
    X_test = ch2.transform(X_test)
    if feature_names:
        # keep selected feature names
        feature_names = [feature_names[i] for i
                         in ch2.get_support(indices=True)]
    print("done in %fs" % (time() - t0))
    print()

if feature_names:
    feature_names = np.asarray(feature_names)


def trim(s):
    """Trim string to fit on terminal (assuming 80-column display)"""
    return s if len(s) <= 80 else s[:77] + "..."


Extracting 10 best features by a chi-squared test
done in 0.427792s



Benchmark classifiers

In [12]:
def benchmark(clf):
    print('_' * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(X_train, y_train)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)

    t0 = time()
    pred = clf.predict(X_test)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)

    score = metrics.accuracy_score(y_test, pred)
    print("accuracy:   %0.3f" % score)

    if hasattr(clf, 'coef_'):
        print("dimensionality: %d" % clf.coef_.shape[1])
        print("density: %f" % density(clf.coef_))

        if print_top10 and feature_names is not None:
            print("top 10 keywords per class:")
            for i, label in enumerate(target_names):
                top10 = np.argsort(clf.coef_[i])[-10:]
                print(trim("%s: %s" % (label, " ".join(feature_names[top10]))))
        print()

    if print_report:
        print("classification report:")
        print(metrics.classification_report(y_test, pred,
                                            target_names=target_names))

    if print_cm:
        print("confusion matrix:")
        print(metrics.confusion_matrix(y_test, pred))

    print()
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_time


In [13]:
results = []
for clf, name in (
        (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
        (Perceptron(n_iter=50), "Perceptron"),
        (PassiveAggressiveClassifier(n_iter=50), "Passive Aggressive"),
        (KNeighborsClassifier(n_neighbors=10), "kNN"),
        (RandomForestClassifier(n_estimators=100), "Random forest")):
    print('=' * 80)
    print(name)
    results.append(benchmark(clf))

for penalty in ["l2", "l1"]:
    print('=' * 80)
    print("%s penalty" % penalty.upper())
    # Train Liblinear model
    results.append(benchmark(LinearSVC(loss='l2', penalty=penalty,
                                            dual=False, tol=1e-3)))

    # Train SGD model
    results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50,
                                           penalty=penalty)))



Ridge Classifier
________________________________________________________________________________
Training: 
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=None, solver='lsqr',
        tol=0.01)



In Ridge, only 'sag' solver can currently fit the intercept when X is sparse. Solver has been automatically changed into 'sag'.



train time: 0.216s
test time:  0.001s
accuracy:   0.186
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.53      0.49      0.51       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.69      0.34      0.45       396
         rec.motorcycles       0.85      0.25      0.39       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.07      0.96      0.12       399
               sci.crypt       0.67      0.41      0.50       396
         sci.electronics


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



train time: 0.417s
test time:  0.001s
accuracy:   0.129
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.22      0.11      0.15       319
           comp.graphics       0.04      0.01      0.02       389
 comp.os.ms-windows.misc       0.00      0.00      0.00       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.07      0.02      0.03       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.68      0.33      0.44       396
         rec.motorcycles       0.91      0.25      0.39       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.87      0.33      0.48       396
         sci.electronics


loss='l2' has been deprecated in favor of loss='squared_hinge' as of 0.16. Backward compatibility for the loss='l2' will be removed in 1.0




test time:  0.002s
accuracy:   0.186
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.52      0.49      0.51       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.70      0.33      0.45       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.67      0.41      0.51       396
         sci.electronics       0.00      0


loss='l2' has been deprecated in favor of loss='squared_hinge' as of 0.16. Backward compatibility for the loss='l2' will be removed in 1.0



train time: 0.433s
test time:  0.001s
accuracy:   0.162
dimensionality: 10
density: 0.455000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.06      0.98      0.12       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.69      0.33      0.45       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.66      0.41      0.50       396
         sci.electronics

Train SGD with Elastic Net penalty

In [14]:
print('=' * 80)
print("Elastic-Net penalty")
results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50,
                                       penalty="elasticnet")))

Elastic-Net penalty
________________________________________________________________________________
Training: 
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
       penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
train time: 0.488s
test time:  0.001s
accuracy:   0.180
dimensionality: 10
density: 0.845000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.06      0.84      0.11       389
 comp.os.ms-windows.misc       0.52      0.49      0.50       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
   

Train NearestCentroid without threshold

In [15]:

print('=' * 80)
print("NearestCentroid (aka Rocchio classifier)")
results.append(benchmark(NearestCentroid()))


NearestCentroid (aka Rocchio classifier)
________________________________________________________________________________
Training: 
NearestCentroid(metric='euclidean', shrink_threshold=None)
train time: 0.020s
test time:  0.002s
accuracy:   0.180
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.17      0.01      0.02       319
           comp.graphics       1.00      0.01      0.02       389
 comp.os.ms-windows.misc       0.57      0.48      0.52       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.22      0.02      0.04       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.70      0.34      0.45       396
         rec.motorcycles       0.90      0.25      0.39       398
      rec.sport.baseball       0.00      0.00      0.00       397
  

Train sparse Naive Bayes classifiers

In [16]:
print('=' * 80)
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=.01)))
results.append(benchmark(BernoulliNB(alpha=.01)))

Naive Bayes
________________________________________________________________________________
Training: 
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
train time: 0.009s
test time:  0.001s
accuracy:   0.183
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.19      0.01      0.01       389
 comp.os.ms-windows.misc       0.54      0.47      0.50       394
comp.sys.ibm.pc.hardware       0.33      0.00      0.01       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.27      0.01      0.01       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.71      0.32      0.44       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.08      0.01      0.02   

LinearSVC with L1-based feature selection

In [17]:
print('=' * 80)
print("LinearSVC with L1-based feature selection")
# The smaller C, the stronger the regularization.
# The more regularization, the more sparsity.
results.append(benchmark(Pipeline([
  ('feature_selection', LinearSVC(penalty="l1", dual=False, tol=1e-3)),
  ('classification', LinearSVC())
])))

LinearSVC with L1-based feature selection
________________________________________________________________________________
Training: 
Pipeline(steps=[('feature_selection', LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.001,
     verbose=0)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])



Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.



train time: 0.436s
test time:  0.002s
accuracy:   0.186
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.52      0.49      0.51       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.70      0.33      0.45       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.67      0.41      0.51       396
         sci.electronics       0.00      0.00      0.00       


Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.



### Plot Results

In [18]:

indices = np.arange(len(results))

results = [[x[i] for x in results] for i in range(4)]

clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)

p1 = go.Bar(x=indices, y=score, 
            name="score", 
            marker=dict(color='navy'))

p2 = go.Bar(x=indices + 2, y=training_time, 
            name="training time",
            marker=dict(color='cyan'))

p3 = go.Bar(x=indices + 4, y=test_time, 
            name="test time", 
            marker=dict(color='darkorange'))


layout = go.Layout(title="Score")
fig = go.Figure(data=[p1, p2, p3], layout=layout)

In [19]:
py.iplot(fig)

### License

Authors: 
    
         Peter Prettenhofer <peter.prettenhofer@gmail.com>
         
         Olivier Grisel <olivier.grisel@ensta.org>
         
         Mathieu Blondel <mathieu@mblondel.org>
         
         Lars Buitinck
        
License: 
    
         BSD 3 clause

In [21]:
from IPython.display import display, HTML

display(HTML('<link href="//fonts.googleapis.com/css?family=Open+Sans:600,400,300,200|Inconsolata|Ubuntu+Mono:400,700" rel="stylesheet" type="text/css" />'))
display(HTML('<link rel="stylesheet" type="text/css" href="http://help.plot.ly/documentation/all_static/css/ipython-notebook-custom.css">'))

! pip install git+https://github.com/plotly/publisher.git --upgrade
import publisher
publisher.publish(
    'Classification of Text Documents using Sparse Features.ipynb', 'scikit-learn/document-classification-20newsgroups/', 'Classification of Text Documents using Sparse Features | plotly',
    ' ',
    title = 'Classification of Text Documents using Sparse Features | plotly',
    name = 'Classification of Text Documents using Sparse Features',
    has_thumbnail='false', thumbnail='thumbnail/your-tutorial-chart.jpg', 
    language='scikit-learn', page_type='example_index',
    display_as='text_documents', order=4,
    ipynb= '~Diksha_Gabha/3598')

Collecting git+https://github.com/plotly/publisher.git
  Cloning https://github.com/plotly/publisher.git to /tmp/pip-jpH4Ro-build
Installing collected packages: publisher
  Found existing installation: publisher 0.10
    Uninstalling publisher-0.10:
      Successfully uninstalled publisher-0.10
  Running setup.py install for publisher ... [?25l- done
[?25hSuccessfully installed publisher-0.10
