Source: https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

## 20 Newsgroups Dataset

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Load the training dataset:

In [1]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Check the target names (categories) and some data files:

In [2]:
# print the categories
twenty_train.target_names

# print the first line of the first data file
print("\n".join(twenty_train.data[0].split("\n")[:3]))

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


## Count Vectorization

Text files are actually series of words (ordered). In order to run machine learning algorithms, we need to convert the text files into numerical feature vectors. We will be using the "Bag of Words" model for our example. 

Briefly, we segment each text file into words (for English, this means splitting by space), and count the # of times each word occurs in each document, and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).

Scikit-learn's `CountVectorizer` is a high level component that we will use to create the feature vectors.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

# print the dimension of the Document-Term matrix
X_train_counts.shape

(11314, 130107)

By doing ‘count_vect.fit_transform(twenty_train.data)’, we are learning the vocabulary dictionary. This a Document-Term matrix of [n_samples, n_features].

## Term Frequencies (TF)

Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents. To avoid this, we can use the frequency of each word in each document. This is known as TF - Term Frequencies, and is computed using count(word) / number of total words.

## Term Frequency x Inverse Document Frequency (TF-IDF)

We can even reduce the weightage of more common words (e.g. the, is, an, ...) which occurs in all documents. This is known as TF-IDF i.e Term Frequency times inverse document frequency.

We can compute both this way:

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# print the dimension of the Document-Term matrix
X_train_tfidf.shape

(11314, 130107)

## Naive Bayes Classification

Naive Bayes Classification is one of the simplest algorithms for text classification. We can train a classifier using scikit-learn this way:

In [8]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
clf

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Training pipeline

We can also build a training pipeline that does all the above with fewer lines of code:

In [13]:
from sklearn.pipeline import Pipeline

# Create a pipeline using a Naive Bayes classifier
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

## Testing the Classifier

Once the classifier is trained, we can test its performance on the test set:

In [14]:
import numpy as np

# Fetch the test set
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

# Run the classifier and compute its accuracy on the test set
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

The accuracy will be roughly 77.38%, which is not bad for a naive classifier.

## SVM

Let's try Support Vector Machines (SVM) to see if we get a performance improvement:

In [15]:
from sklearn.linear_model import SGDClassifier

# Create a pipeline using SVM
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                                   alpha=1e-3, n_iter=5, random_state=42))])
_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

# Run the classifier and compute its accuracy on the test set
predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target)



0.82381837493361654

SVM should produce an improvement in accuracy of roughly 82.38%.

## Parameter Grid Search

By now you may be wondering how the parameters to SVM were selected (such as `alpha=1e-3`). Almost all classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit provides an extremely useful tool called `GridSearchCV` for this:

In [16]:
from sklearn.model_selection import GridSearchCV

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
}

Here, we are creating a list of parameters for which we would like to do performance tuning. 

All the parameter names begin with the classifier name. For example, the `vect` name we gave to the classifier will have a parameter such as `vect__ngram_range`.  This parameter controls the usage of unigram and bigrams, and we are telling scikit-learn choose the optimal value for our classifier.

Once we've identified the parameter(s) we would like to optimize, we will create an instance of a grid search by passing the classifier and parameters to it. In this case, we are optimizing the Naive Bayes classifier (`text_clf`). We've also set `n_jobs=-1` so that the optimizer can utilize multiple cores on this machine:

In [20]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

This may take few minutes to run depending on your machine configuration.

To see the best mean score and the params, run the following code:

In [21]:
gs_clf.best_score_

0.90675269577514583

In [22]:
gs_clf.best_params_

{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

The accuracy has now increased to ~90.6% for the Naive Bayes classifier, nd the corresponding parameters are `{‘clf__alpha’: 0.01, ‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}`.

Similarly, we get improved accuracy ~89.79% for SVM classifier with following code: 

In [23]:
from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3)
}
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)
gs_clf_svm.best_score_



0.89791408873961465

In [24]:
gs_clf_svm.best_params_

{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}