#### 1. Loading the data set - training data.

In [6]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

#### 2. Extracting features from text files

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

#### 3. TF-IDF

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

#### 4. Machine Learning
#### For the sake of showing how to save and load a model in sklearn, I will just use the Naive Bayes (NB) classifier on training data.

In [11]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

Building a pipeline: We can write less code and do all of the above, by building a pipeline as follows:
The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
We will be using the 'text_clf' going forward.

In [12]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

#### 4. Evaluating

In [14]:

import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

#### 6. Saving sklearn model with pickle

In [17]:
import pickle

In [22]:
with open('clf_text.pickle', 'wb') as f:
    pickle.dump(text_clf, f)

#### 6. Loading sklearn model with pickle

In [27]:
with open('clf_text.pickle', 'rb') as f:
    clf = pickle.load(f)

In [28]:
new_predicted = clf.predict(twenty_test.data)
np.mean(new_predicted == twenty_test.target)

0.7738980350504514

As you can see above, the accuracy is the same.