### Classification

How does the vector model is used for applying ML techniques, such as classification.

In [1]:
from sklearn.datasets import fetch_20newsgroups

# Remove metadata to avoid bias
newsgroups_train = fetch_20newsgroups(subset = 'train', remove = ('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove =('headers',  'footers', 'quotes'))

# print categories
print(list(newsgroups_train.target_names))
##

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [4]:
# Number of categories
print(len(newsgroups_train.target_names))

20


In [6]:
# Show a document
docid = 1
doc = newsgroups_train.data[docid]
cat = newsgroups_train.target[docid]
print("Category id " +  str(cat) + " " + newsgroups_train.target_names[cat])
print("Doc " + doc)

Category id 4 comp.sys.mac.hardware
Doc A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.


In [7]:
# number of files 
newsgroups_train.filenames.shape

(11314,)

In [8]:
# Obtain a vector
from sklearn.feature_extraction.text import  TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer = 'word', stop_words= 'english')

vectors_train = vectorizer.fit_transform(newsgroups_train.data)
vectors_train.shape

(11314, 101322)

# Classifier

After creating the vectors, we can create the classifiers (i.e. for clustering) using scikit

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

model = MultinomialNB(alpha = .01)
model.fit(vectors_train, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)

pred = model.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average='weighted')




0.695453607190013

Result can be improved (optimization, preprocessing, etc

In [12]:
from sklearn.utils.extmath import density

print("dimensionality: %d" % model.coef_.shape[1])
print("density: %f" % density(model.coef_))

dimensionality: 101322
density: 1.000000


In [14]:
# Review top features per topic in bayes
import numpy as np

def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate (categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))
        
show_top10(model, vectorizer, newsgroups_train.target_names)

alt.atheism: islam atheists say just religion atheism think don people god
comp.graphics: looking format 3d know program file files thanks image graphics
comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac
comp.windows.x: using windows x11r5 use application thanks widget server motif window
misc.forsale: asking email sell price condition new shipping offer 00 sale
rec.autos: don ford new good dealer just engine like cars car
rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike
rec.sport.baseball: braves players pitching hit runs games game baseball team year
rec.sport.hockey: league year nhl games season players play hockey team game
sci.crypt: people use escrow nsa keys government chip clipper encryption key
sci.electronics: don thanks voltage used know does lik

In [15]:
# Try classifier in two more docs
new_docs = ['This is a survey of PC computers', 'God is love']
new_vectors = vectorizer.transform(new_docs)

pred_docs = model.predict(new_vectors)

print(pred_docs)
print([newsgroups_train.target_names[i] for i in pred_docs])

[ 2 15]
['comp.os.ms-windows.misc', 'soc.religion.christian']
