In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
print(list(newsgroups_train.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [2]:
cats = ['alt.atheism', 'sci.space', 'misc.forsale', 'rec.autos']

In [3]:
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats, remove=('headers', 'footers', 'quotes'))

In [4]:
group_names = {0:'atheism', 1:'forsale', 2:'autos', 3:'space'}

atheism_docs, forsale_docs, autos_docs, space_docs = [], [], [], []
for idx, label in enumerate(newsgroups_train.target):
    if label == 0:
        atheism_docs.append(idx)
    elif label == 1:
        forsale_docs.append(idx)
    elif label == 2:
        autos_docs.append(idx)
    elif label == 3:
        space_docs.append(idx)
print(len(atheism_docs), len(forsale_docs), len(autos_docs), len(space_docs))

480 585 594 593


In [5]:
len(newsgroups_train.data), len(newsgroups_train.target)

(2252, 2252)

# Vectorize Data

In [6]:
import re
docs = [doc.replace('\n',' ').replace('\t',' ').strip() for doc in newsgroups_train.data]
docs = [re.sub(' {2,}',' ', doc) for doc in docs]

In [7]:
import spacy
nlp = spacy.load('en_core_web_sm')

vectors = [nlp(doc) for doc in docs]

In [8]:
vectors = [vector.vector for vector in vectors]

# Per Class Average

In [23]:
import numpy as np

def get_class_average(vectors, class_docs):
    class_vectors = np.vstack((vectors[class_docs[0]], vectors[class_docs[1]]))
    for i in range(2,len(class_docs)):
        tmp = vectors[class_docs[i]]
        if tmp.shape[0] > 0:
            class_vectors = np.vstack((class_vectors, tmp))
    return class_vectors, np.mean(class_vectors, axis=0)
    
atheism_vectors = np.vstack((vectors[atheism_docs[0]], vectors[atheism_docs[1]]))
#for i in range(2,len(atheism_docs)):
#    tmp = vectors[atheism_docs[i]]
    
#    try:
#        atheism_vectors = np.vstack((atheism_vectors, tmp))
#    except:
#        print(atheism_vectors.shape, tmp.shape)
#atheism_vectors.shape
#np.vstack((vectors[atheism_docs[0]], vectors[atheism_docs[1]])).mean(axis=0).shape

In [11]:
np.mean(atheism_vectors, axis=0).shape

(384,)

In [24]:
atheism_vectors, atheism_avg = get_class_average(vectors, atheism_docs)
forsale_vectors, forsale_avg = get_class_average(vectors, forsale_docs)
autos_vectors, autos_avg = get_class_average(vectors, autos_docs)
space_vectors, space_avg = get_class_average(vectors, space_docs)

# Test

In [11]:
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats, remove=('headers', 'footers', 'quotes'))

In [28]:
test_docs = [nlp(doc) for doc in newsgroups_train.data]
test_vectors = [vector.vector for vector in test_docs]

In [73]:
from sklearn.metrics.pairwise import cosine_similarity
import operator
import warnings
warnings.filterwarnings('ignore')
group_names = {0:'atheism', 1:'forsale', 2:'autos', 3:'space'}

def classify_doc(doc, vector):
    results = {}
    results['atheism'] = cosine_similarity(atheism_avg, vector)[0]
    results['forsale'] = cosine_similarity(forsale_avg, vector)[0]
    results['autos'] = cosine_similarity(autos_avg, vector)[0]
    results['space'] = cosine_similarity(space_avg, vector)[0]
    #print(doc)
    sorted_results = sorted(results.items(), key=operator.itemgetter(1))
    return sorted_results, results

def evaluate(idx):
    results, _ = classify_doc(test_docs[idx], test_vectors[idx])
    print('True class: {l}'.format(l=group_names[newsgroups_test.target[idx]]))
    print('Predicted class: {p} with cosine similarity {c}'.format(p=results[0][0], c=results[0][1][0]))
    print(test_docs[idx])

In [74]:
evaluate(0)

True class: forsale
Predicted class: forsale with cosine similarity 0.8243203163146973
Consumer Reports once wrote about the S-10 Blazer that it "shook and rattled
like a tired taxi cab".  There is one noise that is expecially irritating -
the back window squeaks.  I believe its because the whole tailgate assembly
and window are not solid.  Anyway, has anyone had the same problem, and have
you found any fixes?


In [75]:
evaluate(1)

True class: atheism
Predicted class: forsale with cosine similarity 0.5610883235931396


	Agreed.

--


       "Satan and the Angels do not have freewill.  
        They do what god tells them to do. "


In [76]:
evaluate(2)

True class: atheism
Predicted class: atheism with cosine similarity 0.839500367641449
4 month old Sega Genesis, barely used, one controller, in original
box, with Sonics 1 and 2.  $130 gets the whole bundle shipped to you.

Turns out they're not as addictive when they're yours.  Anyway, mail me if 
you're interested in this marvel of modern technology.



In [77]:
evaluate(3)

True class: forsale
Predicted class: forsale with cosine similarity 0.8124094009399414

Tammy, is this all explicitly stated in the bible, or do you assume
that you know that Ezekiel indirectly mentioned? It could have been
another metaphor, for instance Ezekiel was mad at his landlord, so he
talked about him when he wrote about the prince of Tyre.

Sorry, but my interpretation is more mundane, Ezekiel wrote about 
the prince of Tyre when we wrote about the prince of Tyre.
 
Cheers,
Kent


In [78]:
evaluate(4)

True class: atheism
Predicted class: atheism with cosine similarity 0.3254280686378479
FOR SALE: ****************************************************************

386-40 with VGA Color Monitor, dual floppy, VGA card with 1MB on board, joystick,
mouse, 2 MB RAM, no hard drive.


FOR ONLY $500!  Respond quickly!






In [79]:
evaluate(5)

True class: forsale
Predicted class: atheism with cosine similarity 0.5689576864242554
If the  new  Kuiper belt object *is*  called 'Karla', the next
one  should be called 'Smiley'.


In [80]:
evaluate(6)

True class: forsale
Predicted class: forsale with cosine similarity 0.7575559020042419


 Could you explain what any of the above pertains to? Is this a position 
statement on something or typing practice? 
--


       "Satan and the Angels do not have freewill.  
        They do what god tells them to do. "


In [81]:
evaluate(7)

True class: space
Predicted class: atheism with cosine similarity 0.6346250176429749
16 bit MFM FD/HD controller 	- $25/b.o.

copy card w/ software and cable	- $30/b.o.
(can copy any protected software)

if interested, please reply to this account



In [83]:
evaluate(8)

True class: space
Predicted class: atheism with cosine similarity 0.7935458421707153


nice theory.  too bad the MR2's never came with a four cylinder over 2.0
liters.  More like 1.6.  Or did they? were the nonturbo MR2II's  2.2 or
some such?

I also understand that anyone using balancing shafts on four cylinders, must
pay SAAB a royalty for using their patented design..like Porsche's 3.0 I4...


In [84]:
evaluate(9)

True class: space
Predicted class: atheism with cosine similarity 0.7535079121589661
Second week of January (prime ski season at one of the largest Poconos ski
areas).  Just north of Allentown.
Condo sleeps 6-8 depending on how friendly you all are.  Has hot tub,
deck.  Easy access to parking lot and shuttle to slopes (condo is a few
miles from the slopes).

Cost: $6000 OBRO, price based on what we paid for it (used, also) and
current market.
[RICHR]


In [85]:
evaluate(10)

True class: atheism
Predicted class: forsale with cosine similarity 0.6919900178909302

Don't listen to this guy, he's just a crank.  At first, this business
about being the "one true god" was tolerated by the rest of us,
but now it has gotten completely out of hand.

Besides, it really isn't so bad when people stop believing in you.
It's much more relaxing when mortals aren't always begging you for favors.
