In [80]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [81]:
remove = ('headers','quotes','footers')
newsgroups = fetch_20newsgroups(remove=remove)  

By default, the `TfidfVectorizer`
- Only uses unigrams
- Log-scales the inverse document frequency as smooths it (adding 1)
- Normalizes each row with $L_2$ normalization (the sum of squares is 1), so that the dot product of two rows is equal to their cosine similarity 

It's not that important to remove stopwords, they are penalized anyway by the `idf` part (stopwords appear in almost all documents).

In [82]:
vectorizer = TfidfVectorizer(stop_words=None)

In [83]:
vectorizer.fit(newsgroups.data)
corpus_vec = vectorizer.transform(newsgroups.data)

In [84]:
corpus_vec

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1103627 stored elements and shape (11314, 101631)>

Given a query, a simple way to find the most relevant documents in the corpus is to vectorize the query with the same vectorizer, and find the documents with the highest cosine similarity. 

The 20 newsgroups dataset is not particularly large, so we probably won't gain much by using the `HashingVectorizer`, but this would be a good application.

In [85]:
query = ["cat"] # has to be an iterable over documents

query_vec = vectorizer.transform(query)

In [86]:
query_vec

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1 stored elements and shape (1, 101631)>

In [87]:
similarity_scores = corpus_vec.dot(query_vec.T).toarray().flatten()

In [88]:
# find the indices in the array with the 5 highest similarity scores
top_idx = np.argpartition(similarity_scores,-5)[-5:]

The main weakness of TF-IDF is the lack of context awareness, here the query "cat" matches both the animal and the linux command `cat`. This will be fixed by more modern approaches, such BERT. Word2Vec still has the same problem (`cat` with either meaning goes to the same embedding), but has the advantage of enconding some semantic similarity: if we train Word2Vec on a large corpus containing both discussion of domestic animals and computers, we expect the embedding for `cat` to be both close to other animals and other Linux commands.

In [89]:
categories_dict = {i:name for i,name in enumerate(newsgroups.target_names)}
readable_labels = np.array([categories_dict[label] for label in newsgroups.target])

In [90]:
for idx in reversed(top_idx):
    print("Similarity: ",similarity_scores[idx])
    print("Group: ",readable_labels[idx])
    print("Sample:")
    print(newsgroups.data[idx][:500],end='\n\n')

Similarity:  0.3939521519157273
Group:  sci.electronics
Sample:

We use them as Christmas tree decorations, the cat doesn't eat these.

-- 

Similarity:  0.31950344372186057
Group:  comp.os.ms-windows.misc
Sample:

Because you are uptight?

Many computer-literate people see advantages in each system.

You act like a Mac ate your cat.

Similarity:  0.2588604371075505
Group:  rec.motorcycles
Sample:

	There should be no worries about the trans.


	Does this count?

$ cat dod.faq | mailx -s "HAHAHHA" jburnside@ll.mit.edu (waiting to press
							 return...)

Later,

Similarity:  0.23204008172780952
Group:  rec.sport.hockey
Sample:

It would seem logical that the mask is Potvins. His nickname is "The Cat", 
which would go a long ways towards explaining the panther. 

Of course, it could be an old story and the mask is Fuhrs, too.....

Similarity:  0.21892456660595316
Group:  misc.forsale
Sample:
For Sale:

1982 - 16' Hobie Cat Special, very good condition with
trailer, catbox, righting sys