# Part 2: Applications of tf-idf and cosine similarity

In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

## Load the data

In [2]:
# select these 4 groups
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
# 1. For 4 groups (classes) of the 20newsgroups corpus (your choice), find the 10 most important words by:
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, random_state=42)

##  Feature importances

In [3]:
vectorizer = TfidfVectorizer(stop_words='english',analyzer='word')
vectors = vectorizer.fit_transform(newsgroups_train.data)

1. For 4 groups (classes) of the 20newsgroups corpus (your choice), find the 10 most important words by:
    * total tf-idf score
    * average tf-idf score (average only over non-zero values)
    * highest tf (only) score across corpus (try using `use_idf = False` in `TfidfVectorizer` )

### Let's build a reverse dictionary of the index-to-word mapping.

Average number of words per document.

In [4]:
vectors.nnz / float(vectors.shape[0])

114.78072763028516

In [5]:
index_2_word = {}
for word, idx in vectorizer.vocabulary_.items():
   index_2_word[idx] = word

Now, let's conver the sparse matrix tfidf representation to a regulary numpy array. Note that this may not be the optimal solution when your sparse matrix dimension is gigantic.

In [6]:
tfidf_numpy = vectors.toarray()
tfidf_numpy.shape

(2034, 33814)

Sort the total tf-idf score according to the 1) total, 2) mean and 3) sum

In [7]:
total_score = tfidf_numpy.sum(axis=0)

The mean score is a little bit tricky because we are only allowed to average non-zero values. Check this stackoverflow [answer](https://stackoverflow.com/questions/38542548/numpy-mean-of-nonzero-values).

In [8]:
mean_score = np.true_divide(tfidf_numpy.sum(axis=0),(tfidf_numpy!=0).sum(axis=0))

The max score requires another instance of `vectorizer`. Since it only uses term frenquency, I am going to call it `tf_numpy`.

In [9]:
tf_only_vectorizer = TfidfVectorizer(stop_words='english', use_idf = False, analyzer='word')
tf_numpy = tf_only_vectorizer.fit_transform(newsgroups_train.data).toarray()

In [10]:
index_2_word_tf_only = {}
for word, idx in tf_only_vectorizer.vocabulary_.items():
   index_2_word_tf_only[idx] = word

### Do the two dicts agree?
Yes.

In [11]:
index_2_word == index_2_word_tf_only

True

In [12]:
max_score_tf_only = tf_numpy.max(axis=0)

#### Now, let's use the index-to-word map to find the top 10.

[How do I get indices of N maximum values in a NumPy array?](https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array)

In [13]:
def idx_2_words(idx, dict_ = index_2_word):
    '''
    idx is a list, return the corresponding words in the `dict_` as a sorted list
    '''
    return sorted([dict_.get(idx_,None) for idx_ in idx])

In [14]:
top_k = 10
total_score_index_10 = total_score.argsort()[-top_k:][::-1]
print("By total score, most important 10 words are: {}".format(idx_2_words(total_score_index_10)))

By total score, most important 10 words are: ['article', 'com', 'edu', 'god', 'lines', 'organization', 'people', 'space', 'subject', 'writes']


In [15]:
mean_score_index_10 = mean_score.argsort()[-top_k:][::-1]
print("By mean score, most important 10 words are: {}".format(idx_2_words(mean_score_index_10)))

By mean score, most important 10 words are: ['b12', 'dlb', 'enviroleague', 'kewageshig', 'landis', 'p_c', 'sphinx', 'stereoscopic', 'vonnegut', 'xxxx']


In [16]:
top_k = 10
max_score_index_10 = max_score_tf_only.argsort()[-top_k:][::-1]
print("By max score, most important 10 words are: {}".format(idx_2_words(max_score_index_10, index_2_word_tf_only)))

By max score, most important 10 words are: ['000', '___', 'ellipse', 'gb', 'jpeg', 'law', 'ra', 'space', 'tyre', 'xxxx']


2. Do the top 10 words change based on each of the different ranking methods?

> Yes. 

3. Also do this for each category of article (each of the 20 newsgroups) and compare the top words of each. You should treat each category of newsgroup as a separate "corpus" for this question.

> Left to the students as exercise. Check the names of the 20 corpus as below.

In [17]:
list(fetch_20newsgroups(subset='train').target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

#### Ranking

You can use cosine similarity to rank the relevance of a document to a given search query using the following process:
* Convert a search query into a feature vector, treat the query as a document in your corpus and apply tf-idf vectorizing to it.
* Normalize your query vector and all of your document vectors (since documents are often much longer than a query)
* Compute the cosine similarity between the search query and each of your documents
* Rank the documents by their similarity score

Sample queries are available in `data/queries.txt`.

1. For each query, find the 3 most relevant articles from the 20 Newsgroups corpus.

We are still going to use the 4 categories we fetched in part 1 as example.

#### 1. First, let's read the queries into a list.

In [18]:
with open("../data/queries.txt","r") as fp:
    queries = list(map(lambda x: x.strip(), fp.readlines()))

In [19]:
queries

['cheerleader nation tv show',
 'budget rental cars',
 'children who have died from moms postpartum depression',
 'compaq presario notebook v5005us',
 'boxed set of fruits basket',
 'sun sentinal news paper',
 'puerto rico economy',
 'wireless networking',
 'hidden valley ranch commercials',
 'jimmy carter the panama canal']

#### 2. Now, we write a function to turn search query into (normalized) feature vector. We also want to normalize our `tfidf_numpy` array.

In [20]:
def query2vec(single_query, normalized = True, vectorizer = vectorizer):
    vector = vectorizer.transform([single_query]) # iterable as if it is a document
    vector = vector.toarray().flatten() # flatten the 1 by N matrix to a simple vector
    if normalized:
        norm = np.linalg.norm(vector)
        if norm == 0:
            return vector
        else:
            return vector/norm
    return vector

In [21]:
# normalize our tfidf_numpy. Note that you want each row to be normalized, since each row represent a document.
from sklearn.preprocessing import normalize as sk_normalizer
tfidf_numpy_doc_normalized = sk_normalizer(tfidf_numpy,axis=1)

#### 3. For each query, find the top 3 most likely docs


In [22]:
def cosine_similarity(a, b):
    return np.dot(a, b)/(np.linalg.norm(a) * np.linalg.norm(b))

#### For each query, we calculate the similarity scores with every document and print ouf the content.

In [23]:
for idx, query in enumerate(queries):
    # calculate the query vector
    query_vector = query2vec(query)
    print("Checking query: [{}]".format(query))
    cos_scores = [cosine_similarity(query_vector, tfidf_numpy_doc_normalized[i]) for i in range(tfidf_numpy_doc_normalized.shape[0])]
    # find the top 3 maximal value's index
    top_3_index = np.array(cos_scores).argsort()[-3:][::-1]
    # print the contents to check intution agreement.
    for index in top_3_index:
        print(index)
        # uncomment the following line to check the actual contents.
        # print(newsgroups_train.data[index])

Checking query: [cheerleader nation tv show]
1471
695
1809
Checking query: [budget rental cars]
1399
354
1215
Checking query: [children who have died from moms postpartum depression]
899
838
2021
Checking query: [compaq presario notebook v5005us]
1264
1008
1530
Checking query: [boxed set of fruits basket]
1226
264
1494
Checking query: [sun sentinal news paper]
1246
1238
1927
Checking query: [puerto rico economy]
1382
441
1947
Checking query: [wireless networking]
665
876
813
Checking query: [hidden valley ranch commercials]
283
1539
482
Checking query: [jimmy carter the panama canal]
43
1855
1195
