## Term Frequency - Inverse Document Frequency

The more frequenct a term appears in a given document, and the fewer times it appears in other documents, the higher the TF-IDF value

In [2]:
import spacy

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Using a scikit learn dataset to operate on a larger dataset
20 Newsgroups is a collection of 18,000 newsgroup posts across 20 topics

In [3]:
corpus = fetch_20newsgroups(categories=['sci.space'], remove=['headers','footers','quotes'])

In [4]:
print(type(corpus))

<class 'sklearn.utils._bunch.Bunch'>


In [5]:
len(corpus.data)

593

In [7]:
corpus.data[:2]

["\nAny lunar satellite needs fuel to do regular orbit corrections, and when\nits fuel runs out it will crash within months.  The orbits of the Apollo\nmotherships changed noticeably during lunar missions lasting only a few\ndays.  It is *possible* that there are stable orbits here and there --\nthe Moon's gravitational field is poorly mapped -- but we know of none.\n\nPerturbations from Sun and Earth are relatively minor issues at low\naltitudes.  The big problem is that the Moon's own gravitational field\nis quite lumpy due to the irregular distribution of mass within the Moon.",
 '\nGlad to see Griffin is spending his time on engineering rather than on\nritual purification of the language.  Pity he got stuck with the turkey\nrather than one of the sensible options.']

In [10]:
nlp = spacy.load('en_core_web_sm')
# get rid of named entity recognition and dependency parsing
unwanted_pipes = ['ner','parser']

In [9]:
# will remove punctuation and spaces (including newlines) filter for tokens consisting of alphabetic characters, and return the lemma (which require POS tagging)
def spacy_tokenizer(doc):
    with nlp.disable_pipes(*unwanted_pipes):
        return [t.lemma for t in nlp(doc) if not t.is_punct and not t.is_space and t.is_alpha]

In [11]:
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
features = vectorizer.fit_transform(corpus.data)

In [12]:
# number of unique tokens
print(len(vectorizer.get_feature_names_out()))

9469


In [15]:
# the dimensions of our feature matrix. X rows (documents) and Y columns (tokens)
print(features.shape)

(593, 9469)


In [16]:
# what the encoding of the first document looks like in sparse format
print(features[0])

  (0, 6609)	0.10452465319204224
  (0, 2792)	0.12746673572641337
  (0, 6474)	0.15331277268825122
  (0, 1975)	0.10862134983649849
  (0, 5931)	0.17102243214182047
  (0, 5823)	0.09939758959196421
  (0, 1398)	0.10183273017124134
  (0, 2352)	0.08455248650428543
  (0, 8033)	0.08929749232550656
  (0, 3940)	0.11094564582599897
  (0, 7284)	0.0824741348739743
  (0, 6066)	0.05104326044583604
  (0, 2483)	0.10269890253976902
  (0, 7976)	0.13259379932649137
  (0, 9015)	0.12524362655380208
  (0, 1370)	0.07376358403679217
  (0, 6880)	0.09512941822420008
  (0, 4099)	0.051873252060803496
  (0, 2766)	0.13901479196044814
  (0, 3214)	0.11219221685212125
  (0, 4047)	0.06321553828701997
  (0, 8309)	0.060883075560077576
  (0, 7551)	0.048917557559216875
  (0, 1942)	0.12319856435864923
  (0, 2697)	0.15331277268825122
  :	:
  (0, 2079)	0.09036221462758656
  (0, 1860)	0.17102243214182047
  (0, 3332)	0.10452465319204224
  (0, 486)	0.09913077191467295
  (0, 3867)	0.20402427950185778
  (0, 7733)	0.10099496207230896
 

there are TF-IDF variations out there and scikit-learn, among other things, adds smoothing (adds a one to the numerator and denominator in the IDF component), and normalizes by default. 
These can be disabled if desired using the smooth_idf and norm parameters respectively.

In [19]:
# transform query into a TF-IDF vector
query = ['lunar orbit']
query_tfidf = vectorizer.transform(query)

In [20]:
# calculate the cosin similarities between the query and each document
# flatten() the returned list
cosine_similarities = cosine_similarity(features, query_tfidf).flatten()

In [21]:
import numpy as np

In [23]:
# numpy's argsort() method returns a list of *indices* that would sort an array
# the sort is ascending, we want the largest k cosine_similarities at the bottom of the sort, so we negate k and get the last k entries of the list in reverse order

def top_k(arr, k):
    kth_largest = (k + 1) * -1
    return np.argsort(arr)[:kth_largest:-1]

In [25]:
# we query the above, returns top documents
top_related_indices = top_k(cosine_similarities, 5)
print(top_related_indices)

[249 108   0 312 509]


In [27]:
print(cosine_similarities[top_related_indices])

[0.4784239  0.42898437 0.27362524 0.19484222 0.19133134]


In [28]:
print(corpus.data[top_related_indices[0]])


Actually, Hiten wasn't originally intended to go into lunar orbit at all,
so it indeed didn't have much fuel on hand.  The lunar-orbit mission was
an afterthought, after Hagoromo (a tiny subsatellite deployed by Hiten
during a lunar flyby) had a transmitter failure and its proper insertion
into lunar orbit couldn't be positively confirmed.

It should be noted that the technique does have disadvantages.  It takes
a long time, and you end up with a relatively inconvenient lunar orbit.
If you want something useful like a low circular polar orbit, you do have
to plan to expend a certain amount of fuel, although it is reduced from
what you'd need for the brute-force approach.


In [29]:
print(corpus.data[top_related_indices[1]])


Their Hiten engineering-test mission spent a while in a highly eccentric
Earth orbit doing lunar flybys, and then was inserted into lunar orbit
using some very tricky gravity-assist-like maneuvering.  This meant that
it would crash on the Moon eventually, since there is no such thing as
a stable lunar orbit (as far as anyone knows), and I believe I recall
hearing recently that it was about to happen.


In [30]:
# another query
query2 = ['satellite']
query_tfidf2 = vectorizer.transform(query2)

cosine_similarities2 = cosine_similarity(features, query_tfidf2).flatten()
top_related_indices2 = top_k(cosine_similarities2, 5)

print(top_related_indices2)
print(cosine_similarities2[top_related_indices2])

[378 138 248  61 236]
[0.38931334 0.34090226 0.29842859 0.25673657 0.24577701]


In [31]:
print(corpus.data[top_related_indices2[0]])



As an Amateur Radio operator (VHF 2metres) I like to keep up with what is 
going up (and for that matter what is coming down too).
 
In about 30 days I have learned ALOT about satellites current, future and 
past all the way back to Vanguard series and up to Astro D observatory 
(space).  I borrowed a book from the library called Weater Satellites (I 
think, it has a photo of the earth with a TIROS type satellite on it.)
 
I would like to build a model or have a large color poster of one of the 
TIROS satellites I think there are places in the USA that sell them.
ITOS is my favorite looking satellite, followed by AmSat-OSCAR 13 
(AO-13).
 
TTYL
73
Jim
