# Natural Language Processing Demystified | Simple Vectorization
https://nlpdemystified.org<br>
https://github.com/futuremojo/nlp-demystified

### spaCy upgrade and package installation.

At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy and download a statisical language model.
<br><br>
**IMPORTANT**<br>
If you're running this in the cloud rather than using a local Jupyter server on your machine, then the notebook will **timeout** after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical packages.
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

In [1]:
# !pip install -U spacy==3.*
# !python -m spacy download en_core_web_sm
# !python -m spacy info

# Basic Bag-of-Words (BOW)

Course module for this demo: https://www.nlpdemystified.org/course/basic-bag-of-words

In [2]:
import spacy

from scipy import spatial
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Plain frequency BOW

In [3]:
# A corpus of sentences.
corpus = [
  "Red Bull drops hint on F1 engine.",
  "Honda exits F1, leaving F1 partner Red Bull.",
  "Hamilton eyes record eighth F1 title.",
  "Aston Martin announces sponsor."
]

We want to build a basic bag-of-words (BOW) representation of our corpus. Based on what you now know from the lesson, you can probably do this from scratch using dictionaries and lists (and maybe that's a good exercise). Fortunately, there are robust libraries which make it easy.

We can use the scikit-learn **CountVectorizer** which takes a collection of text documents and creates a matrix of token counts:<br>
https://scikit-learn.org/stable/index.html<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html




In [4]:
vectorizer = CountVectorizer()

In [5]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [6]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [12]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
vectorizer2.get_feature_names_out()

array(['and this', 'document is', 'first document', 'is the', 'is this',
       'second document', 'the first', 'the second', 'the third',
       'third one', 'this document', 'this is', 'this the'], dtype=object)

In [8]:
print(X2.toarray())

[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]


The *fit_transform* method does two things:
1. It learns a vocabulary dictionary from the corpus.
2. It returns a matrix where each row represents a document and each column represents a token (i.e. term).<br>

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform


In [9]:
bow = vectorizer.fit_transform(corpus)

We can take a look at the features and vocabulary dictionary. Notice the **CountVectorizer** took care of tokenization for us. It also removed punctuation and lower-cased everything.

In [10]:
# View features (tokens).
print(vectorizer.get_feature_names_out())

# View vocabulary dictionary.
vectorizer.vocabulary_

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

Specifically, the **CountVectorizer** generates a sparse matrix using an efficient, compressed representation. The sparse matrix object includes a number of useful methods:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

In [13]:
print(type(bow))

<class 'scipy.sparse._csr.csr_matrix'>


If we look at the raw structure, we'll see tuples where the first element represents the document, and the second element represents a token ID. It's then followed by a count of that token. So in the second document (index 1), token 8 ("f1") occurs twice.

In [14]:
print(bow)

  (0, 8)	1
  (0, 3)	1
  (0, 6)	1
  (0, 2)	1
  (0, 1)	1
  (1, 8)	1
  (1, 3)	1
  (1, 6)	1
  (1, 1)	2
  (1, 5)	1
  (2, 8)	1
  (2, 3)	1
  (2, 6)	1
  (2, 0)	1
  (2, 7)	1
  (2, 4)	1
  (3, 8)	1
  (3, 3)	1
  (3, 6)	1
  (3, 2)	1
  (3, 1)	1


Before we explore further, we want to make a few modifications.
1. What if we want to use another tokenizer like spaCy's?
2. Instead of frequency, what if we want to have a binary BOW?


## Binary BOW with custom tokenizer

**CountVectorizer** supports using a custom tokenizer. For every document, it will call your tokenizer and expect a list of tokens returned. We'll create a simple callback below which has spaCy tokenize and filter tokens, and then return them.

In [15]:
# As usual, we start by importing spaCy and loading a statistical model.
nlp = spacy.load('en_core_web_sm')

# Create a tokenizer callback using spaCy under the hood. Here, we tokenize
# the passed-in text and return the tokens, filtering out punctuation.
def spacy_tokenizer(doc):
  return [t.text for t in nlp(doc) if not t.is_punct]


This time, we instantiate **CountVectorizer** with our custom tokenizer (*spacy_tokenizer*), turn off case-folding, and also set the *binary* parameter to *True* so we simply get 1s and 0s marking token presence rather than token frequency.

In [16]:
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True)
bow = vectorizer.fit_transform(corpus)



Looking at the resulting feature names and vocabulary dictionary, we can see our *spacy_tokenizer* being used. If you're not convinced, you can remove the punctuation filtering in our tokenizer and rerun the code.

In [17]:
print(vectorizer.get_feature_names_out())
vectorizer.vocabulary_

['And' 'Is' 'This' 'document' 'first' 'is' 'one' 'second' 'the' 'third'
 'this']


{'This': 2,
 'is': 5,
 'the': 8,
 'first': 4,
 'document': 3,
 'second': 7,
 'And': 0,
 'this': 10,
 'third': 9,
 'one': 6,
 'Is': 1}

To get a dense array representation of our sparse matrix, use *toarray*.<br>
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.toarray.html#scipy.sparse.csr_matrix.toarray

We can also index and slice into the sparse matrix.

In [18]:
print('A dense representation like we saw in the slides.')
print(bow.toarray())
print()
print('Indexing and slicing.')
print(bow[0])
print()
print(bow[0:2])

A dense representation like we saw in the slides.
[[0 0 1 1 1 1 0 0 1 0 0]
 [0 0 1 1 0 1 0 1 1 0 0]
 [1 0 0 0 0 1 1 0 1 1 1]
 [0 1 0 1 1 0 0 0 1 0 1]]

Indexing and slicing.
  (0, 2)	1
  (0, 5)	1
  (0, 8)	1
  (0, 4)	1
  (0, 3)	1

  (0, 2)	1
  (0, 5)	1
  (0, 8)	1
  (0, 4)	1
  (0, 3)	1
  (1, 2)	1
  (1, 5)	1
  (1, 8)	1
  (1, 3)	1
  (1, 7)	1


## Cosine Similarity

Writing your own cosine similarity function is straight-forward using numpy (left as an exercise). There are multiple ways to calculate it using scipy.
<br><br>
One way is using the **spatial** package, which is a collection of spatial algorithms and data structures. It has a method to calculate cosine *distance*. To get the cosine *similarity*, we have to substract the distance from 1.<br>
https://docs.scipy.org/doc/scipy/reference/spatial.html<br>
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine

In [19]:
# The cosine method expects array_like inputs, so we need to generate
# arrays from our sparse matrix.
doc1_vs_doc2 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[1].toarray()[0])
doc1_vs_doc3 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[2].toarray()[0])
doc1_vs_doc4 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[3].toarray()[0])

print(corpus)

print(f"Doc 1 vs Doc 2: {doc1_vs_doc2}")
print(f"Doc 1 vs Doc 3: {doc1_vs_doc3}")
print(f"Doc 1 vs Doc 4: {doc1_vs_doc4}")

['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
Doc 1 vs Doc 2: 0.8
Doc 1 vs Doc 3: 0.3651483716701107
Doc 1 vs Doc 4: 0.6


Another approach is using scikit-learn's *cosine_similarity* which computes the metric between multiple vectors. Here, we pass it our BOW and get a matrix of cosine similarities between each document.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [20]:
# cosine_similarity can take either array-likes or sparse matrices.
print(cosine_similarity(bow))

[[1.         0.8        0.36514837 0.6       ]
 [0.8        1.         0.36514837 0.4       ]
 [0.36514837 0.36514837 1.         0.36514837]
 [0.6        0.4        0.36514837 1.        ]]


## N-grams

**CountVectorizer** includes an *ngram_range* parameter to generate different n-grams. n_gram range is specified using a minimum and maximum range. By default, n_gram range is set to (1, 1) which generates unigrams. Setting it to (1, 2) generates both unigrams and bigrams.

In [21]:
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True, ngram_range=(1,2))
bigrams = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print('Number of features: {}'.format(len(vectorizer.get_feature_names_out())))
print(vectorizer.vocabulary_)

['And' 'And this' 'Is' 'Is this' 'This' 'This document' 'This is'
 'document' 'document is' 'first' 'first document' 'is' 'is the' 'one'
 'second' 'second document' 'the' 'the first' 'the second' 'the third'
 'third' 'third one' 'this' 'this is' 'this the']
Number of features: 25
{'This': 4, 'is': 11, 'the': 16, 'first': 9, 'document': 7, 'This is': 6, 'is the': 12, 'the first': 17, 'first document': 10, 'second': 14, 'This document': 5, 'document is': 8, 'the second': 18, 'second document': 15, 'And': 0, 'this': 22, 'third': 20, 'one': 13, 'And this': 1, 'this is': 23, 'the third': 19, 'third one': 21, 'Is': 2, 'Is this': 3, 'this the': 24}


In [22]:
# Setting n_gram range to (2, 2) generates only bigrams.
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True, ngram_range=(2,2))
bigrams = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(vectorizer.vocabulary_)

['And this' 'Is this' 'This document' 'This is' 'document is'
 'first document' 'is the' 'second document' 'the first' 'the second'
 'the third' 'third one' 'this is' 'this the']
{'This is': 3, 'is the': 6, 'the first': 8, 'first document': 5, 'This document': 2, 'document is': 4, 'the second': 9, 'second document': 7, 'And this': 0, 'this is': 12, 'the third': 10, 'third one': 11, 'Is this': 1, 'this the': 13}


## Basic Bag-of-Words Exercises

In [35]:
#
# EXERCISE: Create a spacy_tokenizer callback which takes a string and returns
# a list of tokens (each token's text) with punctuation filtered out.
#
corpus = [
  "Students use their GPS-enabled cellphones to take birdview photographs of a land in order to find specific danger points such as rubbish heaps.",
  "Teenagers are enthusiastic about taking aerial photograph in order to study their neighbourhood.",
  "Aerial photography is a great way to identify terrestrial features that aren’t visible from the ground level, such as lake contours or river paths.",
  "During the early days of digital SLRs, Canon was pretty much the undisputed leader in CMOS image sensor technology.",
  "Syrian President Bashar al-Assad tells the US it will 'pay the price' if it strikes against Syria."
]

nlp = spacy.load('en_core_web_sm')

def custom_tokenizer_callback(text):
    # Load the spaCy language model
 
    # Initialize CountVectorizer
    vectorizer = CountVectorizer()

    # Learn the vocabulary and transform the documents into a bag-of-words matrix
    bag_of_words_matrix = vectorizer.fit_transform(text)

    # Get the vocabulary (unique words) and their corresponding indices
    vocabulary = vectorizer.get_feature_names_out()

    return vocabulary
 
print("\nVocabulary:")
print(custom_tokenizer_callback(corpus))


Vocabulary:
['about' 'aerial' 'against' 'al' 'are' 'aren' 'as' 'assad' 'bashar'
 'birdview' 'canon' 'cellphones' 'cmos' 'contours' 'danger' 'days'
 'digital' 'during' 'early' 'enabled' 'enthusiastic' 'features' 'find'
 'from' 'gps' 'great' 'ground' 'heaps' 'identify' 'if' 'image' 'in' 'is'
 'it' 'lake' 'land' 'leader' 'level' 'much' 'neighbourhood' 'of' 'or'
 'order' 'paths' 'pay' 'photograph' 'photographs' 'photography' 'points'
 'president' 'pretty' 'price' 'river' 'rubbish' 'sensor' 'slrs' 'specific'
 'strikes' 'students' 'study' 'such' 'syria' 'syrian' 'take' 'taking'
 'technology' 'teenagers' 'tells' 'terrestrial' 'that' 'the' 'their' 'to'
 'undisputed' 'us' 'use' 'visible' 'was' 'way' 'will']


In [36]:
#
# EXERCISE: Initialize a CountVectorizer object and set it to use
# your spacy_tokenizer with lower-casing off and to create a binary BOW.
#

# Instantiate a CountVectorizer object called 'vectorizer'.
# Initialize CountVectorizer with binary=True
vectorizer = CountVectorizer(binary=True)

# Create a binary BOW from the corpus using your CountVectorizer.
# Fit and transform the corpus
binary_bow = vectorizer.fit_transform(corpus)

# Convert to array for better readability
binary_bow_array = binary_bow.toarray()

# Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Display results
print("Feature Names:", feature_names)
print("Binary BoW Representation:\n", binary_bow_array)


Feature Names: ['about' 'aerial' 'against' 'al' 'are' 'aren' 'as' 'assad' 'bashar'
 'birdview' 'canon' 'cellphones' 'cmos' 'contours' 'danger' 'days'
 'digital' 'during' 'early' 'enabled' 'enthusiastic' 'features' 'find'
 'from' 'gps' 'great' 'ground' 'heaps' 'identify' 'if' 'image' 'in' 'is'
 'it' 'lake' 'land' 'leader' 'level' 'much' 'neighbourhood' 'of' 'or'
 'order' 'paths' 'pay' 'photograph' 'photographs' 'photography' 'points'
 'president' 'pretty' 'price' 'river' 'rubbish' 'sensor' 'slrs' 'specific'
 'strikes' 'students' 'study' 'such' 'syria' 'syrian' 'take' 'taking'
 'technology' 'teenagers' 'tells' 'terrestrial' 'that' 'the' 'their' 'to'
 'undisputed' 'us' 'use' 'visible' 'was' 'way' 'will']
Binary BoW Representation:
 [[0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1
  0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1
  1 0 0 1 0 0 0 0]
 [1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 1 0 0 1 0 0 

In [46]:
#
# The string below is a whole paragraph. We want to create another
# binary BOW but using the vocabulary of our *current* CountVectorizer. This means
# that words in this paragraph which AREN'T already in the vocabulary won't be
# represented. This is to illustrate how BOW can't handle out-of-vocabulary words
# unless you rebuild your whole vocabulary. Still, we'll see that if there's
# enough overlapping vocabulary, some similarity can still be picked up.
#
# Note that we call 'transform' only instead of 'fit_transform' because the
# fit step (i.e. vocabulary build) is already done and we don't want to re-fit here.
#
corpus_s = [
    "Teenagers take aerial shots of their neighbourhood using digital cameras sitting in old bottles which are launched via kites - a common toy for children living in the favelas. "
    "They then use GPS-enabled smartphones to take pictures of specific danger points - such as rubbish heaps, which can become a breeding ground for mosquitoes carrying dengue fever."
    ]

new_bow = vectorizer.transform(corpus_s)

custom_tokenizer_callback(corpus_s)

array(['aerial', 'are', 'as', 'become', 'bottles', 'breeding', 'cameras',
       'can', 'carrying', 'children', 'common', 'danger', 'dengue',
       'digital', 'enabled', 'favelas', 'fever', 'for', 'gps', 'ground',
       'heaps', 'in', 'kites', 'launched', 'living', 'mosquitoes',
       'neighbourhood', 'of', 'old', 'pictures', 'points', 'rubbish',
       'shots', 'sitting', 'smartphones', 'specific', 'such', 'take',
       'teenagers', 'the', 'their', 'then', 'they', 'to', 'toy', 'use',
       'using', 'via', 'which'], dtype=object)

In [47]:
#
# EXERCISE: using the pairwise cosine_similarity method from sklearn,
# calculate the similarities between each document from the corpus against
# this new document (new_bow). HINT: You can pass two parameters to
# cosine_similarity in this case. See the docs:
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine
#
# Which document is the most similar? Which is the least similar? Do the results make sense
# based on what you see?
#
# Calculate pairwise cosine similarity
similarities = cosine_similarity(binary_bow, new_bow)

print("Pairwise Cosine Similarities:")
print(similarities)

Pairwise Cosine Similarities:
[[0.68181818]
 [0.41391868]
 [0.26673253]
 [0.20100756]
 [0.05330018]]


In [None]:
#
# EXERCISE: Implement your own cosine similarity method using numpy.
# It should take two numpy arrays and output the similarity metric.
# HINTS:
# https://numpy.org/doc/stable/reference/generated/numpy.dot.html
# https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
#
# Verify the similarity between the first document in the corpus and the
# paragraph is the same as the one you got from using pairwise cosine_similarity.
#
import numpy as np
from numpy.linalg import norm

def cos_sim(a, b):
  # Calculate the cosine similarity
  cosine_similarity = np.dot( a, b) / (norm(a) * norm(b))
  return cosine_similarity

print("Cosine Similarity:", cosine_similarity(new_bow, binary_bow))
# Pairwise Cosine Similarities:
# [[0.68181818]
#  [0.41391868]
#  [0.26673253]
#  [0.20100756]
#  [0.05330018]]

Cosine Similarity: [[0.68181818 0.41391868 0.26673253 0.20100756 0.05330018]]


In [50]:
#
# EXERCISE: In spacy_tokenizer, instead of returning the plain text,
# return the lemma_ attribute instead. How do the cosine similarity
# results differ? What if you filter out stop words as well?
#

# Custom preprocessing function using spaCy
def spacy_preprocessor(text):
    doc = nlp(text)
    # Lemmatize and remove stopwords
    return " ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct])

# Preprocess text using spaCy
preprocessed_texts = [spacy_preprocessor(text) for text in corpus]

def custom_tokenizer_callback(text):
    # Load the spaCy language model
 
    # Initialize CountVectorizer
    vectorizer = CountVectorizer()

    # Learn the vocabulary and transform the documents into a bag-of-words matrix
    bag_of_words_matrix = vectorizer.fit_transform(text)

    # Get the vocabulary (unique words) and their corresponding indices
    vocabulary = vectorizer.get_feature_names_out()

    return vocabulary
 
print("\nVocabulary:")
print(custom_tokenizer_callback(preprocessed_texts))
 


Vocabulary:
['aerial' 'al' 'assad' 'bashar' 'birdview' 'canon' 'cellphone' 'cmos'
 'contour' 'danger' 'day' 'digital' 'early' 'enable' 'enthusiastic'
 'feature' 'find' 'gps' 'great' 'ground' 'heap' 'identify' 'image' 'lake'
 'land' 'leader' 'level' 'neighbourhood' 'order' 'path' 'pay' 'photograph'
 'photography' 'point' 'president' 'pretty' 'price' 'river' 'rubbish'
 'sensor' 'slrs' 'specific' 'strike' 'student' 'study' 'syria' 'syrian'
 'take' 'technology' 'teenager' 'tell' 'terrestrial' 'undisputed' 'use'
 'visible' 'way']


# TF-IDF

Course module for this demo: https://www.nlpdemystified.org/course/tf-idf

**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**

In [None]:
# !pip install -U spacy==3.*
# !python -m spacy download en_core_web_sm
# !python -m spacy info

In [51]:
import spacy

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Fetching datasets

This time around, rather than using a short toy corpus, let's use a larger dataset. scikit-learn has a **datasets** module with utilties to load datasets of our own as well as fetch popular reference datasets online.<br>
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets
<br><br>
We'll use the **20 newsgroups** dataset, which is a collection of 18,000 newsgroup posts across 20 topics.<br>
https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset
<br><br>
List of datasets available:<br>
https://scikit-learn.org/stable/datasets.html#datasets

The **datasets** module includes fetchers for each dataset in scikit-learn. For our purposes, we'll fetch only the posts from the *sci.space* topic, and skip on headers, footers, and quoting of other posts.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups
<br><br>
By default, the fetcher retrieves the *training* subset of the data only. If you don't know what that means, it'll become clear later in the course when we discuss modelling. For now, it doesn't matter for our purposes.

In [52]:
corpus = fetch_20newsgroups(categories=['sci.space'],
                            remove=('headers', 'footers', 'quotes'))

We get back a **Bunch** container object containing the data as well as other information.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html
<br><br>
The actual posts are accessed through the *data* attribute and is a list of strings, each one representing a post.

In [53]:
print(type(corpus))

<class 'sklearn.utils._bunch.Bunch'>


In [54]:
# Number of posts in our dataset.
len(corpus.data)

593

In [55]:
# View first two posts.
corpus.data[:2]

["\nAny lunar satellite needs fuel to do regular orbit corrections, and when\nits fuel runs out it will crash within months.  The orbits of the Apollo\nmotherships changed noticeably during lunar missions lasting only a few\ndays.  It is *possible* that there are stable orbits here and there --\nthe Moon's gravitational field is poorly mapped -- but we know of none.\n\nPerturbations from Sun and Earth are relatively minor issues at low\naltitudes.  The big problem is that the Moon's own gravitational field\nis quite lumpy due to the irregular distribution of mass within the Moon.",
 '\nGlad to see Griffin is spending his time on engineering rather than on\nritual purification of the language.  Pity he got stuck with the turkey\nrather than one of the sensible options.']

## Creating TF-IDF features

In [56]:
# Like before, if we want to use spaCy's tokenizer, we need
# to create a callback. Remember to upgrade spaCy if you need
# to (refer to beginnning of file for commentary and instructions).
nlp = spacy.load('en_core_web_sm')

# We don't need named-entity recognition nor dependency parsing for
# this so these components are disabled. This will speed up the
# pipeline. We do need part-of-speech tagging however.
unwanted_pipes = ["ner", "parser"]

# For this exercise, we'll remove punctuation and spaces (which
# includes newlines), filter for tokens consisting of alphabetic
# characters, and return the lemma (which require POS tagging).
def spacy_tokenizer(doc):
  with nlp.disable_pipes(*unwanted_pipes):
    return [t.lemma_ for t in nlp(doc) if \
            not t.is_punct and \
            not t.is_space and \
            t.is_alpha]

Like the classes to create raw frequency and binary bag-of-words vectors, scikit-learn includes a similar class called **TfidfVectorizer** to create TF-IDF vectors from a corpus.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
<br><br>
The usage pattern is similar in that we call *fit_transform* on the corpus which generates the vocabulary dictionary (fit step), and generates the TF-IDF vectors (transform step).

In [57]:
%%time
# Use the default settings of TfidfVectorizer.
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
features = vectorizer.fit_transform(corpus.data)



CPU times: total: 30.1 s
Wall time: 33 s


In [58]:
# The number of unique tokens.
print(len(vectorizer.get_feature_names_out()))

9463


In [59]:
# The dimensions of our feature matrix. X rows (documents) by Y columns (tokens).
print(features.shape)

(593, 9463)


In [60]:
# What the encoding of the first document looks like in sparse format.
print(features[0])

  (0, 424)	0.07006735123597327
  (0, 4943)	0.17755697785104502
  (0, 7310)	0.08827255510573831
  (0, 5573)	0.07462737620371114
  (0, 3317)	0.1987888389166129
  (0, 8517)	0.06551158102003457
  (0, 2378)	0.04343054547334542
  (0, 6912)	0.13559878838138195
  (0, 5908)	0.21554277358564625
  (0, 1847)	0.13559878838138195
  (0, 370)	0.1054358136369086
  (0, 9237)	0.0715855496878138
  (0, 4402)	0.07522156165875085
  (0, 7244)	0.0978911139378133
  (0, 5963)	0.0643662961391887
  (0, 4393)	0.07654434326236456
  (0, 9274)	0.059872496633831214
  (0, 1902)	0.13559878838138195
  (0, 9311)	0.1929427392927135
  (0, 5402)	0.10099174099290609
  (0, 8393)	0.20401777246040834
  (0, 5817)	0.09912761029075574
  (0, 449)	0.10452131953855516
  (0, 5429)	0.17101697764367227
  (0, 1348)	0.09035933266335426
  :	:
  (0, 6381)	0.1533078830125271
  (0, 5048)	0.12319463512940872
  (0, 1145)	0.04891599740875135
  (0, 9181)	0.06193123322498519
  (0, 4611)	0.06321352212439027
  (0, 5679)	0.11218863865345223
  (0, 6207)

As we mentioned in the slides, there are TF-IDF variations out there and scikit-learn, among other things, adds **smoothing** (adds a one to the numerator and denominator in the IDF component), and normalizes by default. These can be disabled if desired using the *smooth_idf* and *norm* parameters respectively. See here for more information:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html


## Querying the data

The similarity measuring techniques we learned previously can be used here in the same way. In effect, we can query our data using this sequence:
1. *Transform* our query using the same vocabulary from our *fit* step on our corpus.
2. Calculate the pairwise cosine similarities between each document in our corpus and our query.
3. Sort them in descending order by score.

In [61]:
# Transform the query into a TF-IDF vector.
query = ["lunar orbit"]
query_tfidf = vectorizer.transform(query)

In [62]:
# Calculate the cosine similarities between the query and each document.
# We're calling flatten() here becaue cosine_similarity returns a list
# of lists and we just want a single list.
cosine_similarities = cosine_similarity(features, query_tfidf).flatten()

Now that we have our list of cosine similarities, we can use this utility function to return the indices of the top k documents with the highest cosine similarities.

In [63]:
import numpy as np

# numpy's argsort() method returns a list of *indices* that
# would sort an array:
# https://numpy.org/doc/stable/reference/generated/numpy.argsort.html
#
# The sort is ascending, but we want the largest k cosine_similarites
# at the bottom of the sort. So we negate k, and get the last k
# entries of the indices list in reverse order. There are faster
# ways to do this using things like argpartition but this is
# more succinct.
def top_k(arr, k):
  kth_largest = (k + 1) * -1
  return np.argsort(arr)[:kth_largest:-1]

In [64]:
# So for our query above, these are the top five documents.
top_related_indices = top_k(cosine_similarities, 5)
print(top_related_indices)

[249 108   0 312 509]


In [65]:
# Let's take a look at their respective cosine similarities.
print(cosine_similarities[top_related_indices])

[0.47796463 0.42917994 0.27361651 0.19484941 0.19147591]


In [66]:
# Top match.
print(corpus.data[top_related_indices[0]])


Actually, Hiten wasn't originally intended to go into lunar orbit at all,
so it indeed didn't have much fuel on hand.  The lunar-orbit mission was
an afterthought, after Hagoromo (a tiny subsatellite deployed by Hiten
during a lunar flyby) had a transmitter failure and its proper insertion
into lunar orbit couldn't be positively confirmed.

It should be noted that the technique does have disadvantages.  It takes
a long time, and you end up with a relatively inconvenient lunar orbit.
If you want something useful like a low circular polar orbit, you do have
to plan to expend a certain amount of fuel, although it is reduced from
what you'd need for the brute-force approach.


In [67]:
# Second-best match.
print(corpus.data[top_related_indices[1]])


Their Hiten engineering-test mission spent a while in a highly eccentric
Earth orbit doing lunar flybys, and then was inserted into lunar orbit
using some very tricky gravity-assist-like maneuvering.  This meant that
it would crash on the Moon eventually, since there is no such thing as
a stable lunar orbit (as far as anyone knows), and I believe I recall
hearing recently that it was about to happen.


In [68]:
# Try a different query
query = ["satellite"]
query_tfidf = vectorizer.transform(query)

cosine_similarities = cosine_similarity(features, query_tfidf).flatten()
top_related_indices = top_k(cosine_similarities, 5)

print(top_related_indices)
print(cosine_similarities[top_related_indices])

[378 138 248 539  61]
[0.38932857 0.34067377 0.29841515 0.266025   0.25696839]


In [69]:
print(corpus.data[top_related_indices[0]])



As an Amateur Radio operator (VHF 2metres) I like to keep up with what is 
going up (and for that matter what is coming down too).
 
In about 30 days I have learned ALOT about satellites current, future and 
past all the way back to Vanguard series and up to Astro D observatory 
(space).  I borrowed a book from the library called Weater Satellites (I 
think, it has a photo of the earth with a TIROS type satellite on it.)
 
I would like to build a model or have a large color poster of one of the 
TIROS satellites I think there are places in the USA that sell them.
ITOS is my favorite looking satellite, followed by AmSat-OSCAR 13 
(AO-13).
 
TTYL
73
Jim


So here we have the beginnings of a simple search engine but we're a far cry from competing with commercial off-the-shelf search engines, let alone Google.
<br>
- For each query, we're scanning through our entire corpus, but in practice, you'll want to create an **inverted index**. Search applications such as Elasticsearch do that under the hood.
- You'd also want to evaluate the efficacy of your search using metrics like **precision** and **recall**.
- Document ranking also tends to be more sophisticated, using different ranking functions like Okapi BM25. With major search engines, ranking also involves hundreds of variables such as what the user searched for previously, what do they tend to click on, where are they physically, and on and on. These variables are part of the "secret sauce" and are closely guarded by companies.
- Beyond word presence, intent and meaning are playing a larger role.
<br>

Information Retrieval is a huge, rich topic and beyond search, it's also key in tasks such as question-answering.

## TF-IDF Exercises

**EXERCISE**<br>
Read up on these concepts we just mentioned if you're curious.<br>

https://en.wikipedia.org/wiki/Inverted_index<br>
https://en.wikipedia.org/wiki/Precision_and_recall<br>
https://en.wikipedia.org/wiki/Okapi_BM25<br>

In [70]:
#
# EXERCISE: fetch multiple topics from the 20 newsgroups
# dataset and query them using the approach we followed.
# A list of topics can be found here:
# https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset
#
# If you're feeling ambitious, incorporate n-grams or
# look at how you can measure precision and recall.
#
from sklearn.datasets import fetch_20newsgroups

# Download the dataset
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

In [71]:
# The fetch_20newsgroups function returns a Bunch object, which is similar to a dictionary. 
# It contains a few key and useful attributes:

        # data: This contains the raw newsgroup text files.
        # target: This holds the category indices of the lists.
        # target_names: It lists the category names corresponding to each index in target

In [72]:
# Exploring the datasets
print("Category names: ", newsgroups_train.target_names)
print("First text sample: ", newsgroups_train.data[0])
print("Corresponding category: ", newsgroups_train.target[0])

Category names:  ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
First text sample:  From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made,

In [73]:
# Preprocessing the Text Data
# One requires converting the text into a structured format that can be understood by a model, 
# typically using techniques like text vectorization (e.g., Bag of Words, TF-IDF)
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TF-IDF vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

In [74]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Train the model
model = MultinomialNB()
model.fit(X_train, newsgroups_train.target)

# Make predictions
predictions = model.predict(X_test)

In [75]:
# Evaluate the model
accuracy = accuracy_score(newsgroups_test.target, predictions)
report = classification_report(newsgroups_test.target, predictions, target_names=newsgroups_test.target_names)

print("Accuracy: ", accuracy)
print(report)

Accuracy:  0.8169144981412639
                          precision    recall  f1-score   support

             alt.atheism       0.80      0.69      0.74       319
           comp.graphics       0.78      0.72      0.75       389
 comp.os.ms-windows.misc       0.79      0.72      0.75       394
comp.sys.ibm.pc.hardware       0.68      0.81      0.74       392
   comp.sys.mac.hardware       0.86      0.81      0.84       385
          comp.windows.x       0.87      0.78      0.82       395
            misc.forsale       0.87      0.80      0.83       390
               rec.autos       0.88      0.91      0.90       396
         rec.motorcycles       0.93      0.96      0.95       398
      rec.sport.baseball       0.91      0.92      0.92       397
        rec.sport.hockey       0.88      0.98      0.93       399
               sci.crypt       0.75      0.96      0.84       396
         sci.electronics       0.84      0.65      0.74       393
                 sci.med       0.92      0.79