# A Simple Search Engine
The goal of this notebook is to start to explore ways of creating a simple search engine to explore books.  
The general approach will be as follows.  
  - Create word vectors of each of the sentences in the book.
  - Use something like the package `annoy` to create an index of the sentence vectors in the book (we will use sklearn for now).
  - Search terms will be assigned a vector and can be compared against the index.

In [1]:
import spacy
import re
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import cosine

Here for testing, we are using pretrained models from `spaCy`, this could be expanded by updating the model, using the text from the book we want to search. This would help refine out searches and get better results

In [2]:
# Load in the spaCy model object, this is what is interacted with for
# text processing.
nlp = spacy.load("en_core_web_md")

Here we read the book into memory.

In [3]:
with open(
    "resources/the-hound-of-baskervilles.txt", mode="r", encoding="utf8"
) as file:
    book = file.read()

We will trim down the book, dropping the title page, and table of contents.

In [4]:
chapter1_idx = book.index("CHAPTER I")
book_trim = book[chapter1_idx:]

In [5]:
print(book_trim[0:152], "...")

CHAPTER I
    Mr. Sherlock Holmes


Mr. Sherlock Holmes, who was usually very late in the mornings, save
upon those not infrequent occasions when he was ...


Here process the book using the spacy model we loaded in earlier. There are more efficient things we could do to speed this process up, but for now this is good enough.

In [6]:
full_doc = nlp(book_trim)

The returned data structure from the spacy model can be queried to get useful data for our needs. For instances we can loop over all of the sentences in the book saving that into a list for further processing.

In [7]:
bks = [s for s in full_doc.sents]

In [8]:
bks[0].__class__

spacy.tokens.span.Span

As we can see the actual returned "sentence" is a smaller spacy document object we can cotinue to work with. Next we will use these spacy sentence objects to grab the word vectors from each sentence. This is the numerical representation of the sentences meaning, this is what we will be searching against.

In [9]:
# An example of some output from the sentence vector of the first sentence.
bks[0].vector[0:10]

array([-0.02991001,  0.41294003,  0.05577499, -0.624195  ,  0.17363301,
        0.0434285 ,  0.38105   , -0.07594499,  0.279755  ,  1.78855   ],
      dtype=float32)

In [10]:
# Create matrix of sentence vectors and sentence text
# svc = normalize(np.array([s.vector for s in bks]), axis=1, norm="l1")
svc = np.array([s.vector for s in bks])
stx = np.array([s.text for s in bks])

We are going to now fit our sentence vectors with a nearest neighbors model, this will allow us to efficiently search the sentences with the sentence vector of our search phrase.

In [11]:
nnb = NearestNeighbors(leaf_size=3, n_neighbors=3)
nnb.fit(svc)

NearestNeighbors(leaf_size=3, n_neighbors=3)

Now here is an example of searching... We will use a phrase, get the word vector for that phrase, and then find the indices of the closest sentences. This is the results the user can see. Again, because we know the location of the sentence in the book, we can use that to return them to the book location to read from.

In [12]:
answer = nnb.kneighbors([nlp("the canine").vector], return_distance=False)
for i in answer.flatten():
    t = [s for s in stx[i : (i + 3)]]
    print(" ".join(t), end=f"\n{'_'*30}\n")

And the dog?" 

"Has been in the habit of carrying this stick behind his master. 
Being a heavy stick the dog has held it tightly by the middle, and
the marks of his teeth are very plainly visible.
______________________________
The giant hound was dead. 

 Sir Henry lay insensible where he had fallen.
______________________________
A sheep-dog of the moor? Or a spectral hound, black, silent, and
monstrous? Was there a human agency in the matter?
______________________________


As you can see, we search for "the canine" and return sentences with meaning that are similar to this. None of these sentences have the word canine in them, but because the meaning of "the canine" is close to dog, it is returning results with dogs, and hound in them.

### A Note On Word Vectors
Word vectors (also known as word embeddins, and sentence embeddings depending on the context) provide a way for us to represent words in vector space. One of the most common models for creating them is Word2Vec. The closer a word is to another word in vector space, the more similar the meanings are.

In [13]:
def word_cosine_similarity(w1, w2, model):
    return 1 - cosine(model.vocab[w1].vector, model.vocab[w2].vector)

Other text similarity measurements, such as edit distance, or soundex are looking to see if the word has similar spelling. Word vectors consider semantic similarity instead.

In [14]:
word_cosine_similarity("farmer", "framer", nlp)

0.06580065190792084

In [15]:
word_cosine_similarity("farmer", "agriculture", nlp)

0.5305896997451782

Here we see the words "farmer" and "farmer" though spelled the same, are not very similar from a meaning perspective. However the words "farmer" and "agriculture" are much more similar.