# Building and Testing Semantic Search Engines

I love executing the final cell in a Jupyter Notebook and getting hit with a mic drop moment.

That's exactly what happened at the end of this notebook while building a couple of semantic search engines that used **embeddings** to grasp the context and nuance of language. One engine relied on pretrained embeddings (`word2vec-google-news-300`), while the other was fully customized, trained from scratch on my own dataset.

Intuitively, I thought I knew which one would win.

But the results told a different story.

Along the way, I dug into embeddings -- how they work, how they're built, and how they power all sorts of AI tools, like ChatGPT, by turning language into numbers. Below is a roadmap of what this notebook covers.

## The Roadmap

This notebook walks through:
- Loading and using pretrained embeddings
- A quick word on embeddings
- Preparing the corpus for semantic search
- Creating a TF-IDF search engine for baseline comparison
- Building a semantic search engine with pretrained embeddings
- Creating the document-features matrix
- Engineering the semantic search function
- Building a semantic search engine with custom-trained embeddings
- Creating (another) document-features matrix
- Re-engineering the semantic search function
- Comparing the search engines and crowning an unexpected winner
  
Let's begin by downloading some pretrained embeddings.

## Loading and using pretrained embeddings

In [2]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [3]:
# Time to download the model, specifically: 'word2vec-google-news-300'
# Note: This might take a while
model = gensim.downloader.load('word2vec-google-news-300')
model

<gensim.models.keyedvectors.KeyedVectors at 0x13ee54440>

Note: As you see above, we have models from Google News and Twitter, among different places. The numbers at the end refer to the vector size. So, if it says "300", then each word is represented with a vector of size 300.

## A quick word on embeddings

Before we build the search engines, let’s take a moment to see how embeddings work.

**Embeddings turn words into numbers**  -- dense numerical vectors that capture meaning. Words that appear in similar contexts tend to have similar vectors.

Let’s try a quick example using pretrained embeddings from Google News:

In [4]:
# Use the .get_vector() method to get the vector for a given word, specifically 'snowboard'
# Note: This vector is, of course, a size of 300 but just show the first 100 elements to keep the output clean
model.get_vector('snowboard')[:100]

array([-0.31835938, -0.34179688,  0.25585938, -0.01184082,  0.11083984,
        0.16210938,  0.17578125,  0.00418091, -0.1328125 , -0.25390625,
        0.17871094, -0.08740234,  0.06176758,  0.10351562, -0.03808594,
       -0.08251953,  0.0016098 ,  0.28515625, -0.10058594,  0.09521484,
       -0.05175781,  0.39648438, -0.22460938,  0.06298828, -0.15332031,
        0.15625   ,  0.09082031,  0.36914062,  0.5       , -0.15332031,
       -0.02087402,  0.10595703,  0.12988281,  0.33007812,  0.03369141,
       -0.40039062,  0.27734375, -0.09082031,  0.1171875 , -0.12890625,
       -0.00491333, -0.03955078,  0.15136719,  0.01037598, -0.10595703,
       -0.09814453,  0.08496094, -0.01220703,  0.32421875,  0.19433594,
       -0.07080078, -0.06542969,  0.0534668 ,  0.15234375, -0.00891113,
       -0.05786133,  0.08544922, -0.03125   ,  0.04272461, -0.23144531,
       -0.09814453, -0.14648438, -0.05078125, -0.08154297, -0.0703125 ,
       -0.51953125,  0.09179688, -0.0168457 , -0.12158203,  0.42

We can also compute how similar two words are using **cosine similarity**. 

In [35]:
# Check two words that should have high similarity
model.similarity('snowboard', 'ski')

0.72749674

In [36]:
# Now try for moderate similarity
model.similarity('snowboard', 'mountain') 

0.42081076

In [37]:
# And now low similarity
model.similarity('snowboard', 'podcast')

0.10670625

# Preparing the corpus for semantic search

Now to build a search engine -- but instead of using TF-IDF or bag-of-words models like -- we'll use **embeddings**. This **semantic search engine** won't just be matching exact words, but rather will be capable of matching a word with a *diffferent* word related to it. Think car vs. automobile. Or, Uber and Lyft. Or, perhaps, something more complex than that, as we will see.

Let's start by loading our dataset and creating the corpus.

In [7]:
# Import pandas, load the dataset and create the corpus
import pandas as pd
fake_news = pd.read_csv('fake_news.csv',index_col=0)
fake_news = fake_news.sample(n=5000,random_state=0)

# Combine title and text columns to form the corpus
corpus = (fake_news.title + '. ' + fake_news.text).values
len(corpus)

5000

In [6]:
# Display the first 3 elements of corpus
corpus[:3]

array(["Ex-Interpol chief says ready to testify for Argentina's Fernandez. BUENOS AIRES (Reuters) - Argentina s previous government never asked Interpol to drop arrest warrants against a group of Iranians accused of bombing a Jewish center, the ex-head of the police agency said on Wednesday, as the government proceeded with treason charges against the former president. Former Interpol chief Ronald Noble said in an email on Wednesday that he wants to testify that the government of former President Cristina Fernandez did not ask to have the arrest warrants lifted as part of a  memorandum  she had with Iran. If a judge allows Noble to testify, the treason case filed this month against Fernandez and 11 other top officials could crumble. She denies wrongdoing and calls the charge politically motivated. The arrest warrants  were not affected in their validity by the approval of the memorandum,  Noble said in an email to a federal appeals court that was seen by Reuters. The Fernandez administ

We know have the foundation for our semantic search engine, a corpus of 5,000 news articles with each document consisting of the article's title followed by its full text.

## Creating a TF-IDF search engine for baseline comparison

Before we jump into semantic search with embeddings, let’s build a baseline using a more traditional method: **TF-IDF**. TF-IDF, or Term Frequency–Inverse Document Frequency, scores words based on how *important* they are to a document -- relative to the rest of the corpus. 

**Note**: TF-IDF doesn't capture context or meaning like embeddings do, but it’s still a popular and helpful approach in many search and NLP tasks.

We'll begin by using the `clean_text` function, which tokenizes the text, lower cases it, removes stop words and lemmatized each word.

In [9]:
from gensim.utils import tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def clean_text(text):
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    tokens = list(tokenize(text))
    #res = ' '.join([stemmer.stem(t.lower()) for t in tokens if t.lower() not in stop_words]) 
    res = ' '.join([lemmatizer.lemmatize(t.lower()) for t in tokens if t.lower() not in stop_words]) 
    if len(res) == 0:
        return ' '
    else:
        return res

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/karlbuscheck/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/karlbuscheck/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now we need to turn the corpus into a document-term matrix with TF-IDF values. This matrix will be the foundation for our TF-IDF-based search engine.

In [10]:
# So, import the TF-IDF Vectorizer, intialize it and then store it in a sparse matrix
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(preprocessor=clean_text,ngram_range=(1,2))
res = tfidf.fit_transform(corpus)

In [11]:
# Check the results
res

<5000x778839 sparse matrix of type '<class 'numpy.float64'>'
	with 1926603 stored elements in Compressed Sparse Row format>

This is quite a big TF-IDF matrix. It has:

- **5,000 rows** (one per article)
- **778,839 columns** (one per unique token or bigram)
- Over **1.9 million non-zero values** stored in sparse format

That’s nearly **3.9 billion total cells** -- the vast majority of which are zero, highlighting the sparsity of text data.

In [12]:
# Get the feature names
features = tfidf.get_feature_names_out()
features

array(['_cingraham', '_cingraham december', '_js', ..., 'zzsg pbf', 'zzz',
       'zzz ek'], dtype=object)

In [13]:
# Check the length of this: It will, of course, match the number of columns, each 
# representing the features or terms in our vocabulary
len(features)

778839

Now let’s define the `tfidf_search()` function, which powers the engine:

- Transforms the user's query into a TF-IDF vector
- Computes cosine similarity between the query and each document in the corpus
- Sorts and returns the top 10 most similar articles

In [18]:
# First, import cosine similarity from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity

In [19]:
# This is a search function
# The user writes a query and the function will return the 10 most similar
# documents, or articles; The function transforms the query into a vector and then matches it
# against all 5,000 document vectors using cosine similarity 
def tfidf_search(query):
    qv = tfidf.transform([query])
    sim = cosine_similarity(res,qv).reshape(res.shape[0])
    indices = sim.argsort()
    for i in range(10):
        ind = indices[-i-1]
        print(f'======= DOCUMENT {ind}, SIMILARITY {sim[ind]}')
        print(corpus[ind] + '\n')

In [20]:
# Run the function with a sample query
tfidf_search('deficit of New York')

New York City budget boss to state: We're not a 'piggy bank'. NEW YORK (Reuters) - New York City’s fiscal watchdog said the city faces much bigger budget deficits in coming years than the mayor has forecast and warned state lawmakers about treating the city like a “piggy bank.” The state, like the city, is finalizing its budget and will soon make decisions affecting New York City. “Some upstate legislators just don’t get it,” New York City Comptroller Scott Stringer said during his review of the city budget on Wednesday. “They see New York City as their piggy bank.”  Stringer cautioned against last-minute, late-night budget decisions that could deprive the city of resources. His comments reflected growing frustration with state lawmakers over their suggestions that the city is flush with cash after a stronger economic recovery than other parts of the state. New York State Governor Andrew Cuomo has had a tense relationship with New York City Mayor Bill de Blasio, persuading his fellow D

This TF-IDF search engine will serve as the baseline -- the benchmark against which we compare the semantic engine -- or, more accurately, *engines*.

We will build two semantic search engines. One will be built with **pretrained embeddings**, downloaded from a popular package. And the second will be trained from scratch using **custom embeddings** built on our own dataset.

## Building a semantic search engine with pretrained embeddings

By using pretrained embeddings, this semantic search engine will be able to find synonyms and perform tip-of-the-tongue search -- allowing users to retrieve infromation even when they can't rememeber the exact words, just related concepts.

In [21]:
# Need to create a matrix with one row per 5,000 documents
# but just 300 columns, per the vectors -- this is a document features matrix
# Start by grabbing the first document in 'corpus' and applying the clean_text function
clean_text(corpus[0])

'ex interpol chief say ready testify argentina fernandez buenos aire reuters argentina previous government never asked interpol drop arrest warrant group iranian accused bombing jewish center ex head police agency said wednesday government proceeded treason charge former president former interpol chief ronald noble said email wednesday want testify government former president cristina fernandez ask arrest warrant lifted part memorandum iran judge allows noble testify treason case filed month fernandez top official could crumble denies wrongdoing call charge politically motivated arrest warrant affected validity approval memorandum noble said email federal appeal court seen reuters fernandez administration always expressed belief warrant remain effect email said accusation fernandez government worked behind scene clear accused bomber amia jewish community center order improve trade argentina iran heart charge treason brought fernandez served president eight year succeeded mauricio macri

In [22]:
# Worth noting, each word in the above output has an embedding vector
# So, split that first document and then grab the first word
clean_text(corpus[0]).split()[0]

'ex'

In [23]:
# Now retrieve the embedding for that first word, 'ex'
# But first, import gensim and the Google News model again
import gensim.downloader
model = gensim.downloader.load('word2vec-google-news-300')

In [24]:
# And now use the .get_vector() method
model.get_vector(clean_text(corpus[0]).split()[0])

array([-0.02172852,  0.20703125, -0.04931641, -0.10205078,  0.01879883,
        0.32421875, -0.18945312, -0.05102539,  0.17480469,  0.08935547,
        0.09619141, -0.23339844, -0.06298828,  0.03063965, -0.06103516,
        0.25976562, -0.02099609, -0.11621094,  0.08886719,  0.03320312,
        0.09228516,  0.04614258,  0.14257812,  0.08105469,  0.04980469,
       -0.12988281, -0.45117188,  0.35742188, -0.08154297,  0.24023438,
        0.23535156,  0.03442383, -0.20507812,  0.10449219,  0.1484375 ,
       -0.00799561,  0.15429688, -0.06542969, -0.05078125,  0.06835938,
        0.02185059, -0.421875  ,  0.12890625, -0.15039062, -0.10351562,
       -0.21289062,  0.06591797, -0.1640625 , -0.01635742, -0.00222778,
        0.04370117, -0.02905273, -0.07714844,  0.18359375, -0.24707031,
       -0.00167084, -0.37695312,  0.07226562, -0.02087402,  0.19042969,
        0.04492188,  0.06689453, -0.16308594,  0.15332031, -0.19628906,
       -0.05786133,  0.08447266, -0.05932617,  0.09472656,  0.09

So how do we represent an entire document using word embeddings?

Each word in the document has a 300-dimensional vector. The answer is to take the **average of all its word vectors** -- a simple but effective technique for capturing the document’s overall meaning.

This logic will be built into a function called `doc_to_vec`.

In [44]:
# First need to work on a single document, let's grab the first document in corpus
# And then display just the first 100 characters to make sure it all worked
doc = corpus[0]
doc[:100]

"Ex-Interpol chief says ready to testify for Argentina's Fernandez. BUENOS AIRES (Reuters) - Argentin"

In [45]:
# Split it, run 'clean_text' and save it as 'words'
words = clean_text(doc).split()
words[:10]

['ex',
 'interpol',
 'chief',
 'say',
 'ready',
 'testify',
 'argentina',
 'fernandez',
 'buenos',
 'aire']

In [46]:
# Then create a vector of 0's and for each word, add it to the vector and divide by the number of words!
# Have to import numpy, create an array and then use the .vector_size method from model
# And, finally, save it as 'vec'
import numpy as np
vec = np.zeros(model.vector_size)
vec

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [47]:
# Now write a for loop for each word in vec, if the word is in the model, add it and then divide by the number
# of additions
for w in words:
    if w in model:
        vec += model.get_vector(w)

In [48]:
# See how that worked, but just display the first 10 elements in the array to keep the output clean
vec[:10]

array([ -0.54648495,   5.94399261,  15.24498749,  -1.65078735,
        -5.38089371,  -4.43889618,   9.56978607, -12.96095085,
        21.75265503,   7.47306824])

In [49]:
# Now we have to divide by 'n', which isn't the length of the words because some aren't in the model,
# but rather start at 0 and then add + 1 for each word in the model
# So, now add this into the above for lopp
n =0
for w in words:
    if w in model:
        n+=1
        vec += model.get_vector(w)

In [50]:
# Now check the output, or, at least, the first 10 elements
vec[:10]

array([ -1.09296989,  11.88798523,  30.48997498,  -3.30157471,
       -10.76178741,  -8.87779236,  19.13957214, -25.9219017 ,
        43.50531006,  14.94613647])

In [51]:
# Now to add the division, if n is greater than 1, to get the average embedding
if n > 1:
    vec /= n

In [52]:
# Check the averages
vec[:10]

array([-0.00473147,  0.05146314,  0.13199123, -0.01429253, -0.04658782,
       -0.038432  ,  0.08285529, -0.11221602,  0.18833468,  0.06470189])

In [53]:
# Check n, or the number of words in the model
n

231

In [54]:
# And check that against the number of words in the first corpus
# As we see, this output shows us that 13 words didn't exist in the embedding model
len(words)

244

In [55]:
# And now to create the doc_to_vec function
# This takes a document, it cleans it and splits the text, then  create a vector of 0's
# and run the loop from above that finds the average embedding
import numpy as np
def doc_to_vec(doc):
    words = clean_text(doc).split()
    vec = np.zeros(model.vector_size)
    n =0
    for w in words:
        if w in model:
            n+=1
            vec += model.get_vector(w)
    if n > 1:
        vec /= n
    return vec

In [56]:
# Did it work??
# Yep! This is the average embedding of each of the documents
doc_to_vec(corpus[0])

array([-0.00236574,  0.02573157,  0.06599562, -0.00714627, -0.02329391,
       -0.019216  ,  0.04142765, -0.05610801,  0.09416734,  0.03235094,
        0.00094525, -0.10188774, -0.05663897,  0.05454634, -0.11048407,
        0.12240396,  0.05195611,  0.0305763 , -0.00822418, -0.04395149,
        0.02717957, -0.00876832,  0.06016745,  0.00144615,  0.0555344 ,
       -0.07885468, -0.02828024,  0.01021679, -0.01010204,  0.00117896,
        0.02186994, -0.05381884, -0.04472397,  0.02414305, -0.00488717,
        0.00897075,  0.05367006,  0.02855303,  0.00290052,  0.04421214,
        0.01523024, -0.06157423,  0.08854659, -0.02685681, -0.03955939,
       -0.07202905, -0.02628904, -0.03108334, -0.06978426,  0.07961201,
        0.01632562, -0.01460577, -0.00936275,  0.0083045 , -0.03270064,
        0.02789584, -0.09299268, -0.04247826, -0.00665045, -0.06149351,
       -0.02748085,  0.0864182 , -0.05477596,  0.01050879, -0.04192663,
       -0.01995803, -0.00390559,  0.02105683, -0.01258343,  0.02

Now, apply the `doc_to_vec` function to corpus to make the document-features matrix!

### Creating the document-features matrix

Now that we have the `doc_to_vec` function, we want a matrix with one row per document and 300 columns, or embeddings features.

In [57]:
# Start by building the matrix from scratch, or creating a matrix of 0's
# Note: This can be dense because it's only 300 features
doc_embeddings_matrix = np.zeros((len(corpus), model.vector_size))
doc_embeddings_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [62]:
# And confirm the shape, which will be one row per document and one column for each embedding feature
doc_embeddings_matrix.shape

(5000, 300)

In [59]:
# Now to populate the matrix of 0's row-by-row
for i in range(len(corpus)):
    doc_embeddings_matrix[i,:] = doc_to_vec(corpus[i])

In [60]:
# See what 'doc_embeddings_mtrix' looks like now:
# What the output shows us is the embeddings of the first document, then the second and so on
doc_embeddings_matrix

array([[-0.00236574,  0.02573157,  0.06599562, ..., -0.07791119,
         0.02072513,  0.03539616],
       [-0.03122442,  0.09486838,  0.06432471, ..., -0.03230939,
         0.03290488,  0.01085769],
       [ 0.02205895,  0.02822519,  0.0260407 , ..., -0.07960629,
         0.01159129,  0.03910882],
       ...,
       [ 0.04574295,  0.03951742,  0.07762926, ..., -0.05901357,
         0.02745964,  0.01396595],
       [ 0.03577278,  0.05002436,  0.06524986, ..., -0.04451846,
         0.04804829,  0.03157182],
       [-0.00676485,  0.03968262,  0.03376636, ...,  0.02053665,
         0.05231315, -0.03880857]])

In [61]:
# To that above point, the length of the embeddings for the first document
# is, of course, 300
len(doc_embeddings_matrix[0])

300

**Worth pointing out**: Those numbers in the above **document-features matrix** are the average embeddings of the words that compose the documents. It's like a compression algorithm -- a semantic compression -- each document, which can be many words, is compressed into an array of 300 numbers. This compression is semantically charged, two similar documents with have similar vector representation -- just like how two similar words are close to each other.

## Engineering the semantic search function

This `semantic_search` function is ultimatley very similar to the `tfidf_search` function we built above. So, let's start revamping it.

In [63]:
# Here's that tfidf_search function, once again, for review
def tfidf_search(query):
    qv = tfidf.transform([query])
    sim = cosine_similarity(res,qv).reshape(res.shape[0])
    indices = sim.argsort()
    for i in range(10):
        ind = indices[-i-1]
        print(f'======= DOCUMENT {ind}, SIMILARITY {sim[ind]}')
        print(corpus[ind] + '\n')

In [64]:
# Now to rename it semantic_search and make some changes
def semantic_search(query):
    # 1. Turn the query from a string to embeddings, a vector of 300 with the doc_to_vec() method
    qv = doc_to_vec(query)
    # 2. Need to find the most similar documents, or those whose embeddings are most similar to those of the query
    sim = cosine_similarity(doc_embeddings_matrix,qv).reshape(doc_embeddings_matrix.shape[0])
    indices = sim.argsort()
    for i in range(10):
        ind = indices[-i-1]
        print(f'======= DOCUMENT {ind}, SIMILARITY {sim[ind]}')
        print(corpus[ind] + '\n')

In [65]:
# Time to see if it works
# Commenting because this does NOT work, getting a Value Error
# semantic_search('what does the president eat for lunch?')

In [66]:
# Now to account for the Value Error because Python didn't like that 'qv' had a single feature
# So, need to reshape 'qv'
def semantic_search(query):
    # 1. Turn the query from a string to embeddings, a vector of 300 with the doc_to_vec() method and reshape 'qv'
    qv = doc_to_vec(query).reshape(1,-1)
    # 2. Need to find the most similar documents, or those whose embeddings are most similar to those of the query
    sim = cosine_similarity(doc_embeddings_matrix,qv).reshape(doc_embeddings_matrix.shape[0])
    indices = sim.argsort()
    for i in range(10):
        ind = indices[-i-1]
        print(f'======= DOCUMENT {ind}, SIMILARITY {sim[ind]}')
        print(corpus[ind] + '\n')

In [67]:
# See how it works now
# Notice: The word 'lunch' doesn't appear right away but related words do
semantic_search('what does the president eat for lunch?')

'Can I get it to go?' Canada's Trudeau charms  Manila while ordering fried chicken. MANILA (Reuters) - Canadian Prime Minister Justin Trudeau hopped from one table to the next, chatted with people and posed for selfies on Sunday at a fastfood chain store in Manila, charming residents of the Philippines capital for the second time in two years. Trudeau, in Manila for a summit of regional leaders, dropped in at an outlet of fastfood giant Jollibee Foods Corp  after a visit to a nearby women s clinic that advocates family planning, a touchy subject in the Catholic-majority Philippines. He greeted nearly everyone in the store, shaking hands and exchanging hugs with fans after ordering fried chicken and a strawberry float.  Can I get it to go? I ll eat it in the car,  Trudeau said, before going behind the counter for a photograph with Jollibee staff. Earlier, when he landed at Clark airport, a smiling Trudeau waded into a crowd of children gathered to greet dignitaries arriving for the summ

**To recap**: The semantic search engine takes the string in the query, it turns it into a 300-dimenstional vector, and then finds the most similar vector in the corpus.

# Building a semantic search engine with custom-trained embeddings

What if we want to train embeddings on our own dataset?

And what exactly does that mean? It means creating a new embedding model -- similar to the one we previously downloaded from Gensim -- but this time, we’ll train it ourselves on our own data. The resulting model will be tailored specifically to our dataset, not the Google News corpus.

In [77]:
# Start with import of the Word2Vec class
from gensim.models import Word2Vec

In [78]:
# Need to create a list of lists for Word2Vec
lol_corpus = [doc.split() for doc in corpus]

In [79]:
# Check the first list, or just the first 10 elements to keep the output clean
lol_corpus[0][:10]

['Ex-Interpol',
 'chief',
 'says',
 'ready',
 'to',
 'testify',
 'for',
 "Argentina's",
 'Fernandez.',
 'BUENOS']

In [80]:
# Need to clean the list with the clean_text function
lol_corpus = [clean_text(doc).split() for doc in corpus]

In [81]:
# Check the results
lol_corpus[0][:10]

['ex',
 'interpol',
 'chief',
 'say',
 'ready',
 'testify',
 'argentina',
 'fernandez',
 'buenos',
 'aire']

In [84]:
# Train a Word2Vec model on our tokenized corpus
# Set vector_size to 300, so that each word will be represented as a 300-dimension vector
# Set window=3 to define the context window, or how many words before and after to consider
# Use workers=4 to make it 4 times as fast
m = Word2Vec(lol_corpus, vector_size=300, window=3, workers=4)
m

<gensim.models.word2vec.Word2Vec at 0x501eebdd0>

In [85]:
# To get the model, take a wv of the object m
mymodel = m.wv

In [86]:
# Display mymodel
mymodel

<gensim.models.keyedvectors.KeyedVectors at 0x501b48f20>

In [87]:
# Note: The above model or 'mymodel' has all the methods we've seen above with the pretrained models
# So, for instance, can get the vector for a specific word
# Only display the first 10 elements to keep the output clean
mymodel.get_vector('president')[:10]

array([ 0.05456392, -0.80775434,  0.66254705, -0.23936757,  0.36111695,
        0.5756765 ,  0.47005334,  0.35943198,  0.25090083, -1.1076299 ],
      dtype=float32)

Now that we have a model,we need to build a document-features matrix and run a search engine on top of it. We'll start by reusing the structure from earlier -- just updated to use our new Word2Vec model.

First up: define a fresh version of the `doc_to_vec` function -- we’ll call it `doc_to_vec2` -- that computes the average vector for each document.

In [89]:
# Here's the doc_to_vec2 function
# Note: It's now 'mymodel'
import numpy as np
def doc_to_vec2(doc):
    words = clean_text(doc).split()
    vec = np.zeros(mymodel.vector_size)
    n=0
    for w in words:
        if w in mymodel:
            n+=1
            vec += mymodel.get_vector(w)
    if n > 1:
        vec /= n
    return vec

### Creating (another) document-features matrix

Just like before, we want a matrix with one row per document and 300 columns -- one for each embedding feature. But this time, we're using vectors from our custom-trained Word2Vec model instead of the pretrained one.

In [90]:
# This will be called doc_embeddings_matrix2
# Note: Once again, it's updated to 'mymodel'
doc_embeddings_matrix2 = np.zeros((len(corpus),mymodel.vector_size))
doc_embeddings_matrix2

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [91]:
# Check the shape
doc_embeddings_matrix2.shape

(5000, 300)

In [94]:
# Then, for each document in the corpus, set the i-th line of the matrix equal to doc_to_vec2 for corpus document i
for i in range(len(corpus)):
    doc_embeddings_matrix2[i,:] = doc_to_vec2(corpus[i])

In [93]:
# Check the results
doc_embeddings_matrix2

array([[ 0.06729469,  0.16997326, -0.09481063, ..., -0.13429278,
         0.02676992,  0.05866554],
       [-0.08530327,  0.43262977, -0.18977185, ...,  0.00408701,
        -0.10635933, -0.0875218 ],
       [ 0.10422009,  0.24829183, -0.03545439, ..., -0.03133481,
         0.07109447,  0.15943093],
       ...,
       [ 0.10370929,  0.18866114,  0.06195659, ..., -0.0369685 ,
         0.06524089, -0.04617735],
       [ 0.14721203,  0.13671954, -0.07118182, ..., -0.06210374,
        -0.06724542,  0.0848758 ],
       [ 0.06078459,  0.14632636, -0.13052803, ..., -0.06495154,
         0.0097929 ,  0.01151082]])

Matrix built -- same shape, new engine.

## Re-engineering the semantic search function

Now to modify the `'semantic_search` and rename it `semantic_search2`!

In [96]:
# This is built on top of the original, but, of coruse, it has been renamed and updated
# with 'doc_to_vec2' and 'doc_embeddings_matrix2'
def semantic_search2(query):
    # 1. turn the query from a string to embeddings
    qv = doc_to_vec2(query).reshape(1, -1)
    # 2. find the documents whose embeddings are most similar to those of the query
    sim = cosine_similarity(doc_embeddings_matrix2,qv).reshape(doc_embeddings_matrix2.shape[0])
    indices = sim.argsort()
    for i in range(10):
        ind = indices[-i-1]
        print(f'======= DOCUMENT {ind}, SIMILARITY {sim[ind]}')
        print(corpus[ind] + '\n')

In [97]:
# Time to test it out
semantic_search2('what does the president eat for lunch?')

LIVE FEED: PRESIDENT TRUMP Speaks At CPAC – 10:00 a.m. EST.  

REMEMBER WHEN WE HAD A COMMANDER IN CHIEF WHO REALLY LOVED AND RESPECTED OUR MILITARY?. Our military men and women never had to wonder if President George W. Bush cared about them Former Press Secretary for George W. Bush, Dana Perino, has a new book out about her tenure during the Bush Administration entitled,  And the Good News Is : Lessons and Advice from the Bright Side. One of the stories from the book is certainly raising some eyebrows about the former president. She describes a visit to Walter Reed military hospital by the then president during 2005. One of the men the president was visiting was a Marine who was in intensive care. What s his prognosis?  the president asked. Well, we don t know sir, because he s not opened his eyes since he arrived, so we haven t been able to communicate with him. But no matter what, Mr. President, he has a long road ahead of him,  said the CNO.The president and his aides then proceed

Uhoh! The search engine doesn't seem to be working too well. Need to figure out why.

## Comparing the search engines and crowning an unexpected winner

Now to compare the three search engines. And, perhaps, crown an unlikely winner:
1. The TF-IDF search engine
2. The semantic search engine with pretrained embeddings
3. The semantic search engine with custom-trained embeddings

In [102]:
# Start with the same query and run it through each of the three search engines
query = 'what does the president eat for lunch?'

### The TF-IDF search engine

We start with a simple bag-of-words baseline. It scans the corpus for documents that match individual terms from the query -- no understanding of meaning, just raw frequency.

In [103]:
# The TF-IDF search engine is just looking for exact words
tfidf_search(query)

 ‘Lunch Shaming’: Schools Punish Poor Kids Who Can’t Pay For Lunch With Appalling Humiliation (VIDEO). As we all know, there are many children in America who cannot afford school lunch. Some of these kids are  lucky  (and I use the term loosely) to qualify for free or reduced price school lunches. Others are not. To qualify for these government-sanctioned programs, the children have to be from a family whose income is no more than 30% above the poverty line. That means that a family of four can make no more than $32,000 a year to qualify. Any adult living in America today understands that this is a very difficult   in fact, damn near impossible   amount of money to live on. So, this leads to many children around the country going through the lunch line at school with the inability to pay. However, the tactics used to solve this issue are downright appalling, and they are being deployed all over the nation.A practice referred to as  lunch shaming  is being used to humiliate poor childre

#### The results?

Not so great, mostly we are just seeing stories with words like "lunch" and "president."

### The semantic search engine with pretrained embeddings

Here we swap in `semantic_search` -- the engine powered by pretrained embeddings trained on a massive corpus from Google news. It looks at *meaning*, not just individual word overlap.

In [104]:
# Try the semantic_search function which is built with the pretrained embeddings
semantic_search(query)

'Can I get it to go?' Canada's Trudeau charms  Manila while ordering fried chicken. MANILA (Reuters) - Canadian Prime Minister Justin Trudeau hopped from one table to the next, chatted with people and posed for selfies on Sunday at a fastfood chain store in Manila, charming residents of the Philippines capital for the second time in two years. Trudeau, in Manila for a summit of regional leaders, dropped in at an outlet of fastfood giant Jollibee Foods Corp  after a visit to a nearby women s clinic that advocates family planning, a touchy subject in the Catholic-majority Philippines. He greeted nearly everyone in the store, shaking hands and exchanging hugs with fans after ordering fried chicken and a strawberry float.  Can I get it to go? I ll eat it in the car,  Trudeau said, before going behind the counter for a photograph with Jollibee staff. Earlier, when he landed at Clark airport, a smiling Trudeau waded into a crowd of children gathered to greet dignitaries arriving for the summ

#### The results?

This one works surprisingly well! The engine retrieved results where exact words like "lunch" and "president" didn't even show up, but the meaning (hello, cosine similarity) was still there. That’s the magic of pretrained embeddings: they understand semantic similarity between terms and return results based on meaning, not just matching letters.

This is how cutting-edge search engines work.

### The semantic search engine with custom-trained embeddings

Now we're using `semantic_search2` -- our own homegrown engine trained from scratch.

In [108]:
# Try the semantic search engine with embeddings trained on our dataset
# It should be great, but...
semantic_search2(query)

LIVE FEED: PRESIDENT TRUMP Speaks At CPAC – 10:00 a.m. EST.  

REMEMBER WHEN WE HAD A COMMANDER IN CHIEF WHO REALLY LOVED AND RESPECTED OUR MILITARY?. Our military men and women never had to wonder if President George W. Bush cared about them Former Press Secretary for George W. Bush, Dana Perino, has a new book out about her tenure during the Bush Administration entitled,  And the Good News Is : Lessons and Advice from the Bright Side. One of the stories from the book is certainly raising some eyebrows about the former president. She describes a visit to Walter Reed military hospital by the then president during 2005. One of the men the president was visiting was a Marine who was in intensive care. What s his prognosis?  the president asked. Well, we don t know sir, because he s not opened his eyes since he arrived, so we haven t been able to communicate with him. But no matter what, Mr. President, he has a long road ahead of him,  said the CNO.The president and his aides then proceed

#### The results?

Not good.

**So, why did the semantic search engine with custom-trained embeddings struggle so much?**

Embeddings are **data hungry**. Even though I trained the engine on over 5,000 news articles, that wasn't nearly enough data for the model to learn words in context. But the pre-trained model had already been trained on a massive trove of data -- nearly 100 billion words from a Google News dataset.

It was a good reminder: Efficiency beats power -- every time. Build (or download) the tool that solves your problem, not the shiny toy.