# How I Built a Lightweight Search Engine Using Classic NLP

In this notebook, I set out to build my own simple search engine based on fundamental NLP techniques. No fancy models, just classic tools like TF-IDF and cosine similarity.

Along the way, I asked a fun question: **"Who's the most famous podcaster in the world?"**

The results? Wildly wrong -- and yet surprisingly revealing. This little failure ended up turning into one of the project's biggest insights -- a clear illustration of the strengths and weaknesses of a bag-of-words style search. 

In short: It worked. And it didn't. And that's exactly what makes it interesting.

## Table of Contents
- Import pandas and load the dataset
- Create the Corpus
- Make a Bag-of-words (BoW) Model using CountVectorizer
- Reduce features through lemmatization
- TF-IDF Vectorization
- Deploying the search engine: Finding the most relevant articles using Cosine Similarity
- 'WORLD'S MOST FAMOUS VICTIMS...' -- The model worked, technically
- Making the Top 10: Sorting Results Like a Google Search
- Build the SEARCH function

## Import pandas and load the dataset

In [23]:
import pandas as pd

In [25]:
# Load the 'fake_news' dataset
fake_news = pd.read_csv('fake_news.csv', index_col=0)

In [27]:
# Check to make sure the dataset loaded properly
fake_news.head()

Unnamed: 0,title,text,subject,date,fake
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


In [29]:
# Check to see the shape of this DataFrame
fake_news.shape

(44898, 5)

As we see above, this is a significant dataset -- nearly 45K rows or 45K articles to dig through once this search engine is built. With that in mind, let's streamline this down to a 1,000-row sample in the interest of development and computation time. The entire process will be smoother, and there will still be enough data to demonstrate how the search engine will work.

In [33]:
# Create the random sample of 1,000 articles
fake_news_1000 = fake_news.sample(n=1000, random_state=0)
fake_news_1000

Unnamed: 0,title,text,subject,date,fake
35305,Ex-Interpol chief says ready to testify for Ar...,BUENOS AIRES (Reuters) - Argentina s previous ...,worldnews,"December 20, 2017",0
29180,U.S. warns North Korea of 'overwhelming' respo...,SEOUL (Reuters) - U.S. President Donald Trump’...,politicsNews,"February 3, 2017",0
29805,Big security risks in Trump feud with spy agen...,WASHINGTON (Reuters) - An unprecedented pre-pr...,politicsNews,"January 13, 2017",0
38237,France puts suspected militant under investiga...,PARIS (Reuters) - A suspected Islamist militan...,worldnews,"November 15, 2017",0
5099,Rudy Giuliani Turns Into A Blithering Idiot W...,During a live CNN interview with Rudy Giuliani...,News,"August 11, 2016",1
...,...,...,...,...,...
33676,"Obama, Argentina's Macri discuss Brazil's poli...",BUENOS AIRES (Reuters) - U.S. President Barack...,politicsNews,"March 23, 2016",0
44767,"In war-torn Darfur, new U.S. aid chief stresse...","ZAM ZAM CAMP, North Darfur (Reuters) - Washing...",worldnews,"August 28, 2017",0
43317,"Turkey's Erdogan, Iraq's Abadi to discuss Iraq...",ANKARA (Reuters) - Turkish President Tayyip Er...,worldnews,"September 17, 2017",0
6436,WATCH: Top Trump Aide Admits He Sees The Pres...,Donald Trump is a reality television star and ...,News,"May 11, 2016",1


Notice above that the updated DataFrame, as would be expected, is exactly 1,000 rows.

## Create the Corpus

In [35]:
# Don't want a DataFrame, want an array of documents
# So, need to concatenate 'title' and 'text' columns, or Series
# Display 'title'
fake_news_1000.title

35305    Ex-Interpol chief says ready to testify for Ar...
29180    U.S. warns North Korea of 'overwhelming' respo...
29805    Big security risks in Trump feud with spy agen...
38237    France puts suspected militant under investiga...
5099      Rudy Giuliani Turns Into A Blithering Idiot W...
                               ...                        
33676    Obama, Argentina's Macri discuss Brazil's poli...
44767    In war-torn Darfur, new U.S. aid chief stresse...
43317    Turkey's Erdogan, Iraq's Abadi to discuss Iraq...
6436      WATCH: Top Trump Aide Admits He Sees The Pres...
15829    12 Yr. Old Videotapes A Public Service Announc...
Name: title, Length: 1000, dtype: object

In [37]:
# Now display 'text'
fake_news_1000.text

35305    BUENOS AIRES (Reuters) - Argentina s previous ...
29180    SEOUL (Reuters) - U.S. President Donald Trump’...
29805    WASHINGTON (Reuters) - An unprecedented pre-pr...
38237    PARIS (Reuters) - A suspected Islamist militan...
5099     During a live CNN interview with Rudy Giuliani...
                               ...                        
33676    BUENOS AIRES (Reuters) - U.S. President Barack...
44767    ZAM ZAM CAMP, North Darfur (Reuters) - Washing...
43317    ANKARA (Reuters) - Turkish President Tayyip Er...
6436     Donald Trump is a reality television star and ...
15829     You are one of the most narcissistic, power-h...
Name: text, Length: 1000, dtype: object

In [39]:
# Just for fun, check to see how long each of these strings or articles are, aka the word count
fake_news_1000.text.str.split().str.len()

35305    411
29180    514
29805    798
38237    267
5099     462
        ... 
33676     76
44767    737
43317    511
6436     309
15829    360
Name: text, Length: 1000, dtype: int64

In [41]:
# Now back to the concatenation of the above Series, or columns
fake_news_1000.title + '. ' + fake_news_1000.text

35305    Ex-Interpol chief says ready to testify for Ar...
29180    U.S. warns North Korea of 'overwhelming' respo...
29805    Big security risks in Trump feud with spy agen...
38237    France puts suspected militant under investiga...
5099      Rudy Giuliani Turns Into A Blithering Idiot W...
                               ...                        
33676    Obama, Argentina's Macri discuss Brazil's poli...
44767    In war-torn Darfur, new U.S. aid chief stresse...
43317    Turkey's Erdogan, Iraq's Abadi to discuss Iraq...
6436      WATCH: Top Trump Aide Admits He Sees The Pres...
15829    12 Yr. Old Videotapes A Public Service Announc...
Length: 1000, dtype: object

In [43]:
# The above output is a pandas Series so need to turn it into an array of strings
# By adding .values this provides the array that can be fed into the model
# And save it as 'corpus'
# Also: Notice no space before the period but a space after for the clean output seen below
# Now, effectively, the title flows smoothly into the corresponding article
corpus = (fake_news_1000.title + '. ' + fake_news_1000.text).values
corpus

array(["Ex-Interpol chief says ready to testify for Argentina's Fernandez. BUENOS AIRES (Reuters) - Argentina s previous government never asked Interpol to drop arrest warrants against a group of Iranians accused of bombing a Jewish center, the ex-head of the police agency said on Wednesday, as the government proceeded with treason charges against the former president. Former Interpol chief Ronald Noble said in an email on Wednesday that he wants to testify that the government of former President Cristina Fernandez did not ask to have the arrest warrants lifted as part of a  memorandum  she had with Iran. If a judge allows Noble to testify, the treason case filed this month against Fernandez and 11 other top officials could crumble. She denies wrongdoing and calls the charge politically motivated. The arrest warrants  were not affected in their validity by the approval of the memorandum,  Noble said in an email to a federal appeals court that was seen by Reuters. The Fernandez administ

## Make a Bag-of-words (BoW) Model using CountVectorizer

The Bag-of-words model is a foundational NLP technique for turning raw text into usable numerical features. **CountVectorizer** is the tool we'll use to build this -- it simply counts how many times each word appears in each document. This transforms our text int a matrix of token counts, capturing how often each words shows up in each article.

Now, to fit a CountVectorizer using the tokenize method from gensim.

In [45]:
# First need to import tokenize from gensim
from gensim.utils import tokenize

In [47]:
# Then need to import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [49]:
# Preproces the text then fit the vectorizer
# Create a modified corpus list
# So, need to start with a for loop
mod_corpus = []
for doc in corpus:
    tokens = list(tokenize(doc))
    break

In [51]:
# Use break above to check the first document in the corpus
corpus[0]

"Ex-Interpol chief says ready to testify for Argentina's Fernandez. BUENOS AIRES (Reuters) - Argentina s previous government never asked Interpol to drop arrest warrants against a group of Iranians accused of bombing a Jewish center, the ex-head of the police agency said on Wednesday, as the government proceeded with treason charges against the former president. Former Interpol chief Ronald Noble said in an email on Wednesday that he wants to testify that the government of former President Cristina Fernandez did not ask to have the arrest warrants lifted as part of a  memorandum  she had with Iran. If a judge allows Noble to testify, the treason case filed this month against Fernandez and 11 other top officials could crumble. She denies wrongdoing and calls the charge politically motivated. The arrest warrants  were not affected in their validity by the approval of the memorandum,  Noble said in an email to a federal appeals court that was seen by Reuters. The Fernandez administration 

Scrolling back above to the concatenated corpus can confirm this is the first document.

In [53]:
# Now check to make sure the list is being tokenized correctly
# But just show the first 10 items to keep the output clean
tokens[:10]

['Ex',
 'Interpol',
 'chief',
 'says',
 'ready',
 'to',
 'testify',
 'for',
 'Argentina',
 's']

In [55]:
# Now grab the for loop that was started above and update it with the .append(tokens) line
mod_corpus = []
for doc in corpus:
    tokens = list(tokenize(doc))
    mod_corpus.append(tokens)

In [57]:
# Confirm that 'mod_corpus' has a length of 1000
len(mod_corpus)

1000

In [59]:
# This is a list of a list
# Again, slicing to show just the first 10 items for clean output sake
mod_corpus[0][:10]

['Ex',
 'Interpol',
 'chief',
 'says',
 'ready',
 'to',
 'testify',
 'for',
 'Argentina',
 's']

In [61]:
# Prefer to have this as a list of documents
# So, need to concatenate all these strings
# So, once again will modify the above for loop, this time with the space and join tokens
mod_corpus = []
for doc in corpus:
    tokens = list(tokenize(doc))
    mod_corpus.append(' '.join(tokens))

In [63]:
# Now check the first element to confirm this worked
mod_corpus[0]

'Ex Interpol chief says ready to testify for Argentina s Fernandez BUENOS AIRES Reuters Argentina s previous government never asked Interpol to drop arrest warrants against a group of Iranians accused of bombing a Jewish center the ex head of the police agency said on Wednesday as the government proceeded with treason charges against the former president Former Interpol chief Ronald Noble said in an email on Wednesday that he wants to testify that the government of former President Cristina Fernandez did not ask to have the arrest warrants lifted as part of a memorandum she had with Iran If a judge allows Noble to testify the treason case filed this month against Fernandez and other top officials could crumble She denies wrongdoing and calls the charge politically motivated The arrest warrants were not affected in their validity by the approval of the memorandum Noble said in an email to a federal appeals court that was seen by Reuters The Fernandez administration always expressed its 

Now to fit the CountVectorizer...

In [65]:
# Fit the CountVectorizer
# And call .fit_transform to create the sparse matrix
# Note: Here's the size of the matrix: 1000x21801
count_vect = CountVectorizer()
res = count_vect.fit_transform(mod_corpus)
res

<1000x21801 sparse matrix of type '<class 'numpy.int64'>'
	with 206938 stored elements in Compressed Sparse Row format>

While not required for building the search engine, it's helpful to inspect the vocabulary generated by CountVectorizer. Below we see how many unique tokens there are and even preview a few of them.

In [73]:
# Optional: Check the vocabulary sizae and sample tokens from CountVectorizer
feature_names = count_vect.get_feature_names_out()
feature_names

array(['_cingraham', '_js', 'a_r_marshall', ..., 'zweiman', 'zww', 'zzsg'],
      dtype=object)

## Reduce features through lemmatization

To reduce redundancy and improve model focus, we'll clean the text by lemmatizing words and removing common stop words.

In [78]:
# This function tokenizes the text and lemmatizes the tokens after lower casing them and removing the stop words
from gensim.utils import tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def clean_text(text):
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    tokens = list(tokenize(text))
    #res = ' '.join([stemmer.stem(t.lower()) for t in tokens if t.lower() not in stop_words]) 
    res = ' '.join([lemmatizer.lemmatize(t.lower()) for t in tokens if t.lower() not in stop_words]) 
    if len(res) == 0:
        return ' '
    else:
        return res

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/karlbuscheck/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/karlbuscheck/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## TF-IDF Vectorization

Now that the text has been cleaned up, let's use TF-IDF VEctorization to transform it into a matrix of weighted features. This process gives more importance to unique terms and less weight to common ones.

In [83]:
# Start by importing TFidfVectorizer as specified
# Then use the clean_text function and set up the ngram_range
# And finally .fit_transform() the corpus

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(preprocessor=clean_text, ngram_range=(1,2))
res = tfidf.fit_transform(corpus)
res

<1000x200188 sparse matrix of type '<class 'numpy.float64'>'
	with 379092 stored elements in Compressed Sparse Row format>

So, the sparse matrix has 1,000 rows (one per article or document) and just over 200K columns, reach representing unique words or word pairings. This is, of course, way too big to be turned into a dense matrix. 

In [88]:
# Check the features of the matrix
# and save it as the variable 'features_1000'
features_1000 = tfidf.get_feature_names_out()
# Check the first 100 to make sure everything worked
features_1000[:100]

array(['_cingraham', '_cingraham december', '_js', '_js ajs',
       'a_r_marshall', 'a_r_marshall march', 'aa', 'aa fisher',
       'aa minus', 'aaron', 'aaron also', 'aaron bernstein',
       'aaron gouveia', 'aaron president', 'aaron rich', 'aaron said',
       'aaron sorkin', 'aaron told', 'aaron work', 'aarp',
       'aarp advocate', 'ab', 'ab cbn', 'abaaoud', 'abaaoud see', 'abadi',
       'abadi according', 'abadi called', 'abadi discus', 'abadi elected',
       'abadi government', 'abadi united', 'abadi week', 'abandon',
       'abandon campaign', 'abandon choreographed', 'abandon decision',
       'abandon nominee', 'abandon puerto', 'abandon renouncing',
       'abandon trump', 'abandon weapon', 'abandoned',
       'abandoned embassy', 'abandoned home', 'abandoned hometown',
       'abandoning', 'abandoning effort', 'abandoning franken',
       'abandoning gop', 'abandoning israel', 'abandoning right',
       'abated', 'abated early', 'abbas', 'abbas appeared',
       'abbas 

## Deploying the search engine: Finding the most relevant articles using Cosine Similarity

Time to bring it all together. We'll run a sample query through our search engine, use cosine similarity to compare it against every artilce, and finally return the most relevant results.

Let's try a fun sample query.

In [95]:
# Here's the query:
# Worth noting: It doesn't have to be a question
query = 'Who\'s the most famous podcaster in the world?'
query

"Who's the most famous podcaster in the world?"

Now we're ready to run the query through the search engine. We've already trained a TF-IDF vectorizer (tfidf) on the 1,000 artilces and saved the resulting document-term matrix (res). Now we'll transform the sample query into a matching TF-IDF vector and compare it to every row in the matric to find the most similar documents.

In [98]:
# Here's the vectorizer
tfidf

In [100]:
# Here's the document-term matrix
res

<1000x200188 sparse matrix of type '<class 'numpy.float64'>'
	with 379092 stored elements in Compressed Sparse Row format>

In [102]:
# Now need to find which documents, of the above matrix, are most similiar to the query
# This, of course, is the essence of what a search engine does
# So have to transform the query into a vector, and it has to be a list in square brackets
qv = tfidf.transform([query])
qv

<1x200188 sparse matrix of type '<class 'numpy.float64'>'
	with 2 stored elements in Compressed Sparse Row format>

*Notice*: The above matrix is just one row (the query) but the matching number of columns of the document-term matrix!

**Cosine similarity** compares the direction of the two vectors -- in this case the query and the article. It's like asking: "Is this article talking about similar topics, even if it's using difference words?" The closer their direction, the higher the similarity score -- just like finding podcast episodes that cover the same topics from different lens.

In [111]:
# Find the cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
# Compute the similarity on the matrix and the query vector
cosine_similarity(res,qv)

array([[0.        ],
       [0.        ],
       [0.00576795],
       [0.        ],
       [0.        ],
       [0.03901848],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.02081021],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.00401586],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.014556  ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.02077854],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.00535952],
       [0.00538259],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.   

**Interpreting the similarity scores**: Cosine similarity values (as seen above) range in value from 0 to 1, where 1 is a perfect match and 0 means no similarity at all. The higher the number, the more similar the article is to the query.

In [116]:
# Check the shape of the similarity matrix
# Note: Since we're comparing 1,000 documents to 1 query, the result is a 1000x1 matrix, 
# one similarity score for each article
cosine_similarity(res,qv).shape

(1000, 1)

In [118]:
# Reshape the array
# And store it as 'sim'
# Note: These numbers are the similarities to all documents
sim = cosine_similarity(res,qv).reshape(res.shape[0])
sim

array([0.        , 0.        , 0.00576795, 0.        , 0.        ,
       0.03901848, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.02081021, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.00401586,
       0.        , 0.        , 0.        , 0.014556  , 0.        ,
       0.        , 0.        , 0.02077854, 0.        , 0.        ,
       0.        , 0.00535952, 0.00538259, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.01106835,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.04478181, 0.        , 0.        , 0.        ,
       0.00844553, 0.        , 0.        , 0.01027265, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

So, what did we find?

Let's retrieve the document with the highest similarity score. It's not necessarily relevant, but it's the closest match the model could find based on overlapping tokens and TF-IDF weights. This is where search gets interesting -- sometimes you get a chart-topping episode, so to speak, and sometimes you get noise.

In [122]:
# Find the index of the most similar document
sim.argmax()

951

In [124]:
# And now view the document
corpus[sim.argmax()]

'WORLD’S MOST FAMOUS VICTIMS Purchase Stunning Number Of Luxury Homes…Not A Bad Payout For A Couple Who Divided Our Nation. The Obamas are moving into a nine-bedroom mansion in the Kalorama section of Washington   the posh neighborhood of diplomats and DC old money   while younger daughter Sasha finishes high school at Sidwell Friends. But they have apparently been buying real estate elsewhere, too.According to sources, the Obamas have purchased a house in Rancho Mirage, Calif., not far from Sunnylands, the former Annenberg estate, which presidents use as a getaway and which is thought of as the unofficial West Coast Camp David.Rancho Mirage, where Gerald Ford retired, is a top destination for golf   a favorite pastime of President Obama. The local daily newspaper, the Desert Sun, has reported rumors of such a sale for more than a year. But sources say now it s a done deal. Via: NYPThe Obamas, who also own a home in Chicago, will rent the  8,200-square-foot property while their younges

## 'WORLD'S MOST FAMOUS VICTIMS...' -- The model worked, technically

Well, the search engine worked, technically, at least. It surfaced the document or article that most overlapped with the language in the query: "Who's the most famous podcaster in the world?" The problem, of course, is that it also missed the most important word of all *podcaster*.

This result highlights a key limitation of the current setup: TF-IDF only understands *words*, not meaning.

The driving force behind this result was the phrase "world's most famous...", which in this case has nothing to do with podasts, the crux of the query. This is why building smarter, more effective search tools requires more advanced models that understand context and semantics, not just matching terms.

Still, it's pretty cool to see how computers can start piecing things together -- even when the outputs get weird.

## Making the Top 10: Sorting Results Like a Google Search

In [129]:
# Now to create the top 10 list, sort of the "Google search"
# Start by using .argsort()
indices = sim.argsort()
# So this will be sorted from the index of the smallest to the largest cosine similarities
indices

array([  0, 623, 624, 626, 627, 628, 629, 630, 631, 632, 633, 634, 622,
       636, 638, 639, 640, 642, 643, 644, 645, 646, 647, 649, 650, 637,
       651, 621, 619, 589, 590, 592, 593, 594, 595, 596, 598, 599, 603,
       604, 620, 605, 607, 609, 610, 611, 612, 613, 614, 615, 616, 617,
       618, 606, 588, 652, 654, 689, 690, 692, 693, 694, 696, 698, 699,
       700, 702, 703, 687, 704, 706, 708, 710, 712, 713, 714, 715, 716,
       717, 718, 719, 705, 653, 686, 684, 655, 656, 658, 659, 662, 663,
       664, 665, 667, 668, 669, 685, 670, 672, 673, 675, 676, 677, 678,
       679, 680, 681, 682, 683, 671, 720, 587, 584, 486, 487, 488, 489,
       490, 491, 492, 493, 494, 497, 498, 485, 500, 503, 504, 505, 507,
       508, 509, 511, 512, 513, 514, 515, 502, 517, 484, 482, 451, 452,
       455, 456, 459, 460, 461, 462, 463, 464, 466, 483, 467, 469, 470,
       471, 473, 474, 475, 477, 478, 479, 480, 481, 468, 585, 518, 520,
       557, 558, 560, 561, 562, 564, 565, 566, 567, 568, 569, 55

Below we loop through the top 10 articles returned by the search engine. Each one is shown with its similarity score, essentially how confident the model is in the match.

**Spolier alert**: The results aren't great. Despite asking about podcasts, none of the articles are actually above podcasts, just the overlapping terms like "world's most famous." This is a great reminder that TF-IDF matches *words*, not *meaning*.

In [131]:
# Have to create a for loop
for i in range(10):
    ind = indices[-i-1]
    print(f'====== DOCUMENT {ind}, SIMILARITY {sim[ind]}')
    print(corpus[ind] + '\n')

WORLD’S MOST FAMOUS VICTIMS Purchase Stunning Number Of Luxury Homes…Not A Bad Payout For A Couple Who Divided Our Nation. The Obamas are moving into a nine-bedroom mansion in the Kalorama section of Washington   the posh neighborhood of diplomats and DC old money   while younger daughter Sasha finishes high school at Sidwell Friends. But they have apparently been buying real estate elsewhere, too.According to sources, the Obamas have purchased a house in Rancho Mirage, Calif., not far from Sunnylands, the former Annenberg estate, which presidents use as a getaway and which is thought of as the unofficial West Coast Camp David.Rancho Mirage, where Gerald Ford retired, is a top destination for golf   a favorite pastime of President Obama. The local daily newspaper, the Desert Sun, has reported rumors of such a sale for more than a year. But sources say now it s a done deal. Via: NYPThe Obamas, who also own a home in Chicago, will rent the  8,200-square-foot property while their youngest

## Build the SEARCH function

Now that we've seen the limitations of the search engine with this intial query, let's build out a function so that the search engine can be tested out at scale.

In [136]:
# Can turn this into a function so that it can be deployed quickly
def search (query, docterm_matrix, vectorizer, document_corpus):
    qv = tfidf.transform([query])
    sim = cosine_similarity(docterm_matrix,qv).reshape(docterm_matrix.shape[0])
    indices = sim.argsort()
    for i in range(10):
        ind = indices[-i-1]
        print(f'====== DOCUMENT {ind}, SIMILARITY {sim[ind]}')
        print(document_corpus[ind] + '\n')   

In [185]:
# Try a sample search particularly one that might better fit this dataset
search('U.S. and China relations', res, tfidf, corpus)

Ahead of Trump trip, China urges U.S. not to allow Taiwan president in. BEIJING/TAIPEI (Reuters) - China urged the United States on Friday not to allow Taiwan s president to travel through U.S. territory en route to the island s diplomatic allies in the Pacific, a sensitive visit shortly ahead of U.S. President Donald Trump s trip to Beijing.  China considers democratic Taiwan to be a wayward province ineligible for state-to-state relations and has never renounced the use of force to bring the island under its control.  China regularly calls Taiwan the most sensitive and important issue between it and the United States, and Beijing always complains to Washington about transit stops by Taiwanese presidents. President Tsai Ing-wen leaves on Saturday on a weeklong trip to three Pacific island allies - Tuvalu, the Solomon Islands and the Marshall Islands - via Honolulu and Guam. In a statement on Friday, a Taiwanese government spokesman said Tsai s trip was aimed at strengthening ties with