# SearchBetter demos

## Getting started

Before you run this demo, you'll need to do a few things:

* Make sure you define `secure.py` in the `src/` directory. We've provided a `secure.py.example` for you to work off!
* Download the [Udacity course listings](https://www.udacity.com/public-api/v0/courses) and put it at 

In [8]:
# First, let's get all the imports out of the way...

import gensim.models.word2vec as word2vec
from pprint import pprint

import sys
sys.path.append('../')
sys.path.append('../src/')

import searchbetter.search as search
reload(search)
import searchbetter.rewriter as rewriter
reload(rewriter)

import secure

## Making a search engine

SearchBetter lets you make custom, batteries-included search engines out of any dataset, no matter how large or how small. We include some examples in `search.py`. As an example, consider the pre-built edX search engine, which searches over a dump of all edX courses.

In [9]:
# Create a search engine that searches over all edX courses.
# Under the hood, this uses Python's Whoosh library to index
# the course data stored in a CSV and then run searches against it.
dataset_path = secure.DATASET_PATH_BASE+'udacity-api.json'
index_path = secure.INDEX_PATH_BASE+'udacity'

# Use `create=False` if you've already made the search engine, `create=True` if this is
# your first time making it. We cache the search indices behind search engines on disk.
### UNCOMMENT THE BELOW IF YOU'RE RUNNING THIS FOR THE FIRST TIME
# udacity_engine = search.UdacitySearchEngine(dataset_path, index_path, create=True)
### COMMENT THE BELOW IF YOU'RE RUNNING THIS FOR THE FIRST TIME
udacity_engine = search.UdacitySearchEngine(dataset_path, index_path, create=False)

# We expose a simple searching API
search_term = "android"
udacity_results = udacity_engine.search(search_term)
print "%d edX search results for '%s':" % (len(udacity_results), search_term)
pprint(udacity_results)

print "\n"

# Searching works on bigrams (two-word queries) too!
search_term = "machine learning"
udacity_results = udacity_engine.search(search_term)
print "%d edX search results for '%s':" % (len(udacity_results), search_term)
pprint(udacity_results)

32 edX search results for 'android':
[{'slug': u'developing-android-apps--ud853',
  'title': u'Developing Android Apps'},
 {'slug': u'new-android-fundamentals--ud851',
  'title': u'New Android Fundamentals'},
 {'slug': u'android-for-beginners--ud834', 'title': u'Android for Beginners'},
 {'slug': u'android-for-beginners--ud834', 'title': u'Android for Beginners'},
 {'slug': u'android-tv-and-google-cast-development--ud875B',
  'title': u'Android TV and Google Cast Development'},
 {'slug': u'gradle-for-android-and-java--ud867',
  'title': u'Gradle for Android and Java'},
 {'slug': u'android-wear-development--ud875A',
  'title': u'Android Wear Development'},
 {'slug': u'material-design-for-android-developers--ud862',
  'title': u'Material Design for Android Developers'},
 {'slug': u'android-basics-user-input--ud836',
  'title': u'Android Basics: User Input'},
 {'slug': u'android-basics-user-input--ud836',
  'title': u'Android Basics: User Input'},
 {'slug': u'android-basics-networking--ud

## Query rewriting

Sometimes the plain-vanilla search engine just doesn't cut it. Sometimes search queries don't return enough results.   With query rewriting, the search engine looks for semantically related terms to the user's query in addition to the query itself. This helps find more search results, which is particularly useful if the bare query doesn't get any hits.

SearchBetter has two built-in query rewriters: a simple one that uses Wikipedia's Categories API to find similar terms, and a more complex one that uses Google's Word2Vec (a ML similar-word-finding algorithm trained on Wikipedia article dumps) to find similar phrases.

In [10]:
# Query rewriting lets you turn a single search query into
# multiple related queries. You can then search for *all*
# of these queries, which can result in more and more useful
# results than just the original query would give.

# First, a rewriter that uses the Wikipedia category API
# to find terms related to the original term
wiki_rewriter = rewriter.WikipediaRewriter()
term = "socialism"
wiki_rewritten_queries = wiki_rewriter.rewrite(term)
print "Rewrites of '%s' using Wikipedia Categories:" % term
pprint(wiki_rewritten_queries)


print "\n"


# Second, a rewriter that uses Word2Vec to find similar
# words to the entered term. This is a machine learning
# algorithm trained on a large text corpus.
# Prepare the corpus (from Wikipedia) to use for the Word2Vec Rewriter.
corpus = word2vec.LineSentence(secure.DATASET_PATH_BASE + 'wikiclean8')

# Now make the rewriter...
model_path = secure.MODEL_PATH_BASE+'word2vec/word2vec'

## UNCOMMENT the below line if it's your first time making this rewriter
# w2v_rewriter = rewriter.Word2VecRewriter(model_path, create=True, corpus=corpus, bigrams=True)
## UNCOMMENT the below line if you've made the rewriter before
w2v_rewriter = rewriter.Word2VecRewriter(model_path, create=False)

w2v_rewritten_queries = w2v_rewriter.rewrite(term)
print "Rewrites of '%s' using Word2Vec:" % term
pprint(w2v_rewritten_queries)

Rewrites of 'socialism' using Wikipedia Categories:
['anti-capitalism', 'anti-fascism', 'socialism']


Rewrites of 'socialism' using Word2Vec:
[u'communism',
 u'capitalism',
 u'ideology',
 u'fascism',
 u'liberalism',
 u'marxism',
 u'marxist',
 u'laissez faire',
 u'imperialism',
 u'nationalism',
 u'socialism']


## Putting it together: Query-Rewriting Search Engines

As we've seen, query rewriters convert one search term into a set of semantically similar ones. Hopefully, by searching for the whole set of terms instead of just one term, we could get more (and more useful) results out of a search engine.

With SearchBetter, you can connect any query rewriter to any search engine and automatically start getting more results.

In [18]:
# Let's plug our two rewriters into the search engine
# to compare the results

# Suppose this is the user's search term
search_term = 'artificial intelligence'

# First, what do we get without rewriting?
udacity_engine.set_rewriter(None)
bare_results = udacity_engine.search(search_term)
print "Without rewriting, %d results for '%s':" % (len(bare_results), search_term)
pprint(bare_results)

print "\n"

# Next, try the Wikipedia rewriter
udacity_engine.set_rewriter(wiki_rewriter)
wiki_rewritten_results = udacity_engine.search(search_term)
print "With Wikipedia Categories rewriting, %d results for '%s':" % (len(wiki_rewritten_results), search_term)
pprint(wiki_rewritten_results)

print "\n"

# Last, try the Word2Vec rewriter
udacity_engine.set_rewriter(w2v_rewriter)
w2v_rewritten_results = udacity_engine.search(search_term)
print "With Word2Vec rewriting, %d results for '%s':" % (len(w2v_rewritten_results), search_term)
pprint(w2v_rewritten_results)

Without rewriting, 5 results for 'Artificial Intelligence':
[{'slug': u'knowledge-based-ai-cognitive-systems--ud409',
  'title': u'Knowledge-Based AI: Cognitive Systems'},
 {'slug': u'intro-to-artificial-intelligence--cs271',
  'title': u'Intro to Artificial Intelligence'},
 {'slug': u'artificial-intelligence-for-robotics--cs373',
  'title': u'Artificial Intelligence for Robotics'},
 {'slug': u'deep-learning--ud730', 'title': u'Deep Learning'},
 {'slug': u'machine-learning--ud262', 'title': u'Machine Learning'}]


With Wikipedia Categories rewriting, 5 results for 'Artificial Intelligence':
[{'slug': u'knowledge-based-ai-cognitive-systems--ud409',
  'title': u'Knowledge-Based AI: Cognitive Systems'},
 {'slug': u'intro-to-artificial-intelligence--cs271',
  'title': u'Intro to Artificial Intelligence'},
 {'slug': u'artificial-intelligence-for-robotics--cs373',
  'title': u'Artificial Intelligence for Robotics'},
 {'slug': u'deep-learning--ud730', 'title': u'Deep Learning'},
 {'slug': u'm