# Searching and Ranking
Herein, we will carry out the necessary steps to crawl, index, and search a set of pages, and even rank their results in many ways.

## Building a search engine
1. __Develop a way to collect the documents.__ Usually, involves _crawling_.
2. __Once documents are collect, they need to be indexed__. This involves creating a big table of the documents and the locations of all the different words. The index simply has to _store a reference_ (such as a file system path or URL) to their locations.
3. __Returning a ranked list of documents from a query.__

Additionaly we will build a neural network for ranking queries.The neural network will learn to associate searches with results based on what links people click on after they get a list of search results. The neural network will use this information to change the ordering of the results to better reflect what people have clicked on in the past.

### Building a Crawler
Steps:
1. Download the pages.
2. Pass them to the indexer
3. Parse the pages to find all the links to the pages that have to be crawled.

_The code for class __Crawler__ is present in __searchengine.py___

In [6]:
"""
Testing Crawler
"""
import searchengine
pagelist = ['http://kiwitobes.com/']
crawler = searchengine.crawler('')
crawler.crawl(pagelist)

Indexing http://kiwitobes.com/
Indexing https://kiwitobes.com/2013/10/22/tuxedo-jacket/comment-page-1/
Indexing https://kiwitobes.com/author/kiwitobes/
Indexing https://kiwitobes.com/feed/
Indexing https://kiwitobes.wordpress.com/wp-login.php
Indexing https://kiwitobes.com/2013/10/22/tuxedo-jacket/
Indexing https://kiwitobes.com/category/uncategorized/
Indexing https://kiwitobes.com/comments/feed/
Could not open http://www.adafruit.com/products/1430
Indexing https://kiwitobes.com/2013/09/26/twitter-lights-and-memory-limits-with-arduino-yun/
Indexing https://kiwitobes.com/2013/08/
Indexing http://kiwitobes.com/
Indexing https://kiwitobes.com/
Indexing https://www.temboo.com/arduino/read-a-tweet
Indexing http://cocktaillabs.tumblr.com/
Indexing http://www.amazon.com/gp/product/B00F6YJK3S/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=B00F6YJK3S&linkCode=as2&tag=tasktoy-20
Indexing https://kiwitobes.com/about/
Indexing https://kiwitobes.files.wordpress.com/2013/09/im

### Building the index
1. Index is a list of all the different words, along with the documents in which they appear and their locations in the documents.
2. We’ll be looking at the actual text on the page and ignoring nontext elements.
3. We’ll also be indexing individual words with all the punctuation characters removed.

In [20]:
"""
setting up the schema (commands already present in the code)
"""
reload(searchengine)
crawler=searchengine.crawler('searchindex.db')
crawler.createindextables()

### Finding the Words of a Page
`gettextonly` function returns a long string containing all the text on the page.<br>
`separatewords` splits a string into a list of separate words so that they can be added to the index. This function considers anything nonalphanumeric to be a separator.<br>
_Note(future implementation): Use a stemming algorithm in place of the normal seperatewords function._