## Project 5: Search Engine
The first step in creating a search engine is to develop a way to collect the documents. After you collect the documents, they need to be indexed. The final step is, of course, returning a ranked list of documents from a query. Finally, you’ll build a neural network for ranking queries. 

中文范例   http://www.jianshu.com/p/42e12d502cc6       http://www.jianshu.com/p/e9f79d69a375

### searchengine module: 
it has two classes: one for crawling and creating the database, and the other for doing full-text searches by querying the database.

Crawling or spidering: it will be seeded with a small set of pages to index and will then follow any links on that page to find other pages, whose links it will also follow. 


In [1]:
import urllib.request
c = urllib.request.urlopen('http://www.diveintopython.net')
contents = c.read()
print(contents[0:200])

b'\r\n<!DOCTYPE html\r\n  PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\r\n<html lang="en">\r\n   <head>\r\n      <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859'


In [3]:
ignorewords = set(['the', 'of', 'to', 'and', 'of', 'a', 'in','is', 'it'])
print(ignorewords)

{'is', 'it', 'to', 'of', 'in', 'and', 'the', 'a'}


### def crawl(self, pages, depth=2)

In [23]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(contents, 'lxml')
links = soup('a')
print(links[:10])

[<a name="intro"></a>, <a class="plain" href="http://www.amazon.com/gp/product/B001GIOFN8/ref=as_li_tf_tl?ie=UTF8&amp;tag=diveintopython-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=B001GIOFN8"><img alt="[Dive Into Python]" height="140" src="images/cover-small.jpg" style="border: 0" title="Dive Into Python: Buy it at Amazon.com" width="106"/></a>, <a href="toc/index.html">read the book</a>, <a href="#download" title="Download Dive Into Python">download it</a>, <a title="Dive Into Python in your language">multiple languages</a>, <a name="read"></a>, <a href="toc/index.html"><i class="citetitle">Dive Into <span class="application">Python</span></i></a>, <a href="appendix/history.html">read the revision history</a>, <a href="mailto:josh@djazzee.com">Email me</a>, <a name="download"></a>]


In [13]:
for link in links[:3]:
    print(link.attrs)
    print(dict(link.attrs))
    print(link.attrs == dict(link.attrs))

{'name': 'intro'}
{'name': 'intro'}
True
{'class': ['plain'], 'href': 'http://www.amazon.com/gp/product/B001GIOFN8/ref=as_li_tf_tl?ie=UTF8&tag=diveintopython-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=B001GIOFN8'}
{'class': ['plain'], 'href': 'http://www.amazon.com/gp/product/B001GIOFN8/ref=as_li_tf_tl?ie=UTF8&tag=diveintopython-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=B001GIOFN8'}
True
{'href': 'toc/index.html'}
{'href': 'toc/index.html'}
True


#### urllib.parse
https://docs.python.org/3/library/urllib.parse.html
urllib.parse.urljoin(base, url, allow_fragments=True)

Construct a full (“absolute”) URL by combining a “base URL” (base) with another URL (url). Informally, this uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL.

The allow_fragments argument has the same meaning and default as for urlparse().
Note If url is an absolute URL (that is, starting with // or scheme://), the url’s host name and/or scheme will be present in the result. For example:

In [30]:
import urllib.parse  
page = 'http://www.diveintopython.net'
newpages = set()
for link in links[:5]:
    if('href' in link.attrs):
        url = urllib.parse.urljoin(page, link['href'])
        print(url)
        if url.find("'") != -1:  # 如果包含子字符串返回开始的索引值，否则返回-1
           continue
        print(url.split('#'))
        url = url.split('#')[0]  # remove location portion
        #if url[:4] == 'http' and not self.isindexed(url):
        if url[:4] == 'http':
            newpages.add(url)
print('***  newpages set: ***')
print(newpages)

http://www.amazon.com/gp/product/B001GIOFN8/ref=as_li_tf_tl?ie=UTF8&tag=diveintopython-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=B001GIOFN8
['http://www.amazon.com/gp/product/B001GIOFN8/ref=as_li_tf_tl?ie=UTF8&tag=diveintopython-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=B001GIOFN8']
http://www.diveintopython.net/toc/index.html
['http://www.diveintopython.net/toc/index.html']
http://www.diveintopython.net#download
['http://www.diveintopython.net', 'download']
***  newpages set: ***
{'http://www.diveintopython.net', 'http://www.diveintopython.net/toc/index.html', 'http://www.amazon.com/gp/product/B001GIOFN8/ref=as_li_tf_tl?ie=UTF8&tag=diveintopython-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=B001GIOFN8'}


#### separate words

In [17]:
import re
#def separatewords(self,text):
text = 'This is a test. 123 and 456? Yes, or no. yes!'
#\W: non-word character,非数字和字母的字符
#使用 compile 函数将正则表达式的字符串形式编译为一个 Pattern 对象
splitter=re.compile('\\W+')  # 若是\W* 0次或多次，会有futurewarning
print(splitter.split(text))  # 最后都会有一个空字符‘’
print(re.split('\W+', text)) 
[s.lower() for s in re.split('\W+', text) if s!='']

['This', 'is', 'a', 'test', '123', 'and', '456', 'Yes', 'or', 'no', 'yes', '']
['This', 'is', 'a', 'test', '123', 'and', '456', 'Yes', 'or', 'no', 'yes', '']


['this', 'is', 'a', 'test', '123', 'and', '456', 'yes', 'or', 'no', 'yes']

In [10]:
import sqlite3
# getentryid
con = sqlite3.connect('searchindex.db')
cur = con.execute("select rowid from wordlist where word = 'python' ") # sql语句必须为双引号“”
res = cur.fetchone()
print(res)

(17,)
