--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, structure the retrieval system by defining a storage schema.

```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content, tags)`

In [3]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                line_num=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer(),stored=True)
               )

--- 
<a id='load_it' ></a>

## Loading Data

The books are in the folder called books in `lab/` folder:

Create the _whoosh_ index files in the folder, then ingest the files.

To load the data, a python script with follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.

In [4]:
import os, os.path
from whoosh import index

# Note, this clears the existing index in the directory
ix = index.create_in("indexes", schema)

# Get a writer form the created index in 
writer = ix.writer()

In [5]:
def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    line_no = 1
    with open(fname, 'r') as infile:
        for line in infile:
            line = line.rstrip('\n')
            line_no += 1
            writer.add_document(filename=fname, \
                                line_num=str(line_no),\
                                content=line)
    print("Indexed: ", fname)


#     with open(fname, 'r') as infile:
#         content=infile.read()
#         txt = content.splitlines()
#         for line in txt:
#             writer.add_document(filename=fname, content=line)
#         print("Indexed: ", fname)

def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".txt"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")
        # Recurse into subfolders
        for d in dirs:
            print("recursing into ",d)
            processFolder(writer,d)

# Functions defined,  get the party started:
processFolder(writer,"book")
writer.commit() # save changes

Processing folder:  book
root =  book
Processing File: book/1chron.txt
Indexed:  book/1chron.txt
Processing File: book/1corinth.txt
Indexed:  book/1corinth.txt
Processing File: book/1john.txt
Indexed:  book/1john.txt
Processing File: book/1kings.txt
Indexed:  book/1kings.txt
Processing File: book/1peter.txt
Indexed:  book/1peter.txt
Processing File: book/1samuel.txt
Indexed:  book/1samuel.txt
Processing File: book/1thess.txt
Indexed:  book/1thess.txt
Processing File: book/1timothy.txt
Indexed:  book/1timothy.txt
Processing File: book/2chron.txt
Indexed:  book/2chron.txt
Processing File: book/2corinth.txt
Indexed:  book/2corinth.txt
Processing File: book/2john.txt
Indexed:  book/2john.txt
Processing File: book/2kings.txt
Indexed:  book/2kings.txt
Processing File: book/2peter.txt
Indexed:  book/2peter.txt
Processing File: book/2samuel.txt
Indexed:  book/2samuel.txt
Processing File: book/2thess.txt
Indexed:  book/2thess.txt
Processing File: book/2timothy.txt
Indexed:  book/2timothy.txt
Pr

--- 
<a id='search_me' ></a>

## Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html

In [6]:
from whoosh.qparser import QueryParser

qp = QueryParser("content", schema=ix.schema)
q = qp.parse(u"love")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit["filename"])

book/luke.txt
book/john.txt
book/1john.txt
book/1john.txt
book/1john.txt
book/1john.txt
book/john.txt
book/1john.txt
book/john.txt
book/malachi.txt


By default the results contains at most the first 10 matching documents. To get more results, use the limit keyword:

We have set the limit for maximum of 20 results to return and above code will return 19 results. If you want all results, use limit=None. However, setting the limit whenever possible makes searches faster because Whoosh doesn’t need to examine and score every document.

-----
<a id='Scoring' ></a>

## Scoring

Until this point you should be familiar with all the code above. The code cell above illustrates the search results using the vector space model. In coming cells, we will be using a scoring criteria while searching the indexes below. 


Normally the list of result documents is sorted by score. The whoosh.scoring module contains implementations of various scoring algorithms. The default is [BM25F](https://en.wikipedia.org/wiki/Okapi_BM25). You can set the scoring object to use when you create the searcher using the weighting keyword argument: 

````
from whoosh import scoring

with myindex.searcher(weighting=scoring.TF_IDF()) as s:
    ... ````
    
    
A weighting model is a WeightingModel subclass with a scorer() method that produces a “scorer” instance. This instance has a method that takes the current matcher and returns a floating point score.



### TF IFD

So why do we have to score the terms. Previously, we have simply used the number of times a token (i.e., word, or more generally an n-gram) occurs in a document to classify the document. Even with the removal of stop words, however, this can still overemphasize tokens that might generally occur across many documents (e.g., names or general concepts). An alternative technique that often provides robust improvements in classification accuracy is to employ the frequency of token occurrence, normalized over the frequency with which the token occurs in all documents. In this manner, we give higher weight in the classification process to tokens that are more strongly tied to a particular label. 

Formally this concept is known as [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf–idf) (or tf-idf). We will use this scoring method to compare the search results with normal vector space model.

In below code cell, documents with a better TF-IDF score will appear higher in the search results list. Compare below results with the results of above cell which was using basic vector space model for scoring documents. Read the below documents to understand what TF_IDF nis about and how it is applied in whoosh. 
 

-----

Reference: 

- [Scoring and sorting](http://whoosh.readthedocs.io/en/latest/searching.html#scoring-and-sorting)
- [TF-IDF](http://www.tfidf.com/)

In [7]:
from whoosh.qparser import QueryParser
from whoosh import scoring

qp = QueryParser("content", schema=ix.schema)
q = qp.parse(u"love")

with ix.searcher(weighting=scoring.TF_IDF()) as s:
    results = s.search(q)
    for hit in results:
        print(hit["filename"])

book/luke.txt
book/1john.txt
book/1john.txt
book/1john.txt
book/1john.txt
book/1samuel.txt
book/hosea.txt
book/john.txt
book/john.txt
book/malachi.txt


You can observe the files **"1samuel.txt"** and **"hosea.txt"** have made it to top 10 while the file **john.txt** which was at position 6 and 7 is pushed down to positions 8 and 9 because of the TFIDF scores because of the ranking based on TDIDF scores. 


----

### Filtering results

You can use the filter keyword argument in search() to specify a set of documents to permit in the results. The argument can be a whoosh.query.Query object, a whoosh.searching.Results object, or a set-like object containing document numbers. The searcher caches filters so if for example you use the same query filter with a searcher multiple times, the additional searches will be faster because the searcher will cache the results of running the filter query. You can also specify a mask keyword argument to specify a set of documents that are not permitted in the results. 

Lets first look up documents where hate is appearing.

----

In [8]:
from whoosh.qparser import QueryParser
from whoosh import scoring

qp = QueryParser("content", schema=ix.schema)
q = qp.parse(u"hate")

with ix.searcher(weighting=scoring.TF_IDF()) as s:
    results = s.search(q)
    for hit in results:
        print(hit["filename"])

book/deut.txt
book/john.txt
book/2samuel.txt
book/proverbs.txt
book/psalms.txt
book/titus.txt
book/1john.txt
book/1kings.txt
book/2chron.txt
book/2chron.txt


In below code cell, we are using filter argument to allow John.txt only in the results and mask the word hate. So if you observe the results below, indexes in john.txt have appeared and none of the indexes have hate in them.

In [9]:
from whoosh.query import *

with ix.searcher(weighting=scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse(u"love")

    # Only show documents in the "rendering" chapter
    allow_q = Term("filename", "book/john.txt")
    # Don't show any documents where the "content" field contains "hate"
    restrict_q = Term("content","hate")

    results = s.search(user_q, mask=restrict_q, filter=allow_q)      #   
    for hit in results:
        print(hit["filename"], hit["content"])

book/john.txt 13:34: A new commandment I give unto you, That ye love one another; as I have loved you, that ye also love one another.
book/john.txt 15:9: As the Father hath loved me, so have I loved you: continue ye in my love.
book/john.txt 13:1: Now before the feast of the passover, when Jesus knew that his hour was come that he should depart out of this world unto the Father, having loved his own which were in the world, he loved them unto the end.
book/john.txt 14:21: He that hath my commandments, and keepeth them, he it is that loveth me: and he that loveth me shall be loved of my Father, and I will love him, and will manifest myself to him.
book/john.txt 14:23: Jesus answered and said unto him, If a man love me, he will keep my words: and my Father will love him, and we will come unto him, and make our abode with him.
book/john.txt 15:10: If ye keep my commandments, ye shall abide in my love; even as I have kept my Father's commandments, and abide in his love.
book/john.txt 15:12

-----

Lets put our results into a pandas dataframe.

In [10]:
from whoosh.searching import Hit 
import numpy as np
from IPython.display import display
import pandas as pd

with ix.searcher(weighting=scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse(u"love")
    
    results = s.search(user_q)
    print("Total no of matches: ",len(results))
    
    rank=[]
    docnum=[]
    score=[]
    filenames=[]
    lines=[]
    line_num=[]
    
    for i in np.arange(0,10):
        rank.append(results[i].rank)
        docnum.append(results[i].docnum)
        score.append(results[i].score)
        filenames.append(results[i]['filename'])
        line_num.append(results[i]['line_num'])
        lines.append(results[i]['content'])
       
    df = pd.DataFrame({'filename' : filenames, 'line_num' : line_num, 'line' : lines, 'docnum' : docnum, \
                            'score' : score, 'rank' : rank})
    display(df)

Total no of matches:  363


Unnamed: 0,docnum,filename,line,line_num,rank,score
0,21820,book/luke.txt,"6:32: For if ye love them which love you, what...",287,0,21.811635
1,1408,book/1john.txt,"2:15: Love not the world, neither the things ...",27,1,16.358726
2,1456,book/1john.txt,"4:10: Herein is love, not that we loved God, ...",75,2,16.358726
3,1462,book/1john.txt,4:16: And we have known and believed the love...,81,3,16.358726
4,1464,book/1john.txt,4:18: There is no fear in love; but perfect l...,83,4,16.358726
5,2949,book/1samuel.txt,20:17: And Jonathan caused David to swear agai...,537,5,16.358726
6,14192,book/hosea.txt,"3:1: Then said the LORD unto me, Go yet, love ...",37,6,16.358726
7,18898,book/john.txt,"13:34: A new commandment I give unto you, That...",622,7,16.358726
8,18942,book/john.txt,"15:9: As the Father hath loved me, so have I l...",666,8,16.358726
9,22690,book/malachi.txt,"1:2: I have loved you, saith the LORD. Yet ye...",4,9,16.358726


The line_num above is the actual line number in the text file. docnum should be the index number in the whole indexes we have created.