# Building and Loading Text Search in Python Whoosh using TFIDF

For this Practice, 
we will be creating full text search capability using Python as we did in the Lab, using TFIDF scoring. 

This time, our data is in the folder **`/dsa/data/all_datasets/hp`**  - but no, 
this is not Hewlett Packard documentation. 
It is something much more enchanting!

Throughout the practice, reflection questions are asked. 
Take the time to answer them - consult the documentation for libraries and functions if needed, 
experiment with the code, and ask your classmates.


## 1. Building the Whoosh Schema

Import the necessary libraries and build a schema including filename, line_num and content.

In [1]:
# Add your code below 
# -----------------------


#TO DO: import the necessary libraries for this step
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

#TO DO: build the schema
schema = Schema(filename=ID(stored=True),
                line_num=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
               )

#### Reflection

 - Which libraries did you import and why?

 - Explain how you built the schema - did you use ID, TEXT, KEYWORD or STORED? 
 - If so, where and why? ([Documentation available here](http://whoosh.readthedocs.io/en/latest/schema.html))



----

## 2. Loading the Data

* In the first cell, import any libraries you need, create the index in the folder `hp_index` within the practices folder, and get a writer for the index.
* In the second cell, complete the function for loadFile
* In the third cell, process the folder and persist your changes.

In [2]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
import os
from whoosh import index

#TO DO: Create the index
# Note, this clears the existing index in the directory
os.makedirs("hp_index", exist_ok=True)

ix = index.create_in("hp_index", schema)

#TO DO: Get a writer form the created index in 
writer = ix.writer()

In [3]:
# Complete code below 
# -----------------------

def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    line_no = 1
    with open(fname, 'r', encoding="utf-8") as infile:
        # TODO: create indexes for each line in the input file
        for line in infile:
            line=line.strip("\n")
            writer.add_document(filename=fname, line_num = str(line_no), content=line)
            line_no += 1 # <---------Increment after so fist line is 1. Could also change initiation to 0 with previous code
        #-------------------------------------------------------
        print("Indexed: ", fname)


def processFolder(writer,folder):
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        # add a new line to separate folders in the output
        print("\nroot = ", root)
        # Process Files
        for file in files:
            if file.endswith(".txt"):
                filename = os.path.join(root, file)
                print('root:', root, '; file:', file, '; filename:', filename)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")


In [4]:
# Add your code below 
# -----------------------

# TODO: process the folder and persist your changes 
processFolder(writer, '/dsa/data/all_datasets/hp')
    
writer.commit()

Processing folder:  /dsa/data/all_datasets/hp

root =  /dsa/data/all_datasets/hp
root: /dsa/data/all_datasets/hp ; file: CHAPTER 1.txt ; filename: /dsa/data/all_datasets/hp/CHAPTER 1.txt
Processing File: /dsa/data/all_datasets/hp/CHAPTER 1.txt
Indexed:  /dsa/data/all_datasets/hp/CHAPTER 1.txt
root: /dsa/data/all_datasets/hp ; file: CHAPTER 2.txt ; filename: /dsa/data/all_datasets/hp/CHAPTER 2.txt
Processing File: /dsa/data/all_datasets/hp/CHAPTER 2.txt
Indexed:  /dsa/data/all_datasets/hp/CHAPTER 2.txt
root: /dsa/data/all_datasets/hp ; file: CHAPTER 3.txt ; filename: /dsa/data/all_datasets/hp/CHAPTER 3.txt
Processing File: /dsa/data/all_datasets/hp/CHAPTER 3.txt
Indexed:  /dsa/data/all_datasets/hp/CHAPTER 3.txt
root: /dsa/data/all_datasets/hp ; file: CHAPTER 4.txt ; filename: /dsa/data/all_datasets/hp/CHAPTER 4.txt
Processing File: /dsa/data/all_datasets/hp/CHAPTER 4.txt
Indexed:  /dsa/data/all_datasets/hp/CHAPTER 4.txt
root: /dsa/data/all_datasets/hp ; file: CHAPTER 5.txt ; filename: /

#### Reflection

 - Which libraries did you import and why?
 - In loadFile, how did you get the line number for each line?
 - In loadFile, which code line adds an index for the processed line?
 - In processFolder, what does the following line do? Give an example.
```
filename = os.path.join(root, file)
```
 - What code line makes sure the index get persisted? (How is it saved so it can be used?)

----

## 3. Executing Queries
* In the first cell, import any libraries you need, and find the indexes of lines where the string 'Harry' appears. Display the top 10 hits.
* In the second cell, import any additional libraries you need, and find the indexes of lines where the string 'Harry' appears using TF-IDF as the scoring mechanism. Display the top 10 hits.
* In the third cell, import any additional libraries you need, and use a filter to list the indexes in chapter 6 corresponding to the search string 'Harry' using TF_IDF as the scoring mechanism. Display the top 10 hits.

In [8]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
from whoosh .qparser import QueryParser

#TO DO: Find the indexes of lines where the string 'Harry' appears. 
qp = QueryParser("content", schema = ix.schema)
q = qp.parse("Harry")

#TO DO: display the top 10 hits
with ix.searcher() as s:
    results = s.search(q)
    for i in results:
        print(i['filename'], i['line_num'], i.score, i.rank)



/dsa/data/all_datasets/hp/CHAPTER 6.txt 707 4.662217912394149 0
/dsa/data/all_datasets/hp/CHAPTER 2.txt 395 4.411890906186834 1
/dsa/data/all_datasets/hp/CHAPTER 1.txt 96 4.235177587134104 2
/dsa/data/all_datasets/hp/CHAPTER 2.txt 44 4.235177587134104 3
/dsa/data/all_datasets/hp/CHAPTER 3.txt 17 4.235177587134104 4
/dsa/data/all_datasets/hp/CHAPTER 3.txt 338 4.235177587134104 5
/dsa/data/all_datasets/hp/CHAPTER 5.txt 915 4.235177587134104 6
/dsa/data/all_datasets/hp/CHAPTER 6.txt 348 4.235177587134104 7
/dsa/data/all_datasets/hp/CHAPTER 6.txt 423 4.235177587134104 8
/dsa/data/all_datasets/hp/CHAPTER 6.txt 755 4.235177587134104 9


In [11]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
from whoosh.qparser import QueryParser
from whoosh import scoring

#TO DO: Find the indexes of lines where the string 'Harry' appears using TF_IDF as the scoring mechanism. 
qp = QueryParser("content", schema=ix.schema)
q = qp.parse("Harry")

#TO DO: display the top 10 hits
w = scoring.BM25F(B=0.8, content_B=1.0, K1=1.5)

with ix.searcher(weighting = w) as s:
    results = s.search(q)
    for i in results:
        print(i['filename'], i['line_num'], i.score, i.rank)



/dsa/data/all_datasets/hp/CHAPTER 1.txt 96 5.0434198813672255 0
/dsa/data/all_datasets/hp/CHAPTER 2.txt 44 5.0434198813672255 1
/dsa/data/all_datasets/hp/CHAPTER 3.txt 17 5.0434198813672255 2
/dsa/data/all_datasets/hp/CHAPTER 3.txt 338 5.0434198813672255 3
/dsa/data/all_datasets/hp/CHAPTER 5.txt 915 5.0434198813672255 4
/dsa/data/all_datasets/hp/CHAPTER 6.txt 348 5.0434198813672255 5
/dsa/data/all_datasets/hp/CHAPTER 6.txt 423 5.0434198813672255 6
/dsa/data/all_datasets/hp/CHAPTER 6.txt 707 5.0434198813672255 7
/dsa/data/all_datasets/hp/CHAPTER 6.txt 755 5.0434198813672255 8
/dsa/data/all_datasets/hp/CHAPTER 7.txt 306 5.0434198813672255 9


In [12]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
from whoosh.qparser import QueryParser # <--- since these were loaded earlier this doesn't matter...
from whoosh import scoring
from whoosh.query import Term

#TO DO: Use a filter to list the indexes in chapter 6 corresponding to the search string 'Harry' 
# using TF_IDF as the scoring mechanism. 

with ix.searcher(weighting = scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse("Harry")
    
    allow_q = Term("filename", "/dsa/data/all_datasets/hp/CHAPTER 6.txt")

    #TO DO: display the top 10 hits
    results = s.search(user_q, filter = allow_q)
    for i in results:
        print(i['filename'], i['line_num'], i.score, i.rank)
    

/dsa/data/all_datasets/hp/CHAPTER 6.txt 401 6.305127613017597 0
/dsa/data/all_datasets/hp/CHAPTER 6.txt 707 6.305127613017597 1
/dsa/data/all_datasets/hp/CHAPTER 6.txt 808 6.305127613017597 2
/dsa/data/all_datasets/hp/CHAPTER 6.txt 5 3.1525638065087986 3
/dsa/data/all_datasets/hp/CHAPTER 6.txt 6 3.1525638065087986 4
/dsa/data/all_datasets/hp/CHAPTER 6.txt 7 3.1525638065087986 5
/dsa/data/all_datasets/hp/CHAPTER 6.txt 10 3.1525638065087986 6
/dsa/data/all_datasets/hp/CHAPTER 6.txt 13 3.1525638065087986 7
/dsa/data/all_datasets/hp/CHAPTER 6.txt 19 3.1525638065087986 8
/dsa/data/all_datasets/hp/CHAPTER 6.txt 40 3.1525638065087986 9


#### Reflection

 - Which libraries did you import and why?
 - What differences do you see in the results of the first two cells?
 - What do those differences mean?

----

# Save your notebook, then `File > Close and Halt`