# This exercise is to use whoosh to build a search engine 
# over your own DSA Notebooks

Have you ever had the thought:
"I know we did this before! Where did I see that in the course materials?"

This exercise is to build a technical solution to aid in answering that question.

**NOTE:** This is a little more like a practice, but it is the exercise for the week.

## Here are the steps 
### I) Build (conceptual):
  1. Crawl through your home directory and find all notebooks (`.ipynb`)
  2. Extract the visible text from the notebooks
  3. Use Whoosh to Index
  
### II) Query:
  1. Open the index
  1. Query

--- 
## Preliminaries
### Parsing Visible Text from a notebook

In [43]:
import sys
import json

# Test on my self, like all evil scientists
filename = './DSA_Notebook_Search_Engine.ipynb'

# Use the JSON library to get it in a DOM-like structure
# This is similar to using BeautifulSoup on HTML/KML/XML
file_data = json.load(open(filename))

# File data is now a map, recall a JSON format is a combo of dictionaries and lists
cells = file_data.get('cells')

print("Dumping {} Non-output Cells from {}".format(len(cells), filename))

# Count the Cells
cno = 1

# for each cell in the notebook
for c in cells:
    
    #extract and test the cell type
    cell_type = c['cell_type']
    if ('code'==cell_type or 'markdown'==cell_type or 'raw'==cell_type ):
        print("# -------------{}----------------".format(cno))
        print("**********************JACKY: CELL TYPE = ", cell_type)
        # run the source into lines, it is actually a list of strings/lines
        source = c['source']
        for l in source:
            print(l.strip('\n'))
        cno += 1


Dumping 30 Non-output Cells from ./DSA_Notebook_Search_Engine.ipynb
# -------------1----------------
**********************JACKY: CELL TYPE =  markdown
# This exercise is to use whoosh to build a search engine 
# over your own DSA Notebooks

Have you ever had the thought:
"I know we did this before! Where did I see that in the course materials?"

This exercise is to build a technical solution to aid in answering that question.

**NOTE:** This is a little more like a practice, but it is the exercise for the week.

## Here are the steps 
### I) Build (conceptual):
  1. Crawl through your home directory and find all notebooks (`.ipynb`)
  2. Extract the visible text from the notebooks
  3. Use Whoosh to Index
  
### II) Query:
  1. Open the index
  1. Query
# -------------2----------------
**********************JACKY: CELL TYPE =  markdown
--- 
## Preliminaries
### Parsing Visible Text from a notebook
# -------------3----------------
**********************JACKY: CELL TYPE =  code
import s

If you have followed the pattern of the way Dr. Scott builds up code... 

## a) Time to create function!

### Your function should return a set of cells, each cell having all 
### lines concatentated into one long `string` of data.

The comment lines, `# TODO ... `, will define steps for you to complete.

In [44]:
def visibleTextFromNB(filename):
    '''
    # TODO Describe this function's purpose and return result
    
    This function pulls all the non-output visible cells from
    a JupyterNotebook and concatenates it all into a block of
    text.
    
    Returns : a list of the cells
    '''
    #####################################
    # TODO: Parse file, pull cells
    #####################################
    
    file_data = json.load(open(filename))
    cells = file_data.get('cells')
    
    #print('Dumping {} Non-output cells from {}'.format(len(cells), filename))
    
    cno = 1
    cell_list = []
    
    for c in cells:
        cell_type = c['cell_type']
        if (cell_type == 'code' or cell_type == 'markdown' or cell_type == 'raw'):
            #print("===========Cell Number: {}======{}========".format(cno, cell_type))
            
            source = c["source"]
            textBlock = ""
            
            for l in source:
                textBlock += l
            
            #print("appending # {} : {}".format(len(cell_list), textBlock[0:50]))
            cell_list.append(textBlock)
            cno += 1

            if cells == None:
                return cell_list
    
    # return the list
    return cell_list

#End of function: visibleTextFromNB 

## Test your function:

In [45]:
#################################
#        NO EDIT CELL
#################################

# Use the lab notebook
filename = '../labs/Text_Search_TFIDF.ipynb'
cells = visibleTextFromNB(filename)

# Print the begin and end
print(cells[0])
print(cells[1])
print('...')
print(cells[len(cells)-1])

# Building and Loading Text Search in Python Whoosh using TFIDF


## OUTLINE

 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Scoring](#Scoring)
 1. [Executing Queries, Google-lite...very very lite](#TFIFD) 
--- 
<a id='task' ></a>

## Task at hand

For this lab, we are going to revist the IR_with_Python_Whoosh lab in module 5 which walks us through the process of creating full text search capability within Python. In addition to that we are going to incorporate a scoring technique called TFIDF for ranking documents based on TFIDF scores of the terms occuring in the documents. 

We will walk through the process to build the search engine in Python using whoosh. We will compare the serach results with and without TFIDF method.
...
The line_num above is the actual line number in the text file. docnum should be the index number in the whole indexes we have created.


#### <span style="background:yellow">Expected Output</span>

```
# Building and Loading Text Search in Python Whoosh using TFIDF


## OUTLINE

 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Scoring](#Scoring)
 1. [Executing Queries, Google-lite...very very lite](#TFIFD) 
--- 
<a id='task' ></a>

## Task at hand

For this lab, we are going to revist the IR_with_Python_Whoosh lab in module 5 which walks us through the process of creating full text search capability within Python. In addition to that we are going to incorporate a scoring technique called TFIDF for ranking documents based on TFIDF scores of the terms occuring in the documents. 

We will walk through the process to build the search engine in Python using whoosh. We will compare the serach results with and without TFIDF method.
...
The line_num above is the actual line number in the text file. docnum should be the index number in the whole indexes we have created.
```

# b) Create a _draft_ function to walk through directory and find notebooks (.ipynb)

Collect up the directory walking code from module 5 examples.

#### Note, the starting folder will be `'~'`, the alias for your home directory

#### Note, do not process files yet, just construct the function and print the notebook path names

In [46]:
import os, os.path

def walkFolder(folder):
    '''
    Process a folder for files and subfolders
    Prints the files and folders that are processed. 
    '''
    # print('Processing folder: ', folder)
    
    #####################################
    # TODO: walk through the filesystem starting at folder
    # HINT: os.walk
    #####################################
    
    for root, dirs, files in os.walk(folder):
        # print("root = ", root)
        
        #####################################
        # TODO: Process Files
        # HINT: skips file.endswith("-checkpoint.ipynb")
        #####################################
        
        for file in files:
            if (file.endswith(".ipynb") \
                and not file.endswith("-checkpoint.ipynb")):
                filename = os.path.join(root, file)
                print("Found Notebook: ", filename)
           
        #####################################
        # TODO: Recurse into subfolders
        # HINT: Skip these
        #           .config
        #           .cache
        #           .ssh
        #           ... etc.         
        #####################################
        
        for d in dirs:
            if (d != ".config" \
                and d != ".cache" \
                and d != ".ssh" \
                and d != ".local" \
                and d != ".git" \
                and d != ".ipython"
               ):
                #print("recursing into :", d)
                walkFolder(d) 
    

############# END for walkFolder

## Test your function:

In [47]:
#################################
#        NO EDIT CELL
#################################

# Use your top-level home directory
initial_root = os.path.expanduser('~') 

walkFolder("/dsa/home/lz6m7")


Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/module1.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/Module1_Python_Intro_DataScience_JACKY.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/intro_data_science_python.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/intro_data_science_r.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/Module1_R_Intro_DataScience_JACKY.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/practices/data_science_practice_python.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/practices/data_science_practice_r.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/answers/data_science_practice_python.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/answers/data_science_practice_r_answers.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_

#### <span style="background:yellow">Expected Output</span>

   * Expected output similar to

```
Found Notebook: /dsa/home/scottgs/Testing Python.ipynb
Found Notebook: /dsa/home/scottgs/Testing R.ipynb
...
Found Notebook: /dsa/home/scottgs/sp17DMIR/modules/module5/practices/Text_Preprocessing.ipynb
Found Notebook: /dsa/home/scottgs/sp17DMIR/modules/module5/exercises/Exercises_Python_text_search.ipynb
Found Notebook: /dsa/home/scottgs/sp17DMIR/modules/module5/answers/Text_Preprocessing_Answers.ipynb
```

# c) Create the schema for whoosh and initialize 
# the index in `notebooks` folder

Recall, in the work above you pulled each visible cell from the notebook into a list of cells.

So, we can store the data such as our results are:
  * Filename
  * Cell No.   
  
Then, we will also index the cell content.

In [48]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer
from whoosh import index


#####################################
# TODO: Create the schema
#####################################
schema = Schema(filename=ID(stored=True),
                cell_no=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer(), stored=True)
               )

#####################################
# TODO: Create the index and initialize a `writer`
#####################################
ix = index.create_in("indexes", schema)
writer = ix.writer()



In [49]:
#################################
#        NO EDIT CELL
#################################

#  REPEATED for better review

def visibleTextFromNB(filename):
    '''
    # TODO Describe this function's purpose and return result
    This function pulls all the non-output visible cells from
    a JupyterNotebook and concatenates it all into a block of
    text.
    Returns : a list of the cells
    '''
    #####################################
    # TODO: Parse file, pull cells
    #####################################

    file_data = json.load(open(filename))

    # File data is now a map, recall a JSON format is a combo of dictionaries and lists
    cells = file_data.get('cells')

    #####################################
    # TODO: Append cells into a list of cells
    # HINT: Do not strip the newline, \n
    #####################################

    cell_list = []
    if cells == None:
        return cell_list
    
    # for each cell in the notebook
    for c in cells:

        #extract and test the cell type
        cell_type = c['cell_type']
        if ('code'==cell_type or 'markdown'==cell_type or 'raw'==cell_type ):
            cell_text = ""
            # run the source into lines, it is actually a list of strings/lines
            source = c['source']
            for l in source:
                cell_text += l
            cell_list.append(cell_text)

            
    #####################################
    # TODO: Append cells into a list of cells
    # HINT: Do not strip the newline, \n
    #####################################

    # return the list
    return cell_list

#End of function: visibleTextFromNB 

## d) Write function to load file into the index
  * See [Module 5 Lab](../../module5/labs/IR_with_Python_Whoosh.ipynb#This-should-look-familiar!)

In [50]:
def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    #####################################
    # TODO: Get cell text from function
    #####################################
    cells = visibleTextFromNB(fname)
    
    cellnum = 1
    
    for c in cells:
        print("Starting file:", fname , " -- cellnum", cellnum)
        print("")
        
        writer.add_document(filename = fname, \
                            cell_no = str(cellnum), \
                            content = c)
        cellnum += 1     
    
    print("")
    print("Indexed: ", fname)
    print("==================")

# END of function

# e) Adapt the folder walking function to invoke file load

### HINT: add the `writer` as a parameter

In [51]:
import os, os.path

#####################################
# TODO: Adapt the parameters
#####################################
def walkFolder(writer, folder):
    '''
    Process a folder for files and subfolders
    Prints the files and folders that are processed. 
    '''
    for root, dirs, files in os.walk(folder):
        # print("root = ", root)
        
        #####################################
        # TODO: Process Files
        # HINT: skips file.endswith("-checkpoint.ipynb")
        #####################################
        
        for file in files:
            if (file.endswith(".ipynb") \
                and not file.endswith("-checkpoint.ipynb")):
                filename = os.path.join(root, file)
                loadFile(writer, filename)
                print("Found Notebook: ", filename)
                print("")
           
        #####################################
        # TODO: Recurse into subfolders
        # HINT: Skip these
        #           .config
        #           .cache
        #           .ssh
        #           ... etc.         
        #####################################
        
        for d in dirs:
            if (d != ".config" \
                and d != ".cache" \
                and d != ".ssh" \
                and d != ".local" \
                and d != ".git" \
                and d != ".ipython"
               ):
                #print("recursing into :", d)
                walkFolder(writer, d)           

############# END for walkFolder

# f) Run the index build 

In [52]:
#################################
#        NO EDIT CELL
#################################

# Use your top-level home directory
initial_root = os.path.expanduser('~') 
walkFolder(writer,initial_root)


# Commit changes
writer.commit() # save changes

Starting file: /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/module1.ipynb  -- cellnum 1


Indexed:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/module1.ipynb
Found Notebook:  /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/module1.ipynb

Starting file: /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/Module1_Python_Intro_DataScience_JACKY.ipynb  -- cellnum 1

Starting file: /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/Module1_Python_Intro_DataScience_JACKY.ipynb  -- cellnum 2

Starting file: /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/Module1_Python_Intro_DataScience_JACKY.ipynb  -- cellnum 3

Starting file: /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/Module1_Python_Intro_DataScience_JACKY.ipynb  -- cellnum 4

Starting file: /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/Module1_Python_Intro_DataScience_JACKY.ipynb  -- cellnum 5

Starting file: /dsa/home/lz6m7/sp17dsa7600_lz6m7/modules/module1/labs/Module1_Python_Intro_DataScie

In [53]:
# DO NOT RUN UNLESS YOU NEED A RESET

# from whoosh import writing
# writer.commit(mergetype=writing.CLEAR)

# g) Execute three queries

## 1

In [54]:
from whoosh.qparser import QueryParser
from whoosh import scoring

# Get input, convert to unicode
qstr = input("Input a query: ")

print("searching for ",qstr)

####################################
# TODO: Build query parser and parse query
####################################

with ix.searcher(weighting = scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse(qstr)

    ####################################
    # TODO: Search the content field
    ####################################
    
    results = s.search(user_q)
    for hit in results:
        print("cell {} of {}".format(hit["cell_no"], hit["filename"]))





Input a query: QueryParser
searching for  QueryParser
cell 26 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module6/exercises/DSA_Notebook_Search_Engine.ipynb
cell 26 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module6/exercises/DSA_Notebook_Search_Engine-Copy1.ipynb
cell 14 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module5/exercises/Exercises_Python_text_search.ipynb
cell 11 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module5/labs/IR_with_Python_Whoosh.ipynb
cell 9 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module6/labs/Text_Search_TFIDF.ipynb
cell 14 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module6/labs/Text_Search_TFIDF.ipynb
cell 17 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module6/labs/Text_Search_TFIDF.ipynb
cell 8 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module6/practices/Text_Search_TFIDF_Practice.ipynb
cell 9 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module6/practices/Text_Search_TFIDF_Practice.ipynb
cell 25 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module6/exercises/DSA_Notebook_Sea

#### <span style="background:yellow">Expected Output</span>

  *  Example output from searching for QueryParser

```
Input a qeury: QueryParser
searching for  QueryParser
content:queryparser
Cell 11 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module5/labs/IR_with_Python_Whoosh.ipynb'
Cell 27 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module5/answers/ParsingWikipediaLifeformPage.ipynb'
Cell 25 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/exercises/DSA_Notebook_Search_Engine.ipynb'
Cell 9 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/labs/Text_Search_TFIDF.ipynb'
Cell 14 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/labs/Text_Search_TFIDF.ipynb'
Cell 17 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/labs/Text_Search_TFIDF.ipynb'
Cell 45 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/labs/Topic_modelling.ipynb'
Cell 7 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/answers/TFIDF_Scoring_Practice.ipynb'
Cell 8 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/answers/TFIDF_Scoring_Practice.ipynb'
Cell 11 of Notebook '/dsa/home/scottgs/sp17DMIR/modules/module5/labs/IR_with_Python_Whoosh.ipynb'
```

---
## 2

In [55]:
from whoosh.qparser import QueryParser
from whoosh import scoring

# Get input, convert to unicode
qstr = input("Input a query: ")

print("searching for ",qstr)

####################################
# TODO: Build query parser and parse query
####################################

with ix.searcher(weighting = scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse(qstr)

    ####################################
    # TODO: Search the content field
    ####################################
    
    results = s.search(user_q)
    for hit in results:
        print("cell {} of {}".format(hit["cell_no"], hit["filename"]))





Input a query: Clusters
searching for  Clusters
cell 1 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 2 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/KMeans_Clustering.ipynb
cell 10 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/Hierarchical_Clustering.ipynb
cell 15 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 21 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 36 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/Hierarchical_Clustering.ipynb
cell 32 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Practices/Practice_DBSCAN.ipynb
cell 27 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Answers/Practice_DBSCAN.ipynb
cell 12 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/Hierarchical_Clustering.ipynb
cell 23 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/KMeans_Clustering.ipynb


## 3

In [56]:
from whoosh.qparser import QueryParser
from whoosh import scoring

# Get input, convert to unicode
qstr = input("Input a query: ")

print("searching for ",qstr)

####################################
# TODO: Build query parser and parse query
####################################

with ix.searcher(weighting = scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse(qstr)

    ####################################
    # TODO: Search the content field
    ####################################
    
    results = s.search(user_q)
    for hit in results:
        print("cell {} of {}".format(hit["cell_no"], hit["filename"]))





Input a query: Milk
searching for  Milk
cell 36 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 1 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module2/labs/Association_Rules_and_Frequent_Pattern_Mining.ipynb
cell 28 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 35 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 4 of /dsa/home/lz6m7/BC-Python_lz6m7/notebooks/L7-BeautifulSoup.ipynb
cell 4 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module2/labs/Visualizing_Association_Rules.ipynb
cell 29 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 43 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 44 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb
cell 47 of /dsa/home/lz6m7/sp17DMIR_lz6m7/modules/module3/Labs/DBSCAN_Clustering.ipynb


# SAVE YOUR NOTEBOOK