# Exercise: Building and Loading Text Search in Python Whoosh

--- 
<a id='task' ></a>

## Task at hand


For this exercise, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously worked with the _`book`_ data. In this exercise, we will work with some wiki data. 

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

From the lab with the text files:
```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content)`

For this exercise, we will be using a few Wikipedia pages for our data source.

### 1) For this exercise, you should look at a few of these web pages:

  * https://en.wikipedia.org/wiki/Nyctimantis
  * https://en.wikipedia.org/wiki/Osteocephalus
  * https://en.wikipedia.org/wiki/Osteopilus
  
Specifically, inspect the HTML source and the 
```HTML
<table class="infobox biota" ... </table>
```



<img src="../images/table_inspect.png" height=400 width=600 />



**Task: You need to extend the above schema definition to collect this frog table data when available.**

* Content will be the all visible text on the html page
* Table information such as kingdom, phylum, class, order, family, subfamily, genus should be searchable 

In [1]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer()),
                # Extend the schema definition to capture relevant table data
                kingdom=TEXT(stored=True),
                phylum=TEXT(stored=True),
                a_class=TEXT(stored=True),
                order=TEXT(stored=True),
                family=TEXT(stored=True),
                subfamily=TEXT(stored=True),
                genus=TEXT(stored=True)
               )

--- 
<a id='load_it' ></a>

## Loading Data

For this exercise, we have created a small folder of a few Wikipedia pages under the `en.wikipedia.org/wiki` folder in the common datasets folder:


In [2]:
! ls /dsa/data/all_datasets/en.wikipedia.org/wiki

Acris.html	     Hylidae.html	   Plectrohyla.html
Anotheca.html	     Hylinae.html	   Pseudacris.html
Aparasphenodon.html  Hyloscirtus.html	   Pseudis.html
Aplastodiscus.html   Hypsiboas.html	   Ptychohyla.html
Argenteohyla.html    Isthmohyla.html	   Scarthyla.html
Bokermannohyla.html  Itapotihyla.html	   Scinax.html
Bromeliohyla.html    Lysapsus.html	   Smilisca.html
Charadrahyla.html    Megastomatohyla.html  Sphaenorhynchus.html
Corythomantis.html   Myersiohyla.html	   Tepuihyla.html
Dendropsophus.html   Nyctimantis.html	   Tlalocohyla.html
Duellmanohyla.html   Osteocephalus.html    Trachycephalus.html
Ecnomiohyla.html     Osteopilus.html	   Triprion.html
Exerodonta.html      Phyllodytes.html	   Xenohyla.html
Hyla.html	     Phytotriades.html




You will create the _whoosh_ index files in the `modules/module6/exercises/wiki_index` folder then ingest the files.

To load the data, write a python script that follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
## Follow the lab for Python IR with whoosh to complete this exercise.

### 2) Create / Initialize the whoosh index and get the `writer` object.

In [3]:
import os, os.path
from whoosh import index
from bs4 import BeautifulSoup

# Step 2 below this comment"

os.makedirs("animal_index", exist_ok=True)  # create a directory for indexing

# Note, this clears the existing index in the directory
ix = index.create_in("animal_index", schema)

# Get a writer to form the created index in 
writer = ix.writer()


### 3) Adapt the helper functions

Note the subtle changes.
Please adapt the code below such as provided recursive parsing of the HTML (.html) files, indexing with the Whoosh API.
Trust no code, verify all code segments.


In [4]:
import re

def visible(element):  # return those html elements that are visible as text 
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: #html tags
        return False
    elif re.match('<!--.*-->', str(element)): # html comments
        return False
    return True

def pullBiota(soup):  
        
    data = {}
    
    table = soup.find('table', class_='infobox')
    for row in table.find_all('tr'):  
        cells = row.find_all('td')
        
        if len(cells)>1:
            #print(cells[0].find(text=True))
            #print(cells[1].find(text=True))
            data[cells[0].find(text=True)] = cells[1].find(text=True)
    
    return data


def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r') as infile:
        html=infile.read()   # read html content ||||| I CHANGED TO html FROM content
        
        #-----------------------------------------
        soup = BeautifulSoup(html, 'html.parser')
        texts = soup.find_all(text=True)
        # Process all the visible text
        visible_texts = filter(visible, texts)
        #-----------------------------------------
        
        # TODO: Assemble all visible_texts into a content string
        # Hint: Iterate over visible_texts line by line; remove newlines; create a concatenated string
        content = ""  # Starting with a blank string
        
        for i in visible_texts:
            content += str(i.rstrip())

        # print(content)
        
        # TODO: Process the "<table class="infobox biota" ... </table> data
        infotable = pullBiota(soup)
        
        writer.add_document(
            filename=fname,
            content=content,
            kingdom=str(infotable.get('Kingdom:')),
            phylum=str(infotable.get('Phylum:')),
            a_class=str(infotable.get('Class:')),
            order=str(infotable.get('Order:')),
            family=str(infotable.get('Family:')),
            subfamily=str(infotable.get('Subfamily:')),
            genus=str(infotable.get('Genus:'))
        )
        
        # Write to the index
        print("Indexed: ", fname)

def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")



### 4) Parse with our defined functions in place.

In [5]:
# Start processing the folder and commit the work
# ---------------------------------------------------

processFolder(writer, '/dsa/data/all_datasets/en.wikipedia.org/wiki')
    
writer.commit()


Processing folder:  /dsa/data/all_datasets/en.wikipedia.org/wiki
root =  /dsa/data/all_datasets/en.wikipedia.org/wiki
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Bokermannohyla.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/w

## Testing Indexed documents

In [6]:
from whoosh.index import open_dir
ix = open_dir('animal_index')
for doc in ix.searcher().documents():
    print(doc)

{'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html', 'genus': 'Acris', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}
{'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html', 'genus': 'Anotheca', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}
{'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html', 'genus': 'Aparasphenodon', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}
{'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html', 'genus': 'Aplastodiscus', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}
{'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wiki

--- 
<a id='search_me' ></a>

### 5) Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html
  
Previously, we hard-coded query strings into the code cells.

Now, use the `input()` function collect a query string from the user. 
Then execute the search. For this task, focus only on the `content` field. 

In [7]:
from whoosh.qparser import QueryParser

# Write your code below this comment:
# --------------------------------------

user_query = input("Enter a search term: ")
qp = QueryParser("content", schema=ix.schema)
q = qp.parse(user_query)

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit['filename'], hit['kingdom'], hit.score, hit.rank)



Enter a search term: frog
/dsa/data/all_datasets/en.wikipedia.org/wiki/Smilisca.html Animalia 2.2658413039464014 0
/dsa/data/all_datasets/en.wikipedia.org/wiki/Sphaenorhynchus.html Animalia 2.2539348167985533 1
/dsa/data/all_datasets/en.wikipedia.org/wiki/Hylidae.html Animalia 2.218049415505333 2
/dsa/data/all_datasets/en.wikipedia.org/wiki/Pseudis.html Animalia 2.150816768195243 3
/dsa/data/all_datasets/en.wikipedia.org/wiki/Osteopilus.html Animalia 2.1385716279006313 4
/dsa/data/all_datasets/en.wikipedia.org/wiki/Ptychohyla.html Animalia 2.1385716279006313 5
/dsa/data/all_datasets/en.wikipedia.org/wiki/Pseudacris.html Animalia 2.0725425534185784 6
/dsa/data/all_datasets/en.wikipedia.org/wiki/Hyla.html Animalia 2.0232093772682416 7
/dsa/data/all_datasets/en.wikipedia.org/wiki/Phytotriades.html Animalia 2.012830471669969 8
/dsa/data/all_datasets/en.wikipedia.org/wiki/Ecnomiohyla.html Animalia 1.953644630201365 9


### 6) Write two example queries to ensure you can search the index 

That is, make sure you can search on the fields you added to the index from the infobox biota table.

```HTML
<table class="infobox biota" ... </table>
```
For this search, we will ignore `content` field and search over the other fields. We can use `MultifieldParser` to specify the fields of our interest. 


In [8]:
# Write your code below this comment:
# --------------------------------------
from whoosh.qparser import MultifieldParser
from whoosh import qparser

# OMIT CONTENT
qp = MultifieldParser(["kingdom","phylum","a_class","order","family","genus"], 
                      schema=ix.schema, group=qparser.OrGroup)  
user_query = input("Enter a search term: ")
q = qp.parse(user_query)

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)


Enter a search term: Animalia
<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html', 'genus': 'Acris', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}>
<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html', 'genus': 'Anotheca', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}>
<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html', 'genus': 'Aparasphenodon', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}>
<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html', 'genus': 'Aplastodiscus', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}>
<Hit {'a_class': 'Amphibia', 'fami

In [9]:
# Write your code below this comment:
# --------------------------------------

# OMIT CONTENT
qp = MultifieldParser(["kingdom","phylum","a_class","order","family","genus"], 
                      schema=ix.schema, group=qparser.OrGroup)
q = qp.parse('Amphibia')

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)




<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html', 'genus': 'Acris', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}>
<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html', 'genus': 'Anotheca', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}>
<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html', 'genus': 'Aparasphenodon', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}>
<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html', 'genus': 'Aplastodiscus', 'kingdom': 'Animalia', 'order': 'Anura', 'phylum': 'Chordata', 'subfamily': 'None'}>
<Hit {'a_class': 'Amphibia', 'family': 'Hylidae', 'filename': '/

# SAVE YOUR NOTEBOOK WITH ALL EXECUTED CELLS
# Then, `File > Close and Halt`