# Text Matching

So for our first implementation we are going to basically create a list of medical words.  We will
then compare the content of document to our list of medical words and say that a document is a 
medical document if any of those words are found on the document.  

## First Step

The first step we have is to gain access to a list of medical terms.  For our first steps we are going
to use terms that we can find online.  One of the better known medical websites is 
[webmd.com](https://webmd.com), which happens to have a dictionary containing some medical terms.  

Let's first start by scraping the list of terms from this site. 

In [14]:
# Start by loading our imports

import bs4
import requests
import itertools
import json
import os
import string
import time

from bs4 import BeautifulSoup

Now that we have our imports in place, lets start by doing a quick analysis of the webpage format.  This
will be required so that we can utilize beautiful soup to correctly extract the text we are looking for.  

So the first page we will be looking at is 
[the collection of words that start with A](https://dictionary.webmd.com/default.htm?filter=A).  

If we look over the format we an see that we are looking for a single `ul` tag that has the classname
of `az-index-results-group-list`.  After getting that element, each `li` element's text contains the
medical words that we are looking for.  

In [12]:
def read_terms_from_webmd():
    base_url = 'https://dictionary.webmd.com/default.htm?filter='
    
    words = []
    for l in list(string.ascii_lowercase):
        url = base_url + l.upper()
        
        result = requests.get(url)
        status_code = result.status_code
        content = result.content

        if status_code < 200 or status_code >= 300:
            # We failed, so don't return any results
            return None

        soup = BeautifulSoup(content, 'html5lib')
        word_list_tag = soup.find('ul', class_='az-index-results-group-list')
        word_tags = word_list_tag.find_all('li')
        
        words.extend([w.get_text().strip() for w in word_tags])
        
    return words

In [13]:
all_terms = read_terms_from_webmd()
all_terms[0]

'Ablutophobia'

We have downloaded our terms, lets go ahead and store them to disk (if needed).  

In [19]:
save_to_disk = False

if save_to_disk:
    directory = os.path.join('..', 'data')
    filename = f'webmd_dictionary_{time.strftime("%Y-%m-%d")}.txt'
    with open(os.path.join(directory, filename), 'w') as webmd_file:
        for line in all_terms:
            print(line, file=webmd_file)
            
def load_webmd_from_disk(filename):
    with open(filename, 'r') as webmd_file:
        return [l.strip() for l in webmd_file]
        

For now we are not going to do much to sterilize the data, but just take it at face value. Normally
we would want to make sure that we aren't seeing any oddities in the data, but we can get to that
after trying this for a bit.  

## Second Step

In our first step we obtained a list of medical terms that we can not check for their existence
in a body of text (including a document) and if any words are found we will label that text as being
medical in nature.  

We will create the comparison function and run it against all the data that we are trying to label
correctly.  

In [24]:
webmd_terms_file = 'webmd_dictionary_2018-08-05.txt'

all_terms = load_webmd_from_disk(os.path.join('..', 'data', webmd_terms_file))
def is_text_medical(text):
    return any((term in text for term in all_terms))

In [27]:
## TODO: Load the test set and check to see how well our solution works. 