# Summer 2021 -> INSY 5379 -> Capstone Project
### UTA MSBA students: 
- Phanikrishna Karanam
- Swetha Gollamudi
- Joel Andrews

#### Generic Web Scraper - Brand Promise

__Summary__: This is a generic web scraper to extract details related to Brand Promise (based on claim) offered by Health IT vendors

__Below steps are peformed__:
- Read input file containing the URLs of HIT vendors
- Open the URL in browser and extract the page source using Selenium
- Parse the source html data using BeautifulSoup and extract required data
- Filter & Preprocess the data
  - Identify Stop words 
- Write extracted data into respective columns of output file based on identified keywords & modifiers
   - Write errors into an error file
- Extract all the href tags and filter thru the href_stop words
- Repeat the above listed steps (from step 2 thru 5) for the urls obtained from the href tags

__Next steps__: 
- By interacting with Canton domain experts:
    - Identify Predictors / Regressors based on extracted text data 
    - Obtain / Create Training data and identify Target labels 
- Using extracted text data and build term matrix based on keywords & modifiers that appear on the output file
- Use unsupervised learning methods for clustering the data and assign the cluster number as label
- Using the output from unsupervised learning, train & build a Machine Learning models to classify labels in the target variable

In [15]:
import time 
from datetime import datetime

start_time = time.time()
start = datetime.now().strftime('%H:%M:%S')

print('Program Start Time:', start)

Program Start Time: 23:27:43


In [2]:
# Import necessary libraries

import pandas as pd
import bs4
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time
import re
import string
from urllib.parse import urljoin

In [3]:
def download_browser(my_url):
    executable_path = r'C:\Users\karan\AppData\Roaming\Python\Python37\site-packages\geckodriver.exe'
    
    options = webdriver.FirefoxOptions()
    
    # Selenium accesses the Chrome browser driver in incognito mode and without actually opening a browser window 
    # (headless argument). Any certificates errors are ignored
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    options.add_argument('--headless')
    
    browser = webdriver.Firefox(executable_path = executable_path, options = options)
    browser.get(my_url)
    
    # Pass the page source to Beautiful Soup for parsing
    page_soup = soup(browser.page_source, 'lxml')
    return (browser, page_soup)

In [4]:
in_file = 'C:/MSBA/Summer 2021/INSY 5379 - Capstone Project/HIT Vendors.xlsx'
keyword_file = 'C:/MSBA/Summer 2021/INSY 5379 - Capstone Project/brand_promise_high_level_keywords - Updated.xlsx'
out_file = 'C:/MSBA/Summer 2021/INSY 5379 - Capstone Project/HIT Output - Brand Promise.csv'
out_file_err = 'C:/MSBA/Summer 2021/INSY 5379 - Capstone Project/HIT Output_Error.csv'

header = ['Company Name', 'Claim', 'Modifier', 'Evidence', 'Link']
header_err = ['Company Name', 'Link', 'Error']

df = pd.read_excel(in_file)
df.drop_duplicates(subset = ['Name'], inplace = True)

input_list = df[['Name', 'Website']].values.tolist()

df_key = pd.read_excel(keyword_file).fillna('')
header_key = list(df_key.columns)

temp_list = df_key.values.tolist()
keywords_list = [list(filter(None, key_list)) for key_list in temp_list]

In [5]:
def write_file(keyword, modifier, txt, link):    
    temp = []
    temp = [comp_name, keyword, modifier, txt, link]   
    out.append(temp)
    return ()

In [6]:
def write_error_file(link):
    temp_err = []
    temp_err = [comp_name, link, e] 
    out_err.append(temp_err)
    return ()

In [14]:
stop_strings = ['@', '\.com', 'www\.', '__', '{', '}', '\.\.', '\[#', '\.js', '=', '\.css', '\.png']

href_stop_strings = ['support', 'video', 'podcast', 'case-study', 'contact', 'careers', 'disclaimer', 'terms', 'blog',
                     '\#top', 'privacy', 'policy', 'news', 'press', 'release', 'testimonial', 'signup', 'feed', 'webinar',
                     '\.css', '\.png', 'woff', 'manifest', 'cache', '\.ttf', '\.otf', '\.ico', '\.zip', '\.mp3', '\.xls',
                     '\.xml']

In [8]:
def extract_text(link): 
    browser, page_soup = download_browser(link)                             # Download browser and parse page source 

    # Sleep for 2 seconds for the page to respond
    time.sleep(2)                                                           # Sleep for 2 seconds for page to respond

    for text_string in page_soup.stripped_strings:   
        # Eliminate all non printable characters in text
        txt = re.sub('[^{}]'.format(string.printable), '', text_string.strip())

        # Eliminate multiple spaces in text
        txt_list = txt.split()                              
        txt = (' ').join(txt_list)

        if len(txt) > 2:
            skip_text = False

            for stop_string in stop_strings:                                # Eliminate all stop strings
                if re.search(stop_string.lower(), txt.lower()):
                    skip_text = True
                    break     

            # Skip if phone numbers are present and include those strings with less than 10 words
            if skip_text == False and len(re.sub('[^0-9]', '', txt)) <= 9 and len(txt_list) > 4:
                
                for keywords in keywords_list:     
                    if re.search(keywords[0].lower(), txt.lower()):                    
                    
                        for modifier in keywords[1:]:    
                            if re.search(modifier.lower(), txt.lower()):
                                
                                write_file(keywords[0].lower(), modifier, txt, link) # Capture all extracted & filtered text
                                #break                                         # If required to stop after 1st modifier hit
        
    return (browser, page_soup)

In [9]:
def extract_href():
    href_pool = []
    
    temp_link = link.split('//')[1]
    if 'www' in temp_link:
        link_internal = temp_link.split('www.')[1].split('/')[0]
    else: 
        link_internal = temp_link.split('/')[0]

    for href_list in href_lists:
        href = href_list['href'].strip()
        skip_href = False
        
        if len(href) > 1 and '@' not in href:
            for href_stop_string in href_stop_strings:                              # Eliminate all stop strings
                if re.search(href_stop_string.lower(), href.lower()):
                    skip_href = True
                    break     

            if skip_href == False: 
                if ((re.search(link_internal, href.lower())) or 
                    ('//' not in href and (href.startswith('/') or href.startswith('#')))):  

                    href_updated = urljoin(link, href)
                    href_pool.append(href_updated)
                      
    href_pool_clean = []
    [href_pool_clean.append(href) for href in href_pool if href not in href_pool_clean]
    
    return (href_pool_clean)

In [10]:
out = []
out_err = []
i = 1

for input in input_list:
    
    comp_name = input[0].strip()    # Capture company name
    link = input[1].strip()         # Capture company url
        
    try: 
        # Extract text from home page
        browser, page_soup = extract_text(link) 
        browser.quit()                                                          # Quit the browser for every URL 
        
        href_lists = page_soup.find_all(href = re.compile('\S'))
        href_pool_clean = extract_href()
        
        # Extract text from all href tags collected from home page
        for href_link in href_pool_clean:          
            try: 
                browser, page_soup = extract_text(href_link)

            except Exception as e:                                              # Write error record for run time errors
                print('url# {}: {} Error: {}'.format(i, href_link, e))
                write_error_file(href_link)

            finally: 
                browser.quit()                                                  # Quit the browser for every URL  
    
    except Exception as e:                                                      # Write error record for run time errors
        print('url# {}: {} Error: {}'.format(i, link, e))
        write_error_file(link)
    
    finally: 
        i = i + 1                                                               # Counter tracking # of records processed 
        if i % 10 == 0:                                                         # Print after every 10 records
            print('# of URLs processed:', i)

url# 6: https://www.mediquant.com/wp-content/themes/Divi/core/admin/fonts/modules.ttf Error: Message: TimedPromise timed out after 300000 ms

# of URLs processed: 10
url# 11: https://www.medhost.com/who-we-serve/ Error: Message: Reached error page: about:neterror?e=redirectLoop&u=https%3A//www.medhost.com/who-we-serve/&c=UTF-8&d=Firefox%20has%20detected%20that%20the%20server%20is%20redirecting%20the%20request%20for%20this%20address%20in%20a%20way%20that%20will%20never%20complete.

url# 13: https://liberation.medecision.com/register/ Error: Message: Reached error page: about:neterror?e=dnsNotFound&u=https%3A//liberation.medecision.com/register/&c=UTF-8&d=We%20can%E2%80%99t%20connect%20to%20the%20server%20at%20liberation.medecision.com.

url# 15: https://www.meddata.com/upcoming-conferences/?ical=1 Error: Message: TimedPromise timed out after 300000 ms

# of URLs processed: 20
url# 21: https://www.lumahealth.io/wp-content/themes/LumaHealth2/css/fonts/FreightSansProMedium-Regular.woff2 Er

url# 78: https://www.harmonyhit.com/AttendediAgent_regfix.exe Error: Message: TimedPromise timed out after 300000 ms

url# 78: https://www.harmonyhit.com/AttendediAgent-5.4.exe Error: Message: TimedPromise timed out after 300000 ms

url# 78: https://www.harmonyhit.com/UnattendediAgent-6.0.exe Error: Message: TimedPromise timed out after 300000 ms

url# 78: https://www.harmonyhit.com/UnattendediAgent-5.4.exe Error: Message: TimedPromise timed out after 300000 ms

# of URLs processed: 80
url# 82: https://cdn.goliathtechnologies.com/wp-content/themes/Divi/core/admin/fonts/modules.ttf Error: Message: TimedPromise timed out after 300000 ms

url# 86: https://www.gehealthcare.com/dist/GEHC/Project/GEHC/fontStyles/inspira/GEInspiraSerif-BoldItalic-v01.woff2 Error: Message: TimedPromise timed out after 300000 ms

url# 86: https://www.gehealthcare.com/dist/GEHC/Project/GEHC/fontStyles/inspira/GEInspiraSerif-Bold-v01.woff2 Error: Message: TimedPromise timed out after 300000 ms

url# 86: https://w

In [11]:
df_out = pd.DataFrame(out, columns = header)
df_out.drop_duplicates(subset = ['Company Name', 'Claim', 'Modifier', 'Evidence'], inplace = True)

#df_out.sort_values(by = ['Company Name', 'Claim', 'Modifier', 'Evidence'], inplace = True)
#df_out.reset_index(drop = True, inplace = True)

df_out.to_csv(out_file, index = False, encoding='utf-8')
df_out

Unnamed: 0,Company Name,Claim,Modifier,Evidence,Link
0,Meditology Services,process,leverag,As a provider of data analytics to health plan...,https://www.meditologyservices.com/incident-re...
1,Meditology Services,data,leverag,As a provider of data analytics to health plan...,https://www.meditologyservices.com/incident-re...
2,Meditology Services,data,better,UVM had better security protocols than many ho...,https://www.meditologyservices.com/how-hackers...
3,MEDITECH,outcome,improv,Our customers are improving outcomes with real...,https://ehr.meditech.com/
8,MEDITECH,efficiency,improv,Princeton Community Hospital Improves Response...,https://ehr.meditech.com/ehr-solutions
...,...,...,...,...,...
25829,Document Storage Systems,outcome,improv,TheraDoc provides a real-time picture of the r...,https://www.dssinc.com/theradoc-form
25830,Document Storage Systems,time,lower,TheraDoc provides a real-time picture of the r...,https://www.dssinc.com/theradoc-form
25831,Document Storage Systems,data,improv,TheraDoc provides a real-time picture of the r...,https://www.dssinc.com/theradoc-form
25832,Document Storage Systems,data,lower,TheraDoc provides a real-time picture of the r...,https://www.dssinc.com/theradoc-form


In [12]:
df_out_err = pd.DataFrame(out_err, columns = header_err)
df_out_err.drop_duplicates(inplace = True)

df_out_err.to_csv(out_file_err, index = False, encoding='utf-8')
df_out_err

Unnamed: 0,Company Name,Link,Error
0,MediQuant,https://www.mediquant.com/wp-content/themes/Di...,Message: TimedPromise timed out after 300000 ms\n
1,MEDHOST,https://www.medhost.com/who-we-serve/,Message: Reached error page: about:neterror?e=...
2,Medecision,https://liberation.medecision.com/register/,Message: Reached error page: about:neterror?e=...
3,MedData,https://www.meddata.com/upcoming-conferences/?...,Message: TimedPromise timed out after 300000 ms\n
4,Luma Health,https://www.lumahealth.io/wp-content/themes/Lu...,Message: TimedPromise timed out after 300000 ms\n
...,...,...,...
62,eMids Technologies,https://www.emids.com/events/?ical=1,Message: TimedPromise timed out after 300000 ms\n
63,Ellkay,http://jobs.ellkay.com,Message: Reached error page: about:neterror?e=...
64,eHealth Exchange,https://ehealthexchange.org/all-events/?ical=1,Message: TimedPromise timed out after 300000 ms\n
65,"DSS, Inc.",https://www.dssinc.com/homepage?format=rss,Message: TimedPromise timed out after 300000 ms\n


In [13]:
end = datetime.now().strftime('%H:%M:%S')
end_time = time.time()
seconds = end_time - start_time

print('Program Start Time   :', start)
print('Program End Time     :', end)
print('Program Elapsed Time :', time.strftime('%H:%M:%S', time.gmtime(seconds)))

Program Start Time   : 04:42:39
Program End Time     : 05:12:48
Program Elapsed Time : 00:30:08
