# Summer 2021 -> INSY 5379 -> Capstone Project
### UTA MSBA students: 
- Phanikrishna Karanam
- Swetha Gollamudi
- Joel Andrews

#### Generic Web Scraper - Products & Solutions

__Summary__: This is a generic web scraper to extract details related to Products, Solutions, Services, Platforms & Clients of Health IT vendors

__Below steps are peformed__:
- Read input file containing the URLs of HIT vendors
- Open the URL in browser and extract the page source using Selenium
- Parse the source html data using BeautifulSoup and extract required data
- Filter & Preprocess the data
  - Identify Stop words and Stop phrases
- Write extracted data into respective columns of output file based on identified keywords
   - Write errors into an error file

__Next steps__: 
- By interacting with Canton domain experts:
    - Identify Predictors / Regressors based on extracted text data. Build word clouds to identify frequent words
    - Obtain / Create Training data and identify Target labels 
- Train & build a Machine Learning model to classify labels in the target variable

In [1]:
# Import necessary libraries

import pandas as pd
import bs4
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time
import re
import string

 - Function to download the browser and parse the html page source

In [2]:
def download_browser(my_url):
    executable_path = r'C:\Users\karan\AppData\Roaming\Python\Python37\site-packages\geckodriver.exe'
    
    options = webdriver.FirefoxOptions()
    
    # Selenium accesses the Chrome browser driver in incognito mode and without actually opening a browser window 
    # (headless argument). Any certificates errors are ignored
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    options.add_argument('--headless')
    
    browser = webdriver.Firefox(executable_path = executable_path, options = options)
    browser.get(my_url)
    
    # Pass the page source to Beautiful Soup for parsing
    page_soup = soup(browser.page_source, 'lxml')
    return (browser, page_soup)

- Function to initialize the lists. Will be called for every URL

In [3]:
def init_lists():
    products = []
    solutions = []
    services = []
    platforms = []
    clients = []
    page_text = []
    return (products, solutions, services, platforms, clients, page_text)

- Specify Input / Output files
- Read the Input file and create a list of Company name and URL

In [4]:
in_file = 'C:/MSBA/Summer 2021/INSY 5379 - Capstone Project/HIT Vendors.xlsx'
out_file = 'C:/MSBA/Summer 2021/INSY 5379 - Capstone Project/HIT Output.csv'
out_file_err = 'C:/MSBA/Summer 2021/INSY 5379 - Capstone Project/HIT Output_Error.csv'

header = ['Company Name', 'Link', 'Products', 'Solutions', 'Services', 'Platforms', 'Clients', 'Page_Text']
header_err = ['Company Name', 'Link', 'Error']

#fh_in = open(in_file, 'r')
df = pd.read_excel(in_file)
#fh_in.close()

input_list = df[['Name', 'Website']].values.tolist()

 - Write data to be written into output file for every URL

In [5]:
def write_file():    
    temp = []
    temp = [comp_name, link, ' | '.join(products_clean), ' | '.join(solutions_clean), ' | '.join(services_clean), 
            ' | '.join(platforms_clean), ' | '.join(clients_clean), ' | '.join(page_text_clean)] 
    out.append(temp)
    return ()

 - Write errors encountered during runtime for any input URL into error file

In [6]:
def write_error_file():
    temp_err = []
    temp_err = [comp_name, link, e] 
    out_err.append(temp_err)
    return ()

 - Define the Keywords for output file columns, Stop Words and Stop Phrases

In [7]:
products_key = ['product', 'software', 'consult']

solutions_key = ['solution', 'care', 'value', 'industry', 'offer', 'practice', 'capabili', 'practice', 'virtual', 
                 'monitor', 'real' 'clinical', 'surveillance', 'what', 'how', 'help', 'systems']

services_key = ['service']

platforms_key = ['platform', 'oprx']

clients_key = ['client', 'partner', 'serve']

# Add stop words in lower case
stop_words = ['read report', 'careers', 'industry report', 'read the case study', 'read the report', 'join our team', 
              'overview', 'partners', 'customers', 'success stories', 'blogs', 'menu', 'previous', 'next', 
              'events & webinars', 'education & training', 'white paper', 'research paper', 'resources', 'terms of use', 
              'news', 'podcasts', 'request a demo', 'get a demo', 'our manifesto', 'schedule a demo', 'request demo',
              'cookie policy', 'blog', 'events', 'webinars', 'downloads', 'about us', 'about', 'advisors', 'contact', 
              'privacy', 'schedule a free demo', 'login', 'read now', 'press releases', 'flip through our magazine', 
              'brochures', 'case studies', 'case study', 'ebooks', 'infographics', 'magazines', 'videos', 'white papers', 
              'discover all our resources', 'learn more about who we are', 'executive team', 'board of directors', 
              'regulatory and compliance', 'see more demos', 'watch the video', 'listen to the podcast', 'news & events', 
              'brand', 'developer network', 'view webinar', 'read story', 'view event', 'log in', 'log out',
              'start the conversation now', 'code of ethics', 'privacy policy', 'terms and conditions', 'skip to content', 
              'home', 'request your demo today', 'read more', 'faq & help center', 'sign in', 'sign out', 'learn more', 
              'get in touch', 'view testimonials', 'terms of service', 'contact us', 'new registration click here', 
              'sites', 'sumo', 'faqs', 'faq', 'terms & conditions', 'site map', 'sitemap', 'customer stories', 
              'support', 'back to top', 'help', 'connect', 'news & blog', 'log in / sign up', 'scroll to top', 'follow', 
              'legal & privacy', 'close menu', 'terms of use and privacy policy', 'email', 'call', 'terms', 'company', 
              'jobs', 'manifesto', 'awards', 'news / events', 'join us', 'meet the team', 'rss', 'get started', 'tour', 
              'read article', 'read more reviews', 'accessibility', 'i understand', 'poland', 'switzerland', 
              'asia pacific', 'czechia', 'denmark', 'germany', 'norway', 'middle east & africa', 'ireland', 'italy', 
              'netherlands', 'france', 'brazil', 'united states', 'austria', 'turkey', 'sweden', 'canada', 'india', 
              'belgium', 'south africa', 'benelux', 'russia', 'united kingdom', 'romania', 'spain', 'slovakia', 
              'australia', 'argentina', 'chile', 'colombia', 'ecuador', 'mexico', 'peru', 'panama', 'portugal', 
              'new zealand', 'online courses', 'to the top', 'to top', 'book an appointment', 'our story', 'view demo', 
              'tutorials', 'demo', 'watch now', 'shop online', 'live demo', 'view demo', 'schedule demo', 'open positions',
              'livechat', 'brochure', 'book a demo', 'call us', 'find us', 'deutsch', 'french', 'japanese', 'portuguese', 
              'english', 'chinese', 'nederlands', 'request a quote', 'accept', 'decline','requestdemotoday', 'history',
              'read the study', 'read the post', 'my account', 'customer portal', 'share your experience', 'back', 
              'shop', 'back to dashboard', 'demos', 'log-in', 'uk & europe', 'asia-pacific', 'register', 'reach us', 
              'term of use', 'help center', 'chat', 'more', 'costa rica', 'croatia', 'czech republic', 'egypt', 'finland', 
              'hong kong', 'hungary', 'indonesia', 'israel', 'japan', 'korea', 'lithuania', 'luxembourg', 'malaysia',
              'morocco', 'nigeria', 'norway', 'philippines', 'qatar', 'saudi arabia', 'singapore', 'thailand', 'taiwan',
              'tunisia', 'united arab emirates', 'vietnam', 'singapore', 'skype', 'whatsapp', 'glossary', 'search'] 

stop_strings = ['download', 'webinar', 'case stud', 'explore', 'whitepaper', 'reviews', 'disclosure', 'register now',
                'press', 'click', 'login', 'blog', 'about', 'scroll', 'council', 'news', 'sign up', 'contact us', 'china',  
                'questions', 'feedback', 'free', 'read story', 'conversation', 'our story', 'terms of use', 'privacy', 
                'cookie', 'view all', 'navigation', 'legal', "let's talk", "let's connect", 'watch video', 'awards', 'faq',
                'leadership','get started', 'more info', 'more details', 'contact support', 'twitter', 'facebook', 
                'youtube', 'linkedin', 'wechat', 'pinterest', 'instagram', 'subscribe', 'try now', 'get a quote', 'http',  
                'javascript', 'my personal information', 'hiring', 'skip to','infographic', 'technical support', 'email us',
                ' more', 'follow us', 'terms of service', 'conditions', 'user agreement', 'not yet registered', 'show me',
                'forgot password', 'careers', 'jobs', 'live chat', 'tour today', 'sign in', 'sign out', 'help guide', 
                'question', 'testimonials', 'for a demo', 'schedule a call', 'google', 'instant demo', 'english', 
                'how it works', 'reference', 'vacanc', 'apply','demo today', 'open position', 'video', 'book demo',
                'item', 'software demo', 'engaging', 'tweet', 'online review', 'learn more', 'e-mail', 'contact form', 
                'buy now', 'open menu', 'full team', 'article', 'social media', 'polic', 'client portal', 'speaker', 
                'customer support', 'code of conduct', 'a demo', 'out more', 'north america', 'podcast', 'help center', 
                'create account', 'chat with us', 'request pricing', 'personal demo', 'acknowledge', 'github', 'android',
                'ios', 'window', 'messenger', 'continue', 'disclaimer', 'meeting', 'go to', 'your demo', 'talk to', 
                'send message', 'we are', 'a message', 'help desk', 'sign-in', 'sign-out', 'contact sales', 'to top',
                '@', '\.com', 'www\.', '__', '{', '}', '\.\.', '\[#', '\.js']

# Note: In reg-ex for using special characters like '.' or '[' or '#' as search string, always use escape character '\'
#       before the actual search string

Main processing section of code:
- For each of the URL in the input file, below is performed:
    - Download the html page source using browser via Selenium 
    - Parse the page source using BeautifulSoup
    - Eliminate non printable characters and multiple spaces
    - Loop thru the key words, stop words, stop phrases and write into respective columns 
    - Look for duplicate entries and clean up
    - Write an entry into output list / error output list
    - Close the browser
- Load the output list and error output list into dataframes and create respective output files

In [8]:
out = []
out_err = []
i = 1

for input in input_list:
    
    comp_name = input[0].strip()    # Capture company name
    link = input[1].strip()         # Capture company url
    
    products, solutions, services, platforms, clients, page_text = init_lists() # Initialize lists
        
    try: 
        browser, page_soup = download_browser(link)                             # Download browser and parse page source 
        
        # Sleep for 3 seconds for the page to respond
        time.sleep(3)                                                           # Sleep for 3 seconds for page to respond
        
        # Capture all tags which have href variable present in them
        u_lists = page_soup.find_all(href = re.compile('\S'))
        #u_lists = page_soup.find_all(class_ = re.compile(r'(drop)|(nav)|(menu)|(item)|(sub)'), href = re.compile('\S'))
        
        for u_list in u_lists:
            
            # Eliminate all non printable characters in text
            txt = re.sub('[^{}]'.format(string.printable), '', u_list.text.strip())
            
            # Eliminate multiple spaces in text
            txt_list = txt.split()                              
            txt = (' ').join(txt_list)
                     
            if len(txt) > 2 and txt.lower() not in stop_words:                  # Eliminate stop words & unwanted text          
                skip_text = False
                
                for stop_string in stop_strings:                                # Eliminate all stop strings
                    if re.search(stop_string.lower(), txt.lower()):
                        skip_text = True
                        break                        
                
                # Skip if phone numbers are present and include those strings with less than 10 words
                if skip_text == False and len(re.sub('[^0-9]', '', txt)) <= 9 and len(txt_list) < 10:
                    href = u_list['href'].strip()
                    
                    # Match on respective key words and write into respective columns 
                    for product in products_key:
                        if re.search(product.lower(), href.lower()):
                            products.append(txt)

                    for solution in solutions_key:
                        if re.search(solution.lower(), href.lower()):
                            solutions.append(txt)

                    for service in services_key:
                        if re.search(service.lower(), href.lower()):
                            services.append(txt)

                    for platform in platforms_key:
                        if re.search(platform.lower(), href.lower()):
                            platforms.append(txt)

                    for client in clients_key:
                        if re.search(client.lower(), href.lower()):
                            clients.append(txt)

                    page_text.append(txt)                                       # Capture all extracted & filtered text 
        
        # Remove duplicates for respective column entries
        products_clean = []
        [products_clean.append(product) for product in products if product not in products_clean]

        solutions_clean = []
        [solutions_clean.append(solution) for solution in solutions if solution not in solutions_clean]

        services_clean = []
        [services_clean.append(service) for service in services if service not in services_clean]

        platforms_clean = []
        [platforms_clean.append(platform) for platform in platforms if platform not in platforms_clean]

        clients_clean = []
        [clients_clean.append(client) for client in clients if client not in clients_clean]

        page_text_clean = []
        [page_text_clean.append(text) for text in page_text if text not in page_text_clean]

        write_file()                                                            # Write output record
    
    except Exception as e:                                                      # Write error record for run time errors
        print('url# {}: {} Error: {}'.format(i, link, e))
        write_error_file()
    finally: 
        browser.quit()                                                          # Quit the browser for every URL  
        
        i = i + 1                                                               # Counter tracking # of records processed 
        if i % 25 == 0:                                                         # Print after every 25 records
            print('# of URLs processed:', i)


# of URLs processed: 25
# of URLs processed: 50
# of URLs processed: 75
url: http://www.triadretail.com Error: Message: Reached error page: about:neterror?e=dnsNotFound&u=http%3A//www.triadretail.com/&c=UTF-8&d=We%20can%E2%80%99t%20connect%20to%20the%20server%20at%20www.triadretail.com.

# of URLs processed: 100
# of URLs processed: 125
# of URLs processed: 150
url: http://www.textnora.com Error: Message: Reached error page: about:neterror?e=dnsNotFound&u=http%3A//www.textnora.com/&c=UTF-8&d=We%20can%E2%80%99t%20connect%20to%20the%20server%20at%20www.textnora.com.

# of URLs processed: 175
# of URLs processed: 200
# of URLs processed: 225
url: http://medicaldevicedevelopments.services Error: Message: Reached error page: about:neterror?e=dnsNotFound&u=http%3A//medicaldevicedevelopments.services/&c=UTF-8&d=We%20can%E2%80%99t%20connect%20to%20the%20server%20at%20medicaldevicedevelopments.services.

url: http://kalco.om/home/ Error: Message: Reached error page: about:neterror?e=dnsNotFound

 - Load the output list into a DataFrame and export to output CSV file

In [9]:
df_out = pd.DataFrame(out, columns = header)
df_out.to_csv(out_file, index = False, encoding='utf-8')

df_out

Unnamed: 0,Company Name,Link,Products,Solutions,Services,Platforms,Clients,Page_Text
0,MEDHOST,http://www.medhost.com/,,Revenue Cycle Solutions | Patient Access Solut...,Revenue Cycle Services | Cloud Services and Ma...,,General Acute Care and Critical Access Hospita...,Clinical Suite | EDIS Emergency Department | C...
1,nThrive,http://www.nthrive.com/,,Patient Access | Mid-Revenue Cycle | Patient F...,,,,Patient Access | Mid-Revenue Cycle | Patient F...
2,TrialStat Solutions Inc.,https://www.trialstat.com,,,Services,,,eClinical Technology | TrialStat EDC | Randomi...
3,Curve Dental,http://www.curvedental.com,Charting | Scheduling | Billing | Reporting | ...,,,,Integration Partners,Features | Charting | Scheduling | Billing | S...
4,VetSuccess,http://www.vetsuccess.com,Products | Performance Reports | Practice Over...,Practice Overview Report | Preventive Care Sna...,,,Lapsing Client Toolkit,Products | Performance Reports | Practice Over...
...,...,...,...,...,...,...,...,...
737,"Chetu, Inc.",https://www.chetu.com/healthcare.php,Embedded Software | Product Lifecycle Manageme...,Healthcare | Animation & Graphic Design | Arti...,Field Service Management | Location Based Serv...,,Web Hosting | Our Clients | Channel Partner,Espaol | Industries | Agriculture | Aviation |...
738,TractManager,https://www.tractmanager.com/,Provider Credentialing Software | Provider Dir...,All Solutions | Value Management | Solutions,CVO Services,,,SOLUTIONS | All Solutions | Provider Data Mana...
739,Claimocity,https://claimocity.com,Top Technology Partners | PracticeSuite | Prac...,PracticeSuite | Practice Management | Practice...,Credentialing Services,,Top Technology Partners,Top Technology Partners | PracticeSuite | Cont...
740,Bridge Connector,https://bridgeconnector.co,,,,,,


 - Load the output error list into a DataFrame and export to output error CSV file

In [10]:
df_out_err = pd.DataFrame(out_err, columns = header_err)
df_out_err.to_csv(out_file_err, index = False, encoding='utf-8')

df_out_err

Unnamed: 0,Company Name,Link,Error
0,Triad,http://www.triadretail.com,Message: Reached error page: about:neterror?e=...
1,Nora,http://www.textnora.com,Message: Reached error page: about:neterror?e=...
2,PROTOMED SA,http://medicaldevicedevelopments.services,Message: Reached error page: about:neterror?e=...
3,Khalifa Al Hinai Advocates and Legal Consultancy,http://kalco.om/home/,Message: Reached error page: about:neterror?e=...
4,EHR1,https://ehr1.com/,Message: Reached error page: about:neterror?e=...
5,Learning Track,https://learningtrack.com/,Message: Reached error page: about:neterror?e=...
6,SRIT India Pvt. Ltd.,http://www.sritindia.com,Message: Reached error page: about:neterror?e=...
7,MedSMART,http://med-smart.org,Message: Reached error page: about:neterror?e=...
8,RL Solutions,http://www.rlsolutions.com/,Message: Reached error page: about:neterror?e=...
9,Infolinx,https://www.infolinx.com/healthcare/,Message: Reached error page: about:neterror?e=...
