# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

In [None]:
# To type enter
# element.send_keys(Keys.RETURN)

# To use a select:
# from selenium.webdriver.support.ui import Select
# select = Select(driver.find_element_by_name('phy_city'))
# select.select_by_visible_text('Houston')

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [None]:
# https://jportal.mdcourts.gov/license/pbPublicSearch.jsp

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [None]:
# It has an ID: 'checkbox'
# Or we can use XPath

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [None]:
# There is only one form, we can use the submit method on the only form tag that exists on this page

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [None]:
# Search for the link with XPath and click on it with click()

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [None]:
# Select(driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')).select_by_visible_text('Statewide')

### How do you type "vap%" into the Trade Name field?

In [None]:
# driver.find_element_by_xpath('//*[@id="txtTradeName"]').send_keys('vap%')

### How do you click the submit button or submit the form?

In [None]:
# driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form').submit()

### How can you find and click the 'Next' button on the search results page?

In [None]:
# driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a')
# Or, it's the last link of the table with btmnavtable class

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [73]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
import pandas as pd

driver = webdriver.Chrome('/Users/mathieurudaz/Desktop/LEDE/chromedriver')
driver.get('https://jportal.mdcourts.gov/license/pbPublicSearch.jsp')

In [74]:
driver.find_element_by_xpath('//*[@id="checkbox"]').click()
driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]').click()

In [75]:
driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]').click()

In [76]:
Select(driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')).select_by_visible_text('Statewide')

In [77]:
driver.find_element_by_xpath('//*[@id="txtTradeName"]').send_keys('vap%')

In [78]:
driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form').submit()

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [None]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)

In [None]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.
for header in business_headers:
    rows = header.find_next_siblings('tr')
    print("HEADER is", header.text.strip())
    print("ROW 0 IS", rows[0].text.strip())
    print("ROW 1 IS", rows[1].text.strip())
    print("ROW 2 IS", rows[2].text.strip())
    print("ROW 3 IS", rows[3].text.strip())
    print("----")

In [212]:
business_headers = driver.find_elements_by_class_name('searchfieldtitle')
rows = driver.find_elements_by_class_name('tablecelltext')
row_index = 0
business_list = []


# As it seems impossible to select text only with Selenium as BeautifulSoup does,
# this function extracts the text without the span tags.
def get_text_without_tags(webElement):
    if len(webElement.text) > 0:
        raw_text = webElement.text
        tag_text = webElement.find_element_by_xpath('.//span').text
        return raw_text.replace(tag_text, '').strip()
        
    
for header in business_headers:
    if len(header.find_elements_by_tag_name('a')) > 0:
        url = header.find_elements_by_tag_name('a')[0].get_attribute('href')
    else:
        url = None
    
    business_list.append({
        "name": header.find_element_by_tag_name('span').text.strip(),
        "adress": (rows[row_index].text.strip() + '\n' +
                   rows[row_index + 1].text.strip() + '\n' +
                   rows[row_index + 2].text.strip() + '\n' +
                   rows[row_index + 3].text.strip()),
        "url": url,
        "lic_status": get_text_without_tags(rows[row_index].find_elements_by_tag_name('td')[2]),
        "license": get_text_without_tags(rows[row_index + 1].find_elements_by_tag_name('td')[2]),
        "issues_date": get_text_without_tags(rows[row_index + 2].find_elements_by_tag_name('td')[2])
    })
    
    row_index += 4

business_list

[{'adress': 'AMIN NARGIS Lic. Status: Issued\n1724 N SALISBURY BLVD UNIT 2 License: 22173807\nSALISBURY, MD 21801 Issued Date: 4/27/2017\nWicomico County',
  'issues_date': '4/27/2017',
  'lic_status': 'Issued',
  'license': '22173807',
  'name': 'VAPE IT STORE I',
  'url': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=PYc0oqEAnBc%3D'},
 {'adress': 'AMIN NARGIS Lic. Status: Issued\n1015 S SALISBURY BLVD License: 22173808\nSALISBURY, MD 21801 Issued Date: 4/27/2017\nWicomico County',
  'issues_date': '4/27/2017',
  'lic_status': 'Issued',
  'license': '22173808',
  'name': 'VAPE IT STORE II',
  'url': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=3Cdtgrdpq1s%3D'},
 {'adress': 'ANJ DISTRIBUTIONS LLC Lic. Status: Issued\n2299 JOHNS HOPKINS ROAD License: 02104436\nGAMBRILLS, MD 21054 Issued Date: 4/05/2017\nAnne Arundel County',
  'issues_date': '4/05/2017',
  'lic_status': 'Issued',
  'license': '02104436',
  'name': 'VAPEPAD THE',
  'url': 'https://jportal

### Save these into `vape-results.csv`

In [205]:
df = pd.DataFrame(business_list)
df.head()

Unnamed: 0,adress,issues_date,lic_status,license,name,url
0,AMIN NARGIS Lic. Status: Issued\n1724 N SALISB...,4/27/2017,Issued,22173807.0,VAPE IT STORE I,https://jportal.mdcourts.gov/license/pbLicense...
1,AMIN NARGIS Lic. Status: Issued\n1015 S SALISB...,4/27/2017,Issued,22173808.0,VAPE IT STORE II,https://jportal.mdcourts.gov/license/pbLicense...
2,ANJ DISTRIBUTIONS LLC Lic. Status: Issued\n229...,4/05/2017,Issued,2104436.0,VAPEPAD THE,https://jportal.mdcourts.gov/license/pbLicense...
3,COX TRADING COMPANY L L C Lic. Status: Issued\...,5/31/2017,Issued,17165957.0,VAPE FROG,https://jportal.mdcourts.gov/license/pbLicense...
4,COX TRADING LLC Lic. Status: Pending\n346 RITC...,,Pending,,VAPE FROG,


In [206]:
df.to_csv('vape-results.csv', index=False)

### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [207]:
pd.read_csv('vape-results.csv').head()

Unnamed: 0,adress,issues_date,lic_status,license,name,url
0,AMIN NARGIS Lic. Status: Issued\n1724 N SALISB...,4/27/2017,Issued,22173807.0,VAPE IT STORE I,https://jportal.mdcourts.gov/license/pbLicense...
1,AMIN NARGIS Lic. Status: Issued\n1015 S SALISB...,4/27/2017,Issued,22173808.0,VAPE IT STORE II,https://jportal.mdcourts.gov/license/pbLicense...
2,ANJ DISTRIBUTIONS LLC Lic. Status: Issued\n229...,4/05/2017,Issued,2104436.0,VAPEPAD THE,https://jportal.mdcourts.gov/license/pbLicense...
3,COX TRADING COMPANY L L C Lic. Status: Issued\...,5/31/2017,Issued,17165957.0,VAPE FROG,https://jportal.mdcourts.gov/license/pbLicense...
4,COX TRADING LLC Lic. Status: Pending\n346 RITC...,,Pending,,VAPE FROG,


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [211]:
vape_results = []
next_button = True

# Click on next until the button disappears
while next_button:    
    business_headers = driver.find_elements_by_class_name('searchfieldtitle')
    rows = driver.find_elements_by_class_name('tablecelltext')
    row_index = 0
    
    for header in business_headers:
        if len(header.find_elements_by_tag_name('a')) > 0:
            url = header.find_elements_by_tag_name('a')[0].get_attribute('href')
        else:
            url = None

        vape_results.append({
            "name": header.find_element_by_tag_name('span').text.strip(),
            "adress": (rows[row_index].text.strip() + '\n' +
                       rows[row_index + 1].text.strip() + '\n' +
                       rows[row_index + 2].text.strip() + '\n' +
                       rows[row_index + 3].text.strip()),
            "url": url,
            "lic_status": get_text_without_tags(rows[row_index].find_elements_by_tag_name('td')[2]),
            "license": get_text_without_tags(rows[row_index + 1].find_elements_by_tag_name('td')[2]),
            "issues_date": get_text_without_tags(rows[row_index + 2].find_elements_by_tag_name('td')[2])
        })

        row_index += 4
    
    next_button = len(driver.find_elements_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a')) > 0
    
    if next_button:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a').click()
        time.sleep(2)

vape_results

[{'adress': 'AMIN NARGIS Lic. Status: Issued\n1724 N SALISBURY BLVD UNIT 2 License: 22173807\nSALISBURY, MD 21801 Issued Date: 4/27/2017\nWicomico County',
  'issues_date': '4/27/2017',
  'lic_status': 'Issued',
  'license': '22173807',
  'name': 'VAPE IT STORE I',
  'url': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=PYc0oqEAnBc%3D'},
 {'adress': 'AMIN NARGIS Lic. Status: Issued\n1015 S SALISBURY BLVD License: 22173808\nSALISBURY, MD 21801 Issued Date: 4/27/2017\nWicomico County',
  'issues_date': '4/27/2017',
  'lic_status': 'Issued',
  'license': '22173808',
  'name': 'VAPE IT STORE II',
  'url': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=3Cdtgrdpq1s%3D'},
 {'adress': 'ANJ DISTRIBUTIONS LLC Lic. Status: Issued\n2299 JOHNS HOPKINS ROAD License: 02104436\nGAMBRILLS, MD 21054 Issued Date: 4/05/2017\nAnne Arundel County',
  'issues_date': '4/05/2017',
  'lic_status': 'Issued',
  'license': '02104436',
  'name': 'VAPEPAD THE',
  'url': 'https://jportal

In [215]:
df_vape_results_all = pd.DataFrame(vape_results)
df_vape_results_all.head()

Unnamed: 0,adress,issues_date,lic_status,license,name,url
0,AMIN NARGIS Lic. Status: Issued\n1724 N SALISB...,4/27/2017,Issued,22173807.0,VAPE IT STORE I,https://jportal.mdcourts.gov/license/pbLicense...
1,AMIN NARGIS Lic. Status: Issued\n1015 S SALISB...,4/27/2017,Issued,22173808.0,VAPE IT STORE II,https://jportal.mdcourts.gov/license/pbLicense...
2,ANJ DISTRIBUTIONS LLC Lic. Status: Issued\n229...,4/05/2017,Issued,2104436.0,VAPEPAD THE,https://jportal.mdcourts.gov/license/pbLicense...
3,COX TRADING COMPANY L L C Lic. Status: Issued\...,5/31/2017,Issued,17165957.0,VAPE FROG,https://jportal.mdcourts.gov/license/pbLicense...
4,COX TRADING LLC Lic. Status: Pending\n346 RITC...,,Pending,,VAPE FROG,


In [220]:
df_vape_results_all.to_csv('vape-results-all.csv', index=False)

In [221]:
pd.read_csv('vape-results-all.csv').head()

Unnamed: 0,adress,issues_date,lic_status,license,name,url
0,AMIN NARGIS Lic. Status: Issued\n1724 N SALISB...,4/27/2017,Issued,22173807.0,VAPE IT STORE I,https://jportal.mdcourts.gov/license/pbLicense...
1,AMIN NARGIS Lic. Status: Issued\n1015 S SALISB...,4/27/2017,Issued,22173808.0,VAPE IT STORE II,https://jportal.mdcourts.gov/license/pbLicense...
2,ANJ DISTRIBUTIONS LLC Lic. Status: Issued\n229...,4/05/2017,Issued,2104436.0,VAPEPAD THE,https://jportal.mdcourts.gov/license/pbLicense...
3,COX TRADING COMPANY L L C Lic. Status: Issued\...,5/31/2017,Issued,17165957.0,VAPE FROG,https://jportal.mdcourts.gov/license/pbLicense...
4,COX TRADING LLC Lic. Status: Pending\n346 RITC...,,Pending,,VAPE FROG,
