# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()

In [2]:
# url = "https://jportal.mdcourts.gov/license/index_disclaimer.jsp"

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [3]:
# checkbox = driver.find_element_by_xpath('//*[@id="checkbox"]')
# checkbox.click()

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [4]:
# button = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')
# button.click()

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [5]:
# link = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]')
# link.click()

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [6]:
# dropdown = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')
# select = Select(dropdown)
# select.select_by_visible_text("Statewide")

### How do you type "vap%" into the Trade Name field?

In [7]:
# tradeName = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
# tradeName.send_keys("vap%")

### How do you click the submit button or submit the form?

In [8]:
# searchButton = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]')
# searchButton.click()

### How can you find and click the 'Next' button on the search results page?

In [9]:
# nextButton = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
# nextButton.click()

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [18]:
# Pulls in ability to control a web browser using Selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
# Tell Selenium to open browser 
driver = webdriver.Chrome() 
# In order to get Selenium to visit a page, use driver.get()
driver.get('https://jportal.mdcourts.gov/license/index_disclaimer.jsp')
checkbox = driver.find_element_by_xpath('//*[@id="checkbox"]')
checkbox.click()
button = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')
button.click()
link = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]')
link.click()

dropdown = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')
select = Select(dropdown)
select.select_by_visible_text("Statewide")
tradeName = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
tradeName.send_keys("vap%")

searchButton = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]')
searchButton.click()

nextButton = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
nextButton.click()

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [21]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, "html.parser")
business_headers = doc.find_all('tr',class_='searchfieldtitle')
business_headers

[<tr class="searchfieldtitle">
 <td class="searchlistnumber">1.</td>
 <td class="searchlistitem"><span class="copybold">VAPE IT STORE I</span></td>
 <td><a href="pbLicenseDetail.jsp?owi=KdVIRFZaRSk%3D"><img alt="Click for Detail of VAPE IT STORE I" src="images/link_click-detail.gif"/></a></td>
 </tr>, <tr class="searchfieldtitle">
 <td class="searchlistnumber">2.</td>
 <td class="searchlistitem"><span class="copybold">VAPE IT STORE II</span></td>
 <td><a href="pbLicenseDetail.jsp?owi=cwUAHU2nGzk%3D"><img alt="Click for Detail of VAPE IT STORE II" src="images/link_click-detail.gif"/></a></td>
 </tr>, <tr class="searchfieldtitle">
 <td class="searchlistnumber">3.</td>
 <td class="searchlistitem"><span class="copybold">VAPEPAD THE</span></td>
 <td><a href="pbLicenseDetail.jsp?owi=tqd4cn5Q%2BEw%3D"><img alt="Click for Detail of VAPEPAD THE" src="images/link_click-detail.gif"/></a></td>
 </tr>, <tr class="searchfieldtitle">
 <td class="searchlistnumber">4.</td>
 <td class="searchlistitem"><

In [23]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.
for header in business_headers:
    rows = header.find_next_siblings('tr')
    print(header.text.strip())
    print(rows[0].text.strip())
    print(rows[1].text.strip())
    print(rows[2].text.strip())
    print(rows[3].text.strip())
    print("___")

1.
VAPE IT STORE I
AMIN NARGIS
Lic. Status: Issued
1724 N SALISBURY BLVD UNIT 2
License: 22173807
SALISBURY, MD 21801
Issued Date: 4/27/2017
Wicomico County
___
2.
VAPE IT STORE II
AMIN NARGIS
Lic. Status: Issued
1015 S SALISBURY BLVD
License: 22173808
SALISBURY, MD 21801
Issued Date: 4/27/2017
Wicomico County
___
3.
VAPEPAD THE
ANJ DISTRIBUTIONS LLC
Lic. Status: Issued
2299 JOHNS HOPKINS ROAD
License: 02104436
GAMBRILLS, MD 21054
Issued Date: 4/05/2017
Anne Arundel County
___
4.
VAPE FROG
COX TRADING COMPANY L L C
Lic. Status: Issued
110 S. PINEY RD
License: 17165957
CHESTER, MD 21619
Issued Date: 5/31/2017
Queen Anne's County
___
5.
VAPE FROG
Pending *
COX TRADING LLC
Lic. Status: Pending
346 RITCHIE HIGHWAY
SEVERNA PARK, MD 21146
Anne Arundel County
___


In [24]:
list_companies = []
for header in business_headers:
    company = {}
    rows = header.find_next_siblings('tr')
    company['trade_name'] = header.find_all('td')[1].text.strip()
    company['company_name'] = rows[0].find_all('td')[1].text.strip()
    company['city_name'] = rows[2].find_all('td')[1].text.strip()
    company['address'] = rows[1].find_all('td')[1].text.strip()
    company['county'] = rows[3].find_all('td')[1].text.strip()
    company_url = header.find_all('td')[2].find('a', href=True)
    if company_url:
        company['detail_link'] = 'https://jportal.mdcourts.gov/license/'+company_url['href']
    else:
        company['detail_link'] = ''
    license_number = rows[1].find_all('td')[2].find('span')
    if license_number is not None:
        company['license_status'] = rows[0].find_all('td')[2].find('span').next_sibling.strip()
        company['license_number'] = rows[1].find_all('td')[2].find('span').next_sibling.strip()
        company['issue_date'] = rows[2].find_all('td')[2].find('span').next_sibling.strip()
    else:
        company['license_status'] = rows[0].find_all('td')[2].find('span').next_sibling.strip()
        company['license_number'] = ""
        company['issue_date'] = ""
    list_companies.append(company)
list_companies

[{'address': '1724 N SALISBURY BLVD UNIT 2',
  'city_name': 'SALISBURY, MD 21801',
  'company_name': 'AMIN NARGIS',
  'county': 'Wicomico County',
  'detail_link': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=KdVIRFZaRSk%3D',
  'issue_date': '4/27/2017',
  'license_number': '22173807',
  'license_status': 'Issued',
  'trade_name': 'VAPE IT STORE I'},
 {'address': '1015 S SALISBURY BLVD',
  'city_name': 'SALISBURY, MD 21801',
  'company_name': 'AMIN NARGIS',
  'county': 'Wicomico County',
  'detail_link': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=cwUAHU2nGzk%3D',
  'issue_date': '4/27/2017',
  'license_number': '22173808',
  'license_status': 'Issued',
  'trade_name': 'VAPE IT STORE II'},
 {'address': '2299 JOHNS HOPKINS ROAD',
  'city_name': 'GAMBRILLS, MD 21054',
  'company_name': 'ANJ DISTRIBUTIONS LLC',
  'county': 'Anne Arundel County',
  'detail_link': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=tqd4cn5Q%2BEw%3D',
  'issue_dat

### Save these into `vape-results.csv`

In [25]:
import pandas as pd
df = pd.DataFrame(list_companies)
df.to_csv('vape_results.csv',index=False)

### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [26]:
df = pd.read_csv('vape_results.csv')
df

Unnamed: 0,address,city_name,company_name,county,detail_link,issue_date,license_number,license_status,trade_name
0,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173807.0,Issued,VAPE IT STORE I
1,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173808.0,Issued,VAPE IT STORE II
2,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,2104436.0,Issued,VAPEPAD THE
3,110 S. PINEY RD,"CHESTER, MD 21619",COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,17165957.0,Issued,VAPE FROG
4,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",COX TRADING LLC,Anne Arundel County,,,,Pending,VAPE FROG


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [27]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import pandas as pd
from bs4 import BeautifulSoup

In [28]:
driver = webdriver.Chrome()
url = "https://jportal.mdcourts.gov/license/index_disclaimer.jsp"
driver.get(url)
checkbox = driver.find_element_by_xpath('//*[@id="checkbox"]')
checkbox.click()
button = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')
button.click()
link = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]')
link.click()

dropdown = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')
select = Select(dropdown)
select.select_by_visible_text("Statewide")
tradeName = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
tradeName.send_keys("vap%")
searchButton = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]')
searchButton.click()

In [29]:
def scrape_this_page():
    doc = BeautifulSoup(driver.page_source, "html.parser")
    business_headers = doc.find_all('tr',class_='searchfieldtitle')
    this_list = []
    
    for header in business_headers:
        company = {}
        rows = header.find_next_siblings('tr')
        company['trade_name'] = header.find_all('td')[1].text.strip()
        company['company_name'] = rows[0].find_all('td')[1].text.strip()
        company['city_name'] = rows[2].find_all('td')[1].text.strip()
        company['address'] = rows[1].find_all('td')[1].text.strip()
        company['county'] = rows[3].find_all('td')[1].text.strip()
        company_url = header.find_all('td')[2].find('a', href=True)
        
        if company_url:
            company['detail_link'] = 'https://jportal.mdcourts.gov/license/'+company_url['href']
        else:
            company['detail_link'] = ''
            
        license_number = rows[1].find_all('td')[2].find('span')
        
        if license_number is not None:
            company['license_status'] = rows[0].find_all('td')[2].find('span').next_sibling.strip()
            company['license_number'] = rows[1].find_all('td')[2].find('span').next_sibling.strip()
            company['issue_date'] = rows[2].find_all('td')[2].find('span').next_sibling.strip()
        else:
            company['license_status'] = rows[0].find_all('td')[2].find('span').next_sibling.strip()
            company['license_number'] = ""
            company['issue_date'] = ""
            
        this_list.append(company)
        
    return this_list

In [30]:
company_list = []

while True:
    try:
        this_list = scrape_this_page()
        company_list += this_list
        nextButton = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
        nextButton.click()
    except:
        break

In [31]:
company_list

[{'address': '1724 N SALISBURY BLVD UNIT 2',
  'city_name': 'SALISBURY, MD 21801',
  'company_name': 'AMIN NARGIS',
  'county': 'Wicomico County',
  'detail_link': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=%2BExmxmrSeEE%3D',
  'issue_date': '4/27/2017',
  'license_number': '22173807',
  'license_status': 'Issued',
  'trade_name': 'VAPE IT STORE I'},
 {'address': '1015 S SALISBURY BLVD',
  'city_name': 'SALISBURY, MD 21801',
  'company_name': 'AMIN NARGIS',
  'county': 'Wicomico County',
  'detail_link': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=AOTGUn5%2FFjc%3D',
  'issue_date': '4/27/2017',
  'license_number': '22173808',
  'license_status': 'Issued',
  'trade_name': 'VAPE IT STORE II'},
 {'address': '2299 JOHNS HOPKINS ROAD',
  'city_name': 'GAMBRILLS, MD 21054',
  'company_name': 'ANJ DISTRIBUTIONS LLC',
  'county': 'Anne Arundel County',
  'detail_link': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=OQTtBF%2F1Lf8%3D',
  'issue

In [32]:
df = pd.DataFrame(company_list)
df.to_csv('vape_results_all.csv',index=False)

In [33]:
df.head()

Unnamed: 0,address,city_name,company_name,county,detail_link,issue_date,license_number,license_status,trade_name
0,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173807.0,Issued,VAPE IT STORE I
1,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173808.0,Issued,VAPE IT STORE II
2,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,2104436.0,Issued,VAPEPAD THE
3,110 S. PINEY RD,"CHESTER, MD 21619",COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,17165957.0,Issued,VAPE FROG
4,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",COX TRADING LLC,Anne Arundel County,,,,Pending,VAPE FROG
