# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [245]:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://jportal.mdcourts.gov/license/index_disclaimer.jsp")

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [246]:
checkbox = driver.find_element_by_xpath('//*[@id="checkbox"]')

In [247]:
checkbox.click()

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [248]:
enter_the_site = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')
enter_the_site.click()

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [249]:
search_license_record = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]')
search_license_record.click()

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [250]:
from selenium.webdriver.support.ui import Select

In [251]:
jurisdition = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')
# convert a element into a select that selenium could understand
jurisdition_select = Select(jurisdition)
jurisdition_select.select_by_visible_text('Statewide')

### How do you type "vap%" into the Trade Name field?

In [252]:
from selenium.webdriver.common.keys import Keys
trade_name = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
trade_name.send_keys('vap%')

### How do you click the submit button or submit the form?

In [253]:
trade_name.send_keys(Keys.RETURN)

### How can you find and click the 'Next' button on the search results page?

In [254]:
next = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
next.click()

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [255]:
while True:
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr').click()
    except:
        break

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source,'html.parser')`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [323]:
from bs4 import BeautifulSoup
import requests

In [324]:
doc = BeautifulSoup(driver.page_source, 'html.parser')
doc

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="EN" xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Maryland Judiciary Business Licenses Online</title>
<link href="theme/styles.css" rel="STYLESHEET" type="text/css"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
</head>
<body>
<!-- HEADER AREA - Header graphic, company logo and link images go here-->
<table border="0" cellpadding="0" cellspacing="0" summary="Page Layout Table" width="100%">
<tbody><tr>
<td colspan="3">
<img alt="MARYLAND BUSINESS LICENSES ONLINE" height="66" src="images/header_new.gif" width="596"/><img alt="" src="images/spacer.gif" width="35"/>
</td>
</tr>
<tr>
<td class="headerline">
<img alt="blank spacer" height="18" src="images/spacer.gif" width="35"/>
</td>
<td align="LEFT" class="headerline">
<table border="0" cellpadding="0" cellspacing="0" summary="Navigation Menu">
<tbody><tr valign="top">
<td>
<img alt="" height="18

In [325]:
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)

5

In [326]:
for header in business_headers:
    rows = header.find_next_siblings('tr')
    print("HEADER is", header.text.strip())
    print("ROW 0 IS", rows[0].text.strip())
    print("ROW 1 IS", rows[1].text.strip())
    print("ROW 2 IS", rows[2].text.strip())
    print("ROW 3 IS", rows[3].text.strip())
    print("----")

HEADER is 1.
VAPE IT STORE I
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 2.
VAPE IT STORE II
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1015 S SALISBURY BLVD
License: 22173808
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 3.
VAPEPAD THE
ROW 0 IS ANJ DISTRIBUTIONS LLC
Lic. Status: Issued
ROW 1 IS 2299 JOHNS HOPKINS ROAD
License: 02104436
ROW 2 IS GAMBRILLS, MD 21054
Issued Date: 4/05/2017
ROW 3 IS Anne Arundel County
----
HEADER is 4.
VAPE FROG
ROW 0 IS COX TRADING COMPANY L L C
Lic. Status: Issued
ROW 1 IS 110 S. PINEY RD
License: 17165957
ROW 2 IS CHESTER, MD 21619
Issued Date: 5/31/2017
ROW 3 IS Queen Anne's County
----
HEADER is 5.
VAPE FROG
Pending *
ROW 0 IS COX TRADING LLC
Lic. Status: Pending
ROW 1 IS 346 RITCHIE HIGHWAY
ROW 2 IS SEVERNA PARK, MD 21146
ROW 3 IS Anne Arundel County
---

### Save these into `vape-results.csv`

In [327]:
business_headers[0].find_next_siblings('tr')

[<tr class="tablecelltext">
 <td> </td>
 <td>AMIN NARGIS</td>
 <td><span class="copybold">Lic. Status:</span> Issued</td>
 </tr>, <tr class="tablecelltext">
 <td> </td>
 <td>1724 N SALISBURY BLVD UNIT 2</td>
 <td><span class="copybold">License:</span> 22173807</td>
 </tr>, <tr class="tablecelltext">
 <td> </td>
 <td>SALISBURY, MD 21801</td>
 <td><span class="copybold">Issued Date:</span> 4/27/2017</td>
 </tr>, <tr class="tablecelltext">
 <td> </td>
 <td>Wicomico County</td>
 <td></td>
 </tr>, <tr class="searchfieldtitle">
 <td class="searchlistnumber">2.</td>
 <td class="searchlistitem"><span class="copybold">VAPE IT STORE II</span></td>
 <td><a href="pbLicenseDetail.jsp?owi=LVmS56v8b84%3D"><img alt="Click for Detail of VAPE IT STORE II" src="images/link_click-detail.gif"/></a></td>
 </tr>, <tr class="tablecelltext">
 <td> </td>
 <td>AMIN NARGIS</td>
 <td><span class="copybold">Lic. Status:</span> Issued</td>
 </tr>, <tr class="tablecelltext">
 <td> </td>
 <td>1015 S SALISBURY BLVD</td

In [328]:
for header in business_headers:
    rows = header.find_next_siblings('tr')
    print("HEADER is", header.text.strip())
    print("ROW 0 IS", rows[0].text.strip())
    print("ROW 1 IS", rows[1].text.strip())
    print("ROW 2 IS", rows[2].text.strip())
    print("ROW 3 IS", rows[3].text.strip())
    print("----")

HEADER is 1.
VAPE IT STORE I
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 2.
VAPE IT STORE II
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1015 S SALISBURY BLVD
License: 22173808
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 3.
VAPEPAD THE
ROW 0 IS ANJ DISTRIBUTIONS LLC
Lic. Status: Issued
ROW 1 IS 2299 JOHNS HOPKINS ROAD
License: 02104436
ROW 2 IS GAMBRILLS, MD 21054
Issued Date: 4/05/2017
ROW 3 IS Anne Arundel County
----
HEADER is 4.
VAPE FROG
ROW 0 IS COX TRADING COMPANY L L C
Lic. Status: Issued
ROW 1 IS 110 S. PINEY RD
License: 17165957
ROW 2 IS CHESTER, MD 21619
Issued Date: 5/31/2017
ROW 3 IS Queen Anne's County
----
HEADER is 5.
VAPE FROG
Pending *
ROW 0 IS COX TRADING LLC
Lic. Status: Pending
ROW 1 IS 346 RITCHIE HIGHWAY
ROW 2 IS SEVERNA PARK, MD 21146
ROW 3 IS Anne Arundel County
---

In [329]:
vape_shops = []

for header in business_headers:
    rows = header.find_next_siblings('tr')
    if rows[2].find_all('td')[2].find('span', attrs={'class':'copybold'}):
        issued_date = rows[2].find_all('td')[2].find('span', attrs={'class':'copybold'}).next.next[1:]
    if rows[1].find_all('td')[2].find('span', attrs={'class':'copybold'}):
        license = rows[1].find_all('td')[2].find('span', attrs={'class':'copybold'}).next.next[1:]
    if rows[0].find_all('td')[2].find('span', attrs={'class':'copybold'}):
        status = rows[0].find_all('td')[2].find('span', attrs={'class':'copybold'}).next.next[1:]
    vape_shops.append({
        'header': header.find_all('td')[0].text.strip(),
        'shop name': header.find_all('td')[1].text.strip(),
        'company': rows[0].find_all('td')[1].text.strip(),
        'address': rows[1].find_all('td')[1].text.strip(),
        'county': rows[2].find_all('td')[1].text.strip(),
        'status': status,
        'license': license,
        'issued date': issued_date
    })

vape_shops

[{'address': '1724 N SALISBURY BLVD UNIT 2',
  'company': 'AMIN NARGIS',
  'county': 'SALISBURY, MD 21801',
  'header': '1.',
  'issued date': '4/27/2017',
  'license': '22173807',
  'shop name': 'VAPE IT STORE I',
  'status': 'Issued'},
 {'address': '1015 S SALISBURY BLVD',
  'company': 'AMIN NARGIS',
  'county': 'SALISBURY, MD 21801',
  'header': '2.',
  'issued date': '4/27/2017',
  'license': '22173808',
  'shop name': 'VAPE IT STORE II',
  'status': 'Issued'},
 {'address': '2299 JOHNS HOPKINS ROAD',
  'company': 'ANJ DISTRIBUTIONS LLC',
  'county': 'GAMBRILLS, MD 21054',
  'header': '3.',
  'issued date': '4/05/2017',
  'license': '02104436',
  'shop name': 'VAPEPAD THE',
  'status': 'Issued'},
 {'address': '110 S. PINEY RD',
  'company': 'COX TRADING COMPANY L L C',
  'county': 'CHESTER, MD 21619',
  'header': '4.',
  'issued date': '5/31/2017',
  'license': '17165957',
  'shop name': 'VAPE FROG',
  'status': 'Issued'},
 {'address': '346 RITCHIE HIGHWAY',
  'company': 'COX TRADIN

### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [330]:
import pandas as pd

In [331]:
df = pd.DataFrame(vape_shops)
df.head()

Unnamed: 0,address,company,county,header,issued date,license,shop name,status
0,1724 N SALISBURY BLVD UNIT 2,AMIN NARGIS,"SALISBURY, MD 21801",1.0,4/27/2017,22173807,VAPE IT STORE I,Issued
1,1015 S SALISBURY BLVD,AMIN NARGIS,"SALISBURY, MD 21801",2.0,4/27/2017,22173808,VAPE IT STORE II,Issued
2,2299 JOHNS HOPKINS ROAD,ANJ DISTRIBUTIONS LLC,"GAMBRILLS, MD 21054",3.0,4/05/2017,2104436,VAPEPAD THE,Issued
3,110 S. PINEY RD,COX TRADING COMPANY L L C,"CHESTER, MD 21619",4.0,5/31/2017,17165957,VAPE FROG,Issued
4,346 RITCHIE HIGHWAY,COX TRADING LLC,"SEVERNA PARK, MD 21146",5.0,5/31/2017,17165957,VAPE FROG,Pending


In [332]:
df.to_csv("vape-results.csv", index=False)

## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [333]:
vape_shops_all = []

while True:
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    
    business_headers = doc.find_all('tr',class_='searchfieldtitle')
        
    for header in business_headers:
        rows = header.find_next_siblings('tr')
        if rows[2].find_all('td')[2].find('span', attrs={'class':'copybold'}):
            issued_date = rows[2].find_all('td')[2].find('span', attrs={'class':'copybold'}).next.next[1:]
        if rows[1].find_all('td')[2].find('span', attrs={'class':'copybold'}):
            license = rows[1].find_all('td')[2].find('span', attrs={'class':'copybold'}).next.next[1:]
        if rows[0].find_all('td')[2].find('span', attrs={'class':'copybold'}):
            status = rows[0].find_all('td')[2].find('span', attrs={'class':'copybold'}).next.next[1:]
        vape_shops_all.append({
            'header': header.find_all('td')[0].text.strip(),
            'shop name': header.find_all('td')[1].text.strip(),
            'company': rows[0].find_all('td')[1].text.strip(),
            'address': rows[1].find_all('td')[1].text.strip(),
            'county': rows[2].find_all('td')[1].text.strip(),
            'status': status,
            'license': license,
            'issued date': issued_date
        })
    try:            
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr').click()
        
    except:
        break
        
vape_shops_all

[{'address': '1724 N SALISBURY BLVD UNIT 2',
  'company': 'AMIN NARGIS',
  'county': 'SALISBURY, MD 21801',
  'header': '1.',
  'issued date': '4/27/2017',
  'license': '22173807',
  'shop name': 'VAPE IT STORE I',
  'status': 'Issued'},
 {'address': '1015 S SALISBURY BLVD',
  'company': 'AMIN NARGIS',
  'county': 'SALISBURY, MD 21801',
  'header': '2.',
  'issued date': '4/27/2017',
  'license': '22173808',
  'shop name': 'VAPE IT STORE II',
  'status': 'Issued'},
 {'address': '2299 JOHNS HOPKINS ROAD',
  'company': 'ANJ DISTRIBUTIONS LLC',
  'county': 'GAMBRILLS, MD 21054',
  'header': '3.',
  'issued date': '4/05/2017',
  'license': '02104436',
  'shop name': 'VAPEPAD THE',
  'status': 'Issued'},
 {'address': '110 S. PINEY RD',
  'company': 'COX TRADING COMPANY L L C',
  'county': 'CHESTER, MD 21619',
  'header': '4.',
  'issued date': '5/31/2017',
  'license': '17165957',
  'shop name': 'VAPE FROG',
  'status': 'Issued'},
 {'address': '346 RITCHIE HIGHWAY',
  'company': 'COX TRADIN

In [334]:
len(vape_shops_all)

32

In [335]:
df_1 = pd.DataFrame(vape_shops_all)
df_1

Unnamed: 0,address,company,county,header,issued date,license,shop name,status
0,1724 N SALISBURY BLVD UNIT 2,AMIN NARGIS,"SALISBURY, MD 21801",1.0,4/27/2017,22173807,VAPE IT STORE I,Issued
1,1015 S SALISBURY BLVD,AMIN NARGIS,"SALISBURY, MD 21801",2.0,4/27/2017,22173808,VAPE IT STORE II,Issued
2,2299 JOHNS HOPKINS ROAD,ANJ DISTRIBUTIONS LLC,"GAMBRILLS, MD 21054",3.0,4/05/2017,2104436,VAPEPAD THE,Issued
3,110 S. PINEY RD,COX TRADING COMPANY L L C,"CHESTER, MD 21619",4.0,5/31/2017,17165957,VAPE FROG,Issued
4,346 RITCHIE HIGHWAY,COX TRADING LLC,"SEVERNA PARK, MD 21146",5.0,5/31/2017,17165957,VAPE FROG,Pending
5,185 MITCHELLS CHANCE RD,DISBROW II EMERSON HARRINGTON,"EDGEWATER, MD 21037",6.0,4/13/2017,2102408,VAPE LOFT (THE),Issued
6,7104 MINSTREL UNIT #7,DISCOUNT TOBACCO ESSEX LLC,"COLUMBIA, MD 21045",7.0,5/19/2017,13141786,VAPE N CIGAR,Issued
7,330 ONE FORTY VILLAGE ROAD,FAIRGROUND VILLAGE LLC,"WESTMINSTER, MD 21157",8.0,4/21/2017,6126253,VAPE DOJO,Issued
8,29890 THREE NOTCH ROAD,GRIMM JENNIFER,"CHARLOTTE HALL, MD 20622",9.0,4/21/2017,6126253,VAPE HAVEN,Pending
9,356 ROMANCOKE ROAD,HUTCH VAPES LLC,"STEVENSVILLE, MD 21666",10.0,4/13/2017,17166688,VAPE BIRD,Issued


In [336]:
df_1.to_csv("vape-results-all.csv", index=False)