# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

https://jportal.mdcourts.gov/license/pbPublicSearch.jsp

In [1]:
from selenium import webdriver

In [2]:
driver = webdriver.Chrome()

In [3]:
driver.get('https://jportal.mdcourts.gov/license/pbPublicSearch.jsp')
#when he says it isn't going to work he means that you won't get there in your selenium browser, because you're
#first redirected to the homepage.

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [4]:
#you'll find it by inspecting the page and looking for the xpath, which in this case is: //*[@id="checkbox"]
checkbox = driver.find_element_by_xpath('//*[@id="checkbox"]')

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [5]:
checkbox.click()

In [6]:
from selenium.webdriver.common.keys import Keys
checkbox.send_keys(Keys.RETURN)

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [7]:
search = driver.find_element_by_xpath ('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]')
search.send_keys(Keys.RETURN)

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [8]:
from selenium.webdriver.support.ui import Select

In [9]:
# select_tag = driver.find_element_by_name('Jurisdiction') #Jurisdiction is the drop down menu
# select = Select(select_tag) 
# select.select_by_visible_text('Statewide')
#this is another way to do this


jurisdiction = driver.find_element_by_xpath ('//*[@id="slcJurisdiction"]/option[2]')
jurisdiction.click()

### How do you type "vap%" into the Trade Name field?

In [10]:
#first select the trade name field 
type_vap = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
#click on it
type_vap.click()
#type something into the box
type_vap.send_keys('VAP%')
#and hit return
type_vap.send_keys(Keys.RETURN)

### How do you click the submit button or submit the form?

In [11]:
#type_vap.send_keys(Keys.RETURN)

### How can you find and click the 'Next' button on the search results page?

In [12]:
while True:
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a').click()
    except:
        break
        
#these are other things to do but they don't work. but maybe good for notes and to understand?

# while True:
#     try:
#         driver.find_element_by_tag_name('nobr').click()
#     except:
#          break


# #using a while loop:
# count = 0
# count = count + 1
# while True:
#     try:
#         driver.get_the_button('button').click()
#         #try and get that link and click it. if you get an error exit the while loop
#     except:
#         break
#     #grab the next button


# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [13]:
while True:
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[1]/a').click()
        #simply replace the xpath with the one for the back button
    except:
        break

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [14]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, 'html.parser')
# results = driver.find_elements_by_tag_name('td')
# #they're all in the tag name td 
# #be very careful that ELEMENTS is PLURAL because otherwise it'll only
# #find ONE!!!
# for each_element in results:
#     print(each_element.text.strip())

In [15]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)

5

In [18]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.
for header in business_headers:
    rows = header.find_next_siblings('tr')
    print("HEADER is", header.text.strip())
    print("ROW 0 is", rows[0].text.strip())
    print("ROW 1 is", rows[1].text.strip())
    print("ROW 2 is", rows[2].text.strip())
    print("ROW 3 is", rows[3].text.strip())
    print("----")

HEADER is 1.
VAPE IT STORE II
ROW 0 is AMIN NARGIS
Lic. Status: Issued
ROW 1 is 1015 S SALISBURY BLVD
License: 22173808
ROW 2 is SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 is Wicomico County
----
HEADER is 2.
VAPE IT STORE I
ROW 0 is AMIN NARGIS
Lic. Status: Issued
ROW 1 is 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 is SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 is Wicomico County
----
HEADER is 3.
VAPEPAD THE
ROW 0 is ANJ DISTRIBUTIONS LLC
Lic. Status: Issued
ROW 1 is 2299 JOHNS HOPKINS ROAD
License: 02104436
ROW 2 is GAMBRILLS, MD 21054
Issued Date: 4/05/2017
ROW 3 is Anne Arundel County
----
HEADER is 4.
VAPE FROG
ROW 0 is COX TRADING COMPANY L L C
Lic. Status: Issued
ROW 1 is 110 S. PINEY RD
License: 17165957
ROW 2 is CHESTER, MD 21619
Issued Date: 5/31/2017
ROW 3 is Queen Anne's County
----
HEADER is 5.
VAPE FROG
Pending *
ROW 0 is COX TRADING LLC
Lic. Status: Pending
ROW 1 is 346 RITCHIE HIGHWAY
ROW 2 is SEVERNA PARK, MD 21146
ROW 3 is Anne Arundel County
---

### Save these into `vape-results.csv`

In [17]:
#first import pandas
import pandas as pd
df = pd.DataFrame(doc)
df.to_csv("vape_results.csv", index=False)
vape_results = pd.read_csv('vape_results')
vape_results.head()

TypeError: DataFrame constructor called with incompatible data and dtype: setting an array element with a sequence

### Open `vape-results.csv` to make sure there aren't any extra weird columns

## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [40]:
while True:
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a').click()
        business_headers = doc.find_all('tr',class_='searchfieldtitle')
        for header in business_headers:
            rows = header.find_next_siblings('tr')
            print("HEADER is", header.text.strip())
            print("ROW 0 is", rows[0].text.strip())
            print("ROW 1 is", rows[1].text.strip())
            print("ROW 2 is", rows[2].text.strip())
            print("ROW 3 is", rows[3].text.strip())
            print("----")
    except:
        break

In [250]:
vape_results_all = []


# for link in rows.find_all('a'):
#     print(link.get('href'))

while True:
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    #we have to be on the first page and grab the information. as the loop comes back around, it will say okay, we're
    #here, now let's take all of the information (defined below) from the page we're on and then save it into our list
    business_headers = doc.find_all('tr',class_='searchfieldtitle')
    #this is where we're getting to where the actual information is
    for header in business_headers:
        current = {}
        #name the dictionary (current)
        rows = header.find_next_siblings('tr')
        #here is like above, we're going through each row and finding the next sibings.
        #REMEMBER that ROWS is a new thing where you are inside the tr already. so then you say, within that, find me
        #these things.
        current['header'] = header.find_all('td')[1].text.strip()
        #here look at the HEADERS in business_headers, which we've ALREADY DEFINED ABOVE HELLLLOOO and we don't want
        #rows because rows is finding all the table rows, ie the next siblings of header.
        current['link_or_status'] = header.find_all('td')[2].text.strip()
#         current['link_or_status'] = header.find_all('td')[2].find('a')
#         if current['link_or_status']:
#             link = header.find_all('td')[2].find('a')
#             print(link['href'])
        #here we're defining all of the keys in our dictionary
        current['company'] = rows[0].find_all('td')[1].text.strip()
        current['address'] =  rows[1].find_all('td')[1].text.strip()
        current['license_status'] = rows[1].find_all('td')[2].text.strip()
        current['city_state'] = rows[2].find_all('td')[1].text.strip()
        current['county'] = rows[3].text.strip()
        vape_results_all.append(current)
        #remember to append the results to the dictionary you defined above
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a').click()
        #here you're tell it to click through each page
    except:
        break




In [251]:
print(vape_results_all)

[{'header': 'VAPE DOJO', 'link_or_status': '', 'company': 'WALKER TRADING COMPANY INC', 'address': '3570 SAINT JOHNS LANE, SUITE 109', 'license_status': 'License: 13144108', 'city_state': 'ELLICOTT CITY, MD 21042', 'county': 'Howard County'}, {'header': 'VAPEZ YARDHOUSE', 'link_or_status': 'Pending *', 'company': 'YARKHOUSE EMPIRE INC', 'address': '3315 PLAZA WAY', 'license_status': '', 'city_state': 'WALDORF, MD 20603', 'county': 'Charles County'}]


In [252]:
import pandas as pd
df = pd.DataFrame(vape_results_all)
df.to_csv("vape_results_all.csv", index=False)
vape_results_all = pd.read_csv('vape_results_all.csv')
vape_results_all.head()

Unnamed: 0,address,city_state,company,county,header,license_status,link_or_status
0,"3570 SAINT JOHNS LANE, SUITE 109","ELLICOTT CITY, MD 21042",WALKER TRADING COMPANY INC,Howard County,VAPE DOJO,License: 13144108,
1,3315 PLAZA WAY,"WALDORF, MD 20603",YARKHOUSE EMPIRE INC,Charles County,VAPEZ YARDHOUSE,,Pending *
