# Background
In 2017, I worked on a consulting project for a marijuana company in Denver, Colorado. There was a sense in the community at that time that Colorado was paving the way for legalization and the industry was much more inclusive and wealth distributed than other commercial sectors. That communal notion has little validity looking at the current economic landscape where over 80% of business owners are white, in contrast to a historical record of disproportionately targeting Black and Latinx people for marijuana possession. (A study from Queens College found that before legalization Black and Latinx people were 13 percentage points more disproportinately targeted for marijuana possession arrests than White counterparts). Everywhere, the argument seemed to be "ok, we're working on being more socially equitable from a minority-inclusive perspective but we're doing a great job from an income class perspective." This led me to wonder: Who owns Marijuana in Colorado, actually? How distributed is income amongst people? In an attempt to answer this question, I thought that looking at retail license owners would be a good starting proxy for market share and ownership. To access all retail licenses, I scraped Colorado's Revenue Department's website.

Update: Marijuana co-ops were recently banned in Colorado and there's a current bill on the docket to expunge the records of all those with posession records. 

# Imports

In [None]:
import pandas as pd
import requests
import matplotlib
import numpy as np
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select

%matplotlib inline

# Scraper

In [None]:
#the goal is to make a list of dictionaries of licensees that we can turn into a dataframe
licensees=[]
#for debugging purposes:
# failed =[]

In [None]:
#for debugging: chromedriver must be compatible with current version of chrome
driver=webdriver.Chrome('./chromedriver')
driver.get('https://codor.mylicense.com/med_verification/Search.aspx?facility=N')

button=Select(driver.find_element_by_css_selector('#t_web_lookup__license_type_name'))
options = button.options

#There are 8 different license types to iterate through. 
#When you select a license type and click "search," all last names of that license type will show up
#If you select "All" for license type, you would have to specify a letter of the alphabet
#For this reason, it's generally more efficient to iterate through license types (the index of the license type on the main menu bar will be num)
# use range (1,9) to collect all
#for demo, comment out the whole range and just comment in the lower range (6,7)
for num in range(1,9):
# for num in range(6,7):
    driver.get('https://codor.mylicense.com/med_verification/Search.aspx?facility=N')
    button=Select(driver.find_element_by_css_selector('#t_web_lookup__license_type_name'))
    #num is the index of the license type in the menu bar (we want to skip "all" so we start with 1)
    button.select_by_index(num)
    text_input = driver.find_element_by_css_selector('#t_web_lookup__last_name')
    # Every time there's a search, there are unique urls generated, there are so many pages for "Key" (num 6) that the urls we collect timeout before we can grab info from them
    #For this reason,I had to add letter batching for the "key option ;comment out for demo purposes
    if num == 6:
        for letter in ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']:
            text_input.send_keys(letter)
            search = driver.find_element_by_css_selector('#sch_button')
            search.click()
        #collect individual person data
        #while True is because we paginate with a css_selector element and break when there are no more next page elements
        #the reason we did not paginate with number elements is that we are only displayed 40 pages at a time and a "..." item you click to see more
        #if this were to be run more than one time, it would probably be more readable to turn this section into a function, as it is used for all cases
            while True:
                #in this case, we selected all elements that contained a link to an individual person's license 
                for a in driver.find_elements_by_css_selector("td[rowspan='0'] a"):
                        #extract the link element
                        try:
                            link = a.get_attribute('href')
                        #if there exists an element that should have a link but we fail to extract it, we should append to a failed list for troubleshooting
                        except:
                            failed.append(a)
                            continue
                        a.click()
                        #after clicking on a each person, you need to switch windows to grab the info for the next person
                        driver.switch_to.window(driver.window_handles[1])
                        #once we click the link for the person, we can switch to BeautifulSoup for a cleaner scraping experience
                        html = driver.page_source
                        soup = BeautifulSoup(html)
                        person ={}
                        full_name = soup.find('span', id="_ctl25__ctl1_full_name")
                        try:
                            person['full_name']=full_name.text
                        #sometimes an individual page is slow to load and thus full_name doesn't exist yet, if there's a failure, we should sleep and try again
                        #we'll try twice with time.sleep(2) first to save time and time.sleep(5) to ensure that it really isn't just a page load issue
                        except:
                            time.sleep(2)
                        try:
                            person['full_name']=full_name.text

                        except:
                            time.sleep(5)
                        try:
                            person['plicense']=soup.find(id='_ctl32__ctl1_license_no').text
                        #if it still fails, we'll append the link to the list of failed items
                        #ideally, this list could be used for a test where an empty list is success
                        except:
                            failed.append(link)
                            continue    
                        person['plicense']=soup.find(id='_ctl32__ctl1_license_no').text
                        print(full_name.text)
                        #finding the facility license
                        regexp = re.compile(r'_ctl37__ct.*_license_no')
                        licenses = soup.find_all(id=regexp)
                        #some people have multiple licenses to their name, let's grab all of them
                        for license in licenses:
                            print(license.text)
                            person['flicense'] = license.text
                            licensees.append(person)
                        print('------------')
                        #after a certain number of windows are open, selenium will crash so it is good practice to close the window
                        driver.close()
                        #switch back to the main search result page to paginate or move on
                        driver.switch_to.window(driver.window_handles[0])
            #once we have iterated through all the names/links on the page, we must go to the next page (unless it's the last)
            #if there is a next page, we'll click it to start the loop again
                try:
                    nextPage = driver.find_element_by_css_selector('#datagrid_results > tbody > tr:last-child span + a')
                    nextPage.click()
                except:
                    break
                #Have to go through the flow again and search for the next letter
                driver.get('https://codor.mylicense.com/med_verification/Search.aspx?facility=N')
                button=Select(driver.find_element_by_css_selector('#t_web_lookup__license_type_name'))
                #here, we know that we only want to select the 6th (Key) license type
                button.select_by_index(6)
                text_input = driver.find_element_by_css_selector('#t_web_lookup__last_name')
                    
                            
    else:
        search = driver.find_element_by_css_selector('#sch_button')
        search.click()

    while True:
# this is for each individual person
#see note above for additional comments
        for a in driver.find_elements_by_css_selector("td[rowspan='0'] a"):
            try:
                link = a.get_attribute('href')
            except:
                failed.append(a)
                continue
            a.click()
            #after clicking on a each person, you need to switch windows to grab the info for the next person
            driver.switch_to.window(driver.window_handles[1])
            #once we click the link for the person, we can switch to BeautifulSoup for a cleaner scraping experience
            html = driver.page_source
            soup = BeautifulSoup(html)
            person ={}
            full_name = soup.find('span', id="_ctl25__ctl1_full_name")
            try:
                person['full_name']=full_name.text
            except:
                time.sleep(2)
            try:
                person['full_name']=full_name.text

            except:
                time.sleep(5)
            try:
                person['plicense']=soup.find(id='_ctl32__ctl1_license_no').text
            except:
                failed.append(link)
                continue    
            person['plicense']=soup.find(id='_ctl32__ctl1_license_no').text
            print(full_name.text)
            #finding the facility license
            regexp = re.compile(r'_ctl37__ct.*_license_no')
            licenses = soup.find_all(id=regexp)
            for license in licenses:
                print(license.text)
                person['flicense'] = license.text
                licensees.append(person)
            print('------------')
            driver.close()
            driver.switch_to.window(driver.window_handles[0])
        #once we have iterated through all the names/links on the page, we must go to the next page (unless it's the last)
        try:
            nextPage = driver.find_element_by_css_selector('#datagrid_results > tbody > tr:last-child span + a')
            nextPage.click()
        except:
            break

    # when there's no more next page, we'll want to return back to the search page and go to the next index number
    driver.get('https://codor.mylicense.com/med_verification/Search.aspx?facility=N')
    button=Select(driver.find_element_by_css_selector('#t_web_lookup__license_type_name'))
    button.select_by_index(num)
    text_input = driver.find_element_by_css_selector('#t_web_lookup__last_name')


# Transform and Save

In [None]:
df = pd.DataFrame(licensees)

In [None]:
df.to_csv('colorado_licenses.csv',index=False)