# Texas Cosmetologist Violations

Texas has a system for [searching for license violations](https://www.tdlr.texas.gov/cimsfo/fosearch.asp). You're going to search for cosmetologists!

## Setup: Import what you'll need to scrape the page

We'll be using Selenium for this, *not* BeautifulSoup and requests.

In [1]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()

## Starting your search

Starting from [here](https://www.tdlr.texas.gov/cimsfo/fosearch.asp), search for cosmetologist violations for people with the last name **Nguyen**.

In [2]:
driver.get('https://www.tdlr.texas.gov/cimsfo/fosearch.asp')

In [3]:
# search field is called 'pht_lnm'
lastname = driver.find_element_by_name('pht_lnm')
lastname.send_keys('nguyen')

In [4]:
# dropdown is called 'pht_status'
profession = Select(driver.find_element_by_name('pht_status'))
profession.select_by_visible_text('Cosmetologists')

In [5]:
# click button via xpath
button = driver.find_element_by_xpath('//*[@id="dat-menu"]/div/div[2]/div/div/section/div/div/table/tbody/tr/td/form/table/tbody/tr[18]/td/input[1]')
driver.execute_script("arguments[0].scrollIntoView(true)", button)
button.click()

## Scraping

Once you are on the results page, do this.

### Loop through each result and print the entire row

Okay wait, that's a heck of a lot. Use `[:10]` to only do the first ten (`listname[:10]` gives you the first ten).

In [6]:
# I know that we were supposed to solve this in a different way. I tried and I failed.
# So here's my solution with a minimum amount of selenium and a maximum of list stuff.
# Isn't it the result that counts? :-)
#
# raw results first:
spaguys_raw = []
for row in driver.find_elements_by_tag_name('tr')[1:]:
    spaguys_raw.append(row.text)

In [7]:
# This would've been doable in one step, but who knews, maybe I can use the raw data again?
# clean results, split by line breaks, remove empty entries:
spaguys =[]
for guy in spaguys_raw:
    spaguys.append(list(filter(None, guy.split('\n'))))
spaguys

[['NGUYEN, TOAN HUU',
  'City: SAN ANTONIO',
  'County: BEXAR',
  'Zip Code: 78217',
  'License #(s): 780948, 1706491, 1699123',
  'Complaint # COS20180004289 Date: 5/30/2018',
  'Respondent is assessed an administrative penalty in the amount of $500. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day.'],
 ['NGUYEN, HANH CONG',
  'City: EL PASO',
  'County: EL PASO',
  'Zip Code: 79934',
  'License #: 737708',
  'Complaint # COS20180006594 Date: 5/30/2018',
  'Respondent is assessed an administrative penalty in the amount of $1,000. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day; Respondent failed to use items subject to possible cross contamination in a manner that does not contaminate the remaining product.'],
 ['NGUYEN, KHIEM VAN',
  'City: LONGVIEW',
  'County: GREGG',
  'Zip Code: 75604',
  'License #: 731665',
  'Complaint # COS20180000257 Date: 5/17/2018',
  'Respondent is assessed an adm

### Loop through each result and print each person's name

You'll get an error because the first one doesn't have a name. How do you make that not happen?! If you want to ignore an error, you use code like this:

```python
try:
   try to do something
except:
   print("It didn't work')
```

It should help you out. If you don't want to print anything, you can type `pass` instead of the `print` statement.

**Why doesn't the first one have a name?**

In [8]:
# Some have more than one name. Good thing I know how they're called
for guy in spaguys:
    if len(guy) == 7:
        print(guy[0])
    else:
        names = [n for n in guy if "NGUYEN" in n]
        print(names)

NGUYEN, TOAN HUU
NGUYEN, HANH CONG
NGUYEN, KHIEM VAN
NGUYEN, DIEP THI NGOC
['NGUYEN, LAN T-THUY', 'NGUYEN, SAMLOI']
['NGUYEN, TUAN A', 'NGUYEN, TUAN VAN']
NGUYEN, THAO B
NGUYEN, BETH MARIA
NGUYEN, TRUNG N
NGUYEN, NGAT THI
NGUYEN, KELLY PHUONG N
NGUYEN, CHAU THI
NGUYEN, XUAN T
NGUYEN, THANH C
NGUYEN, HAI
NGUYEN, JENNIFER T
NGUYEN, TONY VAN
NGUYEN, HANH THAO TRAN
NGUYEN, QUYEN THI MAI
NGUYEN, OANH THI
NGUYEN, THU NHU
NGUYEN, PHUNG THI
NGUYEN, TUAN
NGUYEN, PHUOC BA
NGUYEN, THAI VAN
NGUYEN, JIMMY
NGUYEN, QUI VAN
NGUYEN, KIM LIEN TRAN
NGUYEN, THUY HONG
NGUYEN, TRANG YEN
NGUYEN, BINH THANH
NGUYEN, MUA THI
NGUYEN, TRUNG H
NGUYEN, TRANG N
NGUYEN, PHUONG TUYET TH
NGUYEN, KIM AN THI
NGUYEN, DUC V
NGUYEN, DUC VAN
NGUYEN, AN QUY
NGUYEN, TONY H
NGUYEN, PHUONG THAO THI
NGUYEN, CHINH K
NGUYEN, SAM
NGUYEN, HUE T
NGUYEN, TAM VAN
NGUYEN, CAM THI
NGUYEN, HONG THI ANH
NGUYEN, HUYEN
NGUYEN, TOAN MINH
NGUYEN, SONNY T
NGUYEN, HOANG THI
NGUYEN, NHUNG TUYET
NGUYEN, TOQUYEN
NGUYEN, LY V
NGUYEN, NAM QUANG
NGUYEN

## Loop through each result, printing each violation description ("Basis for order")

> - *Tip: You'll get an error even if you're ALMOST right - which row is causing the problem?*
> - *Tip: You can get the HTML of something by doing `.get_attribute('innerHTML')` - it might help you diagnose your issue.*
> - *Tip: Or I guess you could just skip the one with the problem...

In [9]:
# This includes the penalty. I could do a .split() if we wanted to have the violation description separate,
# but I think this is useful info. I'll isolate the fine later
for guy in spaguys:    
    print(guy[-1])

Respondent is assessed an administrative penalty in the amount of $500. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day.
Respondent is assessed an administrative penalty in the amount of $1,000. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day; Respondent failed to use items subject to possible cross contamination in a manner that does not contaminate the remaining product.
Respondent is assessed an administrative penalty in the amount of $1,250. Respondent failed to follow whirlpool foot spas cleaning and sanitization procedures as required; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect all wax pots.
Respondent is assessed an administrative penalty in the amount of $500. Respondent failed to disinfect tools, implements, and supplies with an EPA-registered disinfectant solution; Respondent failed to

## Loop through each result, printing the complaint number

- TIP: Think about the order of the elements

In [10]:
# The original data stores the code with the string "complaint #" and the date, therefore the .split()
for guy in spaguys:    
    print(guy[-2].split()[2])

COS20180004289
COS20180006594
COS20180000257
COS20180004915
COS20180009255
COS20140018343
COS20180008846
COS20180000897
COS20170023893
COS20180004076
COS20180004498
COS20180008220
COS20170009055
COS20180002334
COS20170019449
COS20170021681
COS20180004089
COS20180004300
COS20180004340
COS20180004475
COS20180004720
COS20180004864
COS20180006279
COS20180004329
COS20170020336
COS20180000630
COS20180003692
COS20180002266
COS20180003857
COS20180004081
COS20180005797
COS20170018997
COS20180002614
COS20180003845
COS20170008359
COS20180004075
COS20170021316
COS20170022035
COS20180004639
COS20170009421
COS20180003532
COS20180004016
COS20170022895
COS20180001881
COS20170017965
COS20180001081
COS20180001141
COS20180003707
COS20170014082
COS20180001604
COS20180000225
COS20170022848
COS20180002313
COS20180002216
COS20180000650
COS20180001594
COS20180002227
COS20170018593
COS20170019077
COS20170022385
COS20170022737
COS20180000914
COS20170008502
COS20170015324
COS20170020893
COS20170022810
COS2018000

## Saving the results

### Loop through each result to create a list of dictionaries

Each dictionary must contain

- Person's name
- Violation description
- Violation number
- License Numbers
- Zip Code
- County
- City

Create a new dictionary for each result (except the header).

> *Tip: If you want to ask for the "next sibling," you can't use `find_next_sibling` in Selenium, you need to use `element.find_element_by_xpath("following-sibling::div")` to find the next div, or `element.find_element_by_xpath("following-sibling::*")` to find the next anything.

In [11]:
spaguys_total = []
# I go through the list of lists and store the entries in a dicitionary, respectively. 
# The try/except is for a few rows of incomplete data in the set.
for guy in spaguys:    
    try:  
        guy_dic = {}

        # There are some tricky rows with multiple names, zip codes, county names and city names
        # I'll get the info for the easy cells first
        # There's a lot of .split()'ing here since the values are stored with a description
        if len(guy) == 7:
            guy_dic['name'] = guy[0]
            guy_dic['zip'] = guy[3].split(': ')[1]
            guy_dic['violation_number'] = guy[-2].split()[2]
            # The number is stored together with the date, so why not save the date as well...
            guy_dic['date'] = guy[-2].split()[4]
            guy_dic['violation'] = guy[-1]
            
            # The violation includes the fine, if there is one. Let's separate the amounts by taking out all numbers
            # This might need some refining if there are other numbers in the string but we are ok for now...
            try:
                guy_dic['fine'] = int(''.join(amount for amount in guy[-1] if amount.isdigit()))
            except:
                guy_dic['fine'] = 'no fine'
            
            guy_dic['county'] = guy[2].split(': ')[1]
            guy_dic['city'] = guy[1].split(': ')[1]
            # Could isolate the license numbers, but don't get the system: do some people have multiple licenses?
            guy_dic['license'] = guy[4].split(': ')[1]

        # Now for the rows with multiple entries for name and/or city, county...
        else:
            # part of the info is the same for alle rows
            # I can take this info from the end, hence no problem with multiple names/adresses
            guy_dic['violation_number'] = guy[-2].split()[2]
            guy_dic['date'] = guy[-2].split()[4]            
            guy_dic['violation'] = guy[-1] 
            
            try:
                guy_dic['fine'] = int(''.join(amount for amount in guy[-1] if amount.isdigit()))
            except:
                guy_dic['fine'] = 'no fine'
                
            guy_dic['license'] = guy[-3].split(': ')[1]
            # now 2 names:
            multinames = [n for n in guy if "NGUYEN" in n]
            guy_dic['name2'] = multinames[1]
            guy_dic['name'] = multinames[0]
            # Let's check if they run the same place together. Then we just need one address. 
            # I assume that same zip = same place as we don't have a street address
            multizip = [z for z in guy if "Zip Code" in z]
            
            if multizip[0].split(': ')[1] != multizip[1].split(': ')[1]:
                guy_dic['zip'] = multizip[0].split(': ')[1] 
                guy_dic['zip2'] = multizip[1].split(': ')[1]
                multicity = [ci for ci in guy if "City: " in ci]
                guy_dic['city'] = multicity[0].split(': ')[1] 
                guy_dic['city2'] = multicity[1].split(': ')[1]
                multicounty = [co for co in guy if "County: " in co]
                guy_dic['county'] = multicounty[0].split(': ')[1] 
                guy_dic['county2'] = multicounty[1].split(': ')[1]          
            else:
                # If they are located at the same address, I only write one address
                guy_dic['zip'] = guy[3].split(': ')[1]
                guy_dic['county'] = guy[2].split(': ')[1]
                guy_dic['city'] = guy[1].split(': ')[1]    
                
        spaguys_total.append(guy_dic)
    except:
        # There are a few entries that are lacking data or are otherwise broken. I'll check them by hand:
        print(guy)

['NGUYEN, HOANG JUDY', 'City: HOUSTON', 'County:', 'Zip Code: 77095', 'License #: 971279', 'Complaint # COS20160018533 Date: 8/15/2016', 'Respondent is assessed an administrative penalty in the amount of $500. Respondent practiced cosmetology services in an unlicensed beauty salon.']
['NGUYEN, KENNY KHANH', 'Company: CK NAILS LIC 745308', 'City: LEANDER', 'County: WILLIAMSON', 'Zip Code: 78641', 'License #: 745308', 'Complaint # COS20150021110 Date: 12/16/2015', 'Respondent is assessed an administrative penalty in the amount of $500. Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used.']
['NGUYEN, THUY T', 'City: CYPRESS', 'County:', 'Zip Code: 77429', 'License #: 748032', 'Complaint # COS20150020171 Date: 11/16/2015', 'Respondent is assessed an administrative penalty in the amount of $1,250. Respondent leased space in a salon to an individual who engaged in the practice of cosmetology but had not obtaine

### Save that to a CSV

- Tip: You'll want to use pandas here

In [12]:
df = pd.DataFrame(spaguys_total)
df

Unnamed: 0,city,city2,county,county2,date,fine,license,name,name2,violation,violation_number,zip,zip2
0,SAN ANTONIO,,BEXAR,,5/30/2018,500,"780948, 1706491, 1699123","NGUYEN, TOAN HUU",,Respondent is assessed an administrative penal...,COS20180004289,78217,
1,EL PASO,,EL PASO,,5/30/2018,1000,737708,"NGUYEN, HANH CONG",,Respondent is assessed an administrative penal...,COS20180006594,79934,
2,LONGVIEW,,GREGG,,5/17/2018,1250,731665,"NGUYEN, KHIEM VAN",,Respondent is assessed an administrative penal...,COS20180000257,75604,
3,HOUSTON,,HARRIS,,5/17/2018,500,"1347649, 760528","NGUYEN, DIEP THI NGOC",,Respondent is assessed an administrative penal...,COS20180004915,77014,
4,SAN ANTONIO,,BEXAR,,5/17/2018,575,767339,"NGUYEN, LAN T-THUY","NGUYEN, SAMLOI",Respondent is assessed an administrative penal...,COS20180009255,78255,
5,AUSTIN,ARLINGTON,TRAVIS,TARRANT,5/9/2018,1000,681274,"NGUYEN, TUAN A","NGUYEN, TUAN VAN",Respondent is assessed an administrative penal...,COS20140018343,78723,76011
6,EULESS,,TARRANT,,5/9/2018,750,"721373, 1142884","NGUYEN, THAO B",,Respondent is assessed an administrative penal...,COS20180008846,76039,
7,HOUSTON,,HARRIS,,4/30/2018,32916,1470271,"NGUYEN, BETH MARIA",,Respondent's Cosmetology Operator license was ...,COS20180000897,77083,
8,AMARILLO,,POTTER,,4/25/2018,1300,"1196244, 767015, 767014","NGUYEN, TRUNG N",,Respondent is assessed an administrative penal...,COS20170023893,79106,
9,PITTSBURG,,CAMP,,4/25/2018,625,759931,"NGUYEN, NGAT THI",,Respondent is assessed an administrative penal...,COS20180004076,75686,


In [13]:
df.to_csv("spa_violations.csv", index=False)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [14]:
pd.read_csv("spa_violations.csv")

Unnamed: 0,city,city2,county,county2,date,fine,license,name,name2,violation,violation_number,zip,zip2
0,SAN ANTONIO,,BEXAR,,5/30/2018,500,"780948, 1706491, 1699123","NGUYEN, TOAN HUU",,Respondent is assessed an administrative penal...,COS20180004289,78217,
1,EL PASO,,EL PASO,,5/30/2018,1000,737708,"NGUYEN, HANH CONG",,Respondent is assessed an administrative penal...,COS20180006594,79934,
2,LONGVIEW,,GREGG,,5/17/2018,1250,731665,"NGUYEN, KHIEM VAN",,Respondent is assessed an administrative penal...,COS20180000257,75604,
3,HOUSTON,,HARRIS,,5/17/2018,500,"1347649, 760528","NGUYEN, DIEP THI NGOC",,Respondent is assessed an administrative penal...,COS20180004915,77014,
4,SAN ANTONIO,,BEXAR,,5/17/2018,575,767339,"NGUYEN, LAN T-THUY","NGUYEN, SAMLOI",Respondent is assessed an administrative penal...,COS20180009255,78255,
5,AUSTIN,ARLINGTON,TRAVIS,TARRANT,5/9/2018,1000,681274,"NGUYEN, TUAN A","NGUYEN, TUAN VAN",Respondent is assessed an administrative penal...,COS20140018343,78723,76011.0
6,EULESS,,TARRANT,,5/9/2018,750,"721373, 1142884","NGUYEN, THAO B",,Respondent is assessed an administrative penal...,COS20180008846,76039,
7,HOUSTON,,HARRIS,,4/30/2018,32916,1470271,"NGUYEN, BETH MARIA",,Respondent's Cosmetology Operator license was ...,COS20180000897,77083,
8,AMARILLO,,POTTER,,4/25/2018,1300,"1196244, 767015, 767014","NGUYEN, TRUNG N",,Respondent is assessed an administrative penal...,COS20170023893,79106,
9,PITTSBURG,,CAMP,,4/25/2018,625,759931,"NGUYEN, NGAT THI",,Respondent is assessed an administrative penal...,COS20180004076,75686,
