# Texas Cosmetologist Violations

Texas has a system for [searching for license violations](https://www.tdlr.texas.gov/cimsfo/fosearch.asp). You're going to search for cosmetologists!

## Setup: Import what you'll need to scrape the page

We'll be using Selenium for this, *not* BeautifulSoup and requests.

In [1]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()

## Starting your search

Starting from [here](https://www.tdlr.texas.gov/cimsfo/fosearch.asp), search for cosmetologist violations for people with the last name **Nguyen**.

In [2]:
driver.get('https://www.tdlr.texas.gov/cimsfo/fosearch.asp')

In [3]:
# search field is called 'pht_lnm'
lastname = driver.find_element_by_name('pht_lnm')
lastname.send_keys('nguyen')

In [4]:
# dropdown is called 'pht_status'
profession = Select(driver.find_element_by_name('pht_status'))
profession.select_by_visible_text('Cosmetologists')

In [5]:
# click button via xpath
button = driver.find_element_by_xpath('//*[@id="dat-menu"]/div/div[2]/div/div/section/div/div/table/tbody/tr/td/form/table/tbody/tr[18]/td/input[1]')
driver.execute_script("arguments[0].scrollIntoView(true)", button)
button.click()

## Scraping

Once you are on the results page, do this.

### Loop through each result and print the entire row

Okay wait, that's a heck of a lot. Use `[:10]` to only do the first ten (`listname[:10]` gives you the first ten).

In [6]:
# I know that we were supposed to solve this in a different way. I tried and I failed.
# So here's my solution with a minimum amount of selenium involved.
# Isn't it the result that counts? :-)
# raw results first:
spaguys = []
for row in driver.find_elements_by_tag_name('tr')[1:11]:
    #print(row.text)
    spaguys.append(row.text)

In [7]:
# clean results, split by line breaks, remove empty entries:
spaguys_clean =[]
for guy in spaguys:
    spaguys_clean.append(list(filter(None, guy.split('\n'))))

### Loop through each result and print each person's name

You'll get an error because the first one doesn't have a name. How do you make that not happen?! If you want to ignore an error, you use code like this:

```python
try:
   try to do something
except:
   print("It didn't work')
```

It should help you out. If you don't want to print anything, you can type `pass` instead of the `print` statement.

**Why doesn't the first one have a name?**

In [8]:
for guy in spaguys_clean:
    if len(guy) == 7:
        print(guy[0])
    else:
        names = [n for n in guy if "NGUYEN" in n]
        print(names)

NGUYEN, TOAN HUU
NGUYEN, HANH CONG
NGUYEN, KHIEM VAN
NGUYEN, DIEP THI NGOC
['NGUYEN, LAN T-THUY', 'NGUYEN, SAMLOI']
['NGUYEN, TUAN A', 'NGUYEN, TUAN VAN']
NGUYEN, THAO B
NGUYEN, BETH MARIA
NGUYEN, TRUNG N
NGUYEN, NGAT THI


## Loop through each result, printing each violation description ("Basis for order")

> - *Tip: You'll get an error even if you're ALMOST right - which row is causing the problem?*
> - *Tip: You can get the HTML of something by doing `.get_attribute('innerHTML')` - it might help you diagnose your issue.*
> - *Tip: Or I guess you could just skip the one with the problem...

In [9]:
for guy in spaguys_clean:    
    # we could also count backwards and print(guy[-1] here)
    complaints = [c for c in guy if "Respondent" in c]
    print(complaints)

['Respondent is assessed an administrative penalty in the amount of $500. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day.']
['Respondent is assessed an administrative penalty in the amount of $1,000. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day; Respondent failed to use items subject to possible cross contamination in a manner that does not contaminate the remaining product.']
['Respondent is assessed an administrative penalty in the amount of $1,250. Respondent failed to follow whirlpool foot spas cleaning and sanitization procedures as required; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect all wax pots.']
['Respondent is assessed an administrative penalty in the amount of $500. Respondent failed to disinfect tools, implements, and supplies with an EPA-registered disinfectant solution; Respon

## Loop through each result, printing the complaint number

- TIP: Think about the order of the elements

In [10]:
for guy in spaguys_clean:    
    compnum = [c for c in guy if "Complaint #" in c]
    print(compnum[0].split()[2])

COS20180004289
COS20180006594
COS20180000257
COS20180004915
COS20180009255
COS20140018343
COS20180008846
COS20180000897
COS20170023893
COS20180004076


## Saving the results

### Loop through each result to create a list of dictionaries

Each dictionary must contain

- Person's name
- Violation description
- Violation number
- License Numbers
- Zip Code
- County
- City

Create a new dictionary for each result (except the header).

> *Tip: If you want to ask for the "next sibling," you can't use `find_next_sibling` in Selenium, you need to use `element.find_element_by_xpath("following-sibling::div")` to find the next div, or `element.find_element_by_xpath("following-sibling::*")` to find the next anything.

In [11]:
spaguys_dict = []
for guy in spaguys_clean:
    guy_dic = {}
    
    # There are some tricky rows with multiple names, zip codes, county names and city names
    # Let's get the info for the normal cells first
    if len(guy) == 7:
        guy_dic['name'] = guy[0]
        guy_dic['zip'] = guy[3].split(': ')[1]
        guy_dic['violation_number'] = guy[-2].split()[2]
        guy_dic['violation'] = guy[-1]       
        guy_dic['county'] = guy[2].split(': ')[1]
        guy_dic['city'] = guy[1].split(': ')[1]
        # We could split the license numbers, but I don't really see a system here, do some people have multiple licences?
        guy_dic['license'] = guy[4].split(': ')[1]
    
    # Now for the doubles. We could join them, but extra columns are better I think
    else:
        # let's look at the info first that is the same for double/single rows. We can take this info from the back, hence no problem with multiple names/adresses
        guy_dic['violation_number'] = guy[-2].split()[2]
        guy_dic['violation'] = guy[-1]       
        guy_dic['license'] = guy[-3].split(': ')[1]
        
        # now 2 names:
        multinames = [n for n in guy if "NGUYEN" in n]
        guy_dic['name2'] = multinames[1]
        guy_dic['name'] = multinames[0]
        
        # Let's check if they run the same place together. Then we just need one address. We assume that same zip = same place
        multizip = [z for z in guy if "Zip Code" in z]
        if multizip[0].split(': ')[1] != multizip[1].split(': ')[1]:
            guy_dic['zip'] = multizip[0].split(': ')[1] 
            guy_dic['zip2'] = multizip[1].split(': ')[1]
            
            multicity = [ci for ci in guy if "City: " in ci]
            guy_dic['city'] = multicity[0].split(': ')[1] 
            guy_dic['city2'] = multicity[1].split(': ')[1]
            
            multicounty = [co for co in guy if "County: " in co]
            guy_dic['county'] = multicounty[0].split(': ')[1] 
            guy_dic['county2'] = multicounty[1].split(': ')[1]          
        
        else:
            guy_dic['zip'] = guy[3].split(': ')[1]
            guy_dic['county'] = guy[2].split(': ')[1]
            guy_dic['city'] = guy[1].split(': ')[1]    
            
    spaguys_dict.append(guy_dic)
spaguys_dict

[{'name': 'NGUYEN, TOAN HUU',
  'zip': '78217',
  'violation_number': 'COS20180004289',
  'violation': 'Respondent is assessed an administrative penalty in the amount of $500. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day.',
  'county': 'BEXAR',
  'city': 'SAN ANTONIO',
  'license': '780948, 1706491, 1699123'},
 {'name': 'NGUYEN, HANH CONG',
  'zip': '79934',
  'violation_number': 'COS20180006594',
  'violation': 'Respondent is assessed an administrative penalty in the amount of $1,000. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day; Respondent failed to use items subject to possible cross contamination in a manner that does not contaminate the remaining product.',
  'county': 'EL PASO',
  'city': 'EL PASO',
  'license': '737708'},
 {'name': 'NGUYEN, KHIEM VAN',
  'zip': '75604',
  'violation_number': 'COS20180000257',
  'violation': 'Respondent is assessed an administrative penalty in the 

### Save that to a CSV

- Tip: You'll want to use pandas here

In [12]:
df = pd.DataFrame(spaguys_dict)
df

Unnamed: 0,city,city2,county,county2,license,name,name2,violation,violation_number,zip,zip2
0,SAN ANTONIO,,BEXAR,,"780948, 1706491, 1699123","NGUYEN, TOAN HUU",,Respondent is assessed an administrative penal...,COS20180004289,78217,
1,EL PASO,,EL PASO,,737708,"NGUYEN, HANH CONG",,Respondent is assessed an administrative penal...,COS20180006594,79934,
2,LONGVIEW,,GREGG,,731665,"NGUYEN, KHIEM VAN",,Respondent is assessed an administrative penal...,COS20180000257,75604,
3,HOUSTON,,HARRIS,,"1347649, 760528","NGUYEN, DIEP THI NGOC",,Respondent is assessed an administrative penal...,COS20180004915,77014,
4,SAN ANTONIO,,BEXAR,,767339,"NGUYEN, LAN T-THUY","NGUYEN, SAMLOI",Respondent is assessed an administrative penal...,COS20180009255,78255,
5,AUSTIN,ARLINGTON,TRAVIS,TARRANT,681274,"NGUYEN, TUAN A","NGUYEN, TUAN VAN",Respondent is assessed an administrative penal...,COS20140018343,78723,76011.0
6,EULESS,,TARRANT,,"721373, 1142884","NGUYEN, THAO B",,Respondent is assessed an administrative penal...,COS20180008846,76039,
7,HOUSTON,,HARRIS,,1470271,"NGUYEN, BETH MARIA",,Respondent's Cosmetology Operator license was ...,COS20180000897,77083,
8,AMARILLO,,POTTER,,"1196244, 767015, 767014","NGUYEN, TRUNG N",,Respondent is assessed an administrative penal...,COS20170023893,79106,
9,PITTSBURG,,CAMP,,759931,"NGUYEN, NGAT THI",,Respondent is assessed an administrative penal...,COS20180004076,75686,


In [13]:
df.to_csv("spa_violations.csv", index=False)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [14]:
pd.read_csv("spa_violations.csv")

Unnamed: 0,city,city2,county,county2,license,name,name2,violation,violation_number,zip,zip2
0,SAN ANTONIO,,BEXAR,,"780948, 1706491, 1699123","NGUYEN, TOAN HUU",,Respondent is assessed an administrative penal...,COS20180004289,78217,
1,EL PASO,,EL PASO,,737708,"NGUYEN, HANH CONG",,Respondent is assessed an administrative penal...,COS20180006594,79934,
2,LONGVIEW,,GREGG,,731665,"NGUYEN, KHIEM VAN",,Respondent is assessed an administrative penal...,COS20180000257,75604,
3,HOUSTON,,HARRIS,,"1347649, 760528","NGUYEN, DIEP THI NGOC",,Respondent is assessed an administrative penal...,COS20180004915,77014,
4,SAN ANTONIO,,BEXAR,,767339,"NGUYEN, LAN T-THUY","NGUYEN, SAMLOI",Respondent is assessed an administrative penal...,COS20180009255,78255,
5,AUSTIN,ARLINGTON,TRAVIS,TARRANT,681274,"NGUYEN, TUAN A","NGUYEN, TUAN VAN",Respondent is assessed an administrative penal...,COS20140018343,78723,76011.0
6,EULESS,,TARRANT,,"721373, 1142884","NGUYEN, THAO B",,Respondent is assessed an administrative penal...,COS20180008846,76039,
7,HOUSTON,,HARRIS,,1470271,"NGUYEN, BETH MARIA",,Respondent's Cosmetology Operator license was ...,COS20180000897,77083,
8,AMARILLO,,POTTER,,"1196244, 767015, 767014","NGUYEN, TRUNG N",,Respondent is assessed an administrative penal...,COS20170023893,79106,
9,PITTSBURG,,CAMP,,759931,"NGUYEN, NGAT THI",,Respondent is assessed an administrative penal...,COS20180004076,75686,
