# Texas Cosmetologist Violations

Texas has a system for [searching for license violations](https://www.tdlr.texas.gov/cimsfo/fosearch.asp). You're going to search for cosmetologists!

## Setup: Import what you'll need to scrape the page

We'll be using Selenium for this, *not* BeautifulSoup and requests.

In [1]:
from selenium import webdriver
driver = webdriver.Chrome()

## Starting your search

Starting from [here](https://www.tdlr.texas.gov/cimsfo/fosearch.asp), search for **cosmetologist violations** for people with the last name **Nguyen**.

In [2]:
driver.get("https://www.tdlr.texas.gov/cimsfo/fosearch.asp")

In [3]:
textbox = driver.find_element_by_xpath("/html/body/div[1]/div/div[2]/div/div/section/div/div/table/tbody/tr/td/form/table/tbody/tr[7]/td/p/input")

In [4]:
textbox.send_keys("Nguyen")

In [5]:
driver.find_element_by_xpath("/html/body/div[1]/div/div[2]/div/div/section/div/div/table/tbody/tr/td/form/table/tbody/tr[18]/td/input[1]").click()



## Scraping

Once you are on the results page, do this.

### Loop through each result and print the entire row

Okay wait, that's a heck of a lot. Use `[:10]` to only do the first ten (`listname[:10]` gives you the first ten).

In [6]:
table = driver.find_elements_by_tag_name('tr')
for each in table[:10]:
    print (each.text)

Name and Location Order Basis for Order
NGUYEN, MIMI PHAM
City: KATY
County: HARRIS
Zip Code: 77449


License #: 784210

Complaint # COS20190010072 Date: 11/12/2020

Respondent is assessed an administrative penalty in the amount of $1,125. Respondent failed properly clean and sanitize the metal implements used at the Salon; Respondent failed to disinfect tools, implements, and supplies with an EPA-registered disinfectant solution.
NGUYEN, HA
City: ARLINGTON
County: TARRANT
Zip Code: 76017


License #: 764888

Complaint # COS20190016762 Date: 11/12/2020

Respondent is assessed an administrative penalty in the amount of $2,250. Respondent failed to clean and sanitize four (4) whirlpool foot spas as required at the end of each day, constituting two (2) violations; Respondent failed to keep a record of the date and time of four (4) foot spas daily or bi-weekly cleaning and if the foot spas were not used, constituting two (2) violations.
NGUYEN, THAO HONG
City: SAN ANTONIO
County: BEXAR
Zip

### Loop through each result and print each person's name

You'll get an error because the first one doesn't have a name. How do you make that not happen?! If you want to ignore an error, you use code like this:

```python
try:
   # try to do something
except:
   # Instead of stopping on an error, it'll jump down here instead
   print("It didn't work')
```

It should help you out. If you don't want to print anything, you can type `pass` instead of the `print` statement. Most people use `pass`, but it's also nice to print out debug statements so you know when/where it's running into errors.

**Why doesn't the first one have a name?**

In [7]:
for each in table[:10]:
    try:
        name = each.find_elements_by_class_name('results_text')
        print (name[0].text)
    except:
        print ("not working")
    

not working
NGUYEN, MIMI PHAM
NGUYEN, HA
NGUYEN, THAO HONG
NGUYEN, CINDY
NGUYEN, CHAU KHANH LINH
NGUYEN, TRANG T
NGUYEN, DUNG MINH
NGUYEN, YEN NHI THI
NGUYEN, JOHNNY DAT


## Loop through each result, printing each violation description ("Basis for order")

> - *Tip: You'll get an error even if you're ALMOST right - which row is causing the problem?*
> - *Tip: You can get the HTML of something by doing `.get_attribute('innerHTML')` - it might help you diagnose your issue.*
> - *Tip: Or I guess you could just skip the one with the problem...*

In [8]:
for each in table[:10]:
    try:
        description = each.find_elements_by_tag_name('td')
        print (description[2].text)
        print ("------")
    except:
        print ("not working")
    

not working
Respondent failed properly clean and sanitize the metal implements used at the Salon; Respondent failed to disinfect tools, implements, and supplies with an EPA-registered disinfectant solution.
------
Respondent failed to clean and sanitize four (4) whirlpool foot spas as required at the end of each day, constituting two (2) violations; Respondent failed to keep a record of the date and time of four (4) foot spas daily or bi-weekly cleaning and if the foot spas were not used, constituting two (2) violations.
------
Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use.
------
Respondent failed to clean and disinfect all wax pots; Respondent failed to properly clean multi-use items prior to each service.
------
Respondent engaged in fraud or deceit in obtaining a certificate, license, or permit.
------
Respondent failed properly clean and sanitize the metal implements used at the Salon; Respondent failed to wipe clean and disin

## Loop through each result, printing the complaint number

- TIP: Think about the order of the elements

In [9]:
for each in table[:10]:
    try:
        comp = each.find_elements_by_class_name('results_text')
        print (f'Complaint number: {comp[5].text}')
        print ("------")
    except:
        print ("not working")
    

not working
Complaint number: COS20190010072
------
Complaint number: COS20190016762
------
Complaint number: COS20200010387
------
Complaint number: COS20200010502
------
Complaint number: COS20190008104
------
Complaint number: COS20200010511
------
Complaint number: COS20200004202
------
Complaint number: COS20190004199
------
Complaint number: COS20200000101
------


## Saving the results

### Loop through each result to create a list of dictionaries

Each dictionary must contain

- Person's name
- Violation description
- Violation number
- License Numbers
- Zip Code
- County
- City

Create a new dictionary for each result (except the header).

> *Tip: If you want to ask for the "next sibling," you can't use `find_next_sibling` in Selenium, you need to use `element.find_element_by_xpath("following-sibling::div")` to find the next div, or `element.find_element_by_xpath("following-sibling::*")` to find the next anything.

In [10]:
# for each in table[1:10]:
#     #try:
#     info = each.find_elements_by_class_name('results_text')
#     compno = info[5].text
#     licenseno = info[4].text
#     zipcode = info[3].text
#     county = info[2].text
#     city = info[1].text
#     print (city, county, zipcode, licenseno, compno)

In [11]:
eachdict = []
for each in table:
    try:
        name = each.find_elements_by_class_name('results_text')
        description = each.find_elements_by_tag_name('td')
        info = each.find_elements_by_class_name('results_text')
        eachdict.append({'name': name[0].text, 
                         'description': description[2].text, 
                         'complaint number': info[5].text,
                        'license number': info[4].text,
                        'city': info[1].text,
                        'county': info[2].text,
                        'zip code':info[3].text
                        })
    except:
        pass
    
eachdict

[{'name': 'NGUYEN, MIMI PHAM',
  'description': 'Respondent failed properly clean and sanitize the metal implements used at the Salon; Respondent failed to disinfect tools, implements, and supplies with an EPA-registered disinfectant solution.',
  'complaint number': 'COS20190010072',
  'license number': '784210',
  'city': 'KATY',
  'county': 'HARRIS',
  'zip code': '77449'},
 {'name': 'NGUYEN, HA',
  'description': 'Respondent failed to clean and sanitize four (4) whirlpool foot spas as required at the end of each day, constituting two (2) violations; Respondent failed to keep a record of the date and time of four (4) foot spas daily or bi-weekly cleaning and if the foot spas were not used, constituting two (2) violations.',
  'complaint number': 'COS20190016762',
  'license number': '764888',
  'city': 'ARLINGTON',
  'county': 'TARRANT',
  'zip code': '76017'},
 {'name': 'NGUYEN, THAO HONG',
  'description': 'Respondent failed to clean, disinfect, and sterilize manicure and pedicure

### Save that to a CSV

- Tip: Use `pd.DataFrame` to create a dataframe, and then save it to a CSV.

In [12]:
import pandas as pd
df = pd.DataFrame(eachdict)
df.to_csv("Texas Cosmetology Violations.csv",sep=',',index=False)
df



Unnamed: 0,name,description,complaint number,license number,city,county,zip code
0,"NGUYEN, MIMI PHAM",Respondent failed properly clean and sanitize ...,COS20190010072,784210,KATY,HARRIS,77449
1,"NGUYEN, HA",Respondent failed to clean and sanitize four (...,COS20190016762,764888,ARLINGTON,TARRANT,76017
2,"NGUYEN, THAO HONG","Respondent failed to clean, disinfect, and ste...",COS20200010387,"799926, 1753491",SAN ANTONIO,BEXAR,78238
3,"NGUYEN, CINDY",Respondent failed to clean and disinfect all w...,COS20200010502,"806232, 1260359, 1280071",CORPUS CHRISTI,NUECES,78414
4,"NGUYEN, CHAU KHANH LINH",Respondent engaged in fraud or deceit in obtai...,COS20190008104,1764073,MONTGOMERY,OUT OF STATE,36116
...,...,...,...,...,...,...,...
159,"NGUYEN, BINH THANH",Respondent leased space in a salon to an indiv...,COS20180012368,772744,LAREDO,WEBB,78045
160,"NGUYEN, DU HUU",Respondent operated a massage establishment wi...,MAS20180006447,"749746, 1410150",ARLINGTON,TARRANT,76014
161,"NGUYEN, SAMANTHA TRAN","Respondent failed to clean, disinfect, and ste...",COS20180006595,"743492, 1279657, 1562243",MCKINNEY,COLLIN,75070
162,"NGUYEN, THU LE",Respondent failed to comply with an order prev...,COS20180007798,"11224436, 685119",SAN ANTONIO,BEXAR,78216


### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [13]:
df2 = pd.read_csv("Texas Cosmetology Violations.csv")
df2.head()

Unnamed: 0,name,description,complaint number,license number,city,county,zip code
0,"NGUYEN, MIMI PHAM",Respondent failed properly clean and sanitize ...,COS20190010072,784210,KATY,HARRIS,77449
1,"NGUYEN, HA",Respondent failed to clean and sanitize four (...,COS20190016762,764888,ARLINGTON,TARRANT,76017
2,"NGUYEN, THAO HONG","Respondent failed to clean, disinfect, and ste...",COS20200010387,"799926, 1753491",SAN ANTONIO,BEXAR,78238
3,"NGUYEN, CINDY",Respondent failed to clean and disinfect all w...,COS20200010502,"806232, 1260359, 1280071",CORPUS CHRISTI,NUECES,78414
4,"NGUYEN, CHAU KHANH LINH",Respondent engaged in fraud or deceit in obtai...,COS20190008104,1764073,MONTGOMERY,OUT OF STATE,36116


## Let's do this an easier way

Use Selenium and `pd.read_html` to get the table as a dataframe.

In [14]:
df3 = pd.read_html(driver.page_source)
df3 = df3[0]
df3

#This is not a proper table but leaving it as is anyway? Do we need to use regex to extract the different columns?

Unnamed: 0,Name and Location,Order,Basis for Order
0,"NGUYEN, MIMI PHAM City: KATY County: HARRIS Zi...",Date: 11/12/2020Respondent is assessed an admi...,Respondent failed properly clean and sanitize ...
1,"NGUYEN, HA City: ARLINGTON County: TARRANT Zip...",Date: 11/12/2020Respondent is assessed an admi...,Respondent failed to clean and sanitize four (...
2,"NGUYEN, THAO HONG City: SAN ANTONIO County: BE...",Date: 11/12/2020Respondent is assessed an admi...,"Respondent failed to clean, disinfect, and ste..."
3,"NGUYEN, CINDY City: CORPUS CHRISTI County: NUE...",Date: 10/29/2020Respondent is assessed an admi...,Respondent failed to clean and disinfect all w...
4,"NGUYEN, CHAU KHANH LINH City: MONTGOMERY Count...",Date: 10/26/2020The Respondent's Cosmetology M...,Respondent engaged in fraud or deceit in obtai...
...,...,...,...
159,"NGUYEN, BINH THANH City: LAREDO County: WEBB Z...",Date: 9/12/2018Respondent is assessed an admin...,Respondent leased space in a salon to an indiv...
160,"NGUYEN, DU HUU City: ARLINGTON County: TARRANT...",Date: 9/12/2018Respondent is assessed an admin...,Respondent operated a massage establishment wi...
161,"NGUYEN, SAMANTHA TRAN City: MCKINNEY County: C...",Date: 9/4/2018Respondent is assessed an admini...,"Respondent failed to clean, disinfect, and ste..."
162,"NGUYEN, THU LE City: SAN ANTONIO County: BEXAR...",Date: 8/7/2018The Respondent's Cosmetology Man...,Respondent failed to comply with an order prev...
