# Texas Cosmetologist Violations

Texas has a system for [searching for license violations](https://www.tdlr.texas.gov/cimsfo/fosearch.asp). You're going to search for cosmetologists!

## Setup: Import what you'll need to scrape the page

We'll be using Selenium for this, *not* BeautifulSoup and requests.

In [2]:
import pandas as pd

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

from webdriver_manager.chrome import ChromeDriverManager



In [3]:
driver = webdriver.Chrome(ChromeDriverManager().install())



Could not get version for google-chrome with the any command: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
Current google-chrome version is UNKNOWN
Get LATEST chromedriver version for UNKNOWN google-chrome
Driver [/Users/prinzmagtulis/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


## Starting your search

Starting from [here](https://www.tdlr.texas.gov/cimsfo/fosearch.asp), search for cosmetologist violations for people with the last name **Nguyen**.

In [4]:
driver.get("https://www.tdlr.texas.gov/cimsfo/fosearch.asp")

In [5]:
driver.find_element(By.ID, "pht_status").click()

In [6]:
driver.find_element(By.XPATH, '//*[@id="pht_status"]/option[10]').click()

In [7]:
driver.find_element(By.XPATH, '//*[@id="pht_lnm"]').send_keys("Nguyen")


In [8]:
driver.find_element(By.XPATH, '//*[@id="dat-menu"]/div/div[2]/div/div/section/div/div/table/tbody/tr/td/form/table/tbody/tr[18]/td/input[1]').click()

## Scraping

Once you are on the results page, do this. **I step you through things bit by bit, so it's going to be a little different than we did in class.** Also, no `pd.read_html` allowed because this isn't actual tabular data!

> You can use either Selenium by itself or Selenium+BeautifulSoup to scrape the results page. The choice is up to you!

### Loop through each result and print the entire row

Okay wait, maybe not, i's a heck of a lot of rows. Use `[:10]` to only do the first ten! For example, if you saved the table rows into `results` you might do something like this:

```python
for result in results[:10]:
    print(result)
```

Although you'd want to print out the text from the row (I give example output below).

> *Tip: If you're using Selenium, `By.TAG_NAME` is used if you don't have a class or ID. If you're using BeautifulSoup, just do your normal thing.*

In [9]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source)

In [26]:
doc.find_all('tr')

[<tr><th style="padding:4px; text-align:left; background:#c2c2c2;">Name and Location</th><th style="padding:4px; text-align:left; background:#c2c2c2;">Order</th><th style="padding:4px; text-align:left; background:#c2c2c2;">Basis for Order</th></tr>,
 <tr style="background:#ffffff;"> <td style="padding:4px; text-align:left; font-size:11px; font:Arial, Helvetica, sans-serif; width:22%;"><span class="results_text">NGUYEN, THANH </span><br/> <span class="default_text">City:</span> <span class="results_text">FRISCO</span><br/> <span class="default_text">County:</span> <span class="results_text">COLLIN</span><br/> <span class="default_text">Zip Code:</span> <span class="results_text">75034</span><br/><br/><br/><span class="default_text"> License #:</span> <span class="results_text">790672</span><br/><br/><span class="default_text">Complaint #</span> <span class="results_text">COS20210004784</span></td> <td style="padding:4px; text-align:left; font-size:11px; font:Arial, Helvetica, sans-serif

In [11]:
rows = driver.find_elements(By.TAG_NAME, "tr")
for row in rows[:10]:
    print(row.text)

Name and Location Order Basis for Order
NGUYEN, THANH
City: FRISCO
County: COLLIN
Zip Code: 75034


License #: 790672

Complaint # COS20210004784 Date: 11/16/2021

Respondent is assessed an administrative penalty in the amount of $1,875. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.
NGUYEN, DAI T
City: HOUSTON
County: Harris
Zip Code: 77034


License #: 765339

Complaint # COS20210005027 Date: 11/16/2021

Respondent is assessed an administrative penalty in the amount of $1,500. Respondent failed to follow whirlpool foot spas cleaning and sanitization procedures as required; Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used; Respondent failed to store eyelash extensions in a sealed bag or covered container and kept in a clean dry debris-free s

The result should look something like this:

```
Name and Location Order Basis for Order
NGUYEN, THANH
City: FRISCO
County: COLLIN
Zip Code: 75034


License #: 790672

Complaint # COS20210004784 Date: 11/16/2021

Respondent is assessed an administrative penalty in the amount of $1,875. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.
NGUYEN, LONG D
City: SAN SABA
County: SAN SABA
Zip Code: 76877
```

### Loop through each result and print each person's name

You'll get an error because the first one doesn't have a name. How do you make that not happen?! If you want to ignore an error, you use code like this:

```python
try:
   # try to do something
except:
   print("It didn't work')
```

It should help you out. If you don't want to print anything when there's an error, you can type `pass` instead of the `print` statement.

**Why doesn't the first one have a name?**

Output should look like this:

```
Doesn't have a name
NGUYEN, THANH
NGUYEN, LONG D
NGUYEN, LUCIE HUONG
NGUYEN, CHINH
NGUYEN, JIMMY
```

* *Tip: The name has a class you can use. The class name is reused in a lot of places, but because it's the first one you don't have to worry about that!*
* *Tip: Instead of searching across the entire page – `driver.find_element` or `doc.select_one` – you should be doing your searching just inside of each **row** (I used this technique in the beginning of class with BeautifulSoup when we were scraping the books page)* 

In [172]:
rows = driver.find_elements(By.TAG_NAME, "tr")
try:
    for names in rows[1:10]:
        cells=names.find_element(By.CLASS_NAME, "results_text")
        print(cells.text)
except:
    print("It didn't work")

NGUYEN, THANH
NGUYEN, DAI T
NGUYEN, LONG D
NGUYEN, LUCIE HUONG
NGUYEN, CHINH
NGUYEN, JIMMY
NGUYEN, NAM
NGUYEN, DUC
NGUYEN, THU THAO THI


## Loop through each result, printing each violation description ("Basis for order")

Your results should look something like:

```
Doesn't have a violation
Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.
Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used, the Department is charging 2 violations; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect manicure tables prior to use for each client.
...
```

> - *Tip: You'll get an error even if you're ALMOST right - which row is causing the problem?*
> - *Tip: If you're using Selenium by itself, you can get the HTML of something by doing `.get_attribute('innerHTML')` – that way it'll look like BeautifulSoup when you print it. It might help you diagnose your issue!*
> - *Tip: Or I guess you could just skip the one with the problem...*

In [174]:
rows = driver.find_elements(By.TAG_NAME, "tr")
for respondents in rows[1:10]:
    cells=respondents.find_elements(By.TAG_NAME, "td")
    print(cells[2].text)
    print("--------")

Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.
--------
Respondent failed to follow whirlpool foot spas cleaning and sanitization procedures as required; Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used; Respondent failed to store eyelash extensions in a sealed bag or covered container and kept in a clean dry debris-free storage area.
--------
Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used, the Department is charging 2 violations; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect manicure tables prior to use for each client.
--------
Respondent failed to keep a reco

## Loop through each result, printing the complaint number

Output should look like this:

```
Doesn't have a complaint number
COS20210004784
COS20210009745
COS20210011484
...
```

- *Tip: Think about the order of the elements. Can you count from the opposite direction than you normally do?*

In [134]:
rows = driver.find_elements(By.TAG_NAME, "tr")
for complaint in rows[1:10]:
    cells=complaint.find_elements(By.CLASS_NAME, "results_text")
    print(cells[5].text)

COS20210004784
COS20210005027
COS20210009745
COS20210011484
COS20210011721
COS20200007069
COS20210010530
COS20200007141
COS20200000839


## Saving the results

### Loop through each result to create a list of dictionaries

Each dictionary must contain

- Person's name
- Violation description
- Violation number
- License Numbers
- Zip Code
- County
- City

Create a new dictionary for each result (except the header).

Based on what you print out, the output might look something like:

```
This row is broken: Name and Location Order Basis for Order
{'name': 'NGUYEN, THANH', 'city': 'FRISCO', 'county': 'COLLIN', 'zip_code': '75034', 'complaint_no': 'COS20210004784', 'license_numbers': '790672', 'complaint': 'Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.'}
{'name': 'NGUYEN, LONG D', 'city': 'SAN SABA', 'county': 'SAN SABA', 'zip_code': '76877', 'complaint_no': 'COS20210009745', 'license_numbers': '760420, 1620583', 'complaint': 'Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used, the Department is charging 2 violations; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect manicure tables prior to use for each client.'}
```

> *Tip: If you want to ask for the "next sibling," you can't use `find_next_sibling` in Selenium, you need to use `element.find_element_by_xpath("following-sibling::div")` to find the next div, or `element.find_element_by_xpath("following-sibling::*")` to find the next anything.

In [164]:
dataset=[]
for dicts in rows[1:10]:
    data={}
    data['name']= dicts.find_element(By.CLASS_NAME, "results_text").text
    data['city'] = dicts.find_elements(By.CLASS_NAME, "results_text")[1].text
    data['county'] = dicts.find_elements(By.CLASS_NAME, "results_text")[2].text
    data['zip_code'] = dicts.find_elements(By.CLASS_NAME, "results_text")[3].text
    data['complaint_no'] = dicts.find_elements(By.CLASS_NAME, "results_text")[5].text
    data['license_numbers'] = dicts.find_elements(By.CLASS_NAME, "results_text")[4].text
    data['complaint'] = dicts.find_elements(By.TAG_NAME, "td")[2].text
    dataset.append(data)
dataset

[{'name': 'NGUYEN, THANH',
  'city': 'FRISCO',
  'county': 'COLLIN',
  'zip_code': '75034',
  'complaint_no': 'COS20210004784',
  'license_numbers': '790672',
  'complaint': 'Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.'},
 {'name': 'NGUYEN, DAI T',
  'city': 'HOUSTON',
  'county': 'Harris',
  'zip_code': '77034',
  'complaint_no': 'COS20210005027',
  'license_numbers': '765339',
  'complaint': 'Respondent failed to follow whirlpool foot spas cleaning and sanitization procedures as required; Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used; Respondent failed to store eyelash extensions in a sealed bag or covered container and kept in a clean dry debris-free storage area.'},
 {'name': 'NGUYEN, LONG D',
  'city': 'SAN SABA',
  'county': 'SAN 

### Save that to a CSV named `output.csv`

The dataframe should look something like...

|index|name|city|county|zip_code|complaint_no|license_numbers|complaint|
|---|---|---|---|---|---|---|---|
|0|NGUYEN, THANH|FRISCO|COLLIN|75034|COS20210004784|790672|Respondent failed to clean and sanitize whirlp...|
|1|NGUYEN, LONG D|SAN SABA|SAN SABA|76877|COS20210009745|760420, 1620583|Respondent failed to keep a record of the date...|


- *Tip: If you send a list of dictionaries to `pd.DataFrame(...)`, it will create a dataframe out of that list!*

In [161]:
df= pd.DataFrame(dataset)
df

Unnamed: 0,name,city,county,zip_code,complaint_no,license_numbers,complaint
0,"NGUYEN, THANH",FRISCO,COLLIN,75034,COS20210004784,790672,Respondent failed to clean and sanitize whirlp...
1,"NGUYEN, DAI T",HOUSTON,Harris,77034,COS20210005027,765339,Respondent failed to follow whirlpool foot spa...
2,"NGUYEN, LONG D",SAN SABA,SAN SABA,76877,COS20210009745,"760420, 1620583",Respondent failed to keep a record of the date...
3,"NGUYEN, LUCIE HUONG",UVALDE,UVALDE,78801,COS20210011484,"762626, 1811788",Respondent failed to keep a record of the date...
4,"NGUYEN, CHINH",TEMPLE,BELL,76502,COS20210011721,777067,Respondent failed to follow whirlpool foot spa...
5,"NGUYEN, JIMMY",ROWLETT,DALLAS,75088,COS20200007069,796773,Respondent failed to clean and sanitize whirlp...
6,"NGUYEN, NAM",HOUSTON,HARRIS,77025,COS20210010530,688039,Respondents failed to follow proper sequential...
7,"NGUYEN, DUC",ABILENE,TAYLOR,79605,COS20200007141,758793,Respondent failed to clean and sanitize whirlp...
8,"NGUYEN, THU THAO THI",SAN ANTONIO,BEXAR,78244,COS20200000839,"802892, 1286737",Respondent performed or attempted to perform a...


In [177]:
df.to_csv("output.csv", index=False)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [178]:
pd.read_csv("output.csv")

Unnamed: 0,name,city,county,zip_code,complaint_no,license_numbers,complaint
0,"NGUYEN, THANH",FRISCO,COLLIN,75034,COS20210004784,790672,Respondent failed to clean and sanitize whirlp...
1,"NGUYEN, DAI T",HOUSTON,Harris,77034,COS20210005027,765339,Respondent failed to follow whirlpool foot spa...
2,"NGUYEN, LONG D",SAN SABA,SAN SABA,76877,COS20210009745,"760420, 1620583",Respondent failed to keep a record of the date...
3,"NGUYEN, LUCIE HUONG",UVALDE,UVALDE,78801,COS20210011484,"762626, 1811788",Respondent failed to keep a record of the date...
4,"NGUYEN, CHINH",TEMPLE,BELL,76502,COS20210011721,777067,Respondent failed to follow whirlpool foot spa...
5,"NGUYEN, JIMMY",ROWLETT,DALLAS,75088,COS20200007069,796773,Respondent failed to clean and sanitize whirlp...
6,"NGUYEN, NAM",HOUSTON,HARRIS,77025,COS20210010530,688039,Respondents failed to follow proper sequential...
7,"NGUYEN, DUC",ABILENE,TAYLOR,79605,COS20200007141,758793,Respondent failed to clean and sanitize whirlp...
8,"NGUYEN, THU THAO THI",SAN ANTONIO,BEXAR,78244,COS20200000839,"802892, 1286737",Respondent performed or attempted to perform a...
