# Texas Cosmetologist Violations

Texas has a system for [searching for license violations](https://www.tdlr.texas.gov/cimsfo/fosearch.asp). You're going to search for cosmetologists!

> You can use the classwork notebook and also [my Everything scraping reference](https://jonathansoma.com/everything/scraping)

## Setup: Import what you'll need to scrape the page

We'll be using either Playwrightfor this, *not* requests.

In [1]:
from playwright.async_api import async_playwright

playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

## Starting your search

Starting from [here](https://www.tdlr.texas.gov/cimsfo/fosearch.asp), search for cosmetologist violations for people with the last name **Nguyen**.

In [2]:
page = await browser.new_page()
await page.goto("https://www.tdlr.texas.gov/cimsfo/fosearch.asp")

<Response url='https://www.tdlr.texas.gov/cimsfo/fosearch.asp' request=<Request url='https://www.tdlr.texas.gov/cimsfo/fosearch.asp' method='GET'>>

In [3]:
await page.locator("#pht_status").click()

In [4]:
await page.locator("#pht_status").select_option('Cosmetologists')

['COS']

In [5]:
await page.locator("#pht_lnm").click()

In [8]:
await page.locator("#pht_lnm").press_sequentially("Nguyen")

In [11]:
await page.get_by_role("banner").get_by_role("button", name="Search").click()

## Scraping

Once you are on the results page, do this. **I step you through things bit by bit, so it's going to be a little different than we did in class.** Also, no `pd.read_html` allowed because this isn't reaaallly tabular data (or there's just too much to clean up).

Once you've loaded the page: **what selector are you going to do use to find the data?** In class we used things like `.article` or `.head-list`.

> You honestly can do this all with Playwright, no BeautifulSoup involved! But we didn't cover that in class so you should stick to this method:
> 
> ```python
> html = await page.content()
> doc = BeautifulSoup(html)
> ```

In [12]:
import requests

from bs4 import BeautifulSoup
html = await page.content()
doc = BeautifulSoup(html)

### Grab the rows, loop through each result and print the entire row

It's probably too many rows to do all at once: once you have a list, use `[:10]` to only show the first ten! For example, if you saved the table rows into `results` you might do something like this:

```python
for result in results[:10]:
    print("------")
    print(result)
```

Although you'd want to print out the text from the row (I give example output below).

In [26]:
results = doc.find_all('td')
for result in results[:10]:
    print("------")
    print(result.text)

------
NGUYEN, THU THI  City: HOUSTON County: HARRIS Zip Code: 77031 License #: 816854Complaint # COS20230009980
------
Date: 10/23/2023Respondents Thu Thi Nguyen and Tien Minh Vo are assessed an administrative penalty in the amount of $1,350.
------
Respondents failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondents failed to discard single use implements after each use; Respondents failed to keep floors, walls, ceilings, shelves, furniture, furnishings, and fixtures clean and in good repair; Respondents failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used; Respondents failed to clean, disinfect, and sterilize metal instruments with a Department-approved sterilizer in accordance with the sterilizer or sanitizers manufacturer's instructions.
------
NGUYEN, TUONG  City: FORT WORTH County: TARRANT Zip Code: 76120 License #: 742271Complaint # COS20230001960
------
Date: 10/

The result should look something like this:

```
Name and Location Order Basis for Order
NGUYEN, THANH
City: FRISCO
County: COLLIN
Zip Code: 75034


License #: 790672

Complaint # COS20210004784 Date: 11/16/2021

Respondent is assessed an administrative penalty in the amount of $1,875. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.
NGUYEN, LONG D
City: SAN SABA
County: SAN SABA
Zip Code: 76877
```

### Loop through each result and print each person's name

You'll probably get an error because the first one doesn't have a name. How do you make that not happen?! If you want to ignore an error, you use code like this:

```python
try:
   # try to do something
except:
   print("It didn't work')
```

It should help you out. If you don't want to print anything when there's an error, you can type `pass` instead of the `print` statement.

**Why doesn't the first one have a name?**

Output should look like this:

```
Doesn't have a name
NGUYEN, THANH
NGUYEN, LONG D
NGUYEN, LUCIE HUONG
NGUYEN, CHINH
NGUYEN, JIMMY
```

* *Tip: The name has a class you can use. The class name is reused in a lot of places, but because it's the first one you don't have to worry about that!*
* *Tip: Instead of searching across the entire page – like `doc.select_one` – you should be doing your searching just inside of each **row** (I used this technique in the beginning of class with BeautifulSoup when we were scraping the books page)* 

In [36]:
names = doc.find_all(class_='results_text')
for name in names[0:30:7]:
    print("------")
    print(name.text)

------
NGUYEN, THU THI 
------
NGUYEN, TUONG 
------
NGUYEN, HUONG T
------
NGUYEN, STEVEN QUOC 
------
NGUYEN, HUONG THI NGOC 


## Loop through each result, printing each violation description ("Basis for order")

Your results should look something like:

```
Doesn't have a violation
Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.
Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used, the Department is charging 2 violations; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect manicure tables prior to use for each client.
...
```

> - *Tip: You'll get an error even if you're ALMOST right - which row is causing the problem?*
> - *Tip: Or I guess you could just skip the one with the problem using try/except...*

In [50]:
violations = doc.find_all(style="padding:4px; text-align:left; font-size:11px; font:Arial, Helvetica, sans-serif; width:39%;")
for violation in violations:
    print("------")
    print(violation.text)

------
Date: 10/23/2023Respondents Thu Thi Nguyen and Tien Minh Vo are assessed an administrative penalty in the amount of $1,350.
------
Respondents failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondents failed to discard single use implements after each use; Respondents failed to keep floors, walls, ceilings, shelves, furniture, furnishings, and fixtures clean and in good repair; Respondents failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used; Respondents failed to clean, disinfect, and sterilize metal instruments with a Department-approved sterilizer in accordance with the sterilizer or sanitizers manufacturer's instructions.
------
Date: 10/4/2023Respondents Trang Phan, Tuong Nguyen, and Victory Lotus LLC are assessed an administrative penalty in the amount of $2,575.
------
Respondents employed an individual as an operator who had not obtained a cosmetology licens

## Loop through each result, printing the complaint number

Output should look similar to this:

```
Doesn't have a complaint number
COS20210004784
COS20210009745
COS20210011484
...
```

- *Tip: Think about the order of the elements. Can you count from the opposite direction than you normally do?*

In [67]:
complaint_nums = doc.find_all(class_='results_text')
for complaint_num in complaint_nums[5:40:7]:
    print("------")
    print(complaint_num.text)

------
COS20230009980
------
COS20230001960
------
COS20220017901
------
COS20220018135
------
COS20220016107


## Saving the results

### Loop through each result to create a list of dictionaries

Each dictionary must contain

- Person's name
- Violation description
- Violation number
- License Numbers
- Zip Code
- County
- City

Create a new dictionary for each result (except the header).

Based on what you print out, the output might look something like:

```
This row is broken: Name and Location Order Basis for Order
{'name': 'NGUYEN, THANH', 'city': 'FRISCO', 'county': 'COLLIN', 'zip_code': '75034', 'complaint_no': 'COS20210004784', 'license_numbers': '790672', 'complaint': 'Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.'}
{'name': 'NGUYEN, LONG D', 'city': 'SAN SABA', 'county': 'SAN SABA', 'zip_code': '76877', 'complaint_no': 'COS20210009745', 'license_numbers': '760420, 1620583', 'complaint': 'Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used, the Department is charging 2 violations; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect manicure tables prior to use for each client.'}
```

> _**Tip:** Depending on how you do this, you might want to figure out how to search for "next siblings" or "following siblings"_

In [68]:
entries = doc.find_all(style="padding:4px; text-align:left; font-size:11px; font:Arial, Helvetica, sans-serif; width:22%;")
for entry in entries:
    print(entry.text)
    
    

------
NGUYEN, THU THI  City: HOUSTON County: HARRIS Zip Code: 77031 License #: 816854Complaint # COS20230009980
------
NGUYEN, TUONG  City: FORT WORTH County: TARRANT Zip Code: 76120 License #: 742271Complaint # COS20230001960
------
NGUYEN, HUONG T City: SAN ANTONIO County: BEXAR Zip Code: 78259 License #: 775126Complaint # COS20220017901
------
NGUYEN, STEVEN QUOC  City: SAN ANGELO County: TOM GREEN Zip Code: 76904 License #(s): 748707, 1458935Complaint # COS20220018135
------
NGUYEN, HUONG THI NGOC  City: MANSFIELD County: TARRANT Zip Code: 76063 License #: 829081Complaint # COS20220016107
------
NGUYEN, TAMMY  City: TOMBALL County: HARRIS Zip Code: 77375NGUYEN, THANG  City: TOMBALL County: HARRIS Zip Code: 77375 License #: 789659Complaint # COS20220013443
------
NGUYEN, TOAN DUC  City: SUGAR LAND County: FORT BEND Zip Code: 77478 License #: 789406Complaint # COS20220015714
------
NGUYEN, NHAN TRONG  City: SAN ANTONIO County: BEXAR Zip Code: 78247 License #(s): 694927, 1096417Compl

In [13]:
# Grab all the HTML on the page
html = await page.content()

# And push it into pd.read_html
import pandas as pd

tables = pd.read_html(html, header=0)
df = tables[0]
df.head()

  tables = pd.read_html(html, header=0)


Unnamed: 0,Name and Location,Order,Basis for Order
0,"NGUYEN, THU THI City: HOUSTON County: HARRIS ...",Date: 10/23/2023 Respondents Thu Thi Nguyen an...,"Respondents failed to clean, disinfect, and st..."
1,"NGUYEN, TUONG City: FORT WORTH County: TARRAN...","Date: 10/4/2023 Respondents Trang Phan, Tuong ...",Respondents employed an individual as an opera...
2,"NGUYEN, HUONG T City: SAN ANTONIO County: BE...",Date: 8/23/2023 Respondent is assessed an admi...,Respondent failed to clean and disinfect facia...
3,"NGUYEN, STEVEN QUOC City: SAN ANGELO County: ...",Date: 7/28/2023 Respondent is assessed an admi...,Respondent failed to cooperate with the inspec...
4,"NGUYEN, HUONG THI NGOC City: MANSFIELD County...",Date: 7/24/2023 Respondent is assessed an admi...,"Respondent failed to clean, disinfect, and ste..."


### Save that to a CSV named `output.csv`

The dataframe should look something like...

|index|name|city|county|zip_code|complaint_no|license_numbers|complaint|
|---|---|---|---|---|---|---|---|
|0|NGUYEN, THANH|FRISCO|COLLIN|75034|COS20210004784|790672|Respondent failed to clean and sanitize whirlp...|
|1|NGUYEN, LONG D|SAN SABA|SAN SABA|76877|COS20210009745|760420, 1620583|Respondent failed to keep a record of the date...|


- *Tip: If you send a list of dictionaries to `pd.DataFrame(...)`, it will create a dataframe out of that list!*

In [14]:
df.to_csv("output.csv", index=False)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [15]:
pd.read_csv("output.csv")

Unnamed: 0,Name and Location,Order,Basis for Order
0,"NGUYEN, THU THI City: HOUSTON County: HARRIS ...",Date: 10/23/2023 Respondents Thu Thi Nguyen an...,"Respondents failed to clean, disinfect, and st..."
1,"NGUYEN, TUONG City: FORT WORTH County: TARRAN...","Date: 10/4/2023 Respondents Trang Phan, Tuong ...",Respondents employed an individual as an opera...
2,"NGUYEN, HUONG T City: SAN ANTONIO County: BE...",Date: 8/23/2023 Respondent is assessed an admi...,Respondent failed to clean and disinfect facia...
3,"NGUYEN, STEVEN QUOC City: SAN ANGELO County: ...",Date: 7/28/2023 Respondent is assessed an admi...,Respondent failed to cooperate with the inspec...
4,"NGUYEN, HUONG THI NGOC City: MANSFIELD County...",Date: 7/24/2023 Respondent is assessed an admi...,"Respondent failed to clean, disinfect, and ste..."
...,...,...,...
90,"NGUYEN, DUC City: ABILENE County: TAYLOR Zip...",Date: 10/12/2021 Respondent is assessed an adm...,Respondent failed to clean and sanitize whirlp...
91,"NGUYEN, THU THAO THI City: SAN ANTONIO County...",Date: 10/11/2021 Respondent is assessed an adm...,Respondent performed or attempted to perform a...
92,"NGUYEN, MINH NHU City: DALLAS County: Zip Cod...",Date: 9/17/2021 Respondent is assessed an admi...,Respondent failed to follow whirlpool foot spa...
93,"NGUYEN, DUNG VAN City: SAULT SAINTE MARIE Cou...",Date: 9/9/2021 Respondent is assessed an admin...,Respondent performed or attempted to perform a...
