# Texas Barber Violations

Texas has a system for [searching for license violations](https://www.tdlr.texas.gov/cimsfo/fosearch.asp). You're going to search for barbers in Houson!

## Preparation: Knowing your tags

These questions are the same for every data set, and might not work exactly for yours.

### What is the tag and class name for every row of data?

In [1]:
# tr

### What is the tag and class name for every person's name?

In [2]:
# td[0] span[0]

### What is the tag and class name for the violation number?

In [3]:
# td[0] span[-1]

### What is the tag and class name for the description of their violation?

In [4]:
# td[2]

## Setup: Import what you'll need to scrape the page

Use `requests`, not `urllib`.

In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Try to scrape the page

To test if you requested the page correctly, save the BeautifulSoup document as `doc` and run the code `doc.find_all('tr')[1].text` to get the text of the first `<tr>` element.

- If the result starts with  **nPlease enter at least one (1) parameter** you were NOT successful.
- If the result starts with **MONTES DE OCA, REINIER**, you were successful.

### Try to request the page however you think you should.

"Try" to do it, because it *will not work.* Once you've learned that it won't work, you should **ask how to do it on the board**.

In [6]:
url = 'https://www.tdlr.texas.gov/cimsfo/fosearch_results.asp'

response = requests.post(url)
doc = BeautifulSoup(response.text, 'html.parser')

### Try to request the page with the correct data parameters

Secret tip: It still won't work. **Ask why not on the board.**

In [7]:
url = 'https://www.tdlr.texas.gov/cimsfo/fosearch_results.asp'

data = {
    'pht_status': 'BAR',
    'pht_lic': '',
    'pht_lnm': '',
    'pht_fnm': '',
    'pht_oth_name': '',
    'phy_city': 'HOUSTON',             
    'phy_cnty': '-1',
    'phy_zip': '',
    'B1': 'Search'
}

response = requests.post(url, data=data)
doc = BeautifulSoup(response.text, 'html.parser')

### What is the smallest `curl` command that still gives you a result?

In [8]:
# curl 'https://www.tdlr.texas.gov/cimsfo/fosearch_results.asp' -H 'Referer: https://www.tdlr.texas.gov/cimsfo/' --data 'pht_status=BAR&pht_lic=&pht_lnm=&pht_fnm=&pht_oth_name=&phy_city=HOUSTON+++++++++++++&phy_cnty=-1&phy_zip=&B1=Search'

## Request the page with the correct data parameters AND the correct MINIMUM headers

This time it should work.

In [9]:
url = 'https://www.tdlr.texas.gov/cimsfo/fosearch_results.asp'

data = {
    'pht_status': 'BAR',
    'pht_lic': '',
    'pht_lnm': '',
    'pht_fnm': '',
    'pht_oth_name': '',
    'phy_city': 'HOUSTON',             
    'phy_cnty': '-1',
    'phy_zip': '',
    'B1': 'Search'
}

headers = {
    'Referer': 'https://www.tdlr.texas.gov/cimsfo/'
}

response = requests.post(url, data=data, headers=headers)
doc = BeautifulSoup(response.text, 'html.parser')

## Scraping

### Loop through each `tr` and print each person's name

You'll get an error because the first one doesn't have a name. How do you make that not happen? I'm happy to help if you ask on the board.

In [10]:
for row in doc.find_all('tr')[1:]:
    print(row.td.span.text.strip())

MONTES DE OCA, REINIER
ALFORD, RAYMOND
CHAPMAN, JESSICA
SALAZAR-ALVAREZ, SAMUEL
GONZALES, DAVID
FLORES, CHRISTOPHER
ARMSTEAD, CEDRIC J
MORAH, PATRICK
TREJO, BLADIMAR A
DAVIS, RICHARD D
HOPKINS, JOSHUA
NINO, ROBERT
HEATH, LOLETHA N
SALAZAR-ALVAREZ, SAMUEL
MONTES DE OCA, REINIER
Company:
Company:
SUTTON, EMANUEL B
SHEPHARD, JAMES C
HERNANDEZ, MARIA DIOCELINA
WILLIAMS, DONTUEL
JOHNSON, JEFFERY J
Company:
HUERTA, FRANCISCO
TIPTON, SELINA I
ARREOLA, ERIC D
HARRISON, OTTO M
RIVERA TORRES, ANGEL D
PECK, MARVIN
MOTA SOTO, CRISTIAN D
WADDLE, EDDIE D
SON, YOUNG J
HILL, BRIAN
BROWN, DELRICK JAREL
FRANKLIN, KELVIN
LEDET, LEON
WILLIAMS, DONTUEL
LACY, JUSTIN J
Company:
ARELLANO, GREGORY F
MACEDO, ANTONIO
MILLER, SHAWN ERIC
HAYWARD, ABBIE DEAN
BROWN, CHARLES EARL
MCQUEEN, IDA M
MCQUEEN, IDA M
CAESAR, RON
MORRIS, VICTOR B
NOLAN, CHRIS B
BICKHAM, DONNELL
LOUIS, DIONNE N
HARRELL, KENTON D
SUBRAHMANIAN, CHITRA N
FRANKLIN, LAWRENCE W
ADAMS, KERRY
PATTERSON, RONALD
LANCASTER, MARLANA DEVONE S
TOLDEN, REGIN

## Loop through each `tr`, printing each violation description

- TIP: What is the container tag name for it?
- TIP: You'll get an error even if you're ALMOST right - which row is causing the problem?

In [11]:
for row in doc.find_all('tr')[1:]:
    print(row.find_all('td')[2].text.strip())

Respondent performed barbering without the required license.
Respondent performed barbering without the required license.
Respondent failed to electronically submit to the Department at least one time per month student's accrued hours.
Respondent performed barbering without the required license.
Respondent leased space in a barber shop to an individual who engaged in the practice of barbering but had not obtained a barber license.
Respondent leased space in a barber shop to an individual who engaged in the practice of barbering but had not obtained a barber license.
The Respondent's license was revoked upon Respondent's imprisonment in a penitentiary.
Respondent leased space in a barber shop to an individual who engaged in the practice of barbering but had not obtained a barber license; Respondent failed to prepare fresh disinfectant solution daily or more often as needed, for immersion of implements.
Respondent performed barbering without the required license.
Respondent practiced bar

## Loop through each `tr`, printing the complaint number

- TIP: It should be the last piece of the fist `td`

In [12]:
for row in doc.find_all('tr')[1:]:
    print(row.td.find_all('span')[-1].text.strip())

BAR20170009735
BAR20170013061
BAR20160014463
BAR20170009706
BAR20160024898
BAR20170003858
BAR20170017750
BAR20170001067
BAR20170015712
BAR20160026976
BAR20170004945
BAR20170005752
BAR20170008862
BAR20170009706
BAR20170009735
BAR20170010211
BAR20170015711
BAR20170005607
BAR20170012408
BAR20160015455
BAR20170004000
BAR20170004622
BAR20170009953
BAR20160019178
BAR20170003998
BAR20170005585
BAR20170004247
BAR20170004644
BAR20170001084
BAR20170003233
BAR20170007267
BAR20170004607
BAR20170004726
BAR20170000258
BAR20170000872
BAR20170000888
BAR20170004000
BAR20170001296
BAR20170001765
BAR20160000930
BAR20160012081
BAR20160023793
BAR20160020239
BAR20160025221
BAR20160003560
BAR20160014501
BAR20160020292
BAR20160019711
BAR20160020884
BAR20160013209
BAR20160013204
BAR20160013487
BAR20160020715
BAR20150023440
BAR20160015167
BAR20160010384
BAR20160010595
BAR20160010403
BAR20160003712
BAR20160002891
BAR20160006458
BAR20160017201
BAR20160005691
BAR20160012167
BAR20160010137
BAR20150016373
BAR2015002

## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Person's name
- Violation description
- Violation number

Create a new dictionary for each `tr` (except the header).

In [13]:
violations = []

for violation in doc.find_all('tr')[1:]:
    violations.append({
        'name': violation.td.span.text.strip(),
        'description': violation.find_all('td')[2].text.strip(),
        'number': violation.td.find_all('span')[-1].text.strip()
    })

violations

[{'description': 'Respondent performed barbering without the required license.',
  'name': 'MONTES DE OCA, REINIER',
  'number': 'BAR20170009735'},
 {'description': 'Respondent performed barbering without the required license.',
  'name': 'ALFORD, RAYMOND',
  'number': 'BAR20170013061'},
 {'description': "Respondent failed to electronically submit to the Department at least one time per month student's accrued hours.",
  'name': 'CHAPMAN, JESSICA',
  'number': 'BAR20160014463'},
 {'description': 'Respondent performed barbering without the required license.',
  'name': 'SALAZAR-ALVAREZ, SAMUEL',
  'number': 'BAR20170009706'},
 {'description': 'Respondent leased space in a barber shop to an individual who engaged in the practice of barbering but had not obtained a barber license.',
  'name': 'GONZALES, DAVID',
  'number': 'BAR20160024898'},
 {'description': 'Respondent leased space in a barber shop to an individual who engaged in the practice of barbering but had not obtained a barber li

### Save that to a CSV

In [14]:
df = pd.DataFrame(violations)
df.to_csv('violations.csv', index=False)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [15]:
pd.read_csv('violations.csv')

Unnamed: 0,description,name,number
0,Respondent performed barbering without the req...,"MONTES DE OCA, REINIER",BAR20170009735
1,Respondent performed barbering without the req...,"ALFORD, RAYMOND",BAR20170013061
2,Respondent failed to electronically submit to ...,"CHAPMAN, JESSICA",BAR20160014463
3,Respondent performed barbering without the req...,"SALAZAR-ALVAREZ, SAMUEL",BAR20170009706
4,Respondent leased space in a barber shop to an...,"GONZALES, DAVID",BAR20160024898
5,Respondent leased space in a barber shop to an...,"FLORES, CHRISTOPHER",BAR20170003858
6,The Respondent's license was revoked upon Resp...,"ARMSTEAD, CEDRIC J",BAR20170017750
7,Respondent leased space in a barber shop to an...,"MORAH, PATRICK",BAR20170001067
8,Respondent performed barbering without the req...,"TREJO, BLADIMAR A",BAR20170015712
9,Respondent practiced barbering in an unlicense...,"DAVIS, RICHARD D",BAR20160026976
