# Texas Tow Trucks (`.apply` and `requests`)

We're going to scrape some [tow trucks in Texas](https://www.tdlr.texas.gov/tools_search/).

Try searching for the TLDR Number `006179570C`.

## Preparation

> You do not need to actually search this out using BeautifulSoup, this is more for you to say "it's a td, it isn't special, but it looks like the third td in a tr with a class" or something

### What is the URL you will be scraping?

In [119]:
# https://www.tdlr.texas.gov/tools_search/
# or
# https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006179570C

### When you search for information on a specific mine, do you need form data? If so, what is your form data going to be?

In [None]:
# Yes, we need to post the TLDR Number:

# { 
# "namedata":"",
# "name_carrier_type":"COMPANY",
# "searchtype":"mcr",
# "mcrdata":"006179570C",
# "citydata":"",
# "city_status":"A",
# "city_carrier_type":"tow",
# "zipcodedata":"",
# "zip_status":"ALL",
# "zip_carrier_type":"all",
# "proc":""
# }

# But we can also use the url
# https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006179570C

## Scrape this page

Scrape this page, displaying the

- The business name
- Phone number
- License status
- Physical address

.

- *TIP: This one isn't very fun, but I have some secret tricks. **Ask me on the board**.*

In [149]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import re

In [154]:
driver = webdriver.Chrome('/Users/mathieurudaz/Desktop/LEDE/chromedriver')

In [195]:
driver.get('https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=0649468VSF')

In [208]:
# Extract text without child tags from a tag
def get_text_without_tags(web_element):
    text = web_element.text
    for t in web_element.find_elements_by_xpath('.//*'):
        text = text.replace(t.text, '')
    return text.strip()

# Get the physical address from the td tag
def get_address(webElement):
    m = re.search(r"(Physical:)\s(.*\s.*)", webElement.text, re.MULTILINE)
    if m:
        return m.group(2)

# Get the license status from the font tag
def get_license_status(webElement):
    m = re.search(r"\((.+)\)", webElement.text, re.MULTILINE)
    if m:
        return m.group(1)

In [214]:
print('Business name:', get_text_without_tags(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody/tr[2]/td[1]')))
print('Phone number:', get_text_without_tags(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody/tr[4]/td[1]')))
print('Licens status:', get_license_status(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]/tbody/tr[2]/td[1]/font')))
print('Physical address:', get_address(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]/tbody/tr[2]/td[2]')))

Business name: HEATH SMITH
Phone number: 9405520687
Licens status: Expired
Physical address: 1529 WILBARGER ST
VERNON, TX. 76384


# Using .apply to find data about SEVERAL tow truck companies

The file `trucks-subset.csv` has information about the trucks, we'll use it to find the pages to scrape.

### Open up `trucks-subset.csv` and save it into a dataframe

In [215]:
df_trucks = pd.read_csv('trucks-subset.csv')

### Open up `trucks-subset.csv` in a text editor, then look at your dataframe. Is something different about them? If so, make them match.

- *TIP: I can help with this.*

In [216]:
df_trucks

Unnamed: 0,TDLR Number
0,006507931C
1,006179570C
2,006502097C


## Go through each row of the dataset, printing out the URL you will need to scrape for the information on that row

For example, `https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006507931C`.

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [217]:
base_url = 'https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber='
df_trucks.apply(lambda x: print(base_url + x))

0    https://www.tdlr.texas.gov/tools_search/mccs_d...
1    https://www.tdlr.texas.gov/tools_search/mccs_d...
2    https://www.tdlr.texas.gov/tools_search/mccs_d...
Name: TDLR Number, dtype: object


TDLR Number    None
dtype: object

### Save this URL into a new column of your dataframe

- *TIP: Use a function and `.apply`*
- *TIP: Be sure to use `return`*

In [218]:
df_trucks['url'] = df_trucks.apply(lambda x: base_url + x)
df_trucks

Unnamed: 0,TDLR Number,url
0,006507931C,https://www.tdlr.texas.gov/tools_search/mccs_d...
1,006179570C,https://www.tdlr.texas.gov/tools_search/mccs_d...
2,006502097C,https://www.tdlr.texas.gov/tools_search/mccs_d...


## Go through each row of the dataset, printing out information about each tow truck company.

Now will be **scraping** inside of your function.

- The business name
- Phone number
- License status
- Physical address

Just print it out for now.

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [230]:
def get_truck_info(row):
    driver.get(row['url'])
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    print(get_text_without_tags(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody/tr[2]/td[1]')))
    print(get_text_without_tags(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody/tr[4]/td[1]')))
    print(get_license_status(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]/tbody/tr[2]/td[1]/font')))
    print(get_address(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]/tbody/tr[2]/td[2]')))
    print('------')
    
df_trucks.apply(get_truck_info, axis=1)

AUGUSTUS E SMITH
9032276464
Active
103 N MAIN ST
BONHAM, TX. 75418
------
B.D. SMITH TOWING
8173330706
Active
13619 BRETT JACKSON RD.
FORT WORTH, TX. 76179
------
BARRY MICHAEL SMITH
8066544404
Active
4501 W CEMETERY RD
CANYON, TX. 79015
------


0    None
1    None
2    None
dtype: object

## Scrape the following information for each row of the dataset, and save it into new columns in your dataframe.

- The business name
- Phone number
- License status
- Physical address

It's basically what we did before, but using the function a little differently.

- *TIP: Use .apply and a function*
- *TIP: Remember to use `return`*

In [231]:
def get_truck_info(row):
    driver.get(row['url'])
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    return pd.Series({
        'name': get_text_without_tags(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody/tr[2]/td[1]')),
        'phone_number': get_text_without_tags(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody/tr[4]/td[1]')),
        'license_status': get_license_status(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]/tbody/tr[2]/td[1]/font')),
        'address': get_address(driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]/tbody/tr[2]/td[2]'))
    })
    
df_trucks_complete = df_trucks.apply(get_truck_info, axis=1).join(df_trucks)
df_trucks_complete.head()

Unnamed: 0,address,license_status,name,phone_number,TDLR Number,url
0,"103 N MAIN ST\nBONHAM, TX. 75418",Active,AUGUSTUS E SMITH,9032276464,006507931C,https://www.tdlr.texas.gov/tools_search/mccs_d...
1,"13619 BRETT JACKSON RD.\nFORT WORTH, TX. 76179",Active,B.D. SMITH TOWING,8173330706,006179570C,https://www.tdlr.texas.gov/tools_search/mccs_d...
2,"4501 W CEMETERY RD\nCANYON, TX. 79015",Active,BARRY MICHAEL SMITH,8066544404,006502097C,https://www.tdlr.texas.gov/tools_search/mccs_d...


### Save your dataframe as a CSV

In [222]:
df_trucks_complete.to_csv('df_trucks_complete.csv', index=False)

### Re-open your dataframe to confirm you didn't save any extra weird columns

In [223]:
pd.read_csv('df_trucks_complete.csv').head()

Unnamed: 0,address,license_status,name,phone_number,TDLR Number,url
0,"103 N MAIN ST\nBONHAM, TX. 75418",Active,AUGUSTUS E SMITH,9032276464,006507931C,https://www.tdlr.texas.gov/tools_search/mccs_d...
1,"13619 BRETT JACKSON RD.\nFORT WORTH, TX. 76179",Active,B.D. SMITH TOWING,8173330706,006179570C,https://www.tdlr.texas.gov/tools_search/mccs_d...
2,"4501 W CEMETERY RD\nCANYON, TX. 79015",Active,BARRY MICHAEL SMITH,8066544404,006502097C,https://www.tdlr.texas.gov/tools_search/mccs_d...


## Repeat this process for the entire `tow-trucks.csv` file

In [224]:
df_tow_trucks = pd.read_csv('tow-trucks.csv')

In [225]:
df_tow_trucks['url'] = df_tow_trucks.apply(lambda x: base_url + x)
df_tow_trucks.head()

Unnamed: 0,TDLR Number,url
0,006507931C,https://www.tdlr.texas.gov/tools_search/mccs_d...
1,006179570C,https://www.tdlr.texas.gov/tools_search/mccs_d...
2,006502097C,https://www.tdlr.texas.gov/tools_search/mccs_d...
3,006494912C,https://www.tdlr.texas.gov/tools_search/mccs_d...
4,0649468VSF,https://www.tdlr.texas.gov/tools_search/mccs_d...


In [232]:
df_tow_trucks_complete = df_tow_trucks.apply(get_truck_info, axis=1).join(df_tow_trucks)
df_tow_trucks_complete.head()

In [234]:
df_tow_trucks_complete.to_csv('df_tow_trucks_complete.csv', index=False)

In [235]:
pd.read_csv('df_tow_trucks_complete.csv').head()

Unnamed: 0,address,license_status,name,phone_number,TDLR Number,url
0,"103 N MAIN ST\nBONHAM, TX. 75418",Active,AUGUSTUS E SMITH,9032276464,006507931C,https://www.tdlr.texas.gov/tools_search/mccs_d...
1,"13619 BRETT JACKSON RD.\nFORT WORTH, TX. 76179",Active,B.D. SMITH TOWING,8173330706,006179570C,https://www.tdlr.texas.gov/tools_search/mccs_d...
2,"4501 W CEMETERY RD\nCANYON, TX. 79015",Active,BARRY MICHAEL SMITH,8066544404,006502097C,https://www.tdlr.texas.gov/tools_search/mccs_d...
3,"1529 WILBARGER ST\nVERNON, TX. 76384",Expired,HEATH SMITH,940-552-0687,006494912C,https://www.tdlr.texas.gov/tools_search/mccs_d...
4,"1529 WILBARGER ST\nVERNON, TX. 76384",Expired,HEATH SMITH,9405520687,0649468VSF,https://www.tdlr.texas.gov/tools_search/mccs_d...
