# Web Scraping with Python Using Selenium  
### Demo of scraping location details from Panera Bread website

**Resources:**  
* Downloading/Installation: https://selenium-python.readthedocs.io/installation.html  
* Getting started and using special keys: https://selenium-python.readthedocs.io/getting-started.html  
* Locating elements: https://selenium-python.readthedocs.io/locating-elements.html  

**Useful Code Snippets:**  
to run the driver and get a link:
```
    link = ''
    driver = Chrome(r"C:\Users\lvincent\/Documents\chromedriver87.exe")
    driver.get(link)
```
to slow down the scraping (add after each driver.get(link)):
```
    driver.implicitly_wait(10)         ->makes the driver wait up to 10 seconds for the page to completely load
    time.sleep(random.randint(1,3))    ->sleeps (pauses run) either 1 or 2 or 3 seconds
```
common ways to locate elements:
```
    driver.find_element_by_xpath('').text                  ->using xpath
    driver.find_element_by_xpath('').get_attribute("href") ->to get information out of attributes, e.g. grabbing a link
    driver.find_element_by_tag_name('body').text           ->using the tag name
    driver.find_elements_by_tag_name('div')                ->returns a list of elements that match, not just the first
    driver.find_element_by_xpath('').get_attribute("innerHTML") ->to get contents of script tags or other tags
```
for getting page with selenium but scraping with BeautifulSoup:
```
    content = driver.find_element_by_tag_name('body').get_attribute("innerHTML")
    soup = BeautifulSoup(content, 'lxml')
```
to get current url, useful when redirected:
```
    driver.current_url
```
**Other Useful Resources:**  
* For Chrome download driver from https://chromedriver.chromium.org/downloads and save where you can find it (may need to update if you version of Chrome gets updated)
* XPaths reference: https://www.w3schools.com/xml/xpath_syntax.asp  

In [30]:
from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys 
import time
import pandas as pd
import re
import json
import numpy as np
import requests
from bs4 import BeautifulSoup
import csv
import time
import random

## Initial testing to find data on webpage

In [94]:
link = 'https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html'
driver = Chrome(r"C:\Users\lvincent\/Documents\chromedriver87.exe")
driver.get(link)

In [7]:
##try to find brand name using xpath, returns only information about the webelement:
driver.find_element_by_xpath('//*[@id="location-name"]/span[1]')

<selenium.webdriver.remote.webelement.WebElement (session="15ff8e856d35a00bb70881ea5e04b1f7", element="c3a2e68a-e0dd-4576-bb7a-459dbbe0a401")>

In [3]:
##must use .text to get the contents of the tag:
driver.find_element_by_xpath('//*[@id="location-name"]/span[1]').text

'Panera Bread'

In [96]:
##can use the same method to get the merchant name and most other fields of interest
driver.find_element_by_xpath('//*[@id="location-name"]/span[2]').text

'Alabaster -S Colonial Drive'

In [8]:
##for latitude and longitude info is within attribute so .text returns nothing
driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[1]/span[1]/meta[1]').text

''

In [9]:
##to get contents of attribute use .get_attribute:
driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[1]/span[1]/meta[1]').get_attribute("content")

'33.2276333'

In [20]:
##Location features are a list but .text combines the text as one string:
driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[3]/div[1]/ul').text.strip()

'Delivery Available\nDine In\nCurbside\nKiosk\nRapid Pick-up'

In [21]:
##This can be split into a list:
locationFeatures = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[3]/div[1]/ul').text.strip()
locationFeatures.split('\n')

['Delivery Available', 'Dine In', 'Curbside', 'Kiosk', 'Rapid Pick-up']

In [22]:
##location hours returns as json:
driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div').get_attribute("data-days")

'[{"day":"MONDAY","intervals":[{"end":2100,"start":600}]},{"day":"TUESDAY","intervals":[{"end":2100,"start":600}]},{"day":"WEDNESDAY","intervals":[{"end":2100,"start":600}]},{"day":"THURSDAY","intervals":[{"end":2100,"start":600}]},{"day":"FRIDAY","intervals":[{"end":2100,"start":600}]},{"day":"SATURDAY","intervals":[{"end":2100,"start":600}]},{"day":"SUNDAY","intervals":[{"end":2100,"start":600}]}]'

In [24]:
##consolidate fields and double check results:
brand_name = driver.find_element_by_xpath('//*[@id="location-name"]/span[1]').text.strip()
merchant_name = driver.find_element_by_xpath('//*[@id="location-name"]/span[2]').text.strip()
latitude = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[1]/span[1]/meta[1]').get_attribute("content").strip()
longitude = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[1]/span[1]/meta[2]').get_attribute("content").strip()
address1 = driver.find_element_by_xpath('//*[@id="address"]/span[1]/span[1]').text.strip()
address2 = driver.find_element_by_xpath('//*[@id="address"]/span[1]/span[2]').text.strip()
city = driver.find_element_by_xpath('//*[@id="address"]/span[2]/span[1]').text.strip()
state = driver.find_element_by_xpath('//*[@id="address"]/abbr[1]').text.strip()
postalCode = driver.find_element_by_xpath('//*[@id="address"]/span[3]').text.strip()
country = driver.find_element_by_xpath('//*[@id="address"]/abbr[2]').get_attribute("title")
telephone = driver.find_element_by_xpath('//*[@id="telephone"]').text.strip().replace(')','').replace('(','').replace(' ','').replace('-','')
locationFeatures = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[3]/div[1]/ul').text.strip()
locationFeatures = locationFeatures.split('\n')
locationHours = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div').get_attribute("data-days")

print(brand_name)
print(merchant_name)
print(latitude)
print(longitude)
print(address1)
print(address2)
print(city)
print(state)
print(postalCode)
print(telephone)
print(country)
print(locationFeatures)
print(locationHours)

Panera Bread
Alabaster -S Colonial Drive
33.2276333
-86.8046543
100 S Colonial Drive
Suite 200
Alabaster
AL
35007
2056644525
United States
['Delivery Available', 'Dine In', 'Curbside', 'Kiosk', 'Rapid Pick-up']
[{"day":"MONDAY","intervals":[{"end":2100,"start":600}]},{"day":"TUESDAY","intervals":[{"end":2100,"start":600}]},{"day":"WEDNESDAY","intervals":[{"end":2100,"start":600}]},{"day":"THURSDAY","intervals":[{"end":2100,"start":600}]},{"day":"FRIDAY","intervals":[{"end":2100,"start":600}]},{"day":"SATURDAY","intervals":[{"end":2100,"start":600}]},{"day":"SUNDAY","intervals":[{"end":2100,"start":600}]}]


## Getting a list of individual merchant links

In [97]:
##first looking at main find location page with links to states:
link = 'https://locations.panerabread.com/'
driver = Chrome(r"C:\Users\lvincent\/Documents\chromedriver87.exe")
driver.get(link)

In [50]:
##to find and grab all the state page links:
# driver.find_element_by_class_name('c-directory-list-content-item')
# driver.find_elements_by_class_name('c-directory-list-content-item-link')
test = driver.find_elements_by_class_name('c-directory-list-content-item-link')
# for t in test:
#     print(t.text)

# test[0].get_attribute('href')

links = []
for t in test:
    link = t.get_attribute('href')
    links.append(link)
links

['https://locations.panerabread.com/al.html',
 'https://locations.panerabread.com/az.html',
 'https://locations.panerabread.com/ar.html',
 'https://locations.panerabread.com/ca.html',
 'https://locations.panerabread.com/co.html',
 'https://locations.panerabread.com/ct.html',
 'https://locations.panerabread.com/de.html',
 'https://locations.panerabread.com/fl.html',
 'https://locations.panerabread.com/ga.html',
 'https://locations.panerabread.com/id.html',
 'https://locations.panerabread.com/il.html',
 'https://locations.panerabread.com/in.html',
 'https://locations.panerabread.com/ia.html',
 'https://locations.panerabread.com/ks.html',
 'https://locations.panerabread.com/ky.html',
 'https://locations.panerabread.com/la.html',
 'https://locations.panerabread.com/me.html',
 'https://locations.panerabread.com/md.html',
 'https://locations.panerabread.com/ma.html',
 'https://locations.panerabread.com/mi.html',
 'https://locations.panerabread.com/mn.html',
 'https://locations.panerabread.co

In [51]:
##testing to find all the city links on state pages
link = 'https://locations.panerabread.com/al.html'
driver.get(link)

test = driver.find_elements_by_class_name('c-directory-list-content-item-link')

links = []
for t in test:
    link = t.get_attribute('href')
    links.append(link)
links

['https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html',
 'https://locations.panerabread.com/al/athens/1323-us-highway-72-e.html',
 'https://locations.panerabread.com/al/auburn.html',
 'https://locations.panerabread.com/al/birmingham.html',
 'https://locations.panerabread.com/al/cullman/1118-cullman-shopping-center-nw.html',
 'https://locations.panerabread.com/al/decatur/1241-point-mallard-pkwy-al-67.html',
 'https://locations.panerabread.com/al/dothan/3515-ross-clark-circle.html',
 'https://locations.panerabread.com/al/enterprise/847-boll-weevil-circle.html',
 'https://locations.panerabread.com/al/florence/304-cox-creek-pkwy.html',
 'https://locations.panerabread.com/al/gadsden/508-e-meighan-boulevard.html',
 'https://locations.panerabread.com/al/gardendale/521-fieldstown-rd.html',
 'https://locations.panerabread.com/al/hoover/1790-riverchase-dr.html',
 'https://locations.panerabread.com/al/huntsville.html',
 'https://locations.panerabread.com/al/madison/8179-highw

In [54]:
# count = len(re.findall("/", links[0]))
# print(count)

test_links = links[0:5]
for t in test_links:
    count = len(re.findall("/", t))
    print(count, ":", t)

5 : https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html
5 : https://locations.panerabread.com/al/athens/1323-us-highway-72-e.html
4 : https://locations.panerabread.com/al/auburn.html
4 : https://locations.panerabread.com/al/birmingham.html
5 : https://locations.panerabread.com/al/cullman/1118-cullman-shopping-center-nw.html


In [63]:
##testing to find all the individual merchant links on city pages
link = 'https://locations.panerabread.com/al/auburn.html'
driver.get(link)

test = driver.find_elements_by_class_name('c-location-grid-item-link')

# links = []
# for t in test:
#     link = t.get_attribute('href')
#     links.append(link)
# links

# driver.find_element_by_class_name('c-location-grid-item-link').text

links = []
for t in test:
    if t.text == 'View Location Details':
        link = t.get_attribute('href')
        links.append(link)
links

['https://locations.panerabread.com/al/auburn/1550-opelika-rd.html',
 'https://locations.panerabread.com/al/auburn/231-mell-st.html']

In [82]:
##function to get all individual merchant web links:
def getStoreLinks():
    statelinks = []
    citylinks = []
    storelinks = []

    mainlink = 'https://locations.panerabread.com/'
    driver = Chrome(r"C:\Users\lvincent\/Documents\chromedriver87.exe")
    driver.get(mainlink)

   
    test_list = driver.find_elements_by_class_name('c-directory-list-content-item-link')
    
    for t in test_list:
        link = t.get_attribute('href')
        count = len(re.findall("/", link))
#         print(link)
        if count >= 5:
            storelinks.append(link)
        elif count >=4:
            citylinks.append(link)
        else:
            statelinks.append(link)
            
    ##for testing on one state:
#     statelinks = ['https://locations.panerabread.com/al.html']

    for statelink in statelinks:
        driver.get(statelink)
        time.sleep(random.randint(0,2)) #sleeps either 0 or 1 or 2 seconds
        
        test_list2 = driver.find_elements_by_class_name('c-directory-list-content-item-link')
        
        for t in test_list2:
            link = t.get_attribute('href')
            count = len(re.findall("/", link))
#             print(link)
            if count >= 5:
                storelinks.append(link)
            else:
                citylinks.append(link)

    for citylink in citylinks:
        driver.get(statelink)
        time.sleep(random.randint(0,2)) #sleeps either 0 or 1 or 2 seconds
        
        test_list3 = driver.find_elements_by_class_name('c-location-grid-item-link')
        
        for t in test_list3:
            if t.text == 'View Location Details':
                link = t.get_attribute('href')
                count = len(re.findall("/", link))
#                 print(link)
                if count >= 5:
                    storelinks.append(link)

    driver.close()
    
    return storelinks

In [77]:
storelinks = getStoreLinks()
storelinks

https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html
https://locations.panerabread.com/al/athens/1323-us-highway-72-e.html
https://locations.panerabread.com/al/auburn.html
https://locations.panerabread.com/al/birmingham.html
https://locations.panerabread.com/al/cullman/1118-cullman-shopping-center-nw.html
https://locations.panerabread.com/al/decatur/1241-point-mallard-pkwy-al-67.html
https://locations.panerabread.com/al/dothan/3515-ross-clark-circle.html
https://locations.panerabread.com/al/enterprise/847-boll-weevil-circle.html
https://locations.panerabread.com/al/florence/304-cox-creek-pkwy.html
https://locations.panerabread.com/al/gadsden/508-e-meighan-boulevard.html
https://locations.panerabread.com/al/gardendale/521-fieldstown-rd.html
https://locations.panerabread.com/al/hoover/1790-riverchase-dr.html
https://locations.panerabread.com/al/huntsville.html
https://locations.panerabread.com/al/madison/8179-highway-72-w.html
https://locations.panerabread.com/al/mobi

['https://locations.panerabread.com/wy/cheyenne/2440-dell-range-blvd.html',
 'https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html',
 'https://locations.panerabread.com/al/athens/1323-us-highway-72-e.html',
 'https://locations.panerabread.com/al/cullman/1118-cullman-shopping-center-nw.html',
 'https://locations.panerabread.com/al/decatur/1241-point-mallard-pkwy-al-67.html',
 'https://locations.panerabread.com/al/dothan/3515-ross-clark-circle.html',
 'https://locations.panerabread.com/al/enterprise/847-boll-weevil-circle.html',
 'https://locations.panerabread.com/al/florence/304-cox-creek-pkwy.html',
 'https://locations.panerabread.com/al/gadsden/508-e-meighan-boulevard.html',
 'https://locations.panerabread.com/al/gardendale/521-fieldstown-rd.html',
 'https://locations.panerabread.com/al/hoover/1790-riverchase-dr.html',
 'https://locations.panerabread.com/al/madison/8179-highway-72-w.html',
 'https://locations.panerabread.com/al/montgomery/7224-eastchase-parkway.htm

## Running scraper on multiple pages and saving results

In [32]:
##function to pull info from all individual merchant web pages:
def runpull(driver, link):
    ##setup df for results:
    column_names = ["brand_name","merchant_name","latitude", "longitude", "address1", "address2",
                  "city","state","postalCode", "country", "telephone","locationFeatures","locationHours"]
    results = pd.DataFrame(columns = column_names)
    isBad = False

    ##Run scrape:
    try:
#     if 2 == 2:
        driver.get(link)
        driver.implicitly_wait(10)
        time.sleep(random.randint(1,3)) #sleeps either 1 or 2 or 3 seconds

        brand_name = driver.find_element_by_xpath('//*[@id="location-name"]/span[1]').text.strip()
        merchant_name = driver.find_element_by_xpath('//*[@id="location-name"]/span[2]').text.strip()
        latitude = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[1]/span[1]/meta[1]').get_attribute("content").strip()
        longitude = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[1]/span[1]/meta[2]').get_attribute("content").strip()
        address1 = driver.find_element_by_xpath('//*[@id="address"]/span[1]/span[1]').text.strip()
        try:
            address2 = driver.find_element_by_xpath('//*[@id="address"]/span[1]/span[2]').text.strip()
        except:
            address2 = ''
        city = driver.find_element_by_xpath('//*[@id="address"]/span[2]/span[1]').text.strip()
        state = driver.find_element_by_xpath('//*[@id="address"]/abbr[1]').text.strip()
        postalCode = driver.find_element_by_xpath('//*[@id="address"]/span[3]').text.strip()
        country = driver.find_element_by_xpath('//*[@id="address"]/abbr[2]').get_attribute("title")
        telephone = driver.find_element_by_xpath('//*[@id="telephone"]').text.strip().replace(')','').replace('(','').replace(' ','').replace('-','')
        locationFeatures = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[3]/div[1]/ul').text.strip()
        locationFeatures = locationFeatures.split('\n')
        locationHours = driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div').get_attribute("data-days")

        print(brand_name)
        print(merchant_name)
        print(latitude)
        print(longitude)
        print(address1)
        print(address2)
        print(city)
        print(state)
        print(postalCode)
        print(telephone)
        print(country)
        print(locationFeatures)
        print(locationHours)

        new_row = {
                  'brand_name': brand_name, 
                  'merchant_name': merchant_name, 
                  'latitude': latitude,
                  'longitude': longitude,
                  'address1': address1, 
                  'address2': address2,
                  'city': city, 
                  'state': state, 
                  'postalCode': postalCode, 
                  'country': country,
                  'telephone': telephone,
                  'locationFeatures': locationFeatures, 
                  'locationHours': locationHours
                }

        results = results.append(new_row, ignore_index=True)

    except:
        isBad = True
  
    return results, isBad

In [31]:
##Run web scraping and save results:

##make output table:
column_names = ["brand_name","merchant_name","latitude", "longitude", "address1", "address2",
                "city","state","postalCode", "country", "telephone","locationFeatures","locationHours"]
output = pd.DataFrame(columns = column_names)

##set up list to save bad links:
bad_links = []

##sample test:
storeLinksTest = ['https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html',
                  'https://locations.panerabread.com/al/hoover/1790-riverchase-dr.html']

##set link list to use for run:
storeLinksRun = storeLinksTest
# storeLinksRun = storeLinks


##run function to pull info from individual merchant web pages:
driver = Chrome(r"C:\Users\lvincent\/Documents\chromedriver87.exe")
for link in storeLinksRun:
    print(link)
    df, isBad = runpull(driver, link)
    output = output.append(df, ignore_index=True)
    if isBad == True:
        print('error on: ', link)
        bad_links.append(link)
driver.close()

##set file names for output to save to (for csv and excel file):
filecsv = 'panera_results_selenium.csv'
filexlsx = 'panera_results_selenium.xlsx'

##save output (to both csv and excel file):
output.to_csv(filecsv)
output.to_excel(filexlsx) 

output

https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html
Panera Bread
Alabaster -S Colonial Drive
33.2276333
-86.8046543
100 S Colonial Drive
Suite 200
Alabaster
AL
35007
2056644525
United States
['Delivery Available', 'Dine In', 'Curbside', 'Kiosk', 'Rapid Pick-up']
[{"day":"MONDAY","intervals":[{"end":2100,"start":600}]},{"day":"TUESDAY","intervals":[{"end":2100,"start":600}]},{"day":"WEDNESDAY","intervals":[{"end":2100,"start":600}]},{"day":"THURSDAY","intervals":[{"end":2100,"start":600}]},{"day":"FRIDAY","intervals":[{"end":2100,"start":600}]},{"day":"SATURDAY","intervals":[{"end":2100,"start":600}]},{"day":"SUNDAY","intervals":[{"end":2100,"start":600}]}]
https://locations.panerabread.com/al/hoover/1790-riverchase-dr.html
Panera Bread
Hoover - Riverchase Dr
33.3736708
-86.8099188
1790 Riverchase Dr
Suite 104
Hoover
AL
35244
2054020023
United States
['Delivery Available', 'Dine In', 'Drive Thru', 'Curbside', 'Kiosk', 'Rapid Pick-up']
[{"day":"MONDAY","intervals":[{

Unnamed: 0,brand_name,merchant_name,latitude,longitude,address1,address2,city,state,postalCode,country,telephone,locationFeatures,locationHours
0,Panera Bread,Alabaster -S Colonial Drive,33.2276333,-86.8046543,100 S Colonial Drive,Suite 200,Alabaster,AL,35007,United States,2056644525,"[Delivery Available, Dine In, Curbside, Kiosk,...","[{""day"":""MONDAY"",""intervals"":[{""end"":2100,""sta..."
1,Panera Bread,Hoover - Riverchase Dr,33.3736708,-86.8099188,1790 Riverchase Dr,Suite 104,Hoover,AL,35244,United States,2054020023,"[Delivery Available, Dine In, Drive Thru, Curb...","[{""day"":""MONDAY"",""intervals"":[{""end"":2100,""sta..."
