# Web Scraping with Python Using BeautifulSoup  
### Demo of scraping location details from Panera Bread website

**Resources:**  
* Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/  ->Really good resource, walks you through an example to demonstrate all the different features  
* Overview and latest release info: https://www.crummy.com/software/BeautifulSoup/  
  
**Useful Code Snippets:**  
to grab a link and make the soup:
```
link = ''
response = requests.get(link)
print(response)
soup = BeautifulSoup(response.text, 'lxml')
```
to slow down the scraping (add after each requests.get(link)):
```
time.sleep(random.randint(1,3))    ->sleeps (pauses run) either 1 or 2 or 3 seconds
```
common ways to find elements:
```
soup.find('div')                ->returns first div tag with everything inside that tag
soup.find_all('div')            ->returns all div tags (with their contents) as a list,
                                    (access individual results with their index number)                             
soup.find_all('div')[0]         ->returns first div tag, same as soup.find('div')
len(soup.find_all('a'))         ->returns the number of <a> tags
soup.find('div', 'featured')    ->returns the first div tag with an attribute that contains 'featured'
soup.find(itemprop='telephone') ->use this format for locating based on itemprop attribute
soup.find(id='telephone')       ->use this format for locating based on id attribute
soup.find('a').get('href')      ->to get the contents of a tag attribute use .get() with the name of the attribute
```

**Other Useful Resources:**  
* Working with JSON data in python: https://realpython.com/python-json/#a-very-brief-history-of-json  

In [78]:
##import libraries:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json 
import re
import time
import math
import random

## Initial testing to find data on webpage

In [79]:
##test to see if link works and if so then parse with BeautifulSoup
link = 'https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html'
response = requests.get(link)
print(response)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'lxml')
    print(soup.prettify())

<Response [200]>
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="//www.yext-pixel.com" rel="dns-prefetch"/>
  <link href="//a.cdnmktg.com" rel="dns-prefetch"/>
  <link href="//a.mktgcdn.com" rel="dns-prefetch"/>
  <meta content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport"/>
  <meta content="telephone=no" name="format-detection"/>
  <link href="../../favicon.ico" rel="shortcut icon"/>
  <meta content="Visit your local Panera Bread at 100 S Colonial Drive in Alabaster, AL to find soup, salad, bakery, pastries, coffee near you. Dine-in, pickup, and delivery." name="description"/>
  <meta content="" name="keywords"/>
  <meta content="Panera Bread at 100 S Colonial Drive Alabaster, AL | bread, soup, salad, coffee, dessert" property="og:title"/>
  <meta content="Visit your local Panera Bread at 100 S Colonial Drive 

In [82]:
##find section with all the main info content
# soup.find_all('div','nap-info')
len(soup.find_all('div','nap-info'))

1

In [83]:
##assign section with all the main info content to variable "info"
info = soup.find_all('div','nap-info')[0]
print(info.prettify())

<div class="nap-info">
 <div class="c-location-title-wrapper">
  <h1 aria-level="1" class="c-location-title" id="location-name" itemprop="name">
   <span class="location-name-brand">
    Panera Bread
   </span>
   <span class="location-name-geo">
    Alabaster -S Colonial Drive
   </span>
  </h1>
 </div>
 <div class="nap-info-content">
  <div class="nap-info-left">
   <span class="coordinates" itemprop="geo" itemscope="" itemtype="http://schema.org/GeoCoordinates">
    <meta content="33.2276333" itemprop="latitude"/>
    <meta content="-86.8046543" itemprop="longitude"/>
   </span>
   <address class="c-address" data-country="US" id="address" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
    <span class="c-address-street" itemprop="streetAddress">
     <span class="c-address-street-1">
      100 S Colonial Drive
     </span>
     <span class="c-address-street-2">
      Suite 200
     </span>
    </span>
    <span class="c-address-city">
     <span itemprop=

In [85]:
##find brand name tag within info:
info.find('span', 'location-name-brand')

<span class="location-name-brand">Panera Bread</span>

In [86]:
info.find('span', 'location-name-brand').contents

['Panera Bread']

In [87]:
info.find('span', 'location-name-brand').contents[0]

'Panera Bread'

In [88]:
##assign the brand name to variable brand_name
brand_name = info.find('span', 'location-name-brand').contents[0].strip()
print(brand_name)

Panera Bread


In [89]:
address1 = info.find('span', 'c-address-street-1').contents[0].strip()
print(address1)

100 S Colonial Drive


In [37]:
city = info.find(itemprop='addressLocality').contents[0].strip()
print(city)

Alabaster


In [43]:
info = soup.find('div', 'nap-info')
brand_name = info.find('span', 'location-name-brand').contents[0].strip()
merchant_name = info.find('span', 'location-name-geo').contents[0].strip()
address1 = info.find('span', 'c-address-street-1').contents[0].strip()
address2 = info.find('span', 'c-address-street-2').contents[0].strip()
city = info.find(itemprop='addressLocality').contents[0].strip()
state = info.find(itemprop='addressRegion').contents[0].strip()
postalCode = info.find(itemprop='postalCode').contents[0].strip()
country = info.find(itemprop='addressCountry').contents[0].strip()
telephone = info.find(itemprop='telephone').contents[0].strip().replace(')','').replace('(','').replace(' ','').replace('-','')
print(brand_name)
print(merchant_name)
print(address1)
print(address2)
print(city)
print(state)
print(postalCode)
print(telephone)

Panera Bread
Alabaster -S Colonial Drive
100 S Colonial Drive
Suite 200
Alabaster
AL
35007
2056644525


In [39]:
#get the latitude and logitude coordinates from tag attribute:
info.find(itemprop="latitude")

<meta content="33.2276333" itemprop="latitude"/>

In [91]:
# info.find(itemprop="latitude")
info.find(itemprop="latitude").get('content')

'33.2276333'

In [41]:
latitude = info.find(itemprop="latitude").get('content')
longitude = info.find(itemprop="longitude").get('content')
print(latitude)
print(longitude)

33.2276333
-86.8046543


## Running scraper on multiple pages and saving results

In [68]:
##function to get all individual merchant web links:
def getStoreLinks(response):
    statelinks = []
    citylinks = []
    storelinks = []

    soup = BeautifulSoup(response.text, 'lxml')

    ##run all states:
#     li = soup.find_all('li', 'c-directory-list-content-item')

    ##test run for just Louisiana (13 stores)
    li = [soup.find_all('li', 'c-directory-list-content-item')[15]]
    
    
    for x in li:
        stateitem = x.find('a', 'c-directory-list-content-item-link')
        link = 'https://locations.panerabread.com/' + stateitem.get('href')
        # print(link)
        stateNum = x.find('span').contents
        if '(1)' in stateNum:
            storelinks.append(link)
        else:
            statelinks.append(link)

    for x in statelinks:
        time.sleep(random.randint(0,2)) #sleeps either 0 or 1 or 2 seconds
        response2 = requests.get(x)
        if response2.status_code == 200:
            soup2 = BeautifulSoup(response2.text, 'lxml')

            li2 = soup2.find_all('li', 'c-directory-list-content-item')

            for item in li2:
                cityitem = item.find('a', 'c-directory-list-content-item-link')
                link = 'https://locations.panerabread.com/' + cityitem.get('href')
                # print(link)
                cityNum = item.find('span').contents
                if '(1)' in cityNum:
                    storelinks.append(link)
                else:
                    citylinks.append(link)

    for x in citylinks:
        time.sleep(random.randint(0,2)) #sleeps either 0 or 1 or 2 seconds
        response3 = requests.get(x)
        if response3.status_code == 200:
            soup3 = BeautifulSoup(response3.text, 'lxml')
            li3 = soup3.find_all('a', 'c-location-grid-item-link')

        for item in li3:
            if 'View Location Details' in item.contents:
                link = 'https://locations.panerabread.com/' + item.get('href')[2:]
                # print(link)
                storelinks.append(link)

    return storelinks

In [48]:
##function to pull info from all individual merchant web pages:
def runpull(response4):
    ##setup df for results
    column_names = ["brand_name","merchant_name","latitude", "longitude", "address1", "address2",
                  "city","state","postalCode", "country", "telephone","locationFeatures","locationHours"]

    results = pd.DataFrame(columns = column_names)

    ##Process the response object 
    soup4 = BeautifulSoup(response4.text, 'lxml')
  
    info = soup4.find('div', 'nap-info')
    brand_name = info.find('span', 'location-name-brand').contents[0].strip()
    merchant_name = info.find('span', 'location-name-geo').contents[0].strip()
    latitude = info.find(itemprop="latitude").get('content')
    longitude = info.find(itemprop="longitude").get('content')
    address1 = info.find('span', 'c-address-street-1').contents[0].strip()
    try:
        address2 = info.find('span', 'c-address-street-2').contents[0].strip()
    except:
        address2 = ''
    city = info.find(itemprop='addressLocality').contents[0].strip()
    state = info.find(itemprop='addressRegion').contents[0].strip()
    postalCode = info.find(itemprop='postalCode').contents[0].strip()
    country = info.find(itemprop='addressCountry').contents[0].strip()
    telephone = info.find(itemprop='telephone').contents[0].strip()
    locationFeatures = info.find('ul', 'nap-info-features-list').contents
    locationHours = info.find('div', 'c-location-hours-details-wrapper js-location-hours').get('data-days')

    lf = []
    for feature in locationFeatures: 
        feature = str(feature)
        feature = feature.replace('<li>','').replace('</li>','')
        lf.append(feature)
    locationFeatures = lf

    print(brand_name)
    print(merchant_name)
    print(latitude)
    print(longitude)
    print(address1)
    print(address2)
    print(city)
    print(state)
    print(postalCode)
    print(telephone)
    print(country)
    print(locationFeatures)
    print(locationHours)

    new_row = {
              'brand_name': brand_name, 
              'merchant_name': merchant_name, 
              'latitude': latitude,
              'longitude': longitude,
              'address1': address1, 
              'address2': address2,
              'city': city, 
              'state': state, 
              'postalCode': postalCode, 
              'country': country,
              'telephone': telephone,
              'locationFeatures': locationFeatures, 
              'locationHours': locationHours
            }
  
    results = results.append(new_row, ignore_index=True)
  
    return results

In [71]:
#Run function to get list of individual merchant links:
response = requests.get('https://locations.panerabread.com/')
if response.status_code == 200:
    storeLinks = getStoreLinks(response)
    print(storeLinks)
    print("Number of individual merchant links: ", len(storeLinks))
    storeLinksRun = storeLinks
    ##save links list to json file (optional)
    with open("panera_storelinks.json", "w") as write_file:
        json.dump(storeLinks, write_file)
else:
    print('error')

['https://locations.panerabread.com/la/bossier-city/2650-airline-drive.html', 'https://locations.panerabread.com/la/covington/70411-highway-21.html', 'https://locations.panerabread.com/la/harvey/2424-manhattan-blvd.html', 'https://locations.panerabread.com/la/lafayette/2622-johnston-st.html', 'https://locations.panerabread.com/la/lake-charles/3404-nelson-road.html', 'https://locations.panerabread.com/la/metairie/4848-veterans-boulevard.html', 'https://locations.panerabread.com/la/shreveport/1705-east-70th-street.html', 'https://locations.panerabread.com/la/slidell/70-town-center-parkway.html', 'https://locations.panerabread.com//la/baton-rouge/3304-patrick-f-taylor-hall.html', 'https://locations.panerabread.com//la/baton-rouge/5000-hennessey-dr.html', 'https://locations.panerabread.com//la/baton-rouge/7877-jefferson-highway.html', 'https://locations.panerabread.com//la/new-orleans/309-n--carrollton-avenue.html', 'https://locations.panerabread.com//la/new-orleans/31-mcalister-drive.html

In [56]:
##Run web scraping and save results:

##make output table:
column_names = ["brand_name","merchant_name","latitude", "longitude", "address1", "address2",
                "city","state","postalCode", "country", "telephone","locationFeatures","locationHours"]
output = pd.DataFrame(columns = column_names)

##set up list to save bad links:
bad_links = []

##sample test:
storeLinksTest = ['https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html',
                  'https://locations.panerabread.com/al/hoover/1790-riverchase-dr.html']

##set link list to use for run:
storeLinksRun = storeLinksTest
# storeLinksRun = storeLinks


##run function to pull info from individual merchant web pages:
for link in storeLinksRun:
    time.sleep(random.randint(0,2)) #sleeps either 0 or 1 or 2 seconds
    print(link)
    response4 = requests.get(link)
    if response4.status_code == 200:
        df = runpull(response4)
        output = output.append(df, ignore_index=True)
    else:
        print('error on: ', link)
        bad_links.append(link)


##set file names for output to save to (for csv and excel file):
filecsv = 'panera_results.csv'
filexlsx = 'panera_results.xlsx'

##save output (to both csv and excel file):
output.to_csv(filecsv)
output.to_excel(filexlsx) 

output

https://locations.panerabread.com/al/alabaster/100-s-colonial-drive.html
Panera Bread
Alabaster -S Colonial Drive
33.2276333
-86.8046543
100 S Colonial Drive
Suite 200
Alabaster
AL
35007
(205) 664-4525
US
['Delivery Available', 'Dine In', 'Curbside', 'Kiosk', 'Rapid Pick-up']
[{"day":"MONDAY","intervals":[{"end":2100,"start":600}]},{"day":"TUESDAY","intervals":[{"end":2100,"start":600}]},{"day":"WEDNESDAY","intervals":[{"end":2100,"start":600}]},{"day":"THURSDAY","intervals":[{"end":2100,"start":600}]},{"day":"FRIDAY","intervals":[{"end":2100,"start":600}]},{"day":"SATURDAY","intervals":[{"end":2100,"start":600}]},{"day":"SUNDAY","intervals":[{"end":2100,"start":600}]}]
https://locations.panerabread.com/al/hoover/1790-riverchase-dr.html
Panera Bread
Hoover - Riverchase Dr
33.3736708
-86.8099188
1790 Riverchase Dr
Suite 104
Hoover
AL
35244
(205) 402-0023
US
['Delivery Available', 'Dine In', 'Drive Thru', 'Curbside', 'Kiosk', 'Rapid Pick-up']
[{"day":"MONDAY","intervals":[{"end":2100,"st

Unnamed: 0,brand_name,merchant_name,latitude,longitude,address1,address2,city,state,postalCode,country,telephone,locationFeatures,locationHours
0,Panera Bread,Alabaster -S Colonial Drive,33.2276333,-86.8046543,100 S Colonial Drive,Suite 200,Alabaster,AL,35007,US,(205) 664-4525,"[Delivery Available, Dine In, Curbside, Kiosk,...","[{""day"":""MONDAY"",""intervals"":[{""end"":2100,""sta..."
1,Panera Bread,Hoover - Riverchase Dr,33.3736708,-86.8099188,1790 Riverchase Dr,Suite 104,Hoover,AL,35244,US,(205) 402-0023,"[Delivery Available, Dine In, Drive Thru, Curb...","[{""day"":""MONDAY"",""intervals"":[{""end"":2100,""sta..."
