### Web-Scraping Concepts:
* Much data is published on the web, designed for human consumption
    * https://www.softwarefindr.com/how-many-websites-are-there/
    * https://www.eetimes.com/author.asp?section_id=36&doc_id=1330462
* Two ways to access it:
    * APIs:  e.g. REST https://en.wikipedia.org/wiki/Representational_state_transfer
    * **Scraping**
* Web publishing process:
    * **Data** &rarr; (server side scripts) &rarr; HTML &rarr; (browser) &rarr; Visual page
    * http://i2.sitepoint.com/graphics/1733_first_principles.thumb.jpg
* Web scraping process:
    * URL &rarr; ([fetch](#1.-Fetch)) &rarr; HTML &rarr; ([parse](#2.-Parse)) &rarr; **Data** &rarr; ([crawl](#3.-Crawl)) &rarr; URL &rarr; ...
    * Need to algorithmically parse HTML structure
* Common patterns:
    * Page == Column :  e.g. index page, list
    * Page == Row : e.g. product page
    * Page == k Rows : e.g. paginated search result list
* Cautions:
    * Getting blocked by the server
    * Human activity or cyber attack?
    * Retrieved page might not look the same as in browser, due to JavaScript
* Ethical concerns?
    * Does the provider allow scraping?
    * Are there commercial restrictions?
    * Intellectual property?
    * Will rapid crawling harm the site (eg. denial of service attack)?
    * Privacy issues?  Personal identifying information?

## Need to solve:
1. How to turn to different pages
2. How to expand the "read more" button
## FYI
### Fetch
* Browser user agent: https://www.whatismybrowser.com/detect/what-is-my-user-agent
* selenium 4.2.0: https://pypi.org/project/selenium/
* REST API: e.g. https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets , http://socialmedia-class.org/twittertutorial.html
* AJAX API: e.g. https://twitter.com/i/search/timeline?f=tweets&q=march%20madness

In [1]:
import requests

In [63]:
url = "https://www.mercato.com/grocery-delivery/ma/boston/fruits-and-vegetables-delivery?keywords=tomato"
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 Edg/98.0.1108.62'}
page = requests.get(url,headers = header)
page.reason

'200'

In [64]:
page.text



In [65]:
requests.utils.default_user_agent()


'python-requests/2.28.0'

### Parse

##### HTML DOM structure:
* DOM = Document Object Model
* https://www.w3.org/TR/WD-DOM/introduction.html
* https://html.spec.whatwg.org (section 4: The elements of HTML)
* Browser's Developer Tools, Inspect Element

##### Two strategies to parse HTML DOM:
1. Search for unique tag attributes (class, id, …)
2. Navigate DOM hierarchy structure, top down

In [66]:
import bs4
from bs4 import BeautifulSoup

In [67]:
soup = bs4.BeautifulSoup(page.text, 'html5lib')
#soup = BeautifulSoup(page.text, "lxml")
soup

<!DOCTYPE html>
<html lang="en"><head>
<link crossorigin="" href="https://ssl.google-analytics.com" rel="preconnect"/>
<title>Fruits &amp; Veggies Delivery in Boston, MA - Mercato</title>
<meta content="Get fresh groceries delivered in 1-Hr.  Order fresh quality food from your favorite butcher, seafood market, baker, fruit &amp; vegetable market and grocer online and get delivery to your door." name="DESCRIPTION"/>
<meta content="INDEX, FOLLOW" name="ROBOTS"/>
<meta content="Fruits &amp; Veggies Delivery in Boston, MA - Mercato" property="og:title"/>
<meta content="https://images.mercato.com/Mercato_OG_image_1200x630V2.jpg" property="og:image"/>
<meta content="Get fresh groceries delivered in 1-Hr.  Order fresh quality food from your favorite butcher, seafood market, baker, fruit &amp; vegetable market and grocer online and get delivery to your door." property="og:description"/>
<meta content="website" property="og:type"/>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8

In [26]:
#! pip3 install html5lib
#! pip install --upgrade pip
#! python -m pip --version
#! pip install --user --upgrade pip


You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


soup
##### 2.1 Search

Use browser "Inspect Element" feature to examine the objects you want to extract.
Look for unique identifiers.  Then search for those uniquie identifiers in soup:

`tag = soup.find('tag', {dictionary of identifiers})`

`tags = soup.find_all('tag', {dictionary of identifiers})`

In [109]:
# <a href="https://www.mercato.com/item/tomato-on-the-vine-backyard/1131735?featuredStoreId=1739" class="js--product-tile-name-link product-tile-rebrand__prod-name js--product-name" data-store-product-id="6196556">&nbsp;Tomato On The Vine Backyard&nbsp;</a>

#<a href="https://www.mercato.com/item/tomatillo/72864?featuredStoreId=2153" class="js--product-tile-name-link product-tile-rebrand__prod-name js--product-name" data-store-product-id="7792188">Tomatillo</a>
name = soup.find_all('a', {'class': "js--product-tile-name-link product-tile-rebrand__prod-name js--product-name"})

# <h1 class="modal-heading">On The Vine Red Tomato</h1>
#tag = soup.select("h1[class='modal-heading']")
#tag = soup.select(".modal-heading")
name

[<a class="js--product-tile-name-link product-tile-rebrand__prod-name js--product-name" data-store-product-id="6196556" href="https://www.mercato.com/item/tomato-on-the-vine-backyard/1131735?featuredStoreId=1739">
  Tomato On The Vine Backyard  
 </a>,
 <a class="js--product-tile-name-link product-tile-rebrand__prod-name js--product-name" data-store-product-id="6196556" href="https://www.mercato.com/item/tomato-on-the-vine-backyard/1131735?featuredStoreId=1739">
  Tomato On The Vine Backyard </a>,
 <a class="js--product-tile-name-link product-tile-rebrand__prod-name js--product-name" data-store-product-id="6196246" href="https://www.mercato.com/item/organic-heirloom-tomatoes/163064?featuredStoreId=1739">
 Organic Heirloom Tomatoes 
 </a>,
 <a class="js--product-tile-name-link product-tile-rebrand__prod-name js--product-name" data-store-product-id="6196246" href="https://www.mercato.com/item/organic-heirloom-tomatoes/163064?featuredStoreId=1739">
 Organic Heirloom Tomatoes</a>,
 <a clas

In [110]:
name = [t.text for t in name]

In [111]:
names = []
for n in name:
    n = n.replace('\n','')
    n = n.replace('\xa0','')
    names.append(n)

In [112]:
# Since there are 2 same names for each item,
del names[1::2]

In [144]:
names

['Tomato On The Vine Backyard',
 'Organic Heirloom Tomatoes',
 'Roma Tomatoes',
 'Kumato Tomatoes - 16 Ounces',
 'Earthbound Farms Organic Cherry Tomatoes - 1 Pint',
 'Tomato Organic On The Vine ',
 'Backyard Farms Cocktail Tomatoes - 10 Ounces',
 'Cherry Tomatoes on the Vine',
 'Mucci Farms Sweet Cocktail Tomatoes - 1 Pound',
 'Mammamia medley tomatoes',
 'Mutti Tomatoes, Cherry - 14 Ounces',
 'Sundried Tomatoes ',
 'Mutti Peeled Tomatoes - 14 Ounces',
 'Beefsteak Tomatoes',
 'Campari Tomatoes 16oz',
 'Grape Tomatoes - 1 Pint',
 'Beefsteak Tomato (Hot House)',
 'Sun Dried Tomatoes',
 'Cherry Tomatoes',
 'Heirloom Tomatoes',
 'Vine Ripe Tomatoes',
 'Naturesweet Cherubs Salad Tomatoes, Heavenly - 10 Ounce...',
 'NatureSweet Glorys Tomatoes - 10 Ounces',
 'Plum Tomatoes',
 'Cluster Tomatoes',
 'Field & Farm Tomatoes, Grape, Sweet - 12 Ounces',
 'Vine Ripe Tomatoes',
 'MamaMia Tomatoes, Medley - 1 Each',
 'Produce Tomatoes',
 'Greenhouse Tomatoes',
 'Heirloom Tomatoes',
 'Signature Brand 

In [126]:
# <div class="product-tile-rebrand__price-block"><div>$4.99 each</div></div>
price = soup.find_all('div', {'class': "product-tile-rebrand__price-block"})
price

[<div class="product-tile-rebrand__price-block">
 <div>
 $1.99 per lb<sup class="regular">*</sup>
 </div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $6.75 per lb<sup class="regular">*</sup>
 </div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $2.49 per lb<sup class="regular">*</sup>
 </div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $3.99 each</div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $4.99 each</div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $3.50 per lb<sup class="regular">*</sup>
 </div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $3.95 each</div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $5.50 per lb<sup class="regular">*</sup>
 </div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $3.49 each</div>
 </div>,
 <div class="product-tile-rebrand__price-block">
 <div>
 $3.50 each</div>
 </div>,
 <div clas

In [127]:
price = [t.text for t in price]

In [128]:
prices = []
for n in price:
    n = n.replace('\n','')
    prices.append(n)

In [146]:
prices

['$1.99 per lb*',
 '$6.75 per lb*',
 '$2.49 per lb*',
 '$3.99 each',
 '$4.99 each',
 '$3.50 per lb*',
 '$3.95 each',
 '$5.50 per lb*',
 '$3.49 each',
 '$3.50 each',
 '$2.90 each',
 '$8.95 per lb*',
 '$5.90 each',
 '$2.99 per lb*',
 '$4.99 each',
 '$4.39 each',
 '$3.29 per lb*',
 '$3.99 each',
 '$4.39 each',
 '$6.59 per lb*',
 '$5.49 per lb*',
 '$4.99 each',
 '$4.99 each',
 '$2.99 per lb*',
 '$3.49 per lb*',
 '$4.99 each',
 '$3.49 per lb*',
 '$4.99 each',
 '$3.49 per lb*',
 '$3.49 per lb*',
 '$5.99 per lb*',
 '$4.99 each',
 '$3.49 per lb*',
 '$4.99 each',
 '$4.99 each',
 '$3.99 per lb*',
 '$5.99 per lb*',
 '$2.49 per lb*',
 '$3.99 each',
 '$4.99 each',
 '$3.99 each',
 '$3.49 per lb*',
 '$4.49 each',
 '$3.99 per lb*',
 '$4.99 each',
 '$3.49 per lb*',
 '$4.49 each',
 '$1.49 per lb*',
 '$14.99 each',
 '$4.99 each',
 '$2.99 per lb*',
 '$3.99 per lb*',
 '$3.99 each',
 '$1.99 per lb*',
 '$1.99 per lb*',
 '$3.59 each',
 '$2.69 per lb*',
 '$2.19 per lb*',
 '$2.99 each',
 '$2.99 per lb*',
 '$2.2

In [141]:
len(prices)

101

In [131]:
# <a class="js--mParticle-store_clicked" data-store-id="1739" data-store-name="Eataly Boston" href="/shop/eataly-boston">Eataly Boston</a>
store = soup.find_all('a', {'class': "js--mParticle-store_clicked"})
store

[<a class="js--mParticle-store_clicked" data-store-id="1739" data-store-name="Eataly Boston" href="/shop/eataly-boston">
 Eataly Boston</a>,
 <a class="mercato-button mercato-button--green-pine store-profile-rebrand__shop-button js--mParticle-store_clicked" data-store-id="1739" data-store-name="Eataly Boston" href="/shop/eataly-boston">
 Shop all 3,290 items!
 </a>,
 <a class="js--mParticle-store_clicked" data-store-id="454" data-store-name="Savenor's Butcher &amp; Market (Boston) " href="/shop/savenors-butcher-market-boston">
 Savenor's Butcher &amp; Market (Boston) </a>,
 <a class="mercato-button mercato-button--green-pine store-profile-rebrand__shop-button js--mParticle-store_clicked" data-store-id="454" data-store-name="Savenor's Butcher &amp; Market (Boston) " href="/shop/savenors-butcher-market-boston">
 Shop all 828 items!
 </a>,
 <a class="js--mParticle-store_clicked" data-store-id="1127" data-store-name="Foodie's Market (South Boston)" href="/shop/foodies-market-south-boston">

In [133]:
store = [t.text for t in store]

In [135]:
stores = []
for n in store:
    n = n.replace('\n','')
    stores.append(n)

In [137]:
del stores[1::2]

In [138]:
stores

['Eataly Boston',
 "Savenor's Butcher & Market (Boston) ",
 "Foodie's Market (South Boston)",
 "Foodie's Market",
 "Wollaston's Market - Marino Center",
 "America's Food Basket - Codman Square",
 'The Daily Market',
 "America's Food Basket - Hyde Park",
 'Happy Market & Spirits']

In [149]:
textfile = open("stores.txt", "w")
for element in stores:
    textfile.write(element + "\n")
textfile.close()

We got product names, prices, which store. Next we are going to find the number of products from each store, cause for now we cannot separate them to every store since we don't know where to make the break.
Divide to 2 parts

In [171]:
url = "https://www.mercato.com/store-results?keywords=celery"
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 Edg/98.0.1108.62'}
page = requests.get(url,headers = header)
page.reason

'200'

In [175]:
page.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<link rel="preload" href="https://dye1fo42o13sl.cloudfront.net/fonts/mercato-fontawesome27.woff2?33722410" as="font" crossorigin>\n<link rel="preload" href="https://dye1fo42o13sl.cloudfront.net/fonts/OpenSans-Bold.woff2" as="font" crossorigin>\n<link rel="preload" href="https://dye1fo42o13sl.cloudfront.net/fonts/OpenSans-Light.woff2" as="font" crossorigin>\n<link rel="preload" href="https://dye1fo42o13sl.cloudfront.net/fonts/OpenSans-LightItalic.woff2" as="font" crossorigin>\n<link rel="preload" href="https://dye1fo42o13sl.cloudfront.net/fonts/OpenSans-Regular.woff2" as="font" crossorigin>\n<link rel="preload" href="https://dye1fo42o13sl.cloudfront.net/fonts/slick.woff" as="font" crossorigin>\n<link rel="preload" href="https://dye1fo42o13sl.cloudfront.net/static/builds/20220628150257_655d94e8/frontend/public/home-bundle.js" as="script">\n<link rel="preconnect" href="https://fcmatch.youtube.com" crossorigin>\n<link rel="preconnect" href="https

In [172]:
requests.utils.default_user_agent()

'python-requests/2.28.0'

In [173]:
soup = bs4.BeautifulSoup(page.text, 'html5lib')
#soup = BeautifulSoup(page.text, "lxml")
soup

<!DOCTYPE html>
<html lang="en"><head>
<link as="font" crossorigin="" href="https://dye1fo42o13sl.cloudfront.net/fonts/mercato-fontawesome27.woff2?33722410" rel="preload"/>
<link as="font" crossorigin="" href="https://dye1fo42o13sl.cloudfront.net/fonts/OpenSans-Bold.woff2" rel="preload"/>
<link as="font" crossorigin="" href="https://dye1fo42o13sl.cloudfront.net/fonts/OpenSans-Light.woff2" rel="preload"/>
<link as="font" crossorigin="" href="https://dye1fo42o13sl.cloudfront.net/fonts/OpenSans-LightItalic.woff2" rel="preload"/>
<link as="font" crossorigin="" href="https://dye1fo42o13sl.cloudfront.net/fonts/OpenSans-Regular.woff2" rel="preload"/>
<link as="font" crossorigin="" href="https://dye1fo42o13sl.cloudfront.net/fonts/slick.woff" rel="preload"/>
<link as="script" href="https://dye1fo42o13sl.cloudfront.net/static/builds/20220628150257_655d94e8/frontend/public/home-bundle.js" rel="preload"/>
<link crossorigin="" href="https://fcmatch.youtube.com" rel="preconnect"/>
<link crossorigin=

In [176]:
# <a href="https://www.mercato.com/item/organic-celery-hearts-1-pound/3655?featuredStoreId=1739" class="js--product-tile-name-link product-tile-rebrand__prod-name js--product-name notranslate" data-store-product-id="7167228">Organic Celery Hearts - 1 Pound</a>

name = soup.find_all('a', {'class': "js--product-tile-name-link product-tile-rebrand__prod-name js--product-name notranslate"})

name

[]