<p><a name="sections"></a></p>
<br>
<br>

# Sections

- <a href="#intro">Introduction to Beautiful Soup</a><br>
    - <a href="#web">What is web scraping</a><br>
    - <a href="#html">Introduction to HTML</a><br>
    - <a href="#beautiful">Basics of Beautiful Soup</a><br>

- <a href="#example">Examples</a><br>
    - <a href="#calendar">Python User Group Calendar</a><br>
    - <a href="#yelp">Scrape Yelp Reviews</a><br>

<p><a name="web"></a></p>

## What is web scraping?

- HTML is short for **HyperText Markup Language**. It's a language for presenting content on the Web.

- Plain text is turned into an HTML document by **tags** that are then interpreted by a browser.

- Using BeautifulSoup, you can easily extract the tag values from HTML source code.

### Beautiful Soup VS Regular Expressions

In [3]:
# the source code of hi.html
!cat data/hi.html
# Windows user
# !type data\hi.html

<!DOCTYPE html>
<html>
    <head>
        <title>Hi</title> <!--Im a comment, ignore me.-->
    </head>
    <body>
        <a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>
    </body>
</html>


### Example:
- Extract the characters between the title tags. 


- In this case it's `Hi` (`<title>Hi</title>`).

- **Solution using Regular Expressions**

In [4]:
import re
hi_path = 'data/hi.html'
with open(hi_path, 'r') as f:
    hi = f.read()
    print(re.findall('<title>(.*)</title>', hi))

['Hi']


- **Solution using BeautifulSoup**

In [5]:
from bs4 import BeautifulSoup
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(hi.title) # find the title tag
    print(hi.title.string)  # find the value of tag

<title>Hi</title>
Hi


**Compared with regular expressions:**
    
- Beautiful Soup's syntax is much simpler, while regular expressions are more flexible.

<p><a name="html"></a></p>

## Introduction to HTML

### Tag

- The `<title>` tags in this example designate the enclosed text as the title to be displayed in the head of the browser tab.
![hi](pic/hi.png)

- Tags are always enclosed by `<` and `>` to distinguish them from the content. 
- A pair of tags consist of start and end tags which carry the same name, but the end tag is preceded by a slash `/` .

### Values

Values are the content between start tags and end tags.

- **Example**

`<title>Hi</title>`: It's a title tag with a value of `Hi`.

### Attributes
Tags have another feauture called attributes.

- **Example**

`<a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>`

The anchor tag `<a>` with an attribute `href` and hyperlink—http://www.crummy.com/software/BeautifulSoup/. It creates an association of text points to another address (a hyperlink).

### Tree structure
- The first tag in the example is the `<html>` tag. 

- Between the `<html>` tags, several tags are opened and closed again: `<head>, <title>` , and
`<body>, <a>`.

    - The `<head>` and `<body>` tags are directly enclosed by the `<html>` tag. 
    - The `<title>` tag is enclosed by the `<head>` tag.
    - The `<a>` tag is enclosed by the `<body>` tag.


- A good way to describe the multiple layers of an HTML document is the tree analogy. 
![html](pic/html.png)

- The `html` tag is the root tag that splits into two branches, `<head>` and `<body>`; `<head>` is followed by another branch called `<title>`; `<body>` is followed by another branch called `<a>`.

<p><a name="beautiful"></a></p>

## Basics of Beautiful Soup

### Parse HTML

- The `prettify()` method adds indentations so that it will help you understand the tree structure of the html document.

In [6]:
from bs4 import BeautifulSoup
# open a local file and parse the plain text by BeautifulSoup directly
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(type(hi)) # get a bs4.BeautifulSoup object
    print('\n')
    print(hi.prettify())

<class 'bs4.BeautifulSoup'>


<!DOCTYPE html>
<html>
 <head>
  <title>
   Hi
  </title>
  <!--Im a comment, ignore me.-->
 </head>
 <body>
  <a href="http://www.crummy.com/software/BeautifulSoup/">
   Hello, beautifulsoup!
  </a>
 </body>
</html>



### Names, Values, and Attributes

Beautiful Soup can extract the `name`, `value` and `attributes` of tags. The corresponding methods are:
- name
- string
- attrs

In [7]:
print("The name of a tags is: ", hi.a.name)
print("The value of a tags is: ", hi.a.string)
print("The attribute of a tags is: ", hi.a.attrs)

The name of a tags is:  a
The value of a tags is:  Hello, beautifulsoup!
The attribute of a tags is:  {'href': 'http://www.crummy.com/software/BeautifulSoup/'}


### get_text() & get()
- For tags that have child tags the string does not work

In [8]:
print(hi.html.string)

None


- Use the get_text method instead. The `get_text()` method will extract all the contents of child tags.

In [9]:
print(hi.html.get_text())



Hi 


Hello, beautifulsoup!




- `get()` is used to find the attribute of a tag. For example, we can get the href of tag a using the following code. 

- It is the same as run `hi.a.attrs` first and then find the value of key `href` from the dictionary.

In [10]:
print(hi.a.get('href'))

http://www.crummy.com/software/BeautifulSoup/


In [11]:
print(hi.a.attrs)

{'href': 'http://www.crummy.com/software/BeautifulSoup/'}


### find() & find_all()
The functions `find` and `findall` are flexible for finding tags.

In [12]:
!cat data/article.html
# Windows user
# !type data\article.html

<!DOCTYPE html>
<html>
    <head>
        <title>Article</title>
    </head>
    <body>
        <h1 id='one'>One</h1>
        	<p>This is the first paragraph.</p>
        <h2 id='two'>Two</h2>
        	<p><a href='www.google.com'>Here is the Google website.</a></p>
        <h3 id='three'>Three</h3>
        	<p>This is the third paragraph.</p>
    </body>
</html>


![article](pic/article.png)

In [13]:
article_path = 'data/article.html'
with open(article_path, 'r') as f:
    article = f.read()
    article = BeautifulSoup(article, 'html.parser')

- Return only the first `p` tag.

In [16]:
print(article.p)

<p>This is the first paragraph.</p>


- `find()` returns the first p tags, which is equivalent to article.p

In [17]:
print(article.find('p'))

<p>This is the first paragraph.</p>


- `find_all()` returns all p tags

In [18]:
print(article.find_all('p'))

[<p>This is the first paragraph.</p>, <p><a href="www.google.com">Here is the Google website.</a></p>, <p>This is the third paragraph.</p>]


- To find the tags that have specific attributes, you can pass a dictionary as the `attrs` argument.

In [19]:
print(article.find_all('h1', attrs={'id':'one'}))

[<h1 id="one">One</h1>]


- You can also specify a function to extract a list of Tag objects that match the given criteria.
- It is the same as the following:

In [20]:
# the tags whose attribute id equals 'one'
print(article.find_all(lambda tag: tag.get('id') == 'one'))

[<h1 id="one">One</h1>]


<p><a name="example"></a></p>

## Examples

<p><a name="calendar"></a></p>

### Python User Group Calendar

Let's extract the time, location, and event titles from this web page [Python User Group Calendar](https://www.python.org/events/python-user-group/).

<img src=pic/events.png width=800/>

- For the examples we discussed before, we saved the html document locally. However, you don't want to download all the pages and then start scraping for your web scraping project.
- The [Requests package](http://docs.python-requests.org/en/master/) we are using here is well designed and very popular in the industry. It makes http requests easy to use with Python.
- The `get` method we are using here is one type of [http request](https://www.tutorialspoint.com/http/http_methods.htm). It is most often used to retrieve information from the web server. 

In [52]:
import requests
response = requests.get('https://www.14ers.com/')
text = BeautifulSoup(response.text, 'html.parser')

In [53]:
print(text.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   14ers.com • Home of Colorado's Fourteeners and High Peaks
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Formed in 2000 by Breckenridge, Colorado resident Bill Middlebrook, 14ers.com is the premier resource for climbing the high peaks in Colorado. The mission of 14ers.com is to provide free access to peak information, photos, climbing routes and an active forum, all in an open environment that's easy to use." name="Description"/>
  <meta content="14ers, Colorado Fourteeners, Colorado 14ers, Fourteeners, 14ers, 14er, 14er Routes, 14er Hiking Routes, Colorado Thirteeners, Thirteeners, Colorado 13ers, 13ers, 13er, Colorado Summits, Colorado Peaks, Colorado Mountains, Climbing Routes, 14er Ski Routes, 14er skiing, 14er ski, Fourteen Thousand Foot Peaks, Fourteen Thousand Feet, Colorado Nature, Colorado View, Colorado Photos, Scen

#### Title
Titles are in `h3` tags with an attribute `class="event-title"`.
<img src=pic/title.png width=900/>

In [23]:
titleTags = text.find_all('h3', {'class': "event-title"})
titleTags

[<h3 class="event-title"><a href="/events/python-user-group/835/">PyCon Latam 2019</a></h3>,
 <h3 class="event-title"><a href="/events/python-user-group/829/">Django Girls Abuja</a></h3>,
 <h3 class="event-title"><a href="/events/python-user-group/858/">Python Mauritius UserGroup (Pymug) June Meetup</a></h3>]

In [24]:
titleString = [tag.string for tag in titleTags]
titleString

['PyCon Latam 2019',
 'Django Girls Abuja',
 'Python Mauritius UserGroup (Pymug) June Meetup']

#### Time
Times are in the `time` tags that have `datetime` attribute.

![time](pic/time.png)

In [25]:
timeTags = text.find_all(lambda tag: 'datetime' in tag.attrs)
timeTags

[<time datetime="2019-08-29T00:00:00+00:00">29 Aug. – 31 Aug. <span class="say-no-more"> 2019</span></time>,
 <time datetime="2019-06-28T00:00:00+00:00">28 June – 29 June <span class="say-no-more"> 2019</span></time>,
 <time datetime="2019-06-23T06:00:00+00:00">23 June<span class="say-no-more"> 2019</span> 6am UTC – 10am UTC</time>]

In [26]:
timeString = [tag.get('datetime') for tag in timeTags]
timeString

['2019-08-29T00:00:00+00:00',
 '2019-06-28T00:00:00+00:00',
 '2019-06-23T06:00:00+00:00']

#### Location
Locations are in `span` tags with the attribute `class="event-location"`.

<img src=pic/location.png width=900/>

In [27]:
locationTags = text.find_all("span", {"class": "event-location"})
locationTags

[<span class="event-location">Hotel Friendly, Puerto Vallarta, México</span>,
 <span class="event-location">Abuja, Nigeria</span>,
 <span class="event-location">Curepipe, Mauritius</span>]

In [28]:
locationString = [tag.string for tag in locationTags]
locationString

['Hotel Friendly, Puerto Vallarta, México',
 'Abuja, Nigeria',
 'Curepipe, Mauritius']

### Web Scraping Project Workflow
- We have been lucky so far because there is no missing values on this page. But what if the location of one event is missing? There is no way for us to locate it from three lists of different length.
- The general workflow of a web scraping project is like the following:
 - Find the unique attribute that will locate the **top level** tags that you are interested in.
     - Each tag could be a listing, review, item...
     - **one unique tag -> one row in csv file**
 - We want to locate the event tag that its child tags contain the title, datetime and location that you want to save as columns in a csv file.
 - Then you go levels deeper to find the child tags of each event. If there is something missing there, you just replace it with an empty string.
 - The event tags have a unique  attribute **class=list-recent-events menu**.
 - Next question is: what is the best data structure to represent one single event?

In [32]:
# Save all the event is a list
result = []
# Save all the ul tags, each ul is a section of the page
uls = text.find_all('ul', {'class': 'list-recent-events menu'})
for ul in uls:
    # Save all the li tags, each li is an event
    lis = ul.find_all('li')
    for li in lis:
        # Initialize an empty dictionary for each event
        event = {}
        # Using try/except to avoid errors caused by missing values
        try:
            title = li.find('a').string
        except:
            continue       
        try:
            time = li.find('time').get('datetime')
        except:
            time = ""
        try:
            location = li.find('span', {'class':'event-location'}).string.strip()
        except:
            location = ""
        
        # Assign the values in the dictionary
        event['location'] = location
        event['time'] = time
        event['title'] = title
        result.append(event)

In [33]:
result

[{'location': 'Hotel Friendly, Puerto Vallarta, México',
  'time': '2019-08-29T00:00:00+00:00',
  'title': 'PyCon Latam 2019'},
 {'location': 'Abuja, Nigeria',
  'time': '2019-06-28T00:00:00+00:00',
  'title': 'Django Girls Abuja'},
 {'location': 'Curepipe, Mauritius',
  'time': '2019-06-23T06:00:00+00:00',
  'title': 'Python Mauritius UserGroup (Pymug) June Meetup'}]

<p><a name="yelp"></a></p>

## Scrape Yelp Reviews
- Let's apply what we have learned to a more complicated example - scrape Yelp reviews. 
- Our task is to scrape all the reviews of the ABC Kitchen Restaurant on Yelp. https://www.yelp.com/biz/abc-kitchen-new-york
- You can easily extend this code to all the restaurants.

### Step 1: Find the pattern of url

- Here we added `User-Agent` to the header of our request. It is because sometimes the web server will check the different fields of the header to block robot scrapers. 
- `User-Agent` is the most common one because it is specific to your browser.

In [36]:
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
            }

response = requests.get('https://www.yelp.com/biz/abc-kitchen-new-york', headers=headers)
text = BeautifulSoup(response.text, 'html.parser')


<!DOCTYPE HTML>

<!--[if lt IE 7 ]> <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie6 ie ltie9 ltie8 no-js" lang="en"> <![endif]-->
<!--[if IE 7 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie7 ie ltie9 ltie8 no-js" lang="en"> <![endif]-->
<!--[if IE 8 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie8 ie ltie9 no-js" lang="en"> <![endif]-->
<!--[if IE 9 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie9 ie no-js" lang="en"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html class="no-js" lang="en" xmlns:fb="http://www.facebook.com/2008/fbml"> <!--<![endif]-->
<head>
<script>
            (function() {
                var main = null;

                var main=function(){window.onerror=function(k,a,c,i,f){var j=(document.getElementsByTagName("html")[0].getAttribute("webdriver")==="true"||navigator.userAgent==="selenium");var h=f&&(f.name==="ServerSideRenderingError"||f.name==="CSRFallbackError");if(j&&!h){document.body.inne

- If you go to the second page, you can see the url becomes https://www.yelp.com/biz/abc-kitchen-new-york?start=20
- Similarly, the url to the thid page: https://www.yelp.com/biz/abc-kitchen-new-york?start=40
- But how do we find out the url of the last page?

In [37]:
import re

temp = 'lemon--p__373c0__3Qnnj text__373c0__2pB8f text-color--mid__373c0__3G312 text-align--left__373c0__2pnx_ text-size--large__373c0__1568g'
num_reviews = text.find('p', attrs={'class': temp}).string
num_reviews = int(re.findall('\d+', num_reviews)[0])
print(num_reviews)

2887


In [38]:
url_list = []
for i in range(0, num_reviews, 20):
    url_list.append('https://www.yelp.com/biz/abc-kitchen-new-york?start='+str(i))
print(url_list[:10])

['https://www.yelp.com/biz/abc-kitchen-new-york?start=0', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=20', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=40', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=60', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=80', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=100', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=120', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=140', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=160', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=180']


### Step 2: Find all the review divs on the page

In [43]:
temp = \
'lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT'
reviews = text.find_all('li', attrs={'class': temp})
print(len(reviews))

20


### Step 3: Scrape the detail information

For debugging purpose, we usually test it out on one review and then apply to the others.

In [45]:
review = reviews[0]

# Username
temp = 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--inherit__373c0__w_15m text-align--left__373c0__2pnx_ text-weight--bold__373c0__3HYJa'
username = review.find('span', attrs={'class': temp}).string
print(username)

Asher W.


In [46]:
# Location
temp = 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_ text-weight--bold__373c0__3HYJa text-size--small__373c0__3SGMi'
location = review.find('span', attrs={'class': temp}).get_text()
print(location)

New York, NY


In [47]:
# Rating
temp = 'lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT'
rating = review.find('span', attrs={'class': temp}).find('div').get('aria-label')
rating = float(re.findall('\d+', rating)[0])
print(rating)

5.0


In [48]:
# Date
temp = 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--mid__373c0__3G312 text-align--left__373c0__2pnx_'
date = review.find('span', attrs={'class': temp}).get_text()
print(date)

6/19/2019


In [49]:
# Content'
temp = 'lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_'
content = review.find('p', attrs={'class': temp}).get_text()
print(content)

I've literally walked by ABC kitchen every day for 5 years and said "I've gotta try that place sometime." "I've eaten at every other great restaurant in Flatiron, why haven't I been to ABC?"Well the other night we finally made it to ABC Kitchen, and damn do we regret waiting this long. We started at the bar. I had a marg which I asked them to rim with their adobo chili salt. My girlfriend and her mother drank wine. The marg was solid, great kick from the chipotle infused salt. The bar was very nice. We were seated shortly after. Our waitress (though I can't remember her name) was an absolutely sweetheart. Not only was she kind and funny but very knowledgeable and opinionated when it came to the menu. We love opinionated waiters because any smart diner knows that nobody knows the food better than the people who serve it and eat it daily, so we often ask for advice when we are having an internal dilemma while choosing between various delicious sounding dishes. With her expertise we arriv

### Step 4: Apply to all the reviews and save them to a csv file

In [50]:
import csv
# Windows using text encoding when opening the file by default.
# Override it to 'utf-8' will save lots of encoding issues.
with open('reviews.csv', 'w', encoding='utf-8') as csvfile:
    review_writer = csv.writer(csvfile)
    for review in reviews:
        dic = {}
        username = review.find('span', attrs={'class': 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--inherit__373c0__w_15m text-align--left__373c0__2pnx_ text-weight--bold__373c0__3HYJa'})\
                   .string.strip() 
        location = review.find('span', attrs={'class': 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_ text-weight--bold__373c0__3HYJa text-size--small__373c0__3SGMi'})\
                   .get_text().strip()
        date = review.find('span', attrs={'class': 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--mid__373c0__3G312 text-align--left__373c0__2pnx_'})\
                   .get_text().strip()
        rating = review.find('span', attrs={'class': 'lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT'})\
                   .find('div').get('aria-label')
        rating = float(re.findall('\d+', rating)[0])
        content = review.find('p', attrs={'class': 'lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_'})\
                   .get_text().strip()
        dic['username'] = username
        dic['location'] = location
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        review_writer.writerow(dic.values())

### Step 5: Apply to all the pages

In [51]:
import time
import random


def scrape_single_page(reviews, csvwriter):
    for review in reviews:
        dic = {}
        username = review.find('span', attrs={'class': 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--inherit__373c0__w_15m text-align--left__373c0__2pnx_ text-weight--bold__373c0__3HYJa'})\
                   .string.strip() 
        location = review.find('span', attrs={'class': 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_ text-weight--bold__373c0__3HYJa text-size--small__373c0__3SGMi'})\
                   .get_text().strip()
        date = review.find('span', attrs={'class': 'lemon--span__373c0__3997G text__373c0__2pB8f text-color--mid__373c0__3G312 text-align--left__373c0__2pnx_'})\
                   .get_text().strip()
        rating = review.find('span', attrs={'class': 'lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT'})\
                   .find('div').get('aria-label')
        rating = float(re.findall('\d+', rating)[0])
        content = review.find('p', attrs={'class': 'lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_'})\
                   .get_text().strip()
        dic['username'] = username
        dic['location'] = location
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        review_writer.writerow(dic.values())
    

with open('reviews.csv', 'w', encoding='utf-8', newline='') as csvfile:
    review_writer = csv.writer(csvfile)
    for index, url in enumerate(url_list):
        response = requests.get(url, headers=headers)
        text = BeautifulSoup(response.text, 'html.parser')
        reviews = text.find_all('li', attrs={'class': 'lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT'})
        scrape_single_page(reviews, review_writer)
        # Random sleep to avoid getting banned from the server
        time.sleep(random.randint(1,3))
        # Log the progress
        print('Finished page ' + str(index + 1))

Finished page 1
Finished page 2
Finished page 3
Finished page 4


KeyboardInterrupt: 