<p><a name="sections"></a></p>
<br>
<br>
# Sections

- <a href="#intro">Introduction to Beautiful Soup</a><br>
    - <a href="#web">What is web scraping</a><br>
    - <a href="#html">Introduction to HTML</a><br>
    - <a href="#beautiful">Basics of Beautiful Soup</a><br>

- <a href="#example">Examples</a><br>
    - <a href="#calendar">Python User Group Calendar</a><br>
    - <a href="#yelp">Scrape Yelp Reviews</a><br>

<p><a name="web"></a></p>
## What is web scraping?
[[back to top]](#sections)

- HTML is short for **HyperText Markup Language**. It's a language for presenting content on the Web.

- Plain text is turned into an HTML document by **tags** that are then interpreted by a browser.

- Using BeautifulSoup, you can easily extract the tag values from HTML source code.

### Beautiful Soup VS Regular Expressions

In [1]:
# the source code of hi.html
!type data/hi.html

The syntax of the command is incorrect.


### Example:
- Extract the characters between the title tags. 


- In this case it's `Hi` (`<title>Hi</title>`).

- **Solution using Regular Expressions**

In [3]:
import re
import Requests
hi_path = 'data/hi.html'
with open(hi_path, 'r') as f:
    hi = f.read()
    print(re.findall('<title>(.*)</title>', hi))

['Hi']


- **Solution using BeautifulSoup**

In [4]:
from bs4 import BeautifulSoup
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(hi.title) # find the title tag
    print(hi.title.string)  # find the value of tag

<title>Hi</title>
Hi


**Compared with regular expressions:**
    
- Beautiful Soup's syntax is much simpler, while regular expressions are more flexible.

<p><a name="html"></a></p>
## Introduction to HTML
[[back to top]](#sections)

### Tag

- The `<title>` tags in this example designate the enclosed text as the title to be displayed in the head of the browser tab.
![hi](pic/hi.png)

- Tags are always enclosed by `<` and `>` to distinguish them from the content. 
- A pair of tags consist of start and end tags which carry the same name, but the end tag is preceded by a slash `/` .

### Values

Values are the content between start tags and end tags.

- **Example**

`<title>Hi</title>`: It's a title tag with a value of `Hi`.

### Attributes
Tags have another feauture called attributes.

- **Example**

`<a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>`

The anchor tag `<a>` with an attribute `href` and hyperlink—http://www.crummy.com/software/BeautifulSoup/. It creates an association of text points to another address (a hyperlink).

### Tree structure
- The first element in the example is the `<html>` element. 


- Between the `<html>` tags of this element, several tags are opened and closed again: `<head>, <title>` , and
`<body>, <a>`.

    - The `<head>` and `<body>` tags are directly enclosed by the `<html>` element. 
    - The `<title>` element is enclosed by the `<head>` tag.
    - The `<a>` element is enclosed by the `<body>` tag.


- A good way to describe the multiple layers of an HTML document is the tree analogy. 
![html](pic/html.png)

- The <html> element is the root element that splits into two branches, `<head>` and `<body>`; `<head>` is followed by another branch called `<title>`; `<body>` is followed by another branch called `<a>`.

<p><a name="beautiful"></a></p>
## Basics of Beautiful Soup
[[back to top]](#sections)

### Parse HTML

- The `prettify()` method adds indentations so that it will help you understand the tree structure of the html document.

In [13]:
from bs4 import BeautifulSoup
# open a local file and parse the plain text by BeautifulSoup directly
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(type(hi)) # get a bs4.BeautifulSoup object
    print('\n')
    print(hi.prettify())

<class 'bs4.BeautifulSoup'>


<!DOCTYPE html>
<html>
 <head>
  <title>
   Hi
  </title>
  <!--Im a comment, ignore me.-->
 </head>
 <body>
  <a href="http://www.crummy.com/software/BeautifulSoup/">
   Hello, beautifulsoup!
  </a>
 </body>
</html>



### Names, Values, and Attributes

Beautiful Soup can extract the `name`, `value` and `attributes` of tags. The corresponding methods are:
- name
- string
- attrs

In [12]:
print("The name of a tags is: ", hi.a.name)
print("The value of a tags is: ", hi.a.string)
print("The attribute of a tags is: ", hi.a.attrs)

The name of a tags is:  a
The value of a tags is:  Hello, beautifulsoup!
The attribute of a tags is:  {'href': 'http://www.crummy.com/software/BeautifulSoup/'}


### get_text() & get()
- For tags that have child tags the string does not work

In [14]:
print(hi.html.string)

None


- Use the get_text method instead. The `get_text()` method will extract all the contents of child tags.

In [15]:
print(hi.html.get_text())



Hi 


Hello, beautifulsoup!




- `get()` is used to find the attribute of a tag. For example, we can get the href of tag a using the following code. 

- It is the same as run `hi.a.attrs` first and then find the value of key `href` from the dictionary.

In [16]:
print(hi.a.get('href'))

http://www.crummy.com/software/BeautifulSoup/


In [17]:
print(hi.a.attrs)

{'href': 'http://www.crummy.com/software/BeautifulSoup/'}


### find() & find_all()
The functions `find` and `findall` are flexible for finding tags.

In [22]:
!type data/article.html

The syntax of the command is incorrect.


![article](pic/article.png)

In [23]:
article_path = 'data/article.html'
with open(article_path, 'r') as f:
    article = f.read()
    article = BeautifulSoup(article, 'html.parser')

- Return only the first `p` tag.

In [24]:
print(article.p)

<p>This is the first paragraph.</p>


- `find()` returns the first p tags, which is equivalent to article.p

In [25]:
print(article.find('p'))

<p>This is the first paragraph.</p>


- `find_all()` returns all p tags

In [26]:
print(article.find_all('p'))

[<p>This is the first paragraph.</p>, <p><a href="www.google.com">Here is the Google website.</a></p>, <p>This is the third paragraph.</p>]


- To find the tags that have specific attributes, you can pass a dictionary as the `attrs` argument.

In [27]:
print(article.find_all('h1', attrs={'id':'one'}))

[<h1 id="one">One</h1>]


- You can also specify a function to extract a list of Tag objects that match the given criteria.
- It is the same as the following:

In [28]:
# the tags whose attribute id equals 'one'
print(article.find_all(lambda tag: tag.get('id') == 'one'))

[<h1 id="one">One</h1>]


<p><a name="example"></a></p>
## Examples
[[back to top]](#sections)

<p><a name="calendar"></a></p>
### Python User Group Calendar
[[back to top]](#sections)

Let's extract the time, location, and event titles from this web page [Python User Group Calendar](https://www.python.org/events/python-user-group/).

<img src=pic/events.png width=800/>

In [29]:
import requests
text = requests.get('https://www.python.org/events/python-user-group/').text
text = BeautifulSoup(text, 'html.parser')

In [30]:
print(text.prettify())

<!DOCTYPE doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" dir="ltr" lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js" rel="prefetch"/>
  <meta content="Python.org" name="application-name"/>
  <meta content="The official home of the Python Programming Language" name="msapplication-tooltip"/>
  <meta content="Python.org" name="apple-mobile-web-app-title"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="black" name="apple-mobile-web-app-status-bar-style"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="True" name="HandheldFrien

#### Title
Titles are in `h3` tags with an attribute `class="event-title"`.
<img src=pic/title.png width=900/>

In [None]:
titleTags = text.find_all('h3', {'class': "event-title"})
titleTags

In [None]:
titleString = [tag.get_text() for tag in titleTags]
titleString

#### Time
Times are in the `time` tags with the attribute `datetime`.

![time](pic/time.png)

In [31]:
timeTags = text.find_all(lambda tag: 'datetime' in tag.attrs)
timeTags

[<time datetime="2017-10-14T00:00:00+00:00">14 Oct. – 15 Oct. <span class="say-no-more"> 2017</span></time>,
 <time datetime="2017-10-14T00:00:00+00:00">14 Oct. – 16 Oct. <span class="say-no-more"> 2017</span></time>,
 <time datetime="2017-10-20T00:00:00+00:00">20 Oct. – 21 Oct. <span class="say-no-more"> 2017</span></time>,
 <time datetime="2017-10-21T00:00:00+00:00">21 Oct. – 22 Oct. <span class="say-no-more"> 2017</span></time>,
 <time datetime="2017-10-21T00:00:00+00:00">21 Oct. – 22 Oct. <span class="say-no-more"> 2017</span></time>,
 <time datetime="2017-10-23T18:00:00+00:00">23 Oct.<span class="say-no-more"> 2017</span></time>,
 <time datetime="2017-10-01T00:00:00+00:00">01 Oct. – 02 Oct. <span class="say-no-more"> 2017</span></time>,
 <time datetime="2017-09-30T00:00:00+00:00">30 Sept. – 01 Oct. <span class="say-no-more"> 2017</span></time>]

In [32]:
timeString = [tag.get('datetime') for tag in timeTags]
timeString

['2017-10-14T00:00:00+00:00',
 '2017-10-14T00:00:00+00:00',
 '2017-10-20T00:00:00+00:00',
 '2017-10-21T00:00:00+00:00',
 '2017-10-21T00:00:00+00:00',
 '2017-10-23T18:00:00+00:00',
 '2017-10-01T00:00:00+00:00',
 '2017-09-30T00:00:00+00:00']

#### Location
Locations are in `span` tags with the attribute `class="envet-location"`.

<img src=pic/location.png width=900/>

In [33]:
locationTags = text.find_all("span", {"class": "event-location"})
locationTags

[<span class="event-location">Cartagena, Colombia</span>,
 <span class="event-location">Freetown, Sierra Leone</span>,
 <span class="event-location">Former Tax Academy, Jos, Nigeria</span>,
 <span class="event-location">Metropolitan City of Rome, Italy</span>,
 <span class="event-location">RİZE, TURKEY</span>,
 <span class="event-location">Campus Madrid, Calle Moreno Nieto, 2, Madrid, Spain</span>,
 <span class="event-location">Bishkek, Kyrgyz Republic</span>,
 <span class="event-location">Outbox Hub at Soliz House on Lumumba avenue,  Kampala, Uganda</span>]

In [34]:
locationString = [tag.get_text() for tag in locationTags]
locationString

['Cartagena, Colombia',
 'Freetown, Sierra Leone',
 'Former Tax Academy, Jos, Nigeria',
 'Metropolitan City of Rome, Italy',
 'RİZE, TURKEY',
 'Campus Madrid, Calle Moreno Nieto, 2, Madrid, Spain',
 'Bishkek, Kyrgyz Republic',
 'Outbox Hub at Soliz House on Lumumba avenue,  Kampala, Uganda']

- Let's quickly wrap up what we have got so far.
- The workflow of a web scraping project is:
 - Find the unique attribute that will locate the **top level** tags that you are interested in.
 - We want to locate the event tag that its child tags contain the title, datetime and location that you want to save as columns in a csv file: 
 
 **one unique tag -> one row in csv file**
 - Then you go levels deeper to find the child tags of each event.
 - The event tags have a unique class attribute equal to **list-recent-events menu**.
 - Next question is: what is the best data structure to represent one single event?

In [35]:
result = []
uls = text.find_all('ul', {'class': 'list-recent-events menu'})
for ul in uls:
    lis = ul.find_all('li')
    for li in lis:
        event = {}
        try:
            title = li.find('a').get_text()
        except:
            continue       
        try:
            time = li.find('time').get('datetime')
        except:
            time = ""
        try:
            location = li.find('span', {'class':'event-location'}).get_text().strip()
        except:
            location = ""
            
        event['location'] = location
        event['time'] = time
        event['title'] = title
        result.append(event)

In [36]:
result

[{'location': 'Cartagena, Colombia',
  'time': '2017-10-14T00:00:00+00:00',
  'title': 'Django Girls Cartagena, Colombia'},
 {'location': 'Freetown, Sierra Leone',
  'time': '2017-10-14T00:00:00+00:00',
  'title': 'Django Girls Freetown'},
 {'location': 'Former Tax Academy, Jos, Nigeria',
  'time': '2017-10-20T00:00:00+00:00',
  'title': 'Django Girls Jos'},
 {'location': 'Metropolitan City of Rome, Italy',
  'time': '2017-10-21T00:00:00+00:00',
  'title': 'Django Girls Rome'},
 {'location': 'RİZE, TURKEY',
  'time': '2017-10-21T00:00:00+00:00',
  'title': 'Django Girls Rize'},
 {'location': 'Campus Madrid, Calle Moreno Nieto, 2, Madrid, Spain',
  'time': '2017-10-23T18:00:00+00:00',
  'title': 'Meetup Python Madrid'},
 {'location': 'Bishkek, Kyrgyz Republic',
  'time': '2017-10-01T00:00:00+00:00',
  'title': 'Django Girls Bishkek'},
 {'location': 'Outbox Hub at Soliz House on Lumumba avenue,  Kampala, Uganda',
  'time': '2017-09-30T00:00:00+00:00',
  'title': 'Django Girls Kampala 201

<p><a name="yelp"></a></p>
### Scrape Yelp Reviews
[[back to top]](#sections)

#### Step 1: Find the pattern of url

In [37]:
from bs4 import BeautifulSoup
import requests

headers = {
    'Connection': 'keep-alive',
    'Access-Control-Request-Headers': 'content-type',
    'Accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get('https://www.yelp.com/biz/abc-kitchen-new-york', headers=headers).text
soup = BeautifulSoup(response, 'html.parser')

- If you go to the second page, you can see the url becomes https://www.yelp.com/biz/abc-kitchen-new-york?start=20. 
- Similarly, the url to the thid page: https://www.yelp.com/biz/abc-kitchen-new-york?start=40
- But how do we find out the url of the last page?

In [38]:
import re
num_reviews = soup.find('span', attrs={'class': 'review-count rating-qualifier'}).text
num_reviews = int(re.search('\d+', num_reviews).group())
print(num_reviews)

2513


In [39]:
url_list = []
for i in range(0, num_reviews, 20):
    url_list.append('https://www.yelp.com/biz/abc-kitchen-new-york?start='+str(i))
print(url_list)

['https://www.yelp.com/biz/abc-kitchen-new-york?start=0', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=20', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=40', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=60', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=80', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=100', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=120', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=140', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=160', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=180', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=200', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=220', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=240', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=260', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=280', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=300', 'https://www.yelp.com/biz/abc-kitchen-new-york?start=320', 'h

#### Step 2: Find all the review divs on the page

In [40]:
reviews = soup.find_all('div', attrs={'class':'review review--with-sidebar'})
print(len(reviews))

20


#### Step 3: Scrape the detail information

For debugging purpose, we usually test it out on one review and then apply to the others.

In [41]:
review = reviews[0]

# Username
username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).text
print(username)

Rob D.


In [42]:
# Location
location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).text
print(location)


Manhattan, NY



In [43]:
# Rating
rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
rating = float(re.search('\d+', rating).group())
print(rating)

5.0


In [44]:
# Date
date = review.find('span', attrs={'class': 'rating-qualifier'}).text
print(date)


        9/24/2017
    


In [45]:
# Content
content = review.find('p').text
print(content)

I would marry that turkey sandwich and spoil it for the rest of it's short delicious life. Hands down one of the best sandwiches in the city - pretty sure our neighboring tables needed to mute my bites as the crunch was off the charts.  It is bonkers good. If you want a classier step up from The Smith - I highly recommend the brunch here, the cappuccinos are also amazing Also - no surprise... the ambiance is second to none.


#### Step 4: Apply to all the reviews and save them to a csv file

In [47]:
import csv
with open('reviews.csv', 'w') as csvfile:
    review_writer = csv.writer(csvfile)
    for review in reviews:
        dic = {}
        username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).text
        location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).text.strip()
        date = review.find('span', attrs={'class': 'rating-qualifier'}).text.strip()
        rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
        rating = float(re.search('\d+', rating).group())
        content = review.find('p').text
        dic['username'] = username
        dic['location'] = location
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        review_writer.writerow(dic.values())

#### Step 5: Apply to all the pages

In [48]:
import time
import random


def scrape_single_page(reviews, csvwriter):
    for review in reviews:
        dic = {}
        username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).text
        location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).text.strip()
        date = review.find('span', attrs={'class': 'rating-qualifier'}).text.strip()
        rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
        rating = float(re.search('\d+', rating).group())
        content = review.find('p').text
        dic['username'] = username
        dic['location'] = location
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        csvwriter.writerow(dic.values())
    

with open('reviews.csv', 'w') as csvfile:
    review_writer = csv.writer(csvfile)
    for index, url in enumerate(url_list):
        response = requests.get(url, headers=headers).text
        soup = BeautifulSoup(response, 'html.parser')
        reviews = soup.find_all('div', attrs={'class':'review review--with-sidebar'})
        scrape_single_page(reviews, review_writer)
        # Random sleep to avoid getting banned from the server
        time.sleep(random.randint(1,3))
        # Log the progress
        print('Finished page ' + str(index + 1))

Finished page 1
Finished page 2
Finished page 3
Finished page 4
Finished page 5
Finished page 6


UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: character maps to <undefined>