<p><a name="sections"></a></p>
<br>
<br>
# Sections

- <a href="#intro">Introduction to Beautiful Soup</a><br>
    - <a href="#web">What is web scraping</a><br>
    - <a href="#html">Introduction to HTML</a><br>
    - <a href="#beautiful">Basics of Beautiful Soup</a><br>

- <a href="#example">Examples</a><br>
    - <a href="#calendar">Python User Group Calendar</a><br>
    - <a href="#yelp">Scrape Yelp Reviews</a><br>

<p><a name="web"></a></p>
## What is web scraping?

- HTML is short for **HyperText Markup Language**. It's a language for presenting content on the Web.

- Plain text is turned into an HTML document by **tags** that are then interpreted by a browser.

- Using BeautifulSoup, you can easily extract the tag values from HTML source code.

### Beautiful Soup VS Regular Expressions

In [None]:
# the source code of hi.html
!cat data/hi.html
# Windows user
# !type data\hi.html

### Example:
- Extract the characters between the title tags. 


- In this case it's `Hi` (`<title>Hi</title>`).

- **Solution using Regular Expressions**

In [None]:
import re
hi_path = 'data/hi.html'
with open(hi_path, 'r') as f:
    hi = f.read()
    print(re.findall('<title>(.*)</title>', hi))

- **Solution using BeautifulSoup**

In [None]:
from bs4 import BeautifulSoup
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(hi.title) # find the title tag
    print(hi.title.string)  # find the value of tag

**Compared with regular expressions:**
    
- Beautiful Soup's syntax is much simpler, while regular expressions are more flexible.

<p><a name="html"></a></p>
## Introduction to HTML

### Tag

- The `<title>` tags in this example designate the enclosed text as the title to be displayed in the head of the browser tab.
![hi](pic/hi.png)

- Tags are always enclosed by `<` and `>` to distinguish them from the content. 
- A pair of tags consist of start and end tags which carry the same name, but the end tag is preceded by a slash `/` .

### Values

Values are the content between start tags and end tags.

- **Example**

`<title>Hi</title>`: It's a title tag with a value of `Hi`.

### Attributes
Tags have another feauture called attributes.

- **Example**

`<a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>`

The anchor tag `<a>` with an attribute `href` and hyperlink—http://www.crummy.com/software/BeautifulSoup/. It creates an association of text points to another address (a hyperlink).

### Tree structure
- The first tag in the example is the `<html>` tag. 

- Between the `<html>` tags, several tags are opened and closed again: `<head>, <title>` , and
`<body>, <a>`.

    - The `<head>` and `<body>` tags are directly enclosed by the `<html>` tag. 
    - The `<title>` tag is enclosed by the `<head>` tag.
    - The `<a>` tag is enclosed by the `<body>` tag.


- A good way to describe the multiple layers of an HTML document is the tree analogy. 
![html](pic/html.png)

- The `html` tag is the root tag that splits into two branches, `<head>` and `<body>`; `<head>` is followed by another branch called `<title>`; `<body>` is followed by another branch called `<a>`.

<p><a name="beautiful"></a></p>
## Basics of Beautiful Soup

### Parse HTML

- The `prettify()` method adds indentations so that it will help you understand the tree structure of the html document.

In [None]:
from bs4 import BeautifulSoup
# open a local file and parse the plain text by BeautifulSoup directly
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(type(hi)) # get a bs4.BeautifulSoup object
    print('\n')
    print(hi.prettify())

### Names, Values, and Attributes

Beautiful Soup can extract the `name`, `value` and `attributes` of tags. The corresponding methods are:
- name
- string
- attrs

In [None]:
print("The name of a tags is: ", hi.a.name)
print("The value of a tags is: ", hi.a.string)
print("The attribute of a tags is: ", hi.a.attrs)

### get_text() & get()
- For tags that have child tags the string does not work

In [None]:
print(hi.html.string)

- Use the get_text method instead. The `get_text()` method will extract all the contents of child tags.

In [None]:
print(hi.html.get_text())

- `get()` is used to find the attribute of a tag. For example, we can get the href of tag a using the following code. 

- It is the same as run `hi.a.attrs` first and then find the value of key `href` from the dictionary.

In [None]:
print(hi.a.get('href'))

In [None]:
print(hi.a.attrs)

### find() & find_all()
The functions `find` and `findall` are flexible for finding tags.

In [None]:
!cat data/article.html
# Windows user
# !type data\article.html

![article](pic/article.png)

In [None]:
article_path = 'data/article.html'
with open(article_path, 'r') as f:
    article = f.read()
    article = BeautifulSoup(article, 'html.parser')

- Return only the first `p` tag.

In [None]:
print(article.p)

- `find()` returns the first p tags, which is equivalent to article.p

In [None]:
print(article.find('p'))

- `find_all()` returns all p tags

In [None]:
print(article.find_all('p'))

- To find the tags that have specific attributes, you can pass a dictionary as the `attrs` argument.

In [None]:
print(article.find_all('h1', attrs={'id':'one'}))

- You can also specify a function to extract a list of Tag objects that match the given criteria.
- It is the same as the following:

In [None]:
# the tags whose attribute id equals 'one'
print(article.find_all(lambda tag: tag.get('id') == 'one'))

<p><a name="example"></a></p>
## Examples

<p><a name="calendar"></a></p>
### Python User Group Calendar

Let's extract the time, location, and event titles from this web page [Python User Group Calendar](https://www.python.org/events/python-user-group/).

<img src=pic/events.png width=800/>

- For the examples we discussed before, we saved the html document locally. However, you don't want to download all the pages and then start scraping for your web scraping project.
- The [Requests package](http://docs.python-requests.org/en/master/) we are using here is well designed and very popular in the industry. It makes http requests easy to use with Python.
- The `get` method we are using here is one type of [http request](https://www.tutorialspoint.com/http/http_methods.htm). It is most often used to retrieve information from the web server. 

In [None]:
import requests
response = requests.get('https://www.python.org/events/python-user-group/')
text = BeautifulSoup(response.text, 'html.parser')

In [None]:
print(text.prettify())

#### Title
Titles are in `h3` tags with an attribute `class="event-title"`.
<img src=pic/title.png width=900/>

In [None]:
titleTags = text.find_all('h3', {'class': "event-title"})
titleTags

In [None]:
titleString = [tag.string for tag in titleTags]
titleString

#### Time
Times are in the `time` tags that have `datetime` attribute.

![time](pic/time.png)

In [None]:
timeTags = text.find_all(lambda tag: 'datetime' in tag.attrs)
timeTags

In [None]:
timeString = [tag.get('datetime') for tag in timeTags]
timeString

#### Location
Locations are in `span` tags with the attribute `class="event-location"`.

<img src=pic/location.png width=900/>

In [None]:
locationTags = text.find_all("span", {"class": "event-location"})
locationTags

In [None]:
locationString = [tag.string for tag in locationTags]
locationString

### Web Scraping Project Workflow
- We have been lucky so far because there is no missing values on this page. But what if the location of one event is missing? There is no way for us to locate it from three lists of different length.
- The general workflow of a web scraping project is like the following:
 - Find the unique attribute that will locate the **top level** tags that you are interested in.
     - Each tag could be a listing, review, item...
     - **one unique tag -> one row in csv file**
 - We want to locate the event tag that its child tags contain the title, datetime and location that you want to save as columns in a csv file.
 - Then you go levels deeper to find the child tags of each event. If there is something missing there, you just replace it with an empty string.
 - The event tags have a unique  attribute **class=list-recent-events menu**.
 - Next question is: what is the best data structure to represent one single event?

In [None]:
# Save all the event is a list
result = []
# Save all the ul tags, each ul is a section of the page
uls = text.find_all('ul', {'class': 'list-recent-events menu'})
for ul in uls:
    # Save all the li tags, each li is an event
    lis = ul.find_all('li')
    for li in lis:
        # Initialize an empty dictionary for each event
        event = {}
        # Using try/except to avoid errors caused by missing values
        try:
            title = li.find('a').string
        except:
            continue       
        try:
            time = li.find('time').get('datetime')
        except:
            time = ""
        try:
            location = li.find('span', {'class':'event-location'}).string.strip()
        except:
            location = ""
        
        # Assign the values in the dictionary
        event['location'] = location
        event['time'] = time
        event['title'] = title
        result.append(event)

In [None]:
result

<p><a name="yelp"></a></p>
## Scrape Yelp Reviews
- Let's apply what we have learned to a more complicated example - scrape Yelp reviews. 
- Our task is to scrape all the reviews of the ABC Kitchen Restaurant on Yelp. https://www.yelp.com/biz/abc-kitchen-new-york
- You can easily extend this code to all the restaurants.

### Step 1: Find the pattern of url

- Here we added `User-Agent` to the header of our request. It is because sometimes the web server will check the different fields of the header to block robot scrapers. 
- `User-Agent` is the most common one because it is specific to your browser.

In [None]:
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
            }

response = requests.get('https://www.yelp.com/biz/abc-kitchen-new-york', headers=headers)
text = BeautifulSoup(response.text, 'html.parser')

- If you go to the second page, you can see the url becomes https://www.yelp.com/biz/abc-kitchen-new-york?start=20
- Similarly, the url to the thid page: https://www.yelp.com/biz/abc-kitchen-new-york?start=40
- But how do we find out the url of the last page?

In [None]:
import re
num_reviews = text.find('span', attrs={'class': 'review-count rating-qualifier'}).string
num_reviews = int(re.findall('\d+', num_reviews)[0])
print(num_reviews)

In [None]:
url_list = []
for i in range(0, num_reviews, 20):
    url_list.append('https://www.yelp.com/biz/abc-kitchen-new-york?start='+str(i))
print(url_list[:10])

### Step 2: Find all the review divs on the page

In [None]:
reviews = text.find_all('div', attrs={'class':'review review--with-sidebar'})
print(len(reviews))

### Step 3: Scrape the detail information

For debugging purpose, we usually test it out on one review and then apply to the others.

In [None]:
review = reviews[0]

# Username
username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).string
print(username)

In [None]:
# Location
location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).get_text()
print(location)

In [None]:
# Rating
rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
rating = float(re.findall('\d+', rating)[0])
print(rating)

In [None]:
# Date
date = review.find('span', attrs={'class': 'rating-qualifier'}).get_text()
print(date)

In [None]:
# Content
content = review.find('p').get_text()
print(content)

### Step 4: Apply to all the reviews and save them to a csv file

In [None]:
import csv
# Windows using text encoding when opening the file by default.
# Override it to 'utf-8' will save lots of encoding issues.
with open('reviews.csv', 'w', encoding='utf-8') as csvfile:
    review_writer = csv.writer(csvfile)
    for review in reviews:
        dic = {}
        username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).string
        location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).get_text().strip()
        date = review.find('span', attrs={'class': 'rating-qualifier'}).get_text().strip()
        rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
        rating = float(re.findall('\d+', rating)[0])
        content = review.find('p').text
        dic['username'] = username
        dic['location'] = location
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        review_writer.writerow(dic.values())

### Step 5: Apply to all the pages

In [None]:
import time
import random


def scrape_single_page(reviews, csvwriter):
    for review in reviews:
        dic = {}
        username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).text
        location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).text.strip()
        date = review.find('span', attrs={'class': 'rating-qualifier'}).text.strip()
        rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
        rating = rating = float(re.findall('\d+', rating)[0])
        content = review.find('p').text
        dic['username'] = username
        dic['location'] = location
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        csvwriter.writerow(dic.values())
    

with open('reviews.csv', 'w', encoding='utf-8') as csvfile:
    review_writer = csv.writer(csvfile)
    for index, url in enumerate(url_list):
        response = requests.get(url, headers=headers)
        text = BeautifulSoup(response.text, 'html.parser')
        reviews = text.find_all('div', attrs={'class':'review review--with-sidebar'})
        scrape_single_page(reviews, review_writer)
        # Random sleep to avoid getting banned from the server
        time.sleep(random.randint(1,3))
        # Log the progress
        print('Finished page ' + str(index + 1))