# <center>Web Scraping</center>

References: 
 - https://www.dataquest.io/blog/web-scraping-tutorial-python/
 - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

## 1. Different ways to access data on the web
 - Scrape HTML web pages 
 - Download data file directly 
    * data files such as csv, txt
    * pdf files
 - Access data through Application Programming Interface (API), e.g. The Movie DB, Twitter
 

## 2.  Basic structure of HTML ##

* HTML tages: <font color="green">head</font>, <font color="green">body</font>, <font color="green">p</font>, <font color="green">a</font>, <font color="green">form</font>, <font color="green">table </font>, ...
* A tag may have properties. 
  * For example, tag <font color="green">a</font> has property (or attribute) <font color="green">href</font>, the target of the link
  *  <font color="green">class</font> and <font color="green">id</font> are special properties used by html to control the style of each element through Cascading Style Sheets (CSS). <font color="green">id</font> is the unique identifier of an element, and <font color="green">class</font> is used to group elements for styling. 
      - An element can be associated with multiple classes. These classes are separated by space, e.g. `<h2 class="city main">London</h2>`
      - Very often, we can scrape by class (e.g. all `city` names) if CSS is used in the page
      - For an illustrative example, check https://www.w3schools.com/html/tryit.asp?filename=tryhtml_classes_multiple
* A tag can be referenced by its position in relation to each other 
  * **child** – a child is a tag inside another tag, e.g. the two <font color="green">p</font> tags are children of the <font color="green">div</font> tag.
  * **parent** – a parent is the tag another tag is inside, e.g. the <font color="green">html</font> tag is the parent of the <font color="green">body</font> tag.
  * **sibling** – a sibling is a tag that has the same parent as another tag, e.g. in the html example, the <font color="green">head</font> and <font color="green">body</font> tags are siblings, since they’re both inside <font color="green">html</font>. Both <font color="green">p</font> tags are siblings, since they’re both inside <font color="green">body</font>.

* Sample html source code (http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html).
* A html document can be viewed as a tree structure

<img src="html.png">

Web page displayed:
--------------------

First paragraph.

Second paragraph.

**First outer paragraph.**

**Second outer paragraph.**

## 3. Access HTML Elements: CSS Selector and XPATH 

### 3.1. CSS Selector 
- Most HTML pages are styled using CSS. 
- Identifying the various elements (e.g. `<h2 class="city main">London</h2>`) on a page based on styles requires you to select the class (e.g. `city`) it falls into. 
- A CSS selector allows you to pick out the elements associated with CSS styles. 
- Advantages of Using CSS Selector
  * It’s faster than XPath.
  * It’s much easier to learn and implement.
  * You have a high chance of finding your elements, since most CSS elements have explicit names or categories.

### 3.2. XPath 
- XPath stands for XML Path -- a query language that helps identify elements from an XML document (HTML can be considered as XML too)
- It uses expressions that navigate into an XML document in a way that can be traced from the start to the intended element—like forming a path from the start, e.g.
  - `/`: root of the document
  - `//`: start from anywhere of the document
  - `@`: attribute
  - `/html//p`: retrieve all `p` tags under `html` from the root
  - `//p[@id="first"]`: retrieve all `p` tags with an attribute `id` with value `first`
  
- Advantage:
   - Flexible
   - Support pattern match in element selection


### 3.3 Examples of CSS Selector and XPath

| Goal|CSS Selector|XPath| 
|:-----|:-----|:-----|
|All `P` elements|`p`|`//p` |  
|All `p` descendants  under `body` | `body p`| `//body//p`|  
|Element (e.g. `p`) By ID (e.g. `first`) |`p#first`|`//p[@id=’first’] ` |    
|Element (e.g. `p`) By Class (e.g. `inner-text`)|`p.inner-text`|`//p[contains(@class,’inner-text’)]`|
|Element (e.g. `p`) with Attribute (e.g. `name`)|`p[name=xyz]`|`//p[@name='xyz']`|    
|All child elements under `p`|`p>*`|`//p/*`|          
|First child of all P|`p>*:first-child`|`//p/*[0]`|                     
|All P with an `a` child|Not possible|`//p[a]` |                      
|Next element|`p + *`|`//p/following-sibling::*[0]`|
|Previous element|Not possible|`//p/preceding-sibling::*[0]`|

- For details of XPath, see https://www.w3schools.com/xml/xpath_syntax.asp
- For details of CSS Selector, see https://www.w3schools.com/cssref/css_selectors.asp

## 4. Steps to scape HTML web pages 

  1. Preparation
     * Install modules <font color="green">requests</font>, <font color="green">BeautifulSoup4/scrapy/selenium/...</font>.
        * **requests**: allow you to send HTTP/1.1 requests using Python. To install:
           - Open terminal (Mac) or Anaconda Command Prompt (Windows)
           - Issue: `pip install requests`
        * **BeautifulSoup**: web page parsing library, to install, use: `pip install beautifulsoup4`
     
  2. Use <font color="green">**requests**</font> library to retrive the source code
  3. View **source code of the web page**: find out html elements that you will scrape
      * **Firefox**: right click on the web page and select "view page source"
      * **Safari**: please instruction here to see page source (http://ccm.net/faq/33026-safari-view-the-source-code-of-a-webpage)
      * **Ineternet Explorer**: see instruction at https://www.computerhope.com/issues/ch000746.htm
  4. Use libraries to parse the source code. Available libraries:
      * <font color="green">Beautifulsoup</font>: Simple, support CSS Selector, but not XPath
      * <font color="green">scrapy (https://scrapy.org/)</font>: Support CSS Selector and  XPath
      * <font color="green">Selenium</font>: Can scrape dynamic web pages
      * <font color="green">lxml</font>: another good library for web page scraping
      * ...

## 5. Scrape the sample html using BeautifulSoup ###
- Kinds of Objects in BeautifulSoup
  * <font color="green">**Tag**</font>: an xml or HTML tag
  * <font color="green">**Name**</font>: every tag has a name
  * <font color="green">**Attributes**</font>: a tag may have any number of attributes. A tag is shown as a **dictionary** in the form of {attribute1_name:attribute1_value, attribute2_name:attribute2_value, ...}. If an attribute has multiple values, the value is stored as a list
  * <font color="green">**NavigableString**</font>: the text within a tag

### 5.1 Retrieve HTML and Navigate the HTML 

In [1]:
# Exercise 5.1.1 Import requests and beautifulsoup packages

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# import requests package
import requests                   

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)
from bs4 import BeautifulSoup     

In [4]:
# Exercise 5.1.2 Get web page content

# send a get request to the web page
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")    

# status_code 200 indicates success. 
# a status code >200 indicates a failure
if page.status_code==200:   
    
    # content property gives the content returned in bytes 
    print(page.content)  # text in bytes
    print(page.text)     # text in unicode

b'<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <div>\n            <p class="inner-text first-item" id="first">\n                First paragraph.\n            </p>\n            <p class="inner-text">\n                Second paragraph.\n            </p>\n        </div>\n        <p class="outer-text first-item" id="second">\n            <b>\n                First outer paragraph.\n            </b>\n        </p>\n        <p class="outer-text">\n            <b>\n                Second outer paragraph.\n            </b>\n        </p>\n    </body>\n</html>'
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
        <p class="outer-text first-item" id="second">
            <b>
           

**Basics of HTTP (Hypertext Transfer Protocol)**
- HTTP is designed to enable communications between clients (e.g. browser) and servers (e.g. Apache web server).
- A client submits an HTTP **request** to the server; then the server returns a **response** to the client. 
- Two commonly used methods for a request-response between a client and server:
  - GET - Requests data from a specified resource
  - POST - Submits data to be processed to a specified resource

In [5]:
# Exercise 5.1.3 Parse web page content

# Process the returned content using beautifulsoup module

# initiate a beautifulsoup object using the html source and Python’s html.parser
soup = BeautifulSoup(page.content, 'html.parser')  

# soup object stands for the **root** 
# node of the html document tree

print("Soup object:")


# print soup object nicely
print(soup.prettify())                             


Soup object:
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


In [6]:
# soup.children returns an iterator of all children nodes
print("\soup children nodes:")
soup_children=soup.children
print(soup_children)

# convert to list
soup_children=list(soup.children)
print("\nlist of children of root:")
print(len(soup_children))

\soup children nodes:
<list_iterator object at 0x0000021A5542AA90>

list of children of root:
1


In [7]:
                    
# html is the only child of the root node
html=soup_children[0]    

html

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [8]:
# Exercise 5.1.4 Get head and body tag

html_children=list(html.children)

print("how many children under html? ", len(html_children))

for idx, child in enumerate(html_children):
    print("Child {} is: {}\n".format(idx, child))

how many children under html?  5
Child 0 is: 


Child 1 is: <head>
<title>A simple example page</title>
</head>

Child 2 is: 


Child 3 is: <body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>

Child 4 is: 




In [9]:
# head is the second child of html
head=html_children[1]

# extract all text inside head
print("\nhead text:")
print(head.get_text())

# body is the fourth child of html
body=html_children[3]


head text:

A simple example page



In [10]:
# Exercise 5.1.5 Continue the navigation through html document tree 

# Task 1. get div tag inside body. 
# div is the second child of body


# Task 2: get the first p in div (2nd child of div)


In [11]:
# Exercise 5.1.6 Get details of a tag

# get the first p tag in the div of body
div=list(body.children)[1]
p=list(div.children)[1]
p

# get the details of p tag
# first, get the data type of p
print("\ndata type:")
print(type(p))
# get tag name (property of p object)
print ("\ntag name: ")     
print(p.name)

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>


data type:
<class 'bs4.element.Tag'>

tag name: 
p


In [12]:
# a tag object with attributes has a dictionary. 
# use <tag>.attrs to get the dictionary
# each attribute name of the tag is a key

# get all attributes 
p.attrs

# get "class" attribute
print ("\ntag class: ")
print(p["class"])

# how to determine if 'id' is an attribute of p?

# get text of p tag
p.get_text()


{'class': ['inner-text', 'first-item'], 'id': 'first'}


tag class: 
['inner-text', 'first-item']


'\n                First paragraph.\n            '

### 5.2. Navigating the html document tree ###
 
* Going down
  * <font color="green">**contents**</font>: get a tag's direct children as a **list**
  * <font color="green">**children**</font>: get a tag's direct chidren as an **iterator**
  * <font color="green">**descendants**</font>:  get an iterator for a tag's all descendants, including direct children, the children of its direct children, and so on
* Going up
  * <font color="green">**parent**</font>: get a tag's parent
  * <font color="green">**parents**</font>: get an iterator for a tag's ancestors, from the parent to the very top of the document
* Going sideways
  * <font color="green">**next_sibling**</font>: get a tag's next sibling
  * <font color="green">**previous_sibling**</font>: get a tag's previous sibling

In [13]:
 # Exercise 5.2.1  get siblings of p object
print("\nget siblings of the first p tag")
print(list(p.next_siblings))

# get next p tag within the div
# get the sibling next to the next sibling of p
print("\nget the 2nd p tag")
print(p.next_sibling.next_sibling)


get siblings of the first p tag
['\n', <p class="inner-text">
                Second paragraph.
            </p>, '\n']

get the 2nd p tag
<p class="inner-text">
                Second paragraph.
            </p>


* `find_all()`: Looks through a tag’s descendants and retrieves all descendants that match filters (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all). A few examples:
    * By tag name: `soup.find_all("title")`
    * By list of tag names: `soup.find_all(["p", "title"])`
    * By attribtue: `soup.find_all(class_="inner-text")`
    * By attribtues: `data_soup.find_all(attrs={"class": "inner-text", "id":"first"})`

In [14]:
soup.find_all("title")

[<title>A simple example page</title>]

In [15]:
soup.find_all(["b", "title"])

[<title>A simple example page</title>,
 <b>
                 First outer paragraph.
             </b>,
 <b>
                 Second outer paragraph.
             </b>]

In [16]:
# Class is an attribute. 
# Since "class" is reserved, 
# use "class_" here
soup.find_all(class_="inner-text")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

In [17]:
soup.find_all(attrs={"class": "inner-text", "id":"first"})

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### 5.3.  Select tags into a list by CSS Selectors: select ###
* CSS selectors used by CSS language to specify HTML tags to style
* Originally, CSS selectors are patterns used to select elements for styling. 
* CSS selectors can be used to **choose html elements by node path**
* List of patterns can be found at https://www.w3schools.com/cssref/css_selectors.asp
* Some examples:
  1. `div p` – finds all <font color="green">p</font> tags inside a <font color="green">div</font> tag.
  2. `body p b` – finds all <font color="green">b</font> tags inside a <font color="green">p</font> tags within a <font color="green">body</font> tag
  3. `p.outer-text` – finds all <font color="green">p</font> tags with a <font color="green">class</font> of **outer-text**.
  4. `p#first` – finds all <font color="green">p</font> tags with an <font color="green">id</font> attribute of **first**
  5. `p[class=outer-text]` – finds all <font color="green">p</font> tags with a class attribute that is **exactly** "outer-text" (no other class). Note [ ] is the generic way to define a filter on any attribute. "." is just for "class" attribute.
  6. `p[class~=outer-text]` – finds all <font color="green">p</font> tags  with a class attribute that **contains** a value "outer-text" (it may contain other values too, equivalent to p.outer-text). 
  7. `body p.outer-text b` – finds any <font color="green">b</font> tags within <font color="green">p</font> tags with a <font color="green">class</font> of **outer-text** inside of a <font color="green">body</font> tag.
  8. `div, p` – finds all <font color="green">div</font> and <font color="green">p</font> tags (without nesting relationships). Compare it with example #1!
  9. `p.outer-text.first-item` – finds all <font color="green">p</font> tags  with **both class attribute "outer-text" and "first-item"**.
  10. What about finding all p with class "outer-text" but not class "first-item"?

In [18]:
# Exercise 5.3.1: select p tags within div tags
# Notice the space between div and p
# This means p is a **descendant** of div 
# p is not necessarily a direct child of div
soup.select("body p")


[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [19]:
# Exercise 5.3.2.: select b tags within p tags in the body

In [20]:
# Exercise 5.3.3.: finds all p tags with a class of outer-text
soup.select("p.outer-text")


[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [21]:
# Exercise 5.3.4.: select p tags with id "first"
soup.select("p#first")


[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [22]:
# Exercise 5.3.5.:find p tag within body and
# with a class attribute which is **exactly** "outer-text"

# Note: this is the generic way to set  
# a filter on any attribute

soup.select("body p[class=outer-text]")


# compare the result with # Exercise 2.6.3.
# how to select a line (i.e. tag "a") with a specific target, 
#   for example, http://www.stevens.edu?

[<p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [23]:
# Exercise 5.3.6. find p tag with body 
# which has a class attribute **containing** a value "out-text"
# Note the use of "~". 

soup.select("body p[class~=outer-text]")

# This is equivalent to soup.select("body p.outer-text")
# However, it's a generic way to set condition 
# on any type of attributes, not just "class" attribute

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [24]:
# Exercise 5.3.7. select b tags within 
# p tags which have a class outer-text and 
# are within the body tag



In [25]:
# Exercise 5.3.8. select all div and p tags 
# Compare the result with Exercise 2.6.1.
# "," between tags means "and/or", 
# while " " (space) between tags means "descendant"

soup.select("div, p")

[<div>
 <p class="inner-text first-item" id="first">
                 First paragraph.
             </p>
 <p class="inner-text">
                 Second paragraph.
             </p>
 </div>,
 <p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [None]:
# Exercise 5.3.9. select p tags 
# with two classes: outer-text and first-item
soup.select("p.outer-text.first-item")

# what if another class, say "xxx" also required?

In [None]:
# Exercise 5.3.10 finding all p tags with class "outer-text" 
# but not class "first-item"


### 5.4.  Example: downloading weather forecast for the next week for New York City ###
- Instruction:
    1. Open web site http://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.WXi6hlGQzIU and inspect page source
    2. Find "Extended Forecast for" in the source code
    3. Extract div tags in this section using "seven-day-forecast-body div ul li div.tombstone-container"
       * Notice that the div under "Extended Forecast for" is what we need
       * Follow the path to weather forecast for each period
       <img src='weather.png' width='100%'>
    4. For each div tag, extract text in different p tags and represent the result as a tuple, e.g. ("Today", "Mostly Sunny", "High: 75F"). Save the 7-day forecast as a list and print the list 

In [26]:
# Exercise 2.7.1. downloading weather forecast for the next week for New York City 

page = requests.get("http://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.WXi6hlGQzIU")    # send a get request to the web page
rows=[]

# status_code 200 indicates success. 
#a status code >200 indicates a failure 
if page.status_code==200:        
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # find a block with id='seven-day-forecast-body'
    # follow the path down to the div for each period
    divs=soup.select("div#seven-day-forecast-body \
    div ul li div.tombstone-container")
    #print len(divs)
    #print divs
    
    for idx, div in enumerate(divs):
        # for testing you can print idx, div
        #print idx, div 
        
        # initiate the variable for each period
        title=None
        desc=None
        temp=None
        
        # get title
        p_title=div.select("p.period-name")
        
        # test if "period-name" indeed exists
        # before you get the text
        if p_title!=[]:
            title=p_title[0].get_text()
        
        # get description
        p_desc=div.select("p.short-desc")
        if p_desc!=[]:
            desc=p_desc[0].get_text()
        
        # get temperature
        p_temp=div.select("p.temp")
        if p_temp!=[]:
            temp=p_temp[0].get_text()
            
        # add title, description, and temperature as a tuple into the list
        rows.append((title, desc, temp))
        print((title, desc, temp))


('ThisAfternoon', 'Slight ChanceShowers', 'High: 73 °F')
('Tonight', 'Slight ChanceShowers', 'Low: 71 °F')
('Wednesday', 'ChanceShowers', 'High: 76 °F')
('WednesdayNight', 'ChanceShowers', 'Low: 75 °F')
('Thursday', 'ShowersLikely', 'High: 74 °F')
('ThursdayNight', 'ShowersLikely', 'Low: 64 °F')
('Friday', 'ShowersLikely', 'High: 74 °F')
('FridayNight', 'ChanceShowers thenPartly Cloudy', 'Low: 61 °F')
('Saturday', 'Mostly Sunny', 'High: 74 °F')


## 6. Scrapy and XPath
- Another excellent library for web scraping
- It supports both XPath and CSS selector
- For details, check: https://docs.scrapy.org/en/latest/intro/tutorial.html

In [30]:
from scrapy import Selector
import requests

page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")    

# status_code 200 indicates success. 
# a status code >200 indicates a failure
if page.status_code==200:   
    
    sel = Selector(text = page.text)


ModuleNotFoundError: No module named 'scrapy'

In [28]:
# soup.select("body p")


sel.xpath('//body//p').extract()  # XPath selector

sel.css("body p").extract()  # CSS selector

NameError: name 'sel' is not defined

In [29]:
# soup.select("p.outer-text")
# soup.select("body p[class~=outer-text]")

sel.xpath('//p[contains(@class,"outer-text")]').extract()

sel.css("p.outer-text").extract()

NameError: name 'sel' is not defined

In [None]:
# soup.select("p#first")

sel.xpath('//p[@id="first"]').extract()

sel.css("p#first").extract()

In [None]:
# soup.select("body p[class=outer-text]")

sel.xpath('//body//p[@class="outer-text"]').extract()

sel.css("body p[class=outer-text]").extract()

In [None]:
# soup.select("p.outer-text.first-item")

sel.xpath('//body//p[contains(@class,"outer-text") and \
                     contains(@class,"first-item")]').extract()

sel.css("p.outer-text.first-item").extract()