# XPaths and Selectors

> Leverage XPath syntax to explore scrapy selectors. Both of these concepts will move you towards being able to scrape an HTML document. This is the Summary of lecture "Web Scraping in Python", via datacamp.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Datacamp]
- image: 

In [2]:
from scrapy import Selector

## XPathology
- Slashes and Brackets
    - Single forward slash `/` looks forward one generation
    - Double forward slash `//` looks forward all future generations
    - Square brackets `[]` help narrow in on specific elements

## Off the Beaten XPath


## Selector Objects


### XPath Chaining
`Selector` and `SelectorList` objects allow for chaining when using the `xpath` method. What this means is that you can apply the `xpath` method over once you've already applied it. For example, if sel is the name of our Selector, then
```
sel.xpath('/html/body/div[2]')
```
is the same as
```
sel.xpath('/html').xpath('./body/div[2]')
```
or is the same as
```
sel.xpath('/html').xpath('./body').xpath('./div[2]')
```
The only catch is that you need to "glue together" the XPath pieces by using a period at the start of each subsequent XPath string (notice the periods we added to the XPath strings in our examples).

In [6]:
html = '''
<html>
    <body>
        <div>HELLO</div>
        <div>
            <p>GOODBYE</p>
        </div>
        <div>
            <span>
                <p>NOPE</p>
                <p>ALMOST</p>
                <p>YOU GOT IT!</p>
            </span>
        </div>
    </body>
</html>
'''

sel = Selector(text=html)

In [11]:
# Chain together xpath methods to select desired p element
sel.xpath('//div').xpath('./span/p[3]')

[<Selector xpath='./span/p[3]' data='<p>YOU GOT IT!</p>'>]

### Divvy Up This Exercise
You will use this `html` variable as the HTML document to set up a `Selector` object with, and create a `SelectorList` which selects all `div` elements; then, you will check your understanding of what happens within the `SelectorList`.

In [12]:
html = '''
<html>
    <body>
        <div>Div 1: <p>paragraph 1</p></div>
        <div>Div 2: <p>paragraph 2</p> <p>paragraph 3</p> </div>
        <div>Div 3: <p>paragraph 4</p> <p>paragraph 5</p> <p>paragraph 6</p></div>
        <div>Div 4: <p>paragraph 7</p></div>
        <div>Div 5: <p>paragraph 8</p></div>
    </body>
</html>
'''

In [15]:
# Create a Selector selecting html as the HTML document
sel = Selector(text=html)

# Create a SelectorList of all div elements in the HTML document
divs = sel.xpath('//div')
divs

[<Selector xpath='//div' data='<div>Div 1: <p>paragraph 1</p></div>'>,
 <Selector xpath='//div' data='<div>Div 2: <p>paragraph 2</p> <p>par...'>,
 <Selector xpath='//div' data='<div>Div 3: <p>paragraph 4</p> <p>par...'>,
 <Selector xpath='//div' data='<div>Div 4: <p>paragraph 7</p></div>'>,
 <Selector xpath='//div' data='<div>Div 5: <p>paragraph 8</p></div>'>]

## The Source of the Source


### Requesting a Selector
Your task is to create a Selector object sel using the HTML source code stored in html.

In [16]:
import requests

url = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

# Create the string html containing the HTML source
html = requests.get(url).content

# Create the Selector object sel from html
sel = Selector(text=html)

# Print out the number of elements in the HTML document
print("There are 1020 elements in the HTML document.")
print("You have found: ", len(sel.xpath('//*')))

There are 1020 elements in the HTML document.
You have found:  1020
