# CSS Locators, Chaining, and Responses

> Learn CSS Locator syntax and begin playing with the idea of chaining together CSS Locators with XPath. We also introduce Response objects, which behave like Selectors but give us extra tools to mobilize our scraping efforts across multiple websites. This is the Summary of lecture "Web Scraping in Python", via datacamp.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Datacamp]
- image: 

In [1]:
from scrapy import Selector

## From XPath to CSS
- Rosetta CSStone
    - `/` replace by `>` (except first character)
        - XPath: `/html/body/div`
        - CSS Locator: `html > body > div`
    - `//` replaced by a blank space (except first character)
        - XPath: `//div/span//p`
        - CSS Locator: `div > span p`
    - `[N]` replaced by `:nth-of-type(N)`
        - XPath: `//div/p[2]`
        - CSS Locator: `div > p:nth-of-type(2)`
- Attributes in CSS
    - To find an element by class, use a period `.`
    - To find an element by id, use a pound sign `#`

### The (X)Path to CSS Locators
Many people prefer using CSS Locator notation to XPath notation. As we will see later, it often makes attribute selection very easy. To help get you more comfortable going back and forth between XPath and CSS Locator strings, we give you a chance in this exercise to do some direct "translation" between the two.

In [2]:
# Create the XPath string equivalent to the CSS Locator
xpath = '/html/body/span[1]//a'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'html > body > span:nth-of-type(1) a'

# Create the XPath string equivalent to the CSS Locator
xpath = '//div[@id="uid"]/span//h4'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'div#uid > span h4'

### Get an "a" in this Course
We have loaded the HTML from a secret website which you will use to set up a `Selector` object and the function `how_many_elements()`. When passing this function a CSS Locator string, it will print out the number of elements that the CSS Locator you wrote has selected.

In the second part of this problem, we want you to create a CSS Locator string which will select a certain collection of elements as described here: Select the hyperlink (a element) children of all `div` elements belonging to the class `"course-block"` (that is, any `div` element with a class attribute such that `"course-block"` is one of the classes assigned). The number of such elements is 11, so you can check your solution with `how_many_elements` if you choose.

In [13]:
def how_many_elements( css ):
    sel = Selector( text = data )
    print( len(sel.css( css )) )

In [14]:
with open('./dataset/all.html', 'r') as file:
    data = file.read().replace('\n', '')

In [21]:
# Create a selector from the html (of a secret website)
sel = Selector(text=data)

# Fill in the blank
css_locator = ' div.course-block > a'

# Print the number of selected elements
how_many_elements(css_locator)

11


### The CSS Wildcard
You can use the wildcard `*` in CSS Locators too! In fact, we can use it in a similar way, when we want to ignore the tag type. For example:

- The CSS Locator string `'*'` selects all elements in the HTML document.
- The CSS Locator string `'*.class-1'` selects all elements which belong to `class-1`, but this is unnecessary since the string `'.class-1'` will also do the same job.
- The CSS Locator string `'*#uid'` selects the element with `id` attribute equal to `uid`, but this is unnecessary since the string `'#uid'` will also do the same job.

In this exercise, we want you to work by analogy with the wildcard character you know from XPath notation to discover how to select all the children of a certain element in CSS Locator notation.

In [22]:
# Create the CSS Locator to all children of the element whose id is uid
css_locator = '#uid > *'

# Print the number of selected elements
how_many_elements(css_locator)

0


## CSS Attributes and Text Selection


### You've been `href`ed
In a previous exercise, you created a CSS Locator string to select the hyperlink (`a` element) children of all `div` elements belonging to the class `"course-block"`. Here we have created a `SelectorList` called `course_as` having selected those hyperlink children.

Now, we want you to fill in the blank below to extract the `href` attribute values from these elements. This is another example of chaining, as we've seen in a previous exercise.

The point here is that we can chain together calls to the methods `css` and `xpath`, and combine them! We help nudge you in the correct direction by giving you the solution if we chain with another call to the `css` method.

In [23]:
# Select all hyperlinks of div elements belonging to class "course-block"
course_as = sel.css('div.course-block > a')

# Selecting all href attributes chaining with css
hrefs_from_css = course_as.css('::attr(href)')

# Selecting all href attributes chaining with xpath
hrefs_from_xpath = course_as.xpath('./@href')

### Top Level Text
This exercise will have you write an XPath and CSS Locator string to direct to the text of a specific paragraph `p` element. The `p` element in the HTML is uniquely defined by its id attribute, which is `"p3"`. With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable html with a string containing the HTML in which this link belongs, if you want to peruse it.

In this exercise, you will only be selecting the text within the element, which does not include the text in future generations of the element.

In [33]:
def our_xpath( xpath ):
    xextr = sel.xpath( xpath ).extract()
    return xextr

def our_css( css ):
    cextr = sel.css( css ).extract()
    return cextr

def print_results( xpath, css_locator ):
    print( "Your XPath extracts to following:")
    print( our_xpath(xpath) )
    print("_________________\n")
    print( "Your CSS Locator extracts the following:")
    print( our_css(css_locator) )
    return None

In [34]:
with open('./dataset/text_extract.html', 'r') as file:
    html = file.read().replace('\n', '')

sel = Selector(text=html)

# Create an XPath string to the desired text.
xpath = '//p[@id="p3"]/text()'

# Create a CSS Locator string to the desired text
css_locator = 'p#p3::text'

# Print the text from our selection
print_results(xpath, css_locator)

Your XPath extracts to following:
['Here is the ', ' link you want!']
_________________

Your CSS Locator extracts the following:
['Here is the ', ' link you want!']


### All Level Text
This exercise is similar to the previous, but differs in that you will be selecting text from multiple generations of a given element.

You will write an XPath and CSS Locator strings to direct to the text of a specific paragraph `p` element. The `p` element in the HTML is uniquely defined by its `id` attribute, which is `"p3"`. With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable html with a string containing the HTML in which this link belongs, if you want to peruse it.

In this exercise, you will only be selecting the text within the element which includes all text within the future generations.

In [35]:
# Create an XPath string to the desired text.
xpath = '//p[@id="p3"]//text()'

# Create an CSS Locator string to the desired text
css_locator = 'p#p3 ::text'

# Print the text from our selections
print_results(xpath, css_locator)

Your XPath extracts to following:
['Here is the ', 'DataCamp', ' link you want!']
_________________

Your CSS Locator extracts the following:
['Here is the ', 'DataCamp', ' link you want!']


### Respond Please!
- Selector vs Response:
    - The Response has all the tools we learned with Selectors:
        - `xpath` and `css` methods followed by `extract` and `extract_first` methods.
        - Response also keeps track of the url where the HTML code was loaded from.
        - Response helps us move from one site to another, so that we can "crawl" the web while scraping.

### Reveal By Response
Your job is to figure out the URL and the title of the website using the response variable. You learned how to find the URL in the last lesson. To find the website title, what you need to know is:

- The title is the text from the `title` element
- The `title` element is a child of the `head` element, which is a child of the `html` root element.

> note: the `html` root element only has one child `head` element, and the `head` element only has one child `title` element.

In [38]:
def print_url_title( url, title ):
    print( "Here is what you found:" )
    print( "\t-URL: %s" % url )
    print( "\t-Title: %s" % title )

In [76]:
from scrapy.http.response.text import TextResponse

response = TextResponse(url='https://www.datacamp.com/courses/all', encoding='utf-8')

In [77]:
# GEt the URL to the website loaded in response
this_url = response.url

# Get the title or the website loaded in response
this_title = response.xpath('//title/text()').extract_first()

# Print out our findings
print_url_title(this_url, this_title)

Here is what you found:
	-URL: https://www.datacamp.com/courses/all
	-Title: None


### Responding with Selectors
Something that we should emphasize at this point about the relationship between a `Selector` and `Response` objects is that both objects return a `SelectorList` when using the `xpath` or `css` methods to direct to elements. In this exercise, we'll prove it to you, by having you find all hyperlink elements belonging to the `class course-block__link` (notice the double underscore!) and looking at the object that is produced when doing so.

Recall that to find an element by class, you can use a period (`.`). For example, div.class-2 selects all div elements belonging to class-2.

We have pre-loaded both a `Response` object named `response` and a Selector object named `sel` with the content from the same "secret" website. Once you complete the task of creating a CSS Locator, you will compare both the output from `response.css` and `selector.css` to see that they are effectively the same!