## The (X)Path to CSS Locators

<p>Many people prefer using CSS Locator notation to XPath notation. As we will see later, it often makes attribute selection very easy. To help get you more comfortable going back and forth between XPath and CSS Locator strings, we give you a chance in this exercise to do some direct "translation" between the two.</p>
<p><strong>Note that the exercises in this chapter may take some time to load.</strong></p>

## Get an "a" in this Course

In [None]:
from scrapy import Selector
import requests
# html = requests.get( 'https://assets.datacamp.com/production/repositories/2560/datasets/0f78aa6961422247398f079e099e179f6bf4aec9/all_long' ).content
html = requests.get('https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short').content

def how_many_elements( css ):
  sel = Selector( text = html )
  print( len(sel.css( css )) )

<p>We have loaded the HTML from a secret website which you will use to set up a <code>Selector</code> object and the function <code>how_many_elements()</code>. When passing this function a CSS Locator string, it will print out the number of elements that the CSS Locator you wrote has selected.</p>
<p>In the second part of this problem, we want you to create a CSS Locator string which will select a certain collection of elements as described here: Select the hyperlink (<code>a</code> element) children of all <code>div</code> elements belonging to the class <code>"course-block"</code> (that is, any <code>div</code> element with a class attribute such that <code>"course-block"</code> is one of the classes assigned). The number of such elements is 11, so you can check your solution with <code>how_many_elements</code> if you choose.</p>

## The CSS Wildcard

In [None]:
# Create the CSS Locator to all children of the element whose id is uid
css_locator = "#uid > *"

<p>You can use the wildcard <code>*</code> in CSS Locators too! In fact, we can use it in a similar way, when we want to ignore the tag type. For example:</p>
<ul>
<li>The CSS Locator string <code>'*'</code> selects all elements in the HTML document. </li>
<li>The CSS Locator string <code>'*.class-1'</code> selects all elements which belong to <code>class-1</code>, but this is unnecessary since the string <code>'.class-1'</code> will also do the same job.</li>
<li>The CSS Locator string <code>'*#uid'</code> selects the element with <code>id</code> attribute equal to <code>uid</code>, but this is unnecessary since the string <code>'#uid'</code> will also do the same job.</li>
</ul>
<p>In this exercise, we want you to work by analogy with the wildcard character you know from XPath notation to discover how to select all the children of a certain element in CSS Locator notation.</p>

<ul>
<li>Assign to the variable <code>css_locator</code> a CSS Locator string which will select all children (regardless of tag-type) of the unique element in the HTML document that has its <code>id</code> attribute equal to <code>uid</code>.</li>
</ul>

<ul>
<li>The exercise discussion already gives you a method to find the element with <code>id</code> equal to <code>uid</code>; what you need to do is find a way to use the wildcard character and your knowledge of how to move down one generation (in CSS Locator notation) to select all the children of the element.</li>
</ul>

## You've been `href`ed

In [None]:
from scrapy import Selector
import requests
# html = requests.get( 'https://assets.datacamp.com/production/repositories/2560/datasets/3ac9c2faa22664a688c5c5ee42e76d47d6b297dc/all' ).content
html = requests.get('https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short').content

sel = Selector( text = html )

In [None]:
from scrapy import Selector

# Create a selector object from a secret website
sel = Selector( text = html )

# Select all hyperlinks of div elements belonging to class "course-block"
course_as = sel.css( 'div.course-block > a' )

# Selecting all href attributes chaining with css
hrefs_from_css = course_as.css( '::attr(href)' )

# Selecting all href attributes chaining with xpath
hrefs_from_xpath = course_as.xpath( './@href' )

<p>In a previous exercise, you created a CSS Locator string to select the hyperlink (<code>a</code> element) children of all <code>div</code> elements belonging to the class <code>"course-block"</code>. Here we have created a <code>SelectorList</code> called <code>course_as</code> having selected those hyperlink children. </p>
<p>Now, we want you to fill in the blank below to extract the <code>href</code> attribute values from these elements. This is another example of chaining, as we've seen in a previous exercise.</p>
<p>The point here is that we can chain together calls to the methods <code>css</code> and <code>xpath</code>, and combine them! We help nudge you in the correct direction by giving you the solution if we chain with another call to the <code>css</code> method.</p>

<ul>
<li>Set up the <code>Selector</code> object <code>sel</code> using the string <code>html</code> as the text input.</li>
<li>Assign to the variable <code>hrefs_from_xpath</code> the <code>href</code> attribute values from the elements in <code>course_as</code>. Your solution should match <code>hrefs_from_css</code>!</li>
</ul>

<ul>
<li>Don't forget when chaining with <code>xpath</code>, to use a period as "glue" as you did in a previous exercise.</li>
</ul>

## Top Level Text

html = '''
<html>
<body>
<div id="this-div">
<p id="p1" class="class-1">This is not the element you are looking for</p>
<p id="p2" class="class-12">
<a href="https://www.google.com">Google</a> is linked to here, but this isn't the link you are looking for. 
</p>
<p id="p3" class="class-1 class-12">
Here is the <a href="https://www.datacamp.com" id="a-exercise">DataCamp</a> link you want!
</p>
</div>
</body>
</html>
'''

from scrapy.http import TextResponse
res = TextResponse( url = "https://www.DataCamp.com", body = html, encoding = 'utf-8' )

def our_xpath( xpath ):
  xextr = res.xpath( xpath ).extract()
  return xextr
  
def our_css( css ):
  cextr = res.css( css ).extract()
  return cextr


def print_results( xpath, css_locator ):
  print( "Your XPath extracts to following:")
  print( our_xpath(xpath) )
  print("_________________\n")
  print( "Your CSS Locator extracts the following:")
  print( our_css(css_locator) )
  return None

In [None]:
# Create an XPath string to the desired text.
xpath = '//p[@id="p3"]/text()'

# Create a CSS Locator string to the desired text.
css_locator = 'p#p3::text'

# Print the text from our selections
print_results( xpath, css_locator )

<p>This exercise will have you write an XPath and CSS Locator string to direct to the text of a specific paragraph <code>p</code> element. The <code>p</code> element in the HTML is uniquely defined by its <code>id</code> attribute, which is <code>"p3"</code>. With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable <code>html</code> with a string containing the HTML in which this link belongs, if you want to peruse it.</p>
<p>In this exercise, you will only be selecting the text within the element, which <strong>does not include</strong> the text in future generations of the element. We have created a function <code>print_results</code> for you to compare which elements your strings direct to.</p>

<ul>
<li>Assign to the variable <code>xpath</code> an XPath string directing to the text within the paragraph <code>p</code> element with <code>id</code> equal to <code>p3</code>, which <strong>does not include</strong> the text of future generations of this <code>p</code> element.</li>
<li>Assign to the variable <code>css_locator</code> a CSS Locator string directing to this same text.</li>
</ul>

<ul>
<li>Remember that in CSS Locator notation, the pound sign <code>#</code> is used in helping identify an element by its id. </li>
<li>Don't forget that for an XPath string, you need to have parentheses following the word text (i.e., <code>text()</code> should be part of the string). </li>
<li>Don't forget that for a CSS Locator string, you need to connect the word text with a double colon <code>::</code>.</li>
</ul>

## All Level Text

html = '''
<html>
<body>
<div id="this-div">
<p id="p1" class="class-1">This is not the element you are looking for</p>
<p id="p2" class="class-12">
<a href="https://www.google.com">Google</a> is linked to here, but this isn't the link you are looking for. 
</p>
<p id="p3" class="class-1 class-12">
Here is the <a href="https://www.datacamp.com" id="a-exercise">DataCamp</a> link you want!
</p>
</div>
</body>
</html>
'''

from scrapy.http import TextResponse
res = TextResponse( url = "https://www.DataCamp.com", body = html, encoding = 'utf-8' )

def our_xpath( xpath ):
  xextr = res.xpath( xpath ).extract()
  return xextr
  
def our_css( css ):
  cextr = res.css( css ).extract()
  return cextr


def print_results( xpath, css_locator ):
  print( "Your XPath extracts to following:")
  print( our_xpath(xpath) )
  print("_________________\n")
  print( "Your CSS Locator extracts the following:")
  print( our_css(css_locator) )
  return None

In [None]:
# Create an XPath string to the desired text.
xpath = '//p[@id="p3"]//text()'

# Create a CSS Locator string to the desired text.
css_locator = 'p#p3 ::text'

# Print the text from our selections
print_results( xpath, css_locator )

<p>This exercise is similar to the previous, but differs in that you will be selecting text from multiple generations of a given element.</p>
<p>You will write an XPath and CSS Locator strings to direct to the text of a specific paragraph <code>p</code> element. The <code>p</code> element in the HTML is uniquely defined by its <code>id</code> attribute, which is <code>"p3"</code>. With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable <code>html</code> with a string containing the HTML in which this link belongs, if you want to peruse it.</p>
<p>In this exercise, you will only be selecting the text within the element which <strong>includes</strong> all text within the future generations. We have created a function <code>print_results</code> for you to compare which elements your strings direct to.</p>

<ul>
<li>Assign to the variable <code>xpath</code> an XPath string directing to the text within the paragraph <code>p</code> element with <code>id</code> equal to <code>p3</code>, which <strong>includes</strong>  the text of future generations of this <code>p</code> element. </li>
<li>Assign to the variable <code>css_locator</code> a CSS Locator string directing to this same text.</li>
</ul>

<ul>
<li>Remember that in CSS Locator notation, the pound sign <code>#</code> is used in helping identify an element by its id. </li>
<li>Don't forget that for an XPath string, you need to have parentheses following the word text (i.e., <code>text()</code> should be part of the string). </li>
<li>Don't forget that for a CSS Locator string, you need to connect the word text with a double colon <code>::</code>.</li>
</ul>

## Reveal By Response

In [None]:
import requests
from scrapy.http import TextResponse

# html = requests.get( 'https://assets.datacamp.com/production/repositories/2560/datasets/3ac9c2faa22664a688c5c5ee42e76d47d6b297dc/all' ).content
html = requests.get('https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short').content

response = TextResponse( url = 'https://www.datacamp.com/courses/all', 
                         body = html, 
                         encoding = 'utf-8' )

def print_url_title( url, title ):
  print( "Here is what you found:" )
  print( "\t-URL: %s" % url )
  print( "\t-Title: %s" % title )

In [None]:
# Get the URL to the website loaded in response
this_url = response.url

# Get the title of the website loaded in response
this_title = response.xpath( '/html/head/title/text()' ).extract_first()

# Print out our findings
print_url_title( this_url, this_title )

<p>We have pre-loaded a <code>Response</code> object, named <code>response</code>, with the content from a secret website. Your job is to figure out the URL and the title of the website using the response variable. You learned how to find the URL in the last lesson. To find the website title, what you need to know is:</p>
<ul>
<li>The title is the <strong>text</strong> from the <code>title</code> element</li>
<li>The <code>title</code> element is a child of the <code>head</code> element, which is a child of the <code>html</code> root element.</li>
</ul>
<p>To note: the <code>html</code> root element only has one child <code>head</code> element, and the <code>head</code> element only has one child <code>title</code> element.</p>

<ul>
<li>Assign to the variable <code>this_url</code> the URL used to load the <code>response</code> variable. </li>
<li>Assign to the variable <code>this_title</code> the title of the website used to load the <code>response</code> variable. Since we only want the text from the single element we will select, we use the <code>extract_first()</code> method to extract the text.</li>
<li><em>Regardless of whether you use <code>xpath</code> or <code>css</code>, make sure that you are selecting the <strong>text</strong> within the title element, and not just the title itself.</em></li>
</ul>

<ul>
<li>You can access the URL of the <code>response</code> via its <code>.url</code> attribute.</li>
<li>You can use your choice of either the <code>xpath</code> or <code>css</code> method within <code>response</code> to get to the title.</li>
<li>If you are using <code>xpath</code>, don't forget that you will need to include <code>/text()</code> within your XPath string to point to the text of the title. </li>
<li>If you are using <code>css</code>, don't forget that you will need to include <code>::text</code> within your CSS Locator string.</li>
</ul>

## Responding with Selectors

In [None]:
import requests
from scrapy.http import TextResponse
from scrapy import Selector 

# html = requests.get( 'https://assets.datacamp.com/production/repositories/2560/datasets/3ac9c2faa22664a688c5c5ee42e76d47d6b297dc/all' ).content
html = requests.get('https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short').content

response = TextResponse( url ='https://www.datacamp.com/courses/all', 
                         body = html, 
                         encoding = 'utf-8' )
sel = Selector( text = html )

In [None]:
# Create a CSS Locator string to the desired hyperlink elements
css_locator = 'a.course-block__link'

# Select the hyperlink elements from response and sel
response_as = response.css( css_locator )
sel_as = sel.css( css_locator )

# Examine similarity
nr = len( response_as )
ns = len( sel_as )
for i in range( min(nr, ns, 2) ):
  print( "Element %d from response: %s" % (i+1, response_as[i]) )
  print( "Element %d from sel: %s" % (i+1, sel_as[i]) )
  print( "" )

<p>Something that we should emphasize at this point about the relationship between a <code>Selector</code> and <code>Response</code> objects is that <strong>both</strong> objects return a <code>SelectorList</code> when using the <code>xpath</code> or <code>css</code> methods to direct to elements. In this exercise, we'll prove it to you, by having you find all hyperlink elements belonging to the class <code>course-block__link</code> (notice the double underscore!) and looking at the object that is produced when doing so. </p>
<p>Recall that to find an element by class, you can use a period (<code>.</code>). For example, <code>div.class-2</code> selects all div elements belonging to <code>class-2</code>.</p>
<p>We have pre-loaded both a <code>Response</code> object named <code>response</code> and a <code>Selector</code> object named <code>sel</code> with the content from the same "secret" website. Once you complete the task of creating a CSS Locator, you will compare both the output from <code>response.css</code> and <code>selector.css</code> to see that they are effectively the same!</p>

<ul>
<li>Assign to the variable <code>css_locator</code> a CSS Locator string which directs to all hyperlink <code>a</code> elements belonging to the class <code>course-block__link</code>. </li>
<li>Assign to the variable <code>response_as</code> the output of passing the <code>css_locator</code> variable to the <code>css</code> method in <code>response</code>.</li>
<li>Assign to the variable <code>sel_as</code>  the output of passing the <code>css_locator</code> variable to the <code>css</code> method in <code>sel</code>.</li>
</ul>

<ul>
<li>Your CSS Locator string should take the form <code>a.____</code>, where the blank represents the class that you want to find.</li>
<li>Don't forget the double underscore!</li>
</ul>

## Selecting from a Selection

In [None]:
import requests
from scrapy.http import TextResponse
from scrapy import Selector

# html = requests.get( 'https://assets.datacamp.com/production/repositories/2560/datasets/3ac9c2faa22664a688c5c5ee42e76d47d6b297dc/all' ).content
html = requests.get('https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short').content

response = TextResponse( url ='https://www.datacamp.com/courses/all', 
                         body = html, 
                         encoding = 'utf-8' )


<p>In this exercise, you will find the text from an <code>h4</code> element within a particular <code>div</code> element. It will occur in steps where the first step is selecting a family of <code>div</code> elements, and the second step is narrowing in on the first one, from which we will grab the <code>h4</code> element text. This process of progressively narrowing in on elements (e.g., first to the <code>div</code> elements, then to the <code>h4</code> element) is another example of "chaining", even if it doesn't look exactly the same as we've seen it before.</p>
<p>Along the way in this exercise, there is a variable <code>first_div</code> set up for you to use. Think carefully about what type of object <code>first_div</code> is!</p>

## Titular

In [None]:
import requests
from scrapy.http import TextResponse

# html = requests.get( 'https://assets.datacamp.com/production/repositories/2560/datasets/3ac9c2faa22664a688c5c5ee42e76d47d6b297dc/all' ).content
html = requests.get('https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short').content

response = TextResponse( url = 'https://www.datacamp.com/courses/all',
                         body = html,
                         encoding = 'utf-8' )

In [None]:
# Create a SelectorList of the course titles
crs_title_els = response.css( 'h4::text' )

# Extract the course titles 
crs_titles = crs_title_els.extract()

# Print out the course titles 
for el in crs_titles:
  print( ">>", el )

<p>Similar to the work given in the previous lesson, we will have you use a pre-loaded <code>Response</code> object, named <code>response</code> to scrape the course titles from the (shortened version of the) DataCamp course directory <a href="https://www.datacamp.com/courses/all">https://www.datacamp.com/courses/all</a>. To successfully do so, you only need to know the following</p>
<ul>
<li>The course titles <strong>are the text</strong> from all the <code>h4</code> elements within the HTML document. </li>
</ul>
<p>We ask you to extract these course titles here.</p>

<ul>
<li>Using <code>response</code>, assign to the variable <code>crs_title_els</code> a <code>SelectorList</code> of the selected course titles.</li>
<li>Assign to the variable <code>crs_titles</code> a list created by extracting the course titles from <code>crs_title_els</code>.</li>
</ul>

<ul>
<li>You can use either the <code>xpath</code> or <code>css</code> method within <code>response</code> to get to the the course titles.</li>
<li>Remember that every <code>h4</code> element is used for the course titles (and you need to get the text from these elements).</li>
</ul>

## Scraping with Children

In [None]:
from scrapy.http import TextResponse
import requests

_url = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

_html = requests.get( _url ).content

_response = TextResponse( url = _url, body = _html, encoding = 'utf-8' )

_as = _response.xpath('//body')

mystery = _as[0]

In [None]:
# Calculate the number of children of the mystery element
how_many_kids = len( mystery.xpath( './*' ) )

# Print out the number
print( "The number of elements you selected was:", how_many_kids )

<p>We did a cute trick in the lesson to calculate how many children there were of one of the <code>div</code> elements belonging to the class <code>course-block</code>. Here we ask you to find the number of children of a mystery element (already stored within a <code>Selector</code> object, so you can use the <code>xpath</code> or <code>css</code> method).</p>
<p>To be explicit, we have created the <code>Selector</code> object <code>mystery</code> in the following way:</p>
<ul>
<li>We first loaded a <code>Response</code> variable using a secret website as the input.</li>
<li>Then we used a call to the <code>xpath</code> method to create a <code>SelectorList</code> of elements (but we won't say which ones)</li>
<li>Finally, we let <code>mystery</code> be the first <code>Selector</code> object of this <code>SelectorList</code>.</li>
</ul>

<ul>
<li><p>Fill in the blank below to chain on a call to <code>xpath</code> so that we can calculate the number of children of the mystery element; we assign this number to the variable <code>how_many_kids</code>.</p>
<ul>
<li><em>Remember, if you use <code>xpath</code>, this really is an instance of chaining, so don't forget to use a period (<code>.</code>) as glue.</em></li></ul></li>
</ul>

<ul>
<li>Remember that children are only down one generation, and thus you only need to use a <em>single</em> forward slash (<code>/</code>).</li>
<li>You will need to use an asterisks (<code>*</code>) wildcard in your solution.</li>
</ul>