# 🌐 Scraping, Part 1.3: CSS Selectors

*More flexibility, more power*

## Q: What is CSS?

We can actually see some in `example.com`'s HTML:

```html
<style type="text/css">
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
    
}
div {
    width: 600px;
    margin: 5em auto;
    padding: 2em;
    background-color: #fdfdff;
    border-radius: 0.5em;
    box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
/* [...] */
</style>
```

### CSS is a mini-language 

... for applying style/behavior to a *particular set* of elements on a page.

The rules look like this:

```css
selector {
    property: value;
    another-property: another-value;
}
```

For the purpose of web-scraping, all we care about is the `selector` part.

These CSS selectors turn out to be a remarkably expressive, powerful language for __pointing to HTML elements__.

## How do CSS selectors work?

Each CSS selector *matches* some set of elements.

Here are some of the key matching rules you'll want to know:

### `tagname`

Selects all instances of `<tagname>`, no matter where in the body it is.

Example: `p`

### `.classname`

The `.` selector limits matches to elements with `class="classname"`. (This is a handy shortcut, since `class` is a very common HTML attribute.)

Example: `.active` or `p.active`

### `#idname`

The `#` selector limits matches to elements with `id="idname"`. (This is another handy shortcut, since `id` is *also* a very common HTML attribute.)

Example: `#main` or `p#main`

### `tagname, othertagname`

Selects all instances of `<tagname>` *and* `<othertagname>`.

Example: `p, a`

### `tagname othertagname`

Selects all instances of `<othertagname>` that are *descendents* (at any level) of `<tagname>`.

Example: `div a`

### `tagname > othertagname`

Selects all instances of `<othertagname>` that are *direct* children of `<tagname>`.

Example: `div > p`

### `tagname + othertagname`

Selects all instances of `<othertagname>` that *immediately follow* `<tagname>` as a *sibling*.

Example: `p + p`

### `*`

Stands in for *any tag*; helpful when used in combination with other selection rules.

Example: `div > *`

## Let's try using them

## Q: Do you remember how to get example.com's HTML into Python?

In [1]:
import requests

In [2]:
example_html = requests.get("https://example.com").text

In [3]:
print(example_html[:100])

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <m


## Let's use CSS selectors in `lxml`

Install `cssselect`, which adds CSS selection capabilities to `lxml`:

```
pip install cssselect
```

Import `lxml.html`:

In [4]:
import lxml.html
# You don't have to do anything special to import `cssselect`;
# `lxml` will handle that for you.

... and turn the HTML text into an `lxml.html.HtmlElement` object:

In [5]:
dom = lxml.html.fromstring(example_html)
type(dom)

lxml.html.HtmlElement

We can use the `.cssselect("selector")` method on any element object, including the top-level `HtmlElement`.

Let's start simple, getting the title:

In [6]:
dom.cssselect("title")

[<Element title at 0x10300cc20>]

## Q: What do you notice?

## A: `.cssselect(...)` returns *a list* of elements matching the selector

... not just the the first match.

So for simple tag selections, `.cssselect("tagname")` is equivalent to `.findall("tagname")`.

If we want to work with the first element, we'll need to get that item from the list:

In [7]:
dom.cssselect("title")[0]

<Element title at 0x10300cc20>

Let's get the paragraphs:

In [8]:
dom.cssselect("p")

[<Element p at 0x106942840>, <Element p at 0x106942c00>]

## Q: What if we wanted not just the `<p>` tags, but also the `h1` tag? 

What are some ways we could do that?

In [9]:
dom.cssselect("p, h1")

[<Element h1 at 0x106943330>,
 <Element p at 0x106942840>,
 <Element p at 0x106942c00>]

In [10]:
dom.cssselect("div > *")

[<Element h1 at 0x106943330>,
 <Element p at 0x106942840>,
 <Element p at 0x106942c00>]

## Q: What do you notice about the ordering of the elements in the list?

In [11]:
dom.cssselect("p, h1")

[<Element h1 at 0x106943330>,
 <Element p at 0x106942840>,
 <Element p at 0x106942c00>]

A: They're in the order they *appear in the document*, not in the order they were specified by the rule.

(That means `p, h1` is 100% equivalent to `h1, p`.)

## Q: How would you expect the results of `div *` to differ from `div > *`?

A: `div *` will get *any* element inside a `<div>` tag, not just the immediate children.

In [12]:
dom.cssselect("div *")

[<Element h1 at 0x106943330>,
 <Element p at 0x106942840>,
 <Element p at 0x106942c00>,
 <Element a at 0x1069438d0>]

## Now let's try with `BeautifulSoup`

Do you remember how to get started with it?

In [13]:
from bs4 import BeautifulSoup

In [14]:
soup = BeautifulSoup(example_html)
type(soup)

bs4.BeautifulSoup

CSS selections with `BeautifulSoup` work exactly the same as with `lxml`, only the method name is slightly different:

In [15]:
soup.select("p, h1")

[<h1>Example Domain</h1>,
 <p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

---

---

---