## Attribution

These slides were adapted from [the companion notebooks](https://github.com/REMitchell/python-scraping) for [Web Scraping in Python](http://shop.oreilly.com/product/0636920034391.do), which are open sourced and provided for free.  If you are interested in a more detailed presentation of web scraping in Python, this book is a great source.

In [5]:
!pip install composable
!pip install composablesoup



In [6]:
!pip install composable --upgrade
!pip install composablesoup --upgrade

Requirement already up-to-date: composable in /home/sbada9048/.pyenv/versions/anaconda3-2020.02/lib/python3.7/site-packages (0.2.5)
Requirement already up-to-date: composablesoup in /home/sbada9048/.pyenv/versions/anaconda3-2020.02/lib/python3.7/site-packages (0.2.4)


In [7]:
from composablesoup import find, find_all, get_text, has_attr
from composable.sequence import slice, head
from composable.strict import map, filter
from composable.string import replace
from composable import from_toolz as tlz

## CSS and Styling HTML Pages

In this section, we will introduce styling web pages using **Cascading Style Sheets (CSS)**, which is common practice in modern web design.  The consequence of this practice is most, if not all, html tags have attributes that classify and group the tags; often in a meaningful/contextual way.  This attributes are useful when web scraping, as we will see in the following sections

### Exploration

1. Go to [this page](http://www.pythonscraping.com/pages/warandpeace.html)
2. Notice that
    1. All of the quotes are colored <font color="#ff5555">red</font>
    2. All of the character names are colored <font color="#55ff55">green</font>
3. Now right click and view the page source.  Look at the `<style>` tag at the top of the page.  *These entries are CSS selectors, which apply style to all matching tags*.
4. Finally, note that
    1. Each quotation is surrounded by `<span class="red">...</span>`
    2. Each name is surrounded by `<span class="green">...</span>`

### CSS Selectors

* **CSS selector** applies style to call matching tags.
* The following selector is
    * named `green`
    * Applies a <font color="#55ff55">green</font> font

```
.green{
	color:#55ff55;
}
```

### Applying CSS selectors to HTML tags

* Apply a selector with the `class` attribute.
* We can apply the `green` selector using

```
<span class="green">...</span>
```
* Imagine that `class="green"` is the same as 
```
<span color="#55ff55">...</span>
```


### Reading War and Peace

In [8]:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('http://www.pythonscraping.com/pages/warandpeace.html')
war_and_peace = BeautifulSoup(r.content, "html.parser")

## Searching for HTML Attributes

We can search for any HTML tag by attribute using `find` and `find_all`.  This method of searching is particularly advantagous when dealing with pages that styled using CSS selectors, as most/all tags will be marked with a `class` attribute and these attributes many times are related to the context of the content.

In this section, we will illustrate searching with tag attributes using `find` and `find_all`

### A note on `find` and `find_all`

* `soup.find` returns the first matching tag
* `soup.find_all` returns a list of all matching tags

In [9]:
war_and_peace.find('span')

<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>

In [10]:
war_and_peace.find_all('span')[:2]

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>]

### pipeable `find` and `find_all`

The module `composablesoup` contains pipeable helper functions for both functions, which we will use exclusively to allow readability and composability.

In [11]:
(war_and_peace 
 >> find('span')
)

<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>

In [12]:
(war_and_peace
 >> find_all('span')
 >> head(2)
)

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>]

### Use `find_all` when 

* There might be multiple instances
* (almost always, it's a safer option)

### Use `find` when 

* You know there is exactly one instance
* You know you really only want the first
* (almost never, `find_all` is almost always better)

### Two ways to search tag attributes

* Dictionary: `bs.find_all('span', {'class': 'green'})`
* Keyword: `bs.find_all('span', class_ = green)`

**Note:** We use the keyword `class_` here because `class` is a protected Python keyword that is only used to define classes.  Other attributes, like `src`, do not need the added `_` at the end.

### Getting all names using an attribute dictionary

In [13]:
(war_and_peace
 >> find_all('span', attrs = {'class':'green'})
 >> head(3)
)

[<span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>]

### Cleaning up the name tags

In [14]:
(war_and_peace
 >> find_all('span', attrs = {'class':'green'})
 >> map(get_text)
 >> head(3)
)

['Anna\nPavlovna Scherer', 'Empress Marya\nFedorovna', 'Prince Vasili Kuragin']

In [15]:
(war_and_peace
 >> find_all('span', attrs = {'class':'green'})
 >> map(get_text)
 >> map(replace('\n', ' '))
 >> head(3)
)

['Anna Pavlovna Scherer', 'Empress Marya Fedorovna', 'Prince Vasili Kuragin']

### Getting all quotes using the `class_` keyword

In [16]:
(war_and_peace
 >> find_all('span', attrs = {'class':'red'})
 >> head(2)
)

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>]

<font color="red"><h2>Exercise 1</h2></font>

Write a list comprehension to 

1. Pull each quote out of the `span` tag.
2. Wrap the quote in `"`

In [18]:
from composable import pipeable
remove_line = pipeable(lambda list: list.replace('\n',''))
add_quote = pipeable(lambda list: '"' + list + '"')
(war_and_peace
 >> find_all('span', attrs = {'class':'red'})
 >> map(get_text)
 >> map(remove_line)
 >> map(add_quote)
 >> head(2)
)

['"Well, Prince, so Genoa and Lucca are now just family estates of theBuonapartes. But I warn you, if you don\'t tell me that this means war,if you still try to defend the infamies and horrors perpetrated bythat Antichrist- I really believe he is Antichrist- I will havenothing more to do with you and you are no longer my friend, no longermy \'faithful slave,\' as you call yourself! But how do you do? I seeI have frightened you- sit down and tell me all the news."',
 '"If you have nothing better to do, Count [or Prince], and if theprospect of spending an evening with a poor invalid is not tooterrible, I shall be very charmed to see you tonight between 7 and 10-Annette Scherer."']

## Getting Data From Tag Attributes

Other, non-CSS attributes have information embedded in thier attributes. For example,

* `src` attribute in `img` tags
* `href` tag in `a` tags.

In this section, we will looks at pulling this information out of a tag.

### Reading the Wikipedia Web Scraping Page

In [19]:
import requests
from bs4 import BeautifulSoup
s = requests.Session() # Start a session
r = s.get('https://en.wikipedia.org/wiki/Web_scraping') # Get a static page
web_scraping = BeautifulSoup(r.content, "html.parser")

### Step 1 - Search For All Tags

In [20]:
(web_scraping
 >> find_all('a')
 >> head(3)
)

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

### Accessing Attribute Data Looks Like Indexing

* **Syntax:** `tag[attribute_string]`
* This returns the corresponding data

In [21]:
example_a_tag1 = (web_scraping
                 >> find_all('a')
                 >> head(3)
                 >> tlz.get(1)
                )
example_a_tag1

<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>

In [22]:
example_a_tag1['href']

'#mw-head'

### Searching for Non-existant Attributes is BAD

* If the attribute doesn't exist, we will get an exception

In [23]:
example_a_tag2 = (web_scraping
                 >> find_all('a')
                 >> head(3)
                 >> tlz.get(0)
                )
example_a_tag2

<a id="top"></a>

In [24]:
example_a_tag2['href']

KeyError: 'href'

### Using a filter to avoid exceptions

* We can use a comprehension to filter out exceptions
* Use the `has_attr` Tag method

In [25]:
(web_scraping
 >> find_all('a')
 >> filter(has_attr('href'))
 >> head(3)
)

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" decoding="async" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/75px-Question_book-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/100px-Question_book-new.svg.png 2x" width="50"/></a>]

In [26]:
(web_scraping
 >> find_all('a')
 >> filter(has_attr('href'))
 >> map(tlz.get('href'))
 >> head(10)
)

['#mw-head',
 '#searchInput',
 '/wiki/File:Question_book-new.svg',
 '/wiki/Wikipedia:Verifiability',
 'https://en.wikipedia.org/w/index.php?title=Web_scraping&action=edit',
 '/wiki/Help:Referencing_for_beginners',
 '//www.google.com/search?as_eq=wikipedia&q=%22Web+scraping%22',
 '//www.google.com/search?tbm=nws&q=%22Web+scraping%22+-wikipedia',
 '//www.google.com/search?&q=%22Web+scraping%22+site:news.google.com/newspapers&source=newspapers',
 '//www.google.com/search?tbs=bks:1&q=%22Web+scraping%22+-wikipedia']

<font color="red"><h2>Exercise 2</h2></font>

Write a list comprehension to get the `src` for all `img` tags on the Wikipedia site.

In [29]:
(web_scraping
>> find_all('img')
>> filter(has_attr('src'))
>> map(tlz.get('src'))
)

['//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/48px-Ambox_globe_content.svg.png',
 '//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1',
 '/static/images/footer/wikimedia-button.png',
 '/static/images/footer/poweredby_mediawiki_88x31.png']

<font color="red"><h2>Exercise 3</h2></font>

Get all image `src` and link `href` from your Assignment 1 website.

In [30]:
s = requests.Session() # Start a session
r = s.get('https://kingyeet9048.github.io/my_site/') # Get a static page
assignment_1 = BeautifulSoup(r.content, "html.parser")

In [31]:
assignment_1

<!DOCTYPE html>

<html lang="en">
<head>
<title>First CS344 Website</title>
<meta charset="utf-8"/>
<meta content="Sulaiman Bada" name="author"/>
<link href="css/styles.css" rel="stylesheet"/>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script>
<link href="https://maxcdn.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" rel="stylesheet"/>
</head>
<body>
<div class="main">
<h1 class="center" style="margin-top: 0.5em; color: darkmagenta">All About Me!</h1>
<h2>What have I been doing this summer?</h2>
<p>
                I have done a variety of things from helping a non-profit organization to
                being a finalist for a pitch competition. I'm set to give my pitch on <b>September 28th(ish).</b>
</p>
<table style="margin-top: 5em;">
<tr>
<th>Projects</th>
<th>Done?</th>
</tr>
<tr>
<td><a href="https://www.bridgeshealthwinona.org/copy-of-coronavirus" target="_blank">Covid Tool for Bridges Health Winona</a></td>
<td>Yes</td>
</tr>
<tr>

In [41]:
get_image = (assignment_1
 >> find_all('img')
 >> filter(has_attr('src'))  
 >> map(tlz.get('src'))
)
get_image

['images/news.jpg']

In [42]:
get_link = (assignment_1
 >> find_all('a')
 >> filter(has_attr('href'))  
 >> map(tlz.get('href'))
)
get_link

['https://www.bridgeshealthwinona.org/copy-of-coronavirus']

## More Complicated Searches

Next, we will

* Search for multiple tags at once
* Search for more than one class

### Searching for a list of tags

Using a list of tags with `find_all` returns all such tags.

In [43]:
(war_and_peace
 >> find_all(['h1', 'h2','h3','h4','h5','h6'])
)

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

### Matching more than one attribute

We can match more than one `class` using a set of attribute values

In [44]:
(war_and_peace
 >> find_all('span', attrs = {'class':{'green', 'red'}})
 >> head(3)
)

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>]

### Searching tag text only

We can search text only using the `text` keyword.

In [45]:
(war_and_peace
 >> find_all(None, text='the prince')
)

['the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince']

### Text search return a NavigableString

* More than text
* Allow access to surrounding tags

In [46]:
(war_and_peace
 >> find_all(None, text='the prince')
 >> map(type)
)

[bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString]

### Getting the surrounding tag with `parent`

More information on parent tags is on the way

In [47]:
(war_and_peace
 >> find_all(None, text='the prince')
 >> map(lambda ns: ns.parent)
)

[<span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>]