# Scraping the Web with Python
A workshop by [Le Wagon](https://www.lewagon.com)

In [42]:
import requests # A library to make HTTP requests (access websites from code, not from the browser)
from bs4 import BeautifulSoup # A library to parse HTML and XML documents
import pandas as pd

## Getting page contents
We use a popular [requests](https://github.com/psf/requests) library to make an HTTP request of type GET to the page we want to scrape and save the response into a `response` variable. 

When using the web, you make HTTP requests all the time: every time you type in something in the address bar and hit "Enter", or click a link on a website—browser makes a GET request!

To get the HTML of the page a string of characters, we need to call `response.content`

In [2]:
url = "http://books.toscrape.com/index.html"
response = requests.get(url)n
html = response.content

Now we need to give the resulting string to a [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library, which is the most popular library in a Python community for processing HTML and XML data. 

It will allow us to navigate the tree of an HTML document and retrieve data through its handy syntax. Here, we are telling the library to process the page we just saved as an HTML document:

In [3]:
scraped = BeautifulSoup(html, 'html.parser')

You can take a look at a formatted source code of the page just by printing out a resulting object. We keep it into  a variable named `scraped`, and we will work with this variable troughout the whole session.

In [4]:
print(scraped)

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

## First steps

Beautiful Soup has an intuitive API and a good [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) that you can use to help yourself out.

How would we get the title of the page we just scraped? Let's try `scraped.title`


In [48]:
scraped.title

<title>
    All products | Books to Scrape - Sandbox
</title>

If you `print` it, you will see the bit of HTML, but we can't use this directly as text content. The object returned by the call to `scrape.title` is an internal representation of an HTML tag used by the library. Check the type:

In [6]:
type(scraped.title)

bs4.element.Tag

If we want to get the text data inside the HTML element, we need to call `.text` on the this object:

In [7]:
scraped.title.text

'\n    All products | Books to Scrape - Sandbox\n'

If we examine the string closely, we will see that it has a lot of *whitespace*: `\n` that is a special character for a line break ([Newline](https://en.wikipedia.org/wiki/Newline)), and a bunch of spaces. 

We can get rid of all of that by calling `.strip` on our string:

In [8]:
title_text = scraped.title.text.strip()

Now we have saved a proper text representation into a variable and we can operate on it later, if we want.

In [9]:
print(title_text)

All products | Books to Scrape - Sandbox


Beautiful Soup even has a method that will automatically extract all text from a page. But beware, most of the times the result will not be pretty!

In [10]:
scraped.get_text()



Usually, we never need the unformatted whole text of a page. We need to get picky and only find the stuff we want. For that, we will use [CSS Selectors](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors), which are a universal way to find your way around a page by using tags and attributes in HTML.

## Finding a single element

### Working with a tag

The `<h1>` tag is reserved for main heading of a document. In a well-written HTML, there will be just one `<h1>` per page. Let's see how easy it is to get its contents with Beautiful Soup 

In [11]:
scraped.find('h1').text

'All products'

`find` method will accepts the name of the tag as an argument and will always return **a single element**. If there are more than one element with the same tag, only the first one will be returned. Keep it in mind for later! 

For convenience sake, instead of typing `scraped.find('h1')`, you can just type `scraped.h1` with the same result! 

In [12]:
scraped.h1.text

'All products'

Sometimes it's easier to assign an element we want to a variable first, and then operate on it:

In [13]:
heading = scraped.h1
heading.text

'All products'

### Working with attributes

HTML tags also tend to have *attributes*, for instance, in `<div class="row" id="main">`, both __class__ and __id__ are attributes of a `<div>` tag, while _"row"_ and _"main"_ are their values.

You can ask Beautiful Soup to show you all attributes with their values by calling a `.attrs` method on an element: 

In [14]:
body = scraped.body
body.attrs

{'id': 'default', 'class': ['default']}

You can also get a value of any given attribute, by using this notation:

In [15]:
body['class']

['default']

If the attribute does not exist on an element, and you attempt to ask its value, you will get an error:

In [16]:
body['name'] # Error!

KeyError: 'name'

⚠️ Don't forget that if the `find` method matches more than one element on a page, only a first matching element will be returned. In order to work with multiple elements, we need to get used to __looping!__

In [None]:
scraped.a # There is definitely more than one <a> on a page, but we only get to see the first

## Working with multiple elements

How would we display all the links in the page? There is a `.find_all` method for that takes a name of the the tag as an argument and returns **all matched elements**.

In [44]:
all_links = scraped.find_all('a')

Note that `all_links` variable now contains a __collection__. We will need to _iterate_ over it to work with individual links. Luckily, Python makes it easy for us with `for... in` loop:

In [45]:
for link in all_links:
    print(link.text.strip())

Books to Scrape
Home
Books
Travel
Mystery
Historical Fiction
Sequential Art
Classics
Philosophy
Romance
Womens Fiction
Fiction
Childrens
Religion
Nonfiction
Music
Default
Science Fiction
Sports and Games
Add a comment
Fantasy
New Adult
Young Adult
Science
Poetry
Paranormal
Art
Psychology
Autobiography
Parenting
Adult Fiction
Humor
Horror
History
Food and Drink
Christian Fiction
Business
Biography
Thriller
Contemporary
Spirituality
Academic
Self Help
Historical
Christian
Suspense
Short Stories
Novels
Health
Politics
Cultural
Erotica
Crime

A Light in the ...

Tipping the Velvet

Soumission

Sharp Objects

Sapiens: A Brief History ...

The Requiem Red

The Dirty Little Secrets ...

The Coming Woman: A ...

The Boys in the ...

The Black Maria

Starving Hearts (Triangular Trade ...

Shakespeare's Sonnets

Set Me Free

Scott Pilgrim's Precious Little ...

Rip it Up and ...

Our Band Could Be ...

Olio

Mesaerion: The Best Science ...

Libertarianism for Beginners

It's Only the Himalayas
n

In the code above, we are [looping](https://www.learnpython.org/en/Loops) over the collection inside the `all_links` variable. `for link...` on line 1 starts the loop and gives a name `link` to every individual element of the collection. We use that name on line 2 to print each link's text and remove all the whitespace around it.

We have just printed all the text from all the links on the page, but this is not very useful: the links leading to categories in the sidebar get mixed up with the links leading to book titles.

What if we want to just print the links to book titles?

First, we will need to go back to Developer Tools in our browser, and inspect the document closely. What is different between the "book title link" and the "category link" in our page's markup? Take a close look at surrounding elements!

Here's the "category" link:

```html
<li>
   <a href="catalogue/category/books/science-fiction_16/index.html">                       
     Science Fiction                         
   </a>
</li>
```

And here's the "book title" link:

```html
<h3>
    <a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>
</h3>
```

> __Test yourself:__ What are the differences?


<details>
<summary>
    <strong>View answer</strong>
</summary>
    <ul>
        <li>
            The "category" link is enclosed in a <strong>li</strong> tag, while the "book title" link is enclosed in in <strong>h3</strong>.
        </li>
        <li>
            The "book title" link has an additional <em>title</em> attribute.
        </li>
    </ul>
        
        
</details>

## Filtering results

### Using attributes

Now, there are two ways we can go to only fetching the links with book titles. Both `find` and `find_all` methods take additional *keyword arguments* for attributes.

In [None]:
# only returns <div> tags that have a class 'product_price'
scraped.find_all('div', class_='product_price') 

# returns the first <div> tag that has a class 'product_price' and ignores the rest
scraped.find('div', class_='product_price') 

⚠️ Note that we have to type `class_=` and not `class=`, this is the one of the quirks of Python as a language, as `class` is a reserved word that cannot be used in that context.

If we don't care about the value of the attribute, and just want to keep the tags that have it and filter out the others, we can pass `True` instead of an argument value.

And it gives us the solution of our problem!

In [46]:
# Filtering links based on attributes
links_with_title = scraped.find_all('a', title=True)
for link in links_with_title:
    print(link.text.strip())

A Light in the ...
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History ...
The Requiem Red
The Dirty Little Secrets ...
The Coming Woman: A ...
The Boys in the ...
The Black Maria
Starving Hearts (Triangular Trade ...
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little ...
Rip it Up and ...
Our Band Could Be ...
Olio
Mesaerion: The Best Science ...
Libertarianism for Beginners
It's Only the Himalayas


### Using selectors

Instead of using `find_all` method that takes a name of the tag and optional attribute arguments, we can opt for a more flexible [select](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) method of Beautiful Soup that accepts any valid _selector string_ as an argument:

In [None]:
# Using the more specific selector
title_links = scraped.select('h3 > a')
for link in title_links:
    print(link.text)

Here, we are using a [Child Combinator](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors/Combinators#Child_combinator) syntax from CSS selectors to only find `<a>` elements that are __direct children__ of `<h3>` elements.

Same result as above!

⚠️ Note that using `select` method will always return a **collection** of results that you will have to **iterate over** with a `for... in` loop

## Navigating the element tree

A well-formed HTML looks like a flock of birds in a V formation:

<img src="https://upload.wikimedia.org/wikipedia/commons/0/0b/Eurasian_Cranes_migrating_to_Meyghan_Salt_Lake.jpg" width="600">

If you take a look at this picture and imagine that each crane is an HTML element, you can immediately tell how they are related. Starting from the right, each next bird is, in HTML speak, a _child_ to a previous one (and the grandchild to the one before last). If we look from the right, then each next bird is a _parent_ (or a grandparent, great-grandparent and so on).

If several elements are located on the same level of identation, they are called *siblings*:

```html
 <p class="parent"> <!-- parent -->
     <!-- children -->
     <i class="child-a"></i> <!-- sibling --> 
     <i class="child-b"></i> <!-- sibling --> 
     <i class="child-c"></i> <!-- sibling --> 
 </p>
```

Very often, the data you need to scrape is nested inside the component that is built from multiple tags. So you need to be able to navigate the relations between HTML elements. Let's take a look at the markup of a "book card" component from the page we're scraping:

In [41]:
item = scraped.find("article", class_="product_pod")
print(item.prettify()) # prettify() method makes HTML more readable

<article class="product_pod">
 <div class="image_container">
  <a href="catalogue/a-light-in-the-attic_1000/index.html">
   <img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
  </a>
 </div>
 <p class="star-rating Three">
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
 </p>
 <h3>
  <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
   A Light in the ...
  </a>
 </h3>
 <div class="product_price">
  <p class="price_color">
   £51.77
  </p>
  <p class="instock availability">
   <i class="icon-ok">
   </i>
   In stock
  </p>
  <form>
   <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">
    Add to basket
   </button>
  </form>
 </div>
</article>



With Beautifil Soup, we can navigate the hierarchy starting from the outermost element just by _chaining_ `.find` methods together:

In [22]:
book_title = item.h3.a['title']
book_price = item.find("div", class_="product_price").p.text

print('"{}" costs {}'.format(book_title, book_price))

"A Light in the Attic" costs £51.77


### Going futher

Beautiful Soup has a lot of methods to navigate the element tree! To read up on them, refer to [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree). Methods that can come up useful in your future scraping endeavours: 

* `children`
* `descendants`
* `parent` and `parents`
* `next_sibling` and `previous_sibling`

## Following the links

What if want to also get book descriptions from the website? They are not found on the homepage, you need to click on the book title to get to a separate page with more information about the product.

Let's try writing a simple scraper in Python that will request the landing page, then click on the first book title in the grid, and then scrape information from the page that follows:

In [24]:
BASE_URL = "http://books.toscrape.com/"
home_url = BASE_URL + "index.html"
home_response = requests.get(home_url)
home_html = home_response.content
home_scraped = BeautifulSoup(home_html, 'html.parser')

book_title = home_scraped.find("a", title=True)
title_text = book_title['title']
title_link = book_title['href']

product_response = requests.get(BASE_URL + title_link)
product_html = product_response.content
product_scraped = BeautifulSoup(product_html, 'html.parser')

# Note we had to call next_sibling twice
description = product_scraped.find("div", id="product_description").next_sibling.next_sibling

And here's our description:

In [25]:
print(title_text)
print('\n')
print(description.text)

A Light in the Attic


It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. 

## A complete scraper

Now the we know that this approach worked, we can try and automate the process even more. Computer programs are very good at repetative tasks, and we've just got ourselves a great repetitive task to automate!

Let's replace our first `find` call with a `find_all`, and do some looping! We will also need a data structure to keep all the contents we scraped. We can decide what we want to do with it later — either save to a database of our promising startup, or print to a file.

In [34]:
BASE_URL = "http://books.toscrape.com/"
home_url = BASE_URL + "index.html"
home_response = requests.get(home_url)
home_html = home_response.content
home_scraped = BeautifulSoup(home_html, 'html.parser')

infos = []

book_titles = home_scraped.find_all("a", title=True)

for book_title in book_titles:
    title_text = book_title['title']
    title_link = book_title['href']
    product_response = requests.get(BASE_URL + title_link)
    product_html = product_response.content
    product_scraped = BeautifulSoup(product_html, 'html.parser')
    description = product_scraped.find("div", id="product_description").next_sibling.next_sibling
    infos.append({title_text : description.text})
    
infos

[{'A Light in the Attic': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for y

We can also prepare our data for a further analysis with a powerful **pandas** library. It will require a special data structure to start with: a Dictionary with names of the columns as keys, each containing an Array of row contents.

In [35]:
panda_dict = {
    "Titles": [],
    "Descriptions": []
}

for book_title in book_titles:
    title_text = book_title['title']
    title_link = book_title['href']
    
    panda_dict["Titles"].append(title_text)
    
    product_response = requests.get(BASE_URL + title_link)
    product_html = product_response.content
    product_scraped = BeautifulSoup(product_html, 'html.parser')
    description = product_scraped.find("div", id="product_description").next_sibling.next_sibling
    
    panda_dict["Descriptions"].append(description.text)
    

In [36]:
df = pd.DataFrame(data=panda_dict)

And let's take a look at our data frame below!

In [37]:
df

Unnamed: 0,Titles,Descriptions
0,A Light in the Attic,It's hard to imagine a world without A Light i...
1,Tipping the Velvet,"""Erotic and absorbing...Written with starling ..."
2,Soumission,"Dans une France assez proche de la nôtre, un h..."
3,Sharp Objects,"WICKED above her hipbone, GIRL across her hear..."
4,Sapiens: A Brief History of Humankind,From a renowned historian comes a groundbreaki...
5,The Requiem Red,Patient Twenty-nine.A monster roams the halls ...
6,The Dirty Little Secrets of Getting Your Dream...,Drawing on his extensive experience evaluating...
7,The Coming Woman: A Novel Based on the Life of...,"""If you have a heart, if you have a soul, Kare..."
8,The Boys in the Boat: Nine Americans and Their...,For readers of Laura Hillenbrand's Seabiscuit ...
9,The Black Maria,"Praise for Aracelis Girmay:""[Girmay's] every l..."
