# Python Web Scraping — Try it yourself!

Let's start by importing the necessary libraries and parsing the front page of the [Books To Scrape](http://books.toscrape.com/index.html) website.

Every paragraph in this document is a *cell*, that can contain other text description, or a snipper of runnable Python code. 

To run the cell, select it and click "Run" in the toolbar, or just press `Shift-Enter`. Double-clicking the cell allows you to edit its contents.

**Pro tip** 🤓:  Run your cells often to catch possible errors early! 

In [1]:
import requests
from bs4 import BeautifulSoup

url = "http://books.toscrape.com/index.html"
response = requests.get(url)
html = response.content
scraped = BeautifulSoup(html, 'html.parser')

After **running** a cell above, you'll be able to use the `scraped` variable to look for elements on the page.

In [2]:
# Run this cell to see how:

scraped.h1

<h1>All products</h1>

😲 If you feel lost, you can refresh your knowledge on the Learn platform inside "Takeaways" section and in lecture slides!

### Challenge 1: Print the title of the page

To print output in Python, you can use the `print()` function. It can either take a literal value as an argument (`print("hello")`, `print(2)`), or a variable — in that case the function will print the value that the variable refers to! 

```python
name = "Bob"
print(name) # => Bob
```

Remember you need to print just the _text_ inside the `<title>` tag, not the whole element!

In [4]:
# write your code here
page_title=scraped.title.text.strip()
print(page_title)

All products | Books to Scrape - Sandbox


<details>
<summary>
    <strong>Reveal answer 🤫</strong>
</summary>
<pre>
title_text = scraped.title.text.strip()
print(title_text)
</pre>    
</details>

### Challenge 2: Print the *full* title of the first book on a page

Remember how to locate a single element with BeautifulSoup. If lost, revisit the slides on the Learn platform, or visit the "Takeaways" section for a quick recap.

In [13]:
# write your code here
#first_book=scraped.article.h3.a.text  <-#we don't get full title but the title with "..."
first_book=scraped.article.h3.a["title"]
print(first_book)

A Light in the Attic


<details>
<summary>
    <strong>Reveal answer 🤫</strong>
</summary>
<pre>
title = scraped.article.h3.a["title"]
print(title)
</pre>    
</details>

### Challenge 3: Print *all* the full titles from the page

Use the Beautiful Soup methods that return a _collection_ of elements. Remind yourself of how to **loop** over them (`for.. in..` construct)

In [14]:
# write your code here
all_items= scraped.find_all("article", class_="product_pod") # or instead class, we could use the condition: "title=True"
for each in all_items:
    print(each.h3.a["title"])

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas


In [16]:
#alternative: if instead just print the result, we want to save the result in same variable. we can use a list:
all_items= scraped.find_all("article", class_="product_pod") # or instead class, we could use the condition: "title=True"
collection=[]
for each in all_items:
    collection.append(each.h3.a["title"])
print(collection)

['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas"]


<details>
<summary>
    <strong>Reveal answer 🤫</strong>
</summary>
<pre>
links = scraped.find_all("a", title=True)
for link in links:
    print(link["title"])
</pre>    
</details>

### Challenge 4: Print all the *prices* from the page

Here's how you can get rid of a currency symbol and convert text to a numerical value (given that the inital text value is in a variable called `price`):

`price = float(price.text.lstrip("£"))`

You can use a **CSS class selector** for this task

In [47]:
# write your code here
all_prices_color=scraped.select(".price_color")
#print(all_prices)
list_all_prices=[]
for each in all_prices_color:
    #print(float(each.text.lstrip("£"))) #<- if we just want to print the prices
    list_all_prices.append(float(each.text.lstrip("£"))) # <-if we want to save all the prices in a list (maybe to use/check latter)
print(list_all_prices) 

[51.77, 53.74, 50.1, 47.82, 54.23, 22.65, 33.34, 17.93, 22.6, 52.15, 13.99, 20.66, 17.46, 52.29, 35.02, 57.25, 23.88, 37.59, 51.33, 45.17]


<details>
<summary>
    <strong>Reveal answer 🤫</strong>
</summary>
<pre>
prices = scraped.select(".price_color")
for price in prices:
    price = float(price.text.lstrip("£"))
    print(price)
</pre>    
</details>

### Challenge 5: Get a corresponding price for each title

This is how the resulting data structure should look like (a List of Dictionaries):

```
[{'Sharp Objects': 'WICKED above her hipbone, GIRL across her heart...'}, {'Sapiens: A Brief History of Humankind': 'From a renowned historian comes a groundbreaking narrative of humanity’s ...}]
```

Note that the real descriptions will be much longer.

A reminder on how you can append a Dictionary into an List:

```python
title_prices = []

# Iterate over all articles 
    # Get article's title as `title` 
    # Get article's price as `price`
    title_prices.append({title: price})

```

In [70]:
title_prices = []

# write your code here
title_price_info=scraped.select(".product_pod")
for each in title_price_info:
    price_currency=each.find("p", class_="price_color")
    price=float(price_currency.text.lstrip("£"))
    title=each.h3.a["title"]
    title_prices.append({title: price})

#print(title_price_info)
print(title_prices)

[{'A Light in the Attic': 51.77}, {'Tipping the Velvet': 53.74}, {'Soumission': 50.1}, {'Sharp Objects': 47.82}, {'Sapiens: A Brief History of Humankind': 54.23}, {'The Requiem Red': 22.65}, {'The Dirty Little Secrets of Getting Your Dream Job': 33.34}, {'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull': 17.93}, {'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics': 22.6}, {'The Black Maria': 52.15}, {'Starving Hearts (Triangular Trade Trilogy, #1)': 13.99}, {"Shakespeare's Sonnets": 20.66}, {'Set Me Free': 17.46}, {"Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)": 52.29}, {'Rip it Up and Start Again': 35.02}, {'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991': 57.25}, {'Olio': 23.88}, {'Mesaerion: The Best Science Fiction Stories 1800-1849': 37.59}, {'Libertarianism for Beginners': 51.33}, {"It's Only the Himalayas": 45.17}]


<details>
<summary>
    <strong>Reveal answer 🤫</strong>
</summary>
<pre>
title_prices = []

articles = scraped.select(".product_pod")

for article in articles:
    title = article.h3.a["title"]
    price = article.find("p", class_="price_color")
    price_float = float(price.text.lstrip("£"))
    title_prices.append({title: price_float}) # Create a Dictionary and append to Array
    
print(title_prices)
</pre>    
</details>

## Above and beyond

Now you can click on "Save to browser storage" icon next to "Download" on top of this notebook. Next time you connect to MyBinder you can restore your work by clicking "Restore from browser storage". 

Take as much time as you need to build a scraper for any website that you want! Keep in mind the information you are trying to scrape should be in public access and not protected by login/password.

In [41]:
# Write your code for a complete scraper when you feel like it :)