# Exercise - Scrape a Webpage

In this exercise, you'll be scraping the Internet Archive to gather data on an item in the archive.org: The book "Robinson Crusoe". 

This exercise allows you to practice your skills around scraping, gathering via an API, and filtering important data corresponding to an item in a catalog. Data wranglers apply these skills for many applications including gathering data on product reviews, movie recommendations, etc. 

Let's get started!

In [52]:
#Imports - do not modify
import requests
from bs4 import BeautifulSoup
import json

## 1. Scrape the HTML

1.1 The novel Robison Crusoe is available on archive.org via the following url: https://archive.org/details/cu31924011498676/mode/2up. 

Create a HTTP GET request via the requests library to get the HTML in Unicode from this page.

In [53]:
#FILL IN
item = "https://archive.org/details/cu31924011498676/mode/2up"

#Create an HTTP GET request
book_metadata = requests.get('https://archive.org/details/cu31924011498676/mode/2up.')

#Raise an exception if we made a request resulting in an error
book_metadata.raise_for_status()
#Access the content of the response in Unicode
book_metadata.content()
#Print the result to check that it worked

TypeError: 'bytes' object is not callable

1.2 Use BeautifulSoup to parse the HTML. Optionally, you could prettify the HTML, so you can look through the file. 

In [None]:
#FILL IN
#Use BeautifulSoup to parse the HTML
with open('book_metadata.html','wb') as file:
    file.write(book_metadata.content)
#OPTIONAL - Print a clean (prettier) version to look through
with open('book_metadata.html') as fp:
    book_prefix=BeautifulSoup(fp,'html.parser')
    


In [None]:
print(book_prefix.prettify())

Let's now answer a few questions around this item by getting specific tags from the HTML.

**Note:** You can use a find/search tool (e.g., on Windows, Command F; on Mac, Control F) to identify tags in the HTML or download the prettified version locally to your system to use a Notepad application for searching the tags for the next section.

1.3 What is the username of the uploader? Print the username in text **(not in HTML)**.

**Hint**: This is the HTML snippet containing the username

```
<a class="item-upload-info__uploader-name" href="/details/@hank_b"> hank_b </a>
```

In [None]:
#search form username 
user_name_collection=book_prefix.find('a',class_='item-upload-info__uploader-name').text.strip()

user_name_collection
    

In [None]:
#FILL IN
#Find the tag
username =user_name_collection

#Strip the username from the HTML
#Example code: username = username[0].text.strip()

print(username)

1.4 How many pages does the book have? Print the results in text **(not in HTML)**.

**Hint**: This is the HTML snippet containing the no. of pages.
```
<dl class="metadata-definition">
           <dt>
            Pages
           </dt>
           <dd class="">
            <span itemprop="numberOfPages">
             418
            </span>
           </dd>
          </dl>
...
```

In [54]:
#FILL IN
#Find the tag
no_of_pages = book_prefix.find('span',itemprop='numberOfPages').text.strip()

#Strip the number from the HTML
# no_of_pages = 
print(no_of_pages)

418


Check your work with the below assertions.

In [55]:
#DO NOT MODIFY
#Ensure these assert statements pass before moving to the next section
assert username == 'hank_b'
assert no_of_pages == '418'

## Use the API

With the Internet Archive, an item’s metadata can be fetched by making an HTTP GET request to its API https://archive.org/metadata/{identifier}. 

Our item's identifier is cu31924011498676.

2.1 Use the requests library to get the metadata in JSON format and print the JSON.

In [56]:
#FILL IN
#Create an HTTP GET request to the metadata API
book= requests.get('https://archive.org/metadata/cu31924011498676')

#Raise an exception if we made a request resulting in an error
book.raise_for_status()
#Get the JSON

json_book_metadata=json.loads(book.content)
print(json_book_metadata)

{'created': 1703009806, 'd1': 'ia600307.us.archive.org', 'd2': 'ia800307.us.archive.org', 'dir': '/19/items/cu31924011498676', 'files': [{'name': '__ia_thumb.jpg', 'source': 'original', 'mtime': '1700390884', 'size': '27630', 'md5': '5e1dff9ea5cdfb3b5145466b9b495190', 'crc32': 'b2b415eb', 'sha1': '935fbc7e7c2ee006a7f355666396a5d56ec78364', 'format': 'Item Tile', 'rotation': '0'}, {'name': 'cu31924011498676.djvu', 'source': 'derivative', 'format': 'DjVu', 'original': 'cu31924011498676_djvu.xml', 'md5': '16569bba84d9dc64169ba7eb874e7aff', 'mtime': '1256993280', 'size': '5096265', 'crc32': '518887ee', 'sha1': 'a6323321b9a05ddcd526a78c95b7a5133432ce66'}, {'name': 'cu31924011498676.epub', 'source': 'derivative', 'original': 'cu31924011498676_abbyy.gz', 'mtime': '1700390882', 'size': '4340746', 'md5': '69e950415956abe17987b6a7a935c0fe', 'crc32': 'b729bd6c', 'sha1': '42d98cd0588ec38d7e5a84648d40b4ef1af02fc3', 'format': 'EPUB'}, {'name': 'cu31924011498676.gif', 'source': 'derivative', 'format'

Inspect the hierarchy of attributes and retrieve values from the JSON to answer the following questions:

2.2 What camera was used?

In [72]:
#FILL IN
#Get the name of the camera
json_book_metadata.keys()

camera_name =json_book_metadata['metadata']['camera']
camera_name

dict_keys(['created', 'd1', 'd2', 'dir', 'files', 'files_count', 'item_last_updated', 'item_size', 'metadata', 'reviews', 'server', 'uniq', 'workable_servers'])

2.3 What is the size of the **PDF** of the book?

**Hint:** The `files` attribute has a list as a value, so you will need to use list indexing to get to the PDF attribute. 

In [89]:
#FILL IN
#Get the name of the PDF
pdf_size=json_book_metadata['files'][4]['size']
pdf_size


'8987708'

Check your work with the below assertions.

In [None]:
#DO NOT MODIFY
#Ensure these assert statements pass before moving on
assert camera_name == 'EOS-1DS MARK ll'
assert pdf_size == '8987708'