# STA 220 Assignment 2

Due __February 20__ by __11:59pm__. Submit your work by uploading it to Gradescope through Canvas.

Please rename this file as "H2_Lastname_Firstname_srnr", where srnr are the last four digits of your student's ID number and export it as as pdf-file. 

The objective of this assignment is to solidify your understanding of Scraping and XML.

Instructions:

1. Provide your solutions in new cells following each exercise description. Create as many new cells as necessary. Use code cells for your Python scripts and Markdown cells for explanatory text or answers to non-coding questions.

2. Prioritize code readability. Just as in writing a book, the clarity of each line matters. Adopt the __one-statement-per-line__ rule. If you have a lengthy code statement, consider breaking it into multiple lines for clarity. Note you can use `'''` to start and end strings in Python that are written over multiple lines.

3. To help understand and maintain code, you should add comments to explain your code. Use the hash symbol (#) to start writing a comment.

4. Submit your final work as a __.pdf__ file on __Gradescope__. To convert your .ipynb file into one of these formats, navigate to "File", select "Download as", and then choose either "PDF via LaTeX" or "HTML". If "PDF via LaTeX" does not work for you, export to "HTML", and then use Chrome to print the .html file into PDF. Gradescope only accepts PDF files.

5. This assignment will be graded on your proficiency in programming. Be sure to demonstrate your abilities and submit your own, correct and readable solutions. 

## Setting

We will scrape the website [books.toscrape.com](https://books.toscrape.com) and use an XML parser to get the information. You may also use Beautifoulsoup4 instead. The following packages may be useful:

In [56]:
import requests
import lxml.html as lx
import re
import pandas as pd

Furthermore, you want to declare some variables before. Feel free to adjust them (in particular the headers):

In [57]:
base_url = 'https://books.toscrape.com/'
headers = {
    'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:144.0) Gecko/20100101 Firefox/144.0"
}

## Exercise 1 [5 Points]

### 1a) [2 Points]

#### Task

Write a function `get_categories` (no arguments) that returns a dictionary consisting of:
- Keys: the Book categories (such as Travel, History) that can be found on the left side of the page `https://books.toscrape.com/index.html`
- Values: the _relative_ links to the page that lists all books of the category. The relative link should sart with 'catalogue/cateogry' and end with 'index.html'

#### Solution START

In [58]:
def get_categories():
    response = requests.get(base_url, headers=headers)
    tree = lx.fromstring(response.content)
    
    categories = {}

    # XPath find all categorical links
    links = tree.xpath('//ul[@class="nav nav-list"]/li/ul/li/a')
    for link in links:
        name = link.text_content().strip()
        href = link.get("href")

        if "catalogue/category" in href:
            categories[name] = href
    
    return categories




#### Solution END

Please run the following code to get full credit:

In [59]:
categories = get_categories()
len(categories)

50

#### Example

In [60]:
pd.DataFrame.from_dict(categories, orient = 'index').head()

Unnamed: 0,0
Travel,catalogue/category/books/travel_2/index.html
Mystery,catalogue/category/books/mystery_3/index.html
Historical Fiction,catalogue/category/books/historical-fiction_4/...
Sequential Art,catalogue/category/books/sequential-art_5/inde...
Classics,catalogue/category/books/classics_6/index.html


### 1b) [1 Points]

#### Task

Write a function `get_books_from_page` that gets an `url` as argument and returns a list of links to the books found on this page (without clicking on the next button). The `url` shall be one link that refers to a page if you click on one of the categories, e.g., [https://books.toscrape.com/catalogue/category/books/classics_6/index.html](https://books.toscrape.com/catalogue/category/books/classics_6/index.html).
The function should return `None` if the page does not contain any books. (See the examples below.)

#### Solution START

In [61]:
def get_books_from_page(url):
    response = requests.get(url, headers = headers)
    tree = lx.fromstring(response.content)

    # create a book list
    books = tree.xpath('//article[@class="product_pod"]/h3/a/@href')

    if not books:
        return None
    
    return books

#### Solution END

Please run the following code to get full credit:

In [62]:
get_books_from_page('https://books.toscrape.com/catalogue/category/books/art_25/index.html')

['../../../wall-and-piece_971/index.html',
 '../../../feathers-displays-of-brilliant-plumage_695/index.html',
 '../../../art-and-fear-observations-on-the-perils-and-rewards-of-artmaking_559/index.html',
 '../../../the-new-drawing-on-the-right-side-of-the-brain_550/index.html',
 '../../../history-of-beauty_521/index.html',
 '../../../the-story-of-art_500/index.html',
 '../../../the-art-book_490/index.html',
 '../../../ways-of-seeing_94/index.html']

In [63]:
get_books_from_page('https://books.toscrape.com/catalogue/category/books/art_25/page-2.html') is None

True

#### Example

In [64]:
get_books_from_page('https://books.toscrape.com/catalogue/category/books/classics_6/index.html')[:5]

['../../../the-secret-garden_413/index.html',
 '../../../the-metamorphosis_409/index.html',
 '../../../the-pilgrims-progress_353/index.html',
 '../../../the-hound-of-the-baskervilles-sherlock-holmes-5_348/index.html',
 '../../../little-women-little-women-1_331/index.html']

In [65]:
get_books_from_page('https://books.toscrape.com/catalogue/category/books/classics_6/page-3.html') is None

True

#### Extra self practice

In [66]:
# want to extract all the link including the next button
'''
from urllib.parse import urljoin

def get_all_books(url):
    all_books = []
    while True:
        response = requests.get(url, headers= headers)
        tree = lx.fromstring(response.content)

        # books in current page
        books = tree.xpath('//article[@class="product_pod"]/h3/a/text()')
        if not books:
            return None
        
        all_books.extend(books)
        next_page = tree.xpath('//li[@class="next"]/a/@href')

        if not next_page:
            break

        url = urljoin(url, next_page[0])
    return all_books

get_all_books("https://books.toscrape.com/catalogue/category/books/mystery_3/index.html")
'''

'\nfrom urllib.parse import urljoin\n\ndef get_all_books(url):\n    all_books = []\n    while True:\n        response = requests.get(url, headers= headers)\n        tree = lx.fromstring(response.content)\n\n        # books in current page\n        books = tree.xpath(\'//article[@class="product_pod"]/h3/a/text()\')\n        if not books:\n            return None\n\n        all_books.extend(books)\n        next_page = tree.xpath(\'//li[@class="next"]/a/@href\')\n\n        if not next_page:\n            break\n\n        url = urljoin(url, next_page[0])\n    return all_books\n\nget_all_books("https://books.toscrape.com/catalogue/category/books/mystery_3/index.html")\n'

### 1c) [2 Points]

#### Task

Write a function `get_all_books_of_category` that takes a string `category` as argument and does the following:
- Use the dictionary `categories` from 1a) to look up the link to the first page of the category (ending with `.index.html`)
- Call the function `get_books_from_page` for the first page of the category and store the result as a list `book_list`.
- Loop i from 2 to 10:
- Call the function `get_books_from_page` for the i-th page of the category (ending with `.page-i.html`) and add it to `book_list`.
- Stop the loop if the function returns a None. In particular, the loop shall not try to access the page $i$ if already the $(i-1)$-th page returned a None.

Afterwards, the function shall return the list `book_links` that contains urls to all books of this category. (See examples below.)
You may print a statement how many pages could be found for the category.

Note that the return must be a list whose elements are urls (strings). In particular, it must not be a list of lists!

Remark: The difference between 1c) and 1b) is that here we want to get all books of one category while for 1b) we had to get all books that were listed on one of the pages of a category. Thus, 1c) is more or less applying 1b) to all pages of a category.

#### Solution START

In [67]:
from urllib.parse import urljoin

def get_all_books_of_category(category):
    relative_link = categories[category]
    base_url = urljoin("https://books.toscrape.com/", relative_link)
    books_url = get_books_from_page(base_url)
    
    if books_url is None:
        return None
    page_count = 1

    for i in range(2,11):
        page_url = base_url.replace("index.html", f"page-{i}.html")
        books = get_books_from_page(page_url)
        if books is None:
            break
        books_url.extend(books)
        page_count += 1
        
    print(f"{category} has {page_count} pages.")
    return books_url
    


#### Solution END

Please run the following code to get full credit:

In [68]:
get_all_books_of_category('Art')

Art has 1 pages.


['../../../wall-and-piece_971/index.html',
 '../../../feathers-displays-of-brilliant-plumage_695/index.html',
 '../../../art-and-fear-observations-on-the-perils-and-rewards-of-artmaking_559/index.html',
 '../../../the-new-drawing-on-the-right-side-of-the-brain_550/index.html',
 '../../../history-of-beauty_521/index.html',
 '../../../the-story-of-art_500/index.html',
 '../../../the-art-book_490/index.html',
 '../../../ways-of-seeing_94/index.html']

#### Example

In [69]:
fantasy = get_all_books_of_category('Mystery')

Mystery has 2 pages.


In [70]:
fantasy[:5]

['../../../sharp-objects_997/index.html',
 '../../../in-a-dark-dark-wood_963/index.html',
 '../../../the-past-never-ends_942/index.html',
 '../../../a-murder-in-time_877/index.html',
 '../../../the-murder-of-roger-ackroyd-hercule-poirot-4_852/index.html']

## Exercise 2 [5 Points]

Consider the following code snippet:
```python
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    response.encoding = "utf-8"
    html = lx.fromstring(response.text)
```
where url is the url (as a string) to the page of one book, e.g. `url = 'https://books.toscrape.com/catalogue/unicorn-tracks_951/index.html'`.

The following functions shall take the object `html` as described above as argument.

Please run the following code to get full credit:

In [71]:
url = 'https://books.toscrape.com/catalogue/unicorn-tracks_951/index.html'
response = requests.get(url, headers = headers)
response.raise_for_status()
response.encoding = "utf-8"
html = lx.fromstring(response.text)

### 2a) [1 Points]

#### Task

Write a function `get_rating` that gets the `html` (as described above) as argument and returns the number of stars (as integer) the book's rating has. For this, you may use the following dictionary:

In [72]:
stars = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

#### Solution START

In [73]:
def get_rating(html):
    stars = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

    rating_element = html.xpath('//p[contains(@class, "star-rating")]')

    if not rating_element:
        return None
    
    rating_class = rating_element[0].get("class")
    rating_word = rating_class.split()[1]

    return stars[rating_word] # retrieve value from dictionary

#### Solution END

Please run the following code to get full credit:

In [74]:
get_rating(html)

3

### 2b) [1 Points]

#### Task

Write a function `get_title` that gets the `html` (as described above) as argument and returns the book's title (as string).

#### Solution START

In [75]:
def get_book_title(html):
    title_html = html.xpath('//div[contains(@class, "product_main")]/h1')

    if not title_html:
        return None
    
    title = title_html[0].text.strip()
    return title

#### Solution END

Please run the following code to get full credit:

In [76]:
get_book_title(html)

'Unicorn Tracks'

### 2c) [1 Points]

#### Task

Write a function `get_stock` that gets the `html` (as described above) as argument and returns how many books are still available (the stock). Note that the return must be an integer, not a string.

#### Solution START

In [77]:
def get_stock(html):
    stock_text = html.xpath(
        '//table[contains(@class, "table-striped")]'
        '//tr[th[text()="Availability"]]/td/text()'
    )

    if not stock_text:
        return None
    
    # stock_text is a list
    text = stock_text[0]
    number = re.search(r'\d+', text) #match object

    if number:
        return int(number.group())
    return 0

#### Solution END

Please run the following code to get full credit:

In [78]:
get_stock(html)

16

In [79]:
type(get_stock(html))

int

### 2d) [2 Points]

#### Task

The following task is meant to combine the previous work/functions to a meaningful output.

Write a function `get_book_info` that takes an url (as string) to one of the book pages as argument and does the following:
- Uses the requests module and an xml parser to get the xml-parsed html code of the page.
- Gets the title `title` of the book
- Calculates how many books are available (`stock`)
- Gets the rating `rating` of the book
- Uses `pd.read_html` to read the one table of the book page that contains information like UPC/Tax/Number of reviews and stores the result as a pandas DataFrame called `table`.
- Adds the following entries to the DataFrame: 'Rating': `rating`, 'Title': `title` and 'Stock': `stock`.
- Sets the column consisting of the descriptions (like 'Rating', 'Title', 'UPC', etc) as index of the DataFrame.
- Returns the DataFrame.

For this task, you may use the functions you defined in earlier tasks.

#### Solution START

In [80]:
from io import StringIO
def get_book_info(url):
    response = requests.get(url, headers= headers)
    html = lx.fromstring(response.text)

    title = get_book_title(html)
    stock = get_stock(html)
    rating = get_rating(html)

    table = pd.read_html(StringIO(response.text))[0]

    table.set_index(0, inplace= True)

    table.loc["Title"] = title
    table.loc["Rating"] = rating
    table.loc["Stock"] = stock

    return table

#### Solution END

Please run the following code to get full credit:

In [81]:
get_book_info("https://books.toscrape.com/catalogue/salt_731/index.html")

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
UPC,86cbddb61ea78bb7
Product Type,Books
Price (excl. tax),Â£46.78
Price (incl. tax),Â£46.78
Tax,Â£0.00
Availability,In stock (14 available)
Number of reviews,0
Title,salt.
Rating,4
Stock,14


#### Example

In [82]:
get_book_info("https://books.toscrape.com/catalogue/unicorn-tracks_951/index.html")

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
UPC,7ae099f3898e0209
Product Type,Books
Price (excl. tax),Â£18.78
Price (incl. tax),Â£18.78
Tax,Â£0.00
Availability,In stock (16 available)
Number of reviews,0
Title,Unicorn Tracks
Rating,3
Stock,16


## Exercise 3 [5 Points]

### 3a) [1 Points]

#### Task

Write a function `get_all_books` (no arguments) that does the following:
- Gets all categories using a function from Exercise 1 and does for all categories `c` the following:
- Applies the function `get_all_books_of_category` to `c`.
- For each link to one book `l`, it applies the function `get_book_info` and adds one more line to the returned DataFrame consisting of ['Category': `c`].
- Concatenates all such Dataframes to one single DataFrame.

Afterwards, create a DataFrame `df` that is the return of the function `get_all_books`. Consider using the time module if necessary. 

#### Solution START

In [None]:
# single thread
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm 

def get_all_books():
    categories_dict = get_categories()
    all_dfs = []
    for category in categories_dict.keys():
        print(f"Processing category: {category}")
        book_links = get_all_books_of_category(category)
        if not book_links:
            continue
        for link in book_links:
            book_url = "https://books.toscrape.com/catalogue/" + link.replace('../../../', '')
            book_df = get_book_info(book_url)
            book_df.loc["Category"] = category
            all_dfs.append(book_df)
    final_df = pd.concat(all_dfs, axis=1)
    return final_df

In [84]:
# multithreading
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm 

def get_all_books_multi():
    categories_dict = get_categories()
    all_task = []
    for category in categories_dict.keys():
        print(f"Processing category: {category}")
        book_links = get_all_books_of_category(category)
        if not book_links:
            continue
        for link in book_links:
            book_url = urljoin("https://books.toscrape.com/catalogue/", link.replace('../../../', ''))
            all_task.append((book_url, category))
            #book_df = get_book_info(book_url)
            #book_df.loc["Category"] = category
            #all_task.append(book_df)
    print(f"prepare to download {len(all_task)} books")

    all_dfs = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_info = {executor.submit(get_book_info, url): (url, category) for url, category in all_task}

        for future in as_completed(future_to_info):
            url, category = future_to_info[future]
            try:
                book_df = future.result()
                book_df.loc["Category"] = category
                all_dfs.append(book_df.T)
            except Exception as e:
                print(e)
    final_df = pd.concat(all_dfs, ignore_index=True)
    return final_df

In [85]:
df=get_all_books_multi()
df

Processing category: Travel
Travel has 1 pages.
Processing category: Mystery
Mystery has 2 pages.
Processing category: Historical Fiction
Historical Fiction has 2 pages.
Processing category: Sequential Art
Sequential Art has 4 pages.
Processing category: Classics
Classics has 1 pages.
Processing category: Philosophy
Philosophy has 1 pages.
Processing category: Romance
Romance has 2 pages.
Processing category: Womens Fiction
Womens Fiction has 1 pages.
Processing category: Fiction
Fiction has 4 pages.
Processing category: Childrens
Childrens has 2 pages.
Processing category: Religion
Religion has 1 pages.
Processing category: Nonfiction
Nonfiction has 6 pages.
Processing category: Music
Music has 1 pages.
Processing category: Default
Default has 8 pages.
Processing category: Science Fiction
Science Fiction has 1 pages.
Processing category: Sports and Games
Sports and Games has 1 pages.
Processing category: Add a comment
Add a comment has 4 pages.
Processing category: Fantasy
Fantasy has

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews,Title,Rating,Stock,Category
0,a94350ee74deaa07,Books,Â£37.33,Â£37.33,Â£0.00,In stock (7 available),0,Under the Tuscan Sun,3,7,Travel
1,ce60436f52c5ee68,Books,Â£49.43,Â£49.43,Â£0.00,In stock (15 available),0,Full Moon over Noahâs Ark: An Odyssey to Mou...,4,15,Travel
2,cc1936a9f4e93477,Books,Â£44.34,Â£44.34,Â£0.00,In stock (7 available),0,A Summer In Europe,2,7,Travel
3,1809259a5a5f1d8d,Books,Â£36.94,Â£36.94,Â£0.00,In stock (8 available),0,Vagabonding: An Uncommon Guide to the Art of L...,2,8,Travel
4,f9705c362f070608,Books,Â£48.87,Â£48.87,Â£0.00,In stock (14 available),0,See America: A Celebration of Our National Par...,3,14,Travel
...,...,...,...,...,...,...,...,...,...,...,...
995,2b5054a4192e9b06,Books,Â£52.65,Â£52.65,Â£0.00,In stock (14 available),0,Why the Right Went Wrong: Conservatism--From G...,4,14,Politics
996,3968e3fbf4695d7c,Books,Â£56.86,Â£56.86,Â£0.00,In stock (12 available),0,Equal Is Unfair: America's Misguided Fight Aga...,1,12,Politics
997,bb8245f52c7cce8f,Books,Â£36.58,Â£36.58,Â£0.00,In stock (15 available),0,Amid the Chaos,1,15,Cultural
998,88c21fcd38e2486e,Books,Â£19.19,Â£19.19,Â£0.00,In stock (15 available),0,Dark Notes,5,15,Erotica


#### Solution END

### 3b) [1 Points]

#### Task

Use the DataFrame `df` to determine how many books the page `books.toscrape.com` has. Add a new column `Price` to the DataFrame that contains the Price (incl. tax) as float (without the currency).

#### Solution START

In [86]:
total_books = len(df)
print(f"The website has {total_books} books")

df['Price'] = df['Price (incl. tax)'].str.extract(r'(\d+\.\d+)').astype(float)
print(df.head())

The website has 1000 books
0               UPC Product Type Price (excl. tax) Price (incl. tax)     Tax  \
0  a94350ee74deaa07        Books           Â£37.33           Â£37.33  Â£0.00   
1  ce60436f52c5ee68        Books           Â£49.43           Â£49.43  Â£0.00   
2  cc1936a9f4e93477        Books           Â£44.34           Â£44.34  Â£0.00   
3  1809259a5a5f1d8d        Books           Â£36.94           Â£36.94  Â£0.00   
4  f9705c362f070608        Books           Â£48.87           Â£48.87  Â£0.00   

0             Availability Number of reviews  \
0   In stock (7 available)                 0   
1  In stock (15 available)                 0   
2   In stock (7 available)                 0   
3   In stock (8 available)                 0   
4  In stock (14 available)                 0   

0                                              Title Rating Stock Category  \
0                               Under the Tuscan Sun      3     7   Travel   
1  Full Moon over Noahâs Ark: An Odyssey to M

#### Solution END

Please run the following code to get full credit:

In [87]:
df.head(10)

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews,Title,Rating,Stock,Category,Price
0,a94350ee74deaa07,Books,Â£37.33,Â£37.33,Â£0.00,In stock (7 available),0,Under the Tuscan Sun,3,7,Travel,37.33
1,ce60436f52c5ee68,Books,Â£49.43,Â£49.43,Â£0.00,In stock (15 available),0,Full Moon over Noahâs Ark: An Odyssey to Mou...,4,15,Travel,49.43
2,cc1936a9f4e93477,Books,Â£44.34,Â£44.34,Â£0.00,In stock (7 available),0,A Summer In Europe,2,7,Travel,44.34
3,1809259a5a5f1d8d,Books,Â£36.94,Â£36.94,Â£0.00,In stock (8 available),0,Vagabonding: An Uncommon Guide to the Art of L...,2,8,Travel,36.94
4,f9705c362f070608,Books,Â£48.87,Â£48.87,Â£0.00,In stock (14 available),0,See America: A Celebration of Our National Par...,3,14,Travel,48.87
5,48736df57e7bec9f,Books,Â£30.54,Â£30.54,Â£0.00,In stock (6 available),0,The Great Railway Bazaar,1,6,Travel,30.54
6,366a236aa1ea6f07,Books,Â£23.21,Â£23.21,Â£0.00,In stock (3 available),0,The Road to Little Dribbling: Adventures of an...,1,3,Travel,23.21
7,9e60929f521fa280,Books,Â£56.88,Â£56.88,Â£0.00,In stock (6 available),0,A Year in Provence (Provence #1),4,6,Travel,56.88
8,a22124811bfa8350,Books,Â£45.17,Â£45.17,Â£0.00,In stock (19 available),0,It's Only the Himalayas,2,19,Travel,45.17
9,747cf7fca2ccdbd4,Books,Â£38.95,Â£38.95,Â£0.00,In stock (3 available),0,Neither Here nor There: Travels in Europe,3,3,Travel,38.95


#### Example

In [88]:
df.head(5)

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews,Title,Rating,Stock,Category,Price
0,a94350ee74deaa07,Books,Â£37.33,Â£37.33,Â£0.00,In stock (7 available),0,Under the Tuscan Sun,3,7,Travel,37.33
1,ce60436f52c5ee68,Books,Â£49.43,Â£49.43,Â£0.00,In stock (15 available),0,Full Moon over Noahâs Ark: An Odyssey to Mou...,4,15,Travel,49.43
2,cc1936a9f4e93477,Books,Â£44.34,Â£44.34,Â£0.00,In stock (7 available),0,A Summer In Europe,2,7,Travel,44.34
3,1809259a5a5f1d8d,Books,Â£36.94,Â£36.94,Â£0.00,In stock (8 available),0,Vagabonding: An Uncommon Guide to the Art of L...,2,8,Travel,36.94
4,f9705c362f070608,Books,Â£48.87,Â£48.87,Â£0.00,In stock (14 available),0,See America: A Celebration of Our National Par...,3,14,Travel,48.87


### 3c) [3 Points]

#### Task

Group the DataFrame `df` of the last task by category and and do the following:
- Provide a pandas Series `books_per_category` that lists all categories (as keys) and the number of books of this category (as value).
- Provide a DataFrame `avg_price_per_category` that reports the average book price per category.
- Provide another DataFrame `books_to_order` that contains the $20$ most expensive books whose
  1. stock is less than $10$ AND
  2. rating is at least three stars

#### Solution START

In [89]:
books_per_category = df.groupby('Category').size()
avg_price_per_category = df.groupby('Category')[['Price']].mean()

books_to_order = df[
    (df['Stock']<10) & (df['Rating']>= 3)
].sort_values(by='Price', ascending=False).head(20)

#### Solution END

Please run the following code to get full credit:

In [90]:
avg_price_per_category

Unnamed: 0_level_0,Price
Category,Unnamed: 1_level_1
Academic,13.12
Add a comment,35.796418
Adult Fiction,15.36
Art,38.52
Autobiography,37.053333
Biography,33.662
Business,32.46
Childrens,32.638276
Christian,42.496667
Christian Fiction,34.385


In [91]:
type(avg_price_per_category)

pandas.DataFrame

In [92]:
books_per_category

Category
Academic                1
Add a comment          67
Adult Fiction           1
Art                     8
Autobiography           9
Biography               5
Business               12
Childrens              29
Christian               3
Christian Fiction       6
Classics               19
Contemporary            3
Crime                   1
Cultural                1
Default               152
Erotica                 1
Fantasy                48
Fiction                65
Food and Drink         30
Health                  4
Historical              2
Historical Fiction     26
History                18
Horror                 17
Humor                  10
Music                  13
Mystery                32
New Adult               6
Nonfiction            110
Novels                  1
Paranormal              1
Parenting               1
Philosophy             11
Poetry                 19
Politics                3
Psychology              7
Religion                7
Romance                35
Sci

In [93]:
books_to_order.shape

(20, 12)

In [94]:
books_to_order

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews,Title,Rating,Stock,Category,Price
191,9cc207168a03470d,Books,Â£59.99,Â£59.99,Â£0.00,In stock (4 available),0,The Perfect Play (Play by Play #1),3,4,Romance,59.99
270,07e6810fd3236bda,Books,Â£59.98,Â£59.98,Â£0.00,In stock (5 available),0,Last One Home (New Beginnings #1),3,5,Fiction,59.98
925,6478ccb4416e6a5d,Books,Â£59.92,Â£59.92,Â£0.00,In stock (6 available),0,The Barefoot Contessa Cookbook,5,6,Food and Drink,59.92
964,9c4d061c1e2fe6bf,Books,Â£59.71,Â£59.71,Â£0.00,In stock (4 available),0,The Bone Hunters (Lexy Vaughan & Steven Macaul...,3,4,Thriller,59.71
392,60376aa71be66083,Books,Â£59.45,Â£59.45,Â£0.00,In stock (6 available),0,The Man Who Mistook His Wife for a Hat and Oth...,4,6,Nonfiction,59.45
854,c53d9fefcda371e9,Books,Â£59.04,Â£59.04,Â£0.00,In stock (3 available),0,Life Without a Recipe,5,3,Autobiography,59.04
203,6e712ea24e77bd96,Books,Â£58.99,Â£58.99,Â£0.00,In stock (1 available),0,Listen to Me (Fusion #1),3,1,Romance,58.99
512,4fd0a2a350f016e6,Books,Â£58.87,Â£58.87,Â£0.00,In stock (9 available),0,Unlimited Intuition Now,4,9,Default,58.87
850,612369a5947a012e,Books,Â£58.81,Â£58.81,Â£0.00,In stock (5 available),0,Approval Junkie: Adventures in Caring Too Much,5,5,Autobiography,58.81
737,63e20a0f98218a87,Books,Â£58.75,Â£58.75,Â£0.00,In stock (1 available),0,Myriad (Prentor #1),4,1,Fantasy,58.75


#### Example

In [95]:
avg_price_per_category.head()

Unnamed: 0_level_0,Price
Category,Unnamed: 1_level_1
Academic,13.12
Add a comment,35.796418
Adult Fiction,15.36
Art,38.52
Autobiography,37.053333


In [96]:
books_per_category.head()

Category
Academic          1
Add a comment    67
Adult Fiction     1
Art               8
Autobiography     9
dtype: int64

In [97]:
books_to_order.head()

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews,Title,Rating,Stock,Category,Price
191,9cc207168a03470d,Books,Â£59.99,Â£59.99,Â£0.00,In stock (4 available),0,The Perfect Play (Play by Play #1),3,4,Romance,59.99
270,07e6810fd3236bda,Books,Â£59.98,Â£59.98,Â£0.00,In stock (5 available),0,Last One Home (New Beginnings #1),3,5,Fiction,59.98
925,6478ccb4416e6a5d,Books,Â£59.92,Â£59.92,Â£0.00,In stock (6 available),0,The Barefoot Contessa Cookbook,5,6,Food and Drink,59.92
964,9c4d061c1e2fe6bf,Books,Â£59.71,Â£59.71,Â£0.00,In stock (4 available),0,The Bone Hunters (Lexy Vaughan & Steven Macaul...,3,4,Thriller,59.71
392,60376aa71be66083,Books,Â£59.45,Â£59.45,Â£0.00,In stock (6 available),0,The Man Who Mistook His Wife for a Hat and Oth...,4,6,Nonfiction,59.45
