# Web Scraping
* Many sources of information on the internet are presented for purely human consumption on a web page.
* This can either be intentional or unintentional.
* Even though websites appear to be highly unstructured, the underlying language HTML cannot be.
    * We can take advantage of this fact to generate structured datasets from the "unstructured" website
    
## Legal Disclaimer
* The legality of webscraping is a subject for debate and varies on a case by case basis, we need to be cognizant of this when engaging in such practices:
    * http://www.prowebscraper.com/blog/six-compelling-facts-about-legality-of-web-scraping/
    
* Be sure to consult the ToS to see if there are violations by scraping the website.
    
* If it's allowed, a general rule of thumb is to not be disruptive/damaging to the service freely providing the information, i.e. if you decide to scrape a website don't send more requests than is reasonable for a human browsing the website.
    
## Overview

* We use both the requests library (seen previously) and a library called "Beautiful Soup" used commonly for webscraping in python.

* We will introduce the basic concepts on a "scraping sandbox" that is provided by scraping web service:  http://toscrape.com/

* More specifically we will focus on the "book store example" http://books.toscrape.com/

# The Madeup Bookstore

* Suppose we want to create an algorithm for selecting the next book we want to buy and read.
    * lets keep it simple like stars per dollar!
* There's no API from this bookstore so we'd either need to go through and manually enter data into a spreadsheet!
* Alternatively we can have the computer take care of this tedious work, we just need to be clever about it!

## Get the data

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time

In [None]:
content = requests.get('http://books.toscrape.com/')

In [None]:
print(content.content)

## Make sense of the data

* This is clearly a mess so how do we make sense of it??
* The answer is beautiful soup!
    * HTML parsing library that's pretty easy to use

In [None]:
soup = BeautifulSoup(content.content)

## Grabbing the elements we care about

In [None]:
pods = soup.find_all('article',class_="product_pod")

In [None]:
print(pods[0])

In [None]:
# Okay now we can find the title
pods[0].find('h3').find('a').get('title')


In [None]:
# Okay now we can find ratings
pods[0].find('p',class_='star-rating').get('class')[1]

In [None]:
# bingo we got the price!
pods[0].find('p',class_="price_color").text

## Getting more than one Book

In [None]:
def switch_stars(stars):
    
    s = stars.lower().strip()
    
    if s == 'one':
        return 1
    elif s == 'two':
        return 2
    elif s == 'three':
        return 3
    elif s == 'four':
        return 4
    elif s == 'five':
        return 5
    else:
        return 0

In [None]:
def parse_pods(pods):

    data = {'titles':[],
            'stars':[],
            'price':[]}

    for pod in pods:

        title = pod.find('h3').find('a').get('title')

        data['titles'].append(title)

        stars = switch_stars(pod.find('p',class_='star-rating').get('class')[1])

        data['stars'].append(stars)

        price = float(pod.find('p',class_="price_color").text[1:])

        data['price'].append(price)
        
    df = pd.DataFrame(data=data)
    
    return df
    

In [None]:
df = parse_pods(pods)
display(df)

## Adding more pages
* We're going to a pretty sloppy hack to grab the next pages
* Can you see how the url changes as we switch pages?


In [None]:
dataframes = []

for i in range(1,11):
    content = requests.get(f'http://books.toscrape.com/catalogue/page-{i}.html')
    soup = BeautifulSoup(content.content)
    pods = soup.find_all('article',class_="product_pod")
    
    dataframes.append(parse_pods(pods))

    time.sleep(3) # time out so we're not jerks.

## Making a decision

In [None]:
df = pd.concat(dataframes,ignore_index=True)
display(df)

In [None]:
df['price_per_star'] = df['price']/df['stars']
df.sort_values(by = 'price_per_star',ascending=True)

## Challenges

* Can you grab the authors? 
* Can you grab the genres? 
* Can you get the next pages from the hyperlinks?
* Can you get the "ISBN"?
* Can you get the description?
