## Introduction

We're going to try and run through an "Example Project" today. Basically, I'm going to try and run through the basics of a web scraping project within the next couple hours, and then you're going to be able to use that sort of for your own project. I'd like it if, in addition to what we cover tonight, you try and also incorporate material we've covered in past lectures (Plotly visualizations, e.g.).

Following that we're going to touch on classification and what it is. If we have any time remaining, we're going to start work on our projects.

## Libraries Used

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

import plotly.offline as p
import plotly.graph_objs as go

p.init_notebook_mode(connected=True)
from IPython.display import Image

## Project Rubric

In [None]:
Image('/Users/MattMecca/Documents/Work-related material/Flatiron School/Random Notes and Documents/grading-rubric.png')

## Example Project

Here we've got a website (**https://www.humblebundle.com/books/big-data-books?hmb_source=navbar&hmb_medium=product_tile&hmb_campaign=tile_index_4**) and we want to scrape specific data from the responses the website's API gives us.

"**Humble Bundle** is a distribution platform selling games, ebooks, software, and other digital content. Since Humble's founding in 2010, our mission has been to support charity ("Humble") while providing awesome content to customers at great prices ("Bundle"). We started by offering only game bundles, but have branched out to include an online storefront, a monthly subscription service, a publishing initiative, and lots more.

The core of our bundle "philosophy" is flexible pricing. When you buy a bundle, you can choose the price you want to pay. You can even choose how your money is divided – between the creators, charity, Humble Partners, and Humble Bundle."

**Humble Bundle** has "bundle" deals, each for a different rate (1 dollar or more, 8 dollar or more, or 15 dollar or more). Each successive "bundle" includes the "bundle" of whatever preceded it. Our analysis will involve these bundles mainly.

## Making our HTTP GET Request

In [3]:
url = "https://www.humblebundle.com/books/big-data-books"
    

In [4]:
response = requests.get(url)

In [4]:
response.text # The actual HTML document that we're gotten 
# back from humblebundle's server



## Parsing our Text using BeautifulSoup

In [5]:
beautsoup_object = BeautifulSoup(response.text, 'html.parser')
# Tells BeautifulSoup that we'd like to use the 'html' parser

In [None]:
print(beautsoup_object)

## Quick Question

How can we make this "prettier"?

In [None]:
print(beautsoup_object.prettify()[:4000])


## Navigation Reminders

What does this do?

In [1]:
beautsoup_object.a 

NameError: name 'beautsoup_object' is not defined

Stands for < a >, or the **hyperlink** tag.

In [None]:
beautsoup_object.p # 'p' stands for <p> or PARAGRAPH tags

In [51]:
print(beautsoup_object.title)

<title>Humble Book Bundle: Big Data by Packt (pay what you want and help charity)</title>


In [52]:
print(beautsoup_object.title.string)

Humble Book Bundle: Big Data by Packt (pay what you want and help charity)


In [54]:
print(beautsoup_object.title.parent.name)


head


## Let's "Inspect" the Relevant Elements of our HTML  

The elements we want seems to be of **class** "dd-header-headline" of *heading 2*, or h2.

* **div** is an HTML element that groups other elements of the page together. 

* **class** is an attribute. All HTML elements can carry a **class** attribute. If your elements have a **class** attribute then you will be able to write some code in order to select that **class**. **Classes** are usually used to convey some sort of *style* to the webpage we're viewing. They are ***tags*** that we can search our document with.

In [6]:
beautsoup_object.find_all("dd-header-headline")

[]

No buen. What if we used the **select()** method?

In [46]:
beautsoup_object.select(".dd-header-headline")

[<h2 class="dd-header-headline">
     Pay $1 or more!
   </h2>, <h2 class="dd-header-headline">
     Pay $8 or more to also unlock!
   </h2>, <h2 class="dd-header-headline">
     Pay $15 or more to also unlock!
   </h2>, <h2 class="dd-header-headline">
     Support Charity
   </h2>, <h2 class="dd-header-headline">
 </h2>]

The dot refers to us subsetting the class rather than the tag. E.g., this would also work:

In [47]:
beautsoup_object.select("h2.dd-header-headline")

[<h2 class="dd-header-headline">
     Pay $1 or more!
   </h2>, <h2 class="dd-header-headline">
     Pay $8 or more to also unlock!
   </h2>, <h2 class="dd-header-headline">
     Pay $15 or more to also unlock!
   </h2>, <h2 class="dd-header-headline">
     Support Charity
   </h2>, <h2 class="dd-header-headline">
 </h2>]

In [48]:
bundle_names = beautsoup_object.select(".dd-header-headline")
type(bundle_names)

list

**select** finds multiple instances and returns a list, whereas **find** (NOT **find_all**) finds the first, so they don't do the same thing. **select_one** would be the equivalent to **find**.

If you want to search for tags that match two or more CSS classes, you should use a CSS **selector**.

## Question

How do you think we could get it to work using the **.find_all()** method? ***Hint***:

In [None]:
?BeautifulSoup.find_all

In [None]:
beautsoup_object.find_all('h2', class_ = "dd-header-headline")

### Slicing and Dicing

In [49]:
bundle_names[0]

<h2 class="dd-header-headline">
    Pay $1 or more!
  </h2>

In [50]:
bundle_names[0].text # Gives us the text without the tags around
                   # it

'\n    Pay $1 or more!\n  '

What if we want to get rid of the white space, though:

In [51]:
bundle_names[0].text.strip() # Much better

'Pay $1 or more!'

In [52]:
for bundle in bundle_names:
    print(bundle.text.strip())

Pay $1 or more!
Pay $8 or more to also unlock!
Pay $15 or more to also unlock!
Support Charity



In [53]:
# Could write it like this in order to get rid of 'Support Charity'

for bundle in bundle_names[0:3]:
    print(bundle.text.strip())

Pay $1 or more!
Pay $8 or more to also unlock!
Pay $15 or more to also unlock!


What we want from each one of these bundles is the following:

* The **bundle name** and the **bundle price**
    * The **products** of each **bundle** 

With **list comprehensions** we make a **list** using some sort of object (does not need to be a **list**)

The first mention is the **output**, or what we're going to do to the iterable, the second mention is usually the iterating process. See:

* list_comprehension = [*output* **for** *elem* **in** *iterable*]

You could write the following **for** loop as the **list comprehension** that follows:

In [54]:
stripped_bundles = []
for bundle in bundle_names[0:3]:
    stripped_bundles.append(bundle.text.strip())
    
stripped_bundles

['Pay $1 or more!',
 'Pay $8 or more to also unlock!',
 'Pay $15 or more to also unlock!']

In [55]:
stripped_bundles = [bundle.text.strip() for bundle in bundle_names[0:3]]
stripped_bundles

['Pay $1 or more!',
 'Pay $8 or more to also unlock!',
 'Pay $15 or more to also unlock!']

#### You could condense this code even further with the following:

In [56]:
stripped_bundles = [bundle.text.strip() for bundle in beautsoup_object.select("h2.dd-header-headline")]
stripped_bundles

['Pay $1 or more!',
 'Pay $8 or more to also unlock!',
 'Pay $15 or more to also unlock!',
 'Support Charity',
 '']

In [57]:
type(stripped_bundles)

list

## Getting the Product Names

### Quick Question

Here's the url: https://www.humblebundle.com/books/big-data-books?hmb_source=navbar&hmb_medium=product_tile&hmb_campaign=tile_index_4. How could we look at the page's HTML?

In [11]:
product_url = 'https://www.humblebundle.com/books/big-data-books?hmb_source=navbar&hmb_medium=product_tile&hmb_campaign=tile_index_4'

In [15]:
response02 = requests.get(product_url)
    

In [25]:
beautsoup_obj02 = BeautifulSoup(response02.text, 'html.parser')

In [28]:
beautsoup_obj02.select(".dd-image-box-caption")

[<div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     Mastering Apache Spark 2.x
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     Splunk Essentials
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     MongoDB Cookbook
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     Getting Started with Hadoop 2.x
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     Learning ElasticSearch 5.0
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
 <span data-sheets-userformat='{"2":15235,"3":{"1":0},"4":[nu

In [32]:
product_name = beautsoup_obj02.select(".dd-image-box-caption")
type(product_name)

list

In [34]:
product_name[0].text.strip()

'Mastering Apache Spark 2.x'

In [35]:
stripped_product = []
for product in product_name:
    stripped_product.append(product.text.strip())

In [36]:
stripped_product

['Mastering Apache Spark 2.x',
 'Splunk Essentials',
 'MongoDB Cookbook',
 'Getting Started with Hadoop 2.x',
 'Learning ElasticSearch 5.0',
 'Three Months of Mapt Pro for $30 Coupon',
 'Modern Big Data Processing with Hadoop',
 'Apache Hive Essentials',
 'Learning Elastic Stack 6.0',
 'Learning Hadoop 2',
 'Apache Spark with Scala',
 'Working with Big Data in Python',
 'Statistics for Data Science',
 'Python Data Analysis',
 'Learning R for Data Visualization',
 'Big Data Analytics with Hadoop 3',
 'Mastering MongoDB 3.x',
 'Artificial Intelligence for Big Data',
 "Big Data Architect's Handbook",
 'Hadoop Real-World Solutions Cookbook',
 'Build scalable applications with Apache Kafka',
 'Learning Apache Cassandra',
 'Data Science Algorithms in a Week',
 'Python Data Science Essentials',
 'Mastering Tableau 10',
 'Java for Data Science']

In [39]:
type(stripped_product)

list

In [41]:
stripped_product[:25]

['Mastering Apache Spark 2.x',
 'Splunk Essentials',
 'MongoDB Cookbook',
 'Getting Started with Hadoop 2.x',
 'Learning ElasticSearch 5.0',
 'Three Months of Mapt Pro for $30 Coupon',
 'Modern Big Data Processing with Hadoop',
 'Apache Hive Essentials',
 'Learning Elastic Stack 6.0',
 'Learning Hadoop 2',
 'Apache Spark with Scala',
 'Working with Big Data in Python',
 'Statistics for Data Science',
 'Python Data Analysis',
 'Learning R for Data Visualization',
 'Big Data Analytics with Hadoop 3',
 'Mastering MongoDB 3.x',
 'Artificial Intelligence for Big Data',
 "Big Data Architect's Handbook",
 'Hadoop Real-World Solutions Cookbook',
 'Build scalable applications with Apache Kafka',
 'Learning Apache Cassandra',
 'Data Science Algorithms in a Week',
 'Python Data Science Essentials',
 'Mastering Tableau 10']

When inspecting the HTML, we see the following:

In [62]:
<div class="dd-image-box-caption">


SyntaxError: invalid syntax (<ipython-input-62-d09952242e68>, line 1)

In [63]:
prod_names = beautsoup_object.select(".dd-image-box-caption")
prod_names[:1000]

[<div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     Mastering Apache Spark 2.x
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     Splunk Essentials
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     MongoDB Cookbook
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     Getting Started with Hadoop 2.x
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
     
     Learning ElasticSearch 5.0
   </div>,
 <div class="dd-image-box-caption dd-image-box-text dd-image-box-white ">
 <i class="hb hb-lock dd-caption-lock"></i>
 <span data-sheets-userformat='{"2":15235,"3":{"1":0},"4":[nu

We see that we've got a bunch of Big Data books listed here:

* Mastering Apache Spark 2.x;
* Splunk Essentials;
* MongoDB Cookbook;
* Getting Started with Hadoop 2.x;
* etc.

In [None]:
type(prod_names)

In [46]:
stripped_prod_names = [product.text.strip() for product in prod_names]
stripped_prod_names

['Mastering Apache Spark 2.x',
 'Splunk Essentials',
 'MongoDB Cookbook',
 'Getting Started with Hadoop 2.x',
 'Learning ElasticSearch 5.0',
 'Three Months of Mapt Pro for $30 Coupon',
 'Modern Big Data Processing with Hadoop',
 'Apache Hive Essentials',
 'Learning Elastic Stack 6.0',
 'Learning Hadoop 2',
 'Apache Spark with Scala',
 'Working with Big Data in Python',
 'Statistics for Data Science',
 'Python Data Analysis',
 'Learning R for Data Visualization',
 'Big Data Analytics with Hadoop 3',
 'Mastering MongoDB 3.x',
 'Artificial Intelligence for Big Data',
 "Big Data Architect's Handbook",
 'Hadoop Real-World Solutions Cookbook',
 'Build scalable applications with Apache Kafka',
 'Learning Apache Cassandra',
 'Data Science Algorithms in a Week',
 'Python Data Science Essentials',
 'Mastering Tableau 10',
 'Java for Data Science']

Or:

In [47]:
stripped_prod_names = []
for product in prod_names:
    stripped_prod_names.append(product.text.strip())
    
stripped_prod_names

['Mastering Apache Spark 2.x',
 'Splunk Essentials',
 'MongoDB Cookbook',
 'Getting Started with Hadoop 2.x',
 'Learning ElasticSearch 5.0',
 'Three Months of Mapt Pro for $30 Coupon',
 'Modern Big Data Processing with Hadoop',
 'Apache Hive Essentials',
 'Learning Elastic Stack 6.0',
 'Learning Hadoop 2',
 'Apache Spark with Scala',
 'Working with Big Data in Python',
 'Statistics for Data Science',
 'Python Data Analysis',
 'Learning R for Data Visualization',
 'Big Data Analytics with Hadoop 3',
 'Mastering MongoDB 3.x',
 'Artificial Intelligence for Big Data',
 "Big Data Architect's Handbook",
 'Hadoop Real-World Solutions Cookbook',
 'Build scalable applications with Apache Kafka',
 'Learning Apache Cassandra',
 'Data Science Algorithms in a Week',
 'Python Data Science Essentials',
 'Mastering Tableau 10',
 'Java for Data Science']

## Extracting the Price

We see that the price is in the bundle name:

In [58]:
stripped_bundles

['Pay $1 or more!',
 'Pay $8 or more to also unlock!',
 'Pay $15 or more to also unlock!',
 'Support Charity',
 '']

In [59]:
[bundle.split()[1] for bundle in stripped_bundles if bundle.startswith("Pay")]

['$1', '$8', '$15']

## Streamlining the Process

Here, we'll target 'dd-game-row.' This'll allow us to **target** by each ***row***. For each ***row*** we'll be able to pull out the **title** and then the specific **product** of each bundle. This is the same as what we were doing before, but perhaps a bit more streamlined.

Also, we'll see that, by creating a **nested dictionary**, we'll make a formidable data structure. It's from this that we can derive our insight.

In [60]:
bundles = beautsoup_object.select('.dd-game-row')
bundles

[<div class="main-content-row dd-game-row js-nav-row">
 <div class="u-constrain-width">
 <div class="dd-header">
 <h2 class="dd-header-headline">
     Pay $1 or more!
   </h2>
 <h3 class="dd-header-subheader">
 </h3>
 </div>
 <div class="dd-image-box-list">
 <div class="dd-image-box game-boxes hoverable desktop">
 <div class="dd-image-box-figure u-lazy-load" data-slideout="masteringapachespark2_x">
 <div class="dd-image-box-badge-holder u-hide-onerror">
 </div>
 <div class="dd-image-holder">
 <img class="dd-image-box-figure-img" data-retina-src="https://humblebundle.imgix.net/misc/files/hashed/a883a6da28992ae6e1772b6e5c6045ef2c335239.png?auto=format&amp;dpr=2&amp;fit=clip&amp;h=240&amp;w=180&amp;s=0241b3b7926c52ad30063b6289e3bafb" data-src="https://humblebundle.imgix.net/misc/files/hashed/a883a6da28992ae6e1772b6e5c6045ef2c335239.png?auto=format&amp;fit=crop&amp;fm=png&amp;h=218&amp;w=150&amp;s=cf7161ce312f788bbebc8546c8425596"/>
 </div>
 <span class="hover-black-overlay u-hide-onerror"

In [61]:
product_names = beautsoup_object.select(".dd-image-box-caption")[0].text.strip()
product_names

'Mastering Apache Spark 2.x'

In [41]:
url = "https://www.humblebundle.com/books/big-data-books"
response = requests.get(url)

bundle_dict = {}

for bundle in bundles:
    if bundle.select('.dd-header-headline'):
        
        # Getting bundle name
        bundle_name = bundle.select('.dd-header-headline')[0].text.strip()
        
        # Getting product names
        prod_names = bundle.select('.dd-image-box-caption')
        prod_names = [prodname.text.strip() for prodname in prod_names]
        
        # Add a product tier to our data structure
        bundle_dict[bundle_name] = {'products':prod_names}

In [42]:
bundle_dict

{'Pay $1 or more!': {'products': ['Mastering Apache Spark 2.x',
   'Splunk Essentials',
   'MongoDB Cookbook',
   'Getting Started with Hadoop 2.x',
   'Learning ElasticSearch 5.0',
   'Three Months of Mapt Pro for $30 Coupon']},
 'Pay $8 or more to also unlock!': {'products': ['Modern Big Data Processing with Hadoop',
   'Apache Hive Essentials',
   'Learning Elastic Stack 6.0',
   'Learning Hadoop 2',
   'Apache Spark with Scala',
   'Working with Big Data in Python',
   'Statistics for Data Science',
   'Python Data Analysis',
   'Learning R for Data Visualization']},
 'Pay $15 or more to also unlock!': {'products': ['Big Data Analytics with Hadoop 3',
   'Mastering MongoDB 3.x',
   'Artificial Intelligence for Big Data',
   "Big Data Architect's Handbook",
   'Hadoop Real-World Solutions Cookbook',
   'Build scalable applications with Apache Kafka',
   'Learning Apache Cassandra',
   'Data Science Algorithms in a Week',
   'Python Data Science Essentials',
   'Mastering Tableau 10'

In [48]:
bundle_dict.keys()

dict_keys(['Pay $1 or more!', 'Pay $8 or more to also unlock!', 'Pay $15 or more to also unlock!'])

### If we wanted to break this down:

In [50]:
for bundle_name, bundle_info in bundle_dict.items(): 
    print(bundle_name)
    print('Products:')
    print(', '.join(bundle_info['products']))
    print('\n\n')

Pay $1 or more!
Products:
Mastering Apache Spark 2.x, Splunk Essentials, MongoDB Cookbook, Getting Started with Hadoop 2.x, Learning ElasticSearch 5.0, Three Months of Mapt Pro for $30 Coupon



Pay $8 or more to also unlock!
Products:
Modern Big Data Processing with Hadoop, Apache Hive Essentials, Learning Elastic Stack 6.0, Learning Hadoop 2, Apache Spark with Scala, Working with Big Data in Python, Statistics for Data Science, Python Data Analysis, Learning R for Data Visualization



Pay $15 or more to also unlock!
Products:
Big Data Analytics with Hadoop 3, Mastering MongoDB 3.x, Artificial Intelligence for Big Data, Big Data Architect's Handbook, Hadoop Real-World Solutions Cookbook, Build scalable applications with Apache Kafka, Learning Apache Cassandra, Data Science Algorithms in a Week, Python Data Science Essentials, Mastering Tableau 10, Java for Data Science





**.items()** is like the **enumerate()** function for ***dictionaries***. It returns a list of dict's (key, value) tuple pairs. The method **.join()** returns a string in which the string elements of sequence have been joined by str separator.



## Paired Programming

Use Plotly's documentation to try and come up with some cool visual. Perhaps we can count how many times 'Apache', 'Spark', 'Scala', and other big data buzz words are used in each of our bundles.

## Classification Overview

We went over regression basics the other day. Today we'll do the same for classification. What is classification, you ask?

"It is customary to refer to problems with a ***continuous or quantitative response*** as **regression** problems. In contrast, when the response variable is **categorical** or **qualitative** in nature, we are dealing with ***classification*** problems. A statistical learning technique to preduct a **qualitative response** is called a ***classifier***.

Today, we're going to look at a classifying technique that is a natural extension of linear regression: **logistic regression**.

***NOTE***: much of the below terminology (especially in the second paragraph) is going to be completely foreign to you. Do not stress over it. I only include it here because this is, technically, what separates a **GLM** from a **linear model**. Much of this will make more sense to you when we finally get into **statistics and modeling techniques**.

**Logistic Regression** is a type of ***Generalized* Linear Model (GLM)**. There are a couple differences between ***generalized* linear regression models** and **linear regression models**. 

The first is that **generalized** models allow our response, or 'y,' variable to follow distributions that are not **normal**. We haven't gone over distributions yet, and so you'll have to hold your breath on that point's further elaboration. What I'll give you now is an example. Say we're trying to predict whether or not someone will sign up for a special promotion. We have data on their income level, their gender, and their age. We want to predict whether or not they'll sign up for our promotion. That response variable will be one of two things: **yes** or **no**. We can translate the yes to **1** and the no to **0**, a sort of ***binary dummy variable***. But in order to predict a response variable that follows a ***Bernoulli*** distribution (more on that later), we have to make sure our output is between 0 and 1. In order to do that, we apply a nonlinear transformation to the right hand side of our regression equation. I'll show more on that on the board.

The second is that **GLM**s use ***link*** functions. Instead of equating the mean of the response to the linear combination of explanatory variables, in **GLM** it is a **function** of the response mean, not necessarily the mean itself, that is linearly related to the predictors. The **link function** does just this, "***linking*** the mean of the response to a linear combination of the explanatory variables.

#### But so –– textbooks aside –– what is a logistic regression model actually, and how can we use it to perform classification? 

* Bernoulli response + Logit link = Logistic regression model

Let's take a look at the board: 

## Quasi-homework: SQL for Next Tuesday's Class

Please have MySQL downloaded and the sample datasets that I email you uploaded by Thursday's class. Matt will be here to help with your projects, but he can also help you troubleshoot through any MySQL downloading/data uploading issues. It's important that we get this done before meeting next week.

For **Windows**:
https://www.youtube.com/watch?v=iHTI_Nk7uwo


For **Mac**:
https://www.youtube.com/watch?v=iOlJxOkp6sI



## Helpful Links

"Using BeautifulSoup to parse HTML and extract press briefings URLs": http://www.compjour.org/warmups/govt-text-releases/intro-to-bs4-lxml-parsing-wh-press-briefings/

