# Web Scraping

### Introduction

The web is full of great datasets, but not all of them are readily available for download and analysis.
Today we'll take a look at how you can surf the web robotically, saving the relevant information into storage containers as you go!

We'll be taking a look at a few packages:
* Beautiful Soup
* Selenium
* re
* Pandas

Our general approach will be:
* pick a domain/set of web pages to scrape
* investigate those web pages using the developers tools from your web browser (such as Chrome or Firefox)
* write rules to select the relevant objects from the DOM
* parse information from those objects and store it in a container

In [19]:
from bs4 import BeautifulSoup
import requests
import selenium
import re
import pandas as pd

### Web Page Introduction: **The DOM**

Before we start scraping, having a little background about how web pages are formatted is very helpful.

"The Document Object Model (DOM) is a programming interface for HTML and XML documents. It represents the page so that programs can change the document structure, style, and content. The DOM represents the document as nodes and objects. That way, programming languages can connect to the page."

https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction

### An example webpage

In [None]:
#How complex or simple should this be? Is this lesson going to be overwhelming?

### Grabbing a Web Page

In [25]:
html_page = requests.get('https://www.azlyrics.com/') #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing

In [32]:
#Preview the soup....MMMMMM 
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content="noarchive" name="robots"/>\n  <meta content="AZLyrics" name="name"/>\n  <meta content="lyrics,music,song lyrics,songs,paroles" name="keywords"/>\n  <base href="//www.azlyrics.com"/>\n  <script src="//www.azlyrics.com/external.js" type="text/javascript">\n  </script>\n  <title>\n   AZLyrics - Song Lyrics from A to Z\n  </title>\n  <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css" rel="stylesheet"/>\n  <link href="//www.azlyrics.com/bsaz.css" rel="stylesheet"/>\n  <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\n  <!--[if lt IE 9]>\r\n      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>\r\n      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script

In [29]:
#Get all the hyperlinks on a page
soup.findAll('a')

[<a class="navbar-brand" href="//www.azlyrics.com"><img alt="AZLyrics.com" class="pull-left" src="//www.azlyrics.com/az_logo_tr.png" style="max-height:40px; margin-top:-10px;"/></a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/a.html">A</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/b.html">B</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/c.html">C</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/d.html">D</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/e.html">E</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/f.html">F</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/g.html">G</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/h.html">H</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/i.html">I</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/j.html">J</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/k.html">K</a>,
 <a class="btn btn-menu" href="//www.azlyrics.com/l.html">L</a>,
 <a class="btn btn-menu" href="//www.

### Parsing the DOM

### Cleaning Elements

### Storing Elements

### Visualizing Results

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

* Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application

* Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.

* Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

### Selenium

### Importing Selenium and Setting the driver

In [4]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

### Download Chromedriver
https://sites.google.com/a/chromium.org/chromedriver/home

In [11]:
cd /Users/matthew.mitchell/Documents/Tools/

/Users/matthew.mitchell/Documents/Tools


In [15]:
driver = webdriver.Firefox('/Users/matthew.mitchell/Documents/Tools/Scraping/geckodriver/')
driver.get("http://www.gmail.com")

NotADirectoryError: [Errno 20] Not a directory: '/Users/matthew.mitchell/Documents/Tools/Scraping/geckodriver/'

### Inspecting the Web Page for Relevant Elements

Right click (option click on mac) on an element you want to robotically interact with and go to inspect.
![](./inspect.png)

You should see the following pane open up displaying the underlying DOM.


![](./inspect2.png)

### Table of Top Commands/Methods

http://akul.me/blog/2016/selenium-cheatsheet/

### Scrolling + Javascript Snippets

### Summary
You should now have a brief intro to web scraping! The possabilities are nearly endless with what you can do. That said, not all websites will be thrilled with your new prowess. Surfing the web at super human speeds will get you banned from many domains and may violate the terms & conditions of many websites that require login. As such, there are a few considerations you should take along your way.

* Is there a terms and conditions for using the website?
* Test your scraping bot on in small samples to debug before scaling to hundreds, thousands or millions of requests.
* Start thinking about your IP address: getting blacklisted from a website is no fun. Consider using a VPN.
* Slow your bot down! Add delays along the way with the time package. Specifically, time.sleep(seconds) adds wait time in a program.

# Resources

#### Beatiful Soup - a good go to tool for parsing the DOM
https://www.crummy.com/software/BeautifulSoup/?

#### Selenium - Browser automation
https://www.seleniumhq.org/

#### Scrapy - another package for scraping larger datasets at scale
https://scrapy.org/