****
# WebScrapping Internet with Python
****
<p style="text-align: right"><i>Jesus Perez Colino<br>First version: Oct 2014<br>Last revision: Nov 2015</i></p>

### About this notebook: 
Notebook prepared by **Jesus Perez Colino** Version 0.1, First Released: 01/10/2014, Alpha.  

- This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). This work is offered for free, with the hope that it will be useful.


- **Summary**: This notebook contains a brief introduction about how to work with **Excel Spreadsheets** from Python using **openpyxl** and **pandas**.


- **Python & packages versions** to reproduce the results of this notebook: 

In [17]:
import IPython
import re
from sys import version 
from urllib import urlopen
from bs4 import BeautifulSoup 
print ' Reproducibility conditions for this notebook '.center(85,'-')
print 'Python version:     ' + version
print 'RegEx version:      ' + re.__version__
print 'IPython version:    ' + IPython.__version__
print '-'*85

-------------------- Reproducibility conditions for this notebook -------------------
Python version:     2.7.10 |Anaconda 2.4.0 (x86_64)| (default, Oct 19 2015, 18:31:17) 
[GCC 4.2.1 (Apple Inc. build 5577)]
RegEx version:      2.2.1
IPython version:    4.0.0
-------------------------------------------------------------------------------------


# Introduction

Before to begin to design web-scrapping functions, there are two basic concepts that we need to review in order to be able to build efficient web-scrappy functions: 
- *first*, Regular Expressions and 
- *second*, the package BeautifulSoup.

## Basics about Regular Expressions (regex)

In the context of this notebook, a **Regular Expression** is a specific kind of text pattern that you can use with many modern applications, and not only in Python but also with many programming languages (like Java, Scala, Perl or Ruby). You can use them to verify whether input fits into the text pattern, to find text that matches the pattern within a larger body of text, to replace text matching the pattern with other text or rearranged bits of the matched text, to split a block of text into a list of subtexts, and to shoot yourself in the foot. 

**Regular expressions** are so called because they are used to identify *regular strings* defined as any string that can be generated by a series of linear rules. For instance:
- Write the letter “a” at least once.
- Append to this the letter “b” exactly five times.
- Append to this the letter “c” any even number of times.
- Optionally, write the letter “d” at the end.

Notice that strings that follow these rules are: `aaaabbbbbccccd`, `aabbbbbcc`, and so on (there are an infinite number of variations).

Regular expressions are merely a shorthand way of expressing these sets of rules. For instance, here’s the regular expression for the series of steps just described:

`aa*bbbbb(cc)*(d | )`


where 

- `aa*` The letter `a` is written, followed by `a*` (read as a star) which means “any number of a’s, including 0 of them.” 
- `bbbbb` five b’s in a row
- `(cc)*` two c’s, surround them in parentheses, and write an asterisk after it, meaning that you can have any number of pairs of c’s (note that this can mean 0 pairs, as well)
- `(d|)` Adding a bar in the middle of two expressions means that it can be “this thing or that thing.” In this case, we are saying “add a d followed by a space or just add a space without a d.”

The most classical example of regular expression is a regex that will automatically idetify an email address: 

`[A-Za-z0-9\._+-]+@[A-Za-z]+\.(com|org|edu|net)`

- `[A-Za-z0-9\._+-]+` helps to collect any email address that contains at least one of the following features: uppercase letters, lowercase letters, the numbers 0-9, periods (.), plus signs (+) or (-), or underscores (_). “A-Z” means “any uppercase letter, A through Z.” By putting all these possible sequences and symbols in brackets (as opposed to parentheses) we are saying “this symbol can be any one of these things we’ve listed in the brackets.” Note also that the final + sign means “these characters can occur as many times as they want to, but must occur at least once.”
- `@` the email address should contains the @ symbol
- `[A-Za-z]+` means that the email address then must contain at least one uppercase or lowercase letter.
- `\.` followed by a period (.)
- `(com|org|edu|net)` ended with *com, org, edu, or net* (in reality, there are many possible top-level domains, but, these four should suffice for the sake of example)

In [13]:
import re
p = re.compile("[A-Za-z0-9\._+-]+@[A-Za-z]+\.(com|org|edu|net)")
print p

<_sre.SRE_Pattern object at 0x105c2d6d0>


In [15]:
string = 'purple alice-b@mymail.com monkey dishwasher'
match = re.search(p, string)
if match:
    print match.group() 

alice-b@mymail.com



## Basics about Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. 

Here’s an HTML document as a basic example (taken from [Beautiful documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)) for this introduction:

In [18]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [19]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

In [21]:
print soup.title
print soup.title.name
print soup.title.string

<title>The Dormouse's story</title>
title
The Dormouse's story


In [22]:
print soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [29]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [23]:
# One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [25]:
# Another common task is extracting all the text from a page:

print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



In [28]:
print(soup.prettify(formatter="minimal"))

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


# Some Crawling Examples

**Example 1**: Create a list of the content related with the Black-Scholes model that is currently in Wikipedia.

If you run this, you should see a list of all article URLs that the Wikipedia article on Black-Scholes model links to. 

In [6]:
from urllib import urlopen
from bs4 import BeautifulSoup 
import re

html = urlopen("https://en.wikipedia.org/wiki/Black–Scholes_model")
bsObj = BeautifulSoup(html)
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a", 
                        href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Mathematical_model
/wiki/Financial_market
/wiki/Derivative_(finance)
/wiki/Option_style
/wiki/Option_(finance)
/wiki/Chicago_Board_Options_Exchange
/wiki/Volatility_smile
/wiki/Fischer_Black
/wiki/Myron_Scholes
/wiki/Journal_of_Political_Economy
/wiki/Partial_differential_equation
/wiki/Black%E2%80%93Scholes_equation
/wiki/Hedge_(finance)
/wiki/Delta_hedging
/wiki/Investment_bank
/wiki/Hedge_fund
/wiki/Robert_C._Merton
/wiki/Options_pricing
/wiki/Nobel_Memorial_Prize_in_Economic_Sciences
/wiki/Risk_management
/wiki/Black-Scholes_formula
/wiki/No-arbitrage_bounds
/wiki/Risk-neutral_measure
/wiki/Black-Scholes_equation
/wiki/Volatility_surface
/wiki/Derivative_(finance)#OTC_and_exchange-traded
/wiki/Risk-free_interest_rate
/wiki/Random_walk
/wiki/Geometric_Brownian_motion
/wiki/Dividend
/wiki/Arbitrage
/wiki/Short_selling
/wiki/Frictionless_market
/wiki/Hedge_(finance)
/wiki/Transaction_cost
/wiki/Strike_price
/wiki/Risk-free_interest_rate
/wiki/Continuous_compounding
/wiki/Force_o

In [None]:
#TODO... 