# Rule 1 - Web Scraping - Part 2
***

# Recap

Recall from our [Lesson 1](https://github.com/modulusmath/Finance/blob/master/Rule%20%231%20Financial%20Scraping%20-%20Crash%20Course%20-%20Part%20%231.ipynb)  we connected to the **Market Web** site, downloaded **Revenue** for symbol **'goog'**, parsed HTML and pulled out **revenue** data.  

Now, one of our **design goals** is to run without any **tracebacks**, catching error obvious errors to **quickly spot** problems.

We got **lucky** in that we saw **no errors**, but what if the server is down or we can't connect? 

It's unlikely a large site is completely down but it's best to be *__**prepared**__*.

# HTTP / Network Error Checking

Setting up our request object again:

In [8]:
import requests

In [43]:
url ="http://www.marketwatch.com/investing/stock/goog/financials/"

In [44]:
r = requests.get(url,timeout=10)

Instead of making an attempt straight away, we'll ['try'](https://docs.python.org/3/tutorial/errors.html) it:

In [21]:
import sys

def err_web(url):
    """ Catch the Errors from the Web Connections             """
    """ All or nothing here: If not 200 OK - exit the program """
    try:
        r = requests.get(url,timeout=10)
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print ("HTTP Error:",errh)
        return errh
    except requests.exceptions.ConnectionError as errc:
        print ("Fatal Error Connecting:",errc)
        return errc
    except requests.exceptions.Timeout as errt:
        print ("Timeout Error:",errt)
        return errt
    except requests.exceptions.RequestException as err:
        print ("OOps: Something Else",err)
        return err 
    else:
        return r

If requests.get returns an **exception**, we'll **catch it**. 

First though, we'll **[raise_for_status()](http://docs.python-requests.org/en/master/user/quickstart/)** to capture any **HTTP errors**, then we explictly check for **TimeOuts** and **Connection Failures**. 

Let's see:

In [23]:
response = err_web("http://www.marketwatch.com/investing/stock/goog/financialsZZ/")

HTTP Error: 404 Client Error: Not Found for url: https://www.marketwatch.com/investing/stock/goog/financialsZZ/


In [24]:
print (response)

404 Client Error: Not Found for url: https://www.marketwatch.com/investing/stock/goog/financialsZZ/


That's a lot **better** than a **traceback**. Ditto if we can't connect:

In [27]:
import time;
print (time.strftime("%H:%M:%S"))
response = err_web("http://1.1.1.1/");
print (time.strftime("%H:%M:%S"))

12:07:47
Fatal Error Connecting: HTTPConnectionPool(host='1.1.1.1', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fc188e6fa58>, 'Connection to 1.1.1.1 timed out. (connect timeout=10)'))
12:07:57


10 seconds later we have our answer - **no ugly tracebacks**!

# HTML parsing/Beautiful Soup Error Checking

Assuming we connected reliably, we have a **requests** object now and  also created a **soup** object, and pulled out some HTML: 

In [122]:
from bs4 import BeautifulSoup as bsoup
soup_sales   = bsoup(r.content,"lxml")
a_href_sales = soup_sales.find('a',attrs={'data-ref':'ratio_SalesNet1YrGrowth'})

What if our anchor in the HTML **"ratio_SalesNet1YrGrowth"** does not exist? 

What if MarketWeb changes their **HTML** and **layout**? (hint: they will).    

Screen Scraping is an art form like a **sandcastle** at the beach, it's going to **fall down** at some point and you have **little control** when it happens.

If **Beautiful soup fails to find** the tag you're looking for it returns a *__**NoneType**__*:

In [101]:
print (type(a_href_sales))

<class 'NoneType'>


That means if you call **find_parent()** on a **NoneType** we'll get an **AttributeError**:

In [100]:
from bs4 import BeautifulSoup as bsoup
soup_sales   = bsoup(r.content,"lxml")
a_href_sales = soup_sales.find('a',attrs={'data-ref':'ratio_SalesNet1YrGrowthOOps'})
sales_td_parent = a_href_sales.find_parent()


AttributeError: 'NoneType' object has no attribute 'find_parent'

We'll need to **catch the Exception** and handle it gracefully:

In [123]:
def get_rev(soup_sales):    
    """                      Revenue                       """
    revenue_master=[]
        
    a_href_sales = soup_sales.find('a',attrs={'data-ref':'ratio_SalesNet1YrGrowth'})
    """ If we got here the http call succeeded so we will have a valid soup object  
        However, if no content found in a soup obj, bsoup returns a NoneType.
        So worst case a_href_sales becomes NoneType, but we don't kick up an AttributeError here.
    """
    try:
        """ if the soup object is null we will kick up the AttributeError here, so we try and group together."""
        sales_td_parent = a_href_sales.find_parent()
        sales_data_links = sales_td_parent.find_next_siblings("td",attrs={'class':'valueCell'})
    except AttributeError as e:
        print("No Sales data web patterns found")
        print("")    
        return revenue_master
    else:
        #sales_data_links still could be a NoneType
        if sales_data_links is not None:
            for link in sales_data_links:
                     rev_val=link.string
                     revenue_master.append(rev_val)   
        return revenue_master

Hence, we group both soup calls together and **bail** if so, alerting the **user**. Also note that even if sales_td_parent contains data, **sales_data_links still could be a NoneType** so we explictly check that as well.

In [124]:
get_rev(soup_sales)

['49.96B', '59.73B', '65.83B', '73.59B', '89.73B']

In this case we get all the data we are looking for. Nice.

# Validating Data

The data we receive for Sales/Revenue was: **['49.96B', '59.73B', '65.83B', '73.59B', '89.73B']**   

**Wonderful** but...
    

Q: How can we be so sure it will **always** be good **'as it is'**?
A: We can't

Q: What if some of the numbers are **negative**? 
A: Maybe they are prepended with **'-'**. Maybe they are enclosed in **braces**: (). Maybe both! Lions and tigers and bears, oh my!

Q: What if a company reports in Millions, Thousands? lowercase? Both Millions **and** Billions? 
A: We'll need to catch all of those facts.

In [103]:
import re

def check_data(data):
    """ Return a Number and a Denomination Value (typically M or B)    """
    """ Ensure each list is filled by returning NAJO if no match       """

    mynanstring='NA'
    if data is None:
        return None,None
    else:
        data_pat=re.compile("([-(]?[-(]?[0-9,]+\.?[0-9]{,2})([mMbB]?)")
        data_is_valid=data_pat.search(data)
        if data_is_valid:
            num   = data_is_valid.group(1)
            denom = data_is_valid.group(2)
            brc_pat=re.compile("\(")           
            braces=brc_pat.search(num)
            num = num.replace('(',"")
            num = num.replace(')',"")
            num = num.replace(",","")
                    
            if denom:
                denom_val=denom
            else:
                denom_val=None
            
            if braces:
                neg_pat=re.compile("-")
                is_neg_already=neg_pat.search(num)
                if is_neg_already:
                    return  float(num),denom_val
                else:
                    return -float(num),denom_val
            else:
                return float(num),denom_val
        else:
            return mynanstring,mynanstring

We'll pass each of the values into **check_data()** and return either a **float** and a **denomination value** or a custom **"Not a Number"** string you can set.

Let's see:

In [72]:
rev=['49.96B', '59.73B', '65.83B', '73.59B', '89.73B']

In [82]:
for value in rev:
    num,den = check_data(value)
    print (num,den)

49.96 B
59.73 B
65.83 B
73.59 B
89.73 B


In [113]:
rev=['(-49.96m)', '-59.73M', '65.83m', '-(73.59M)', 'HHHHH','-(W!)','(8B)','$$$$$']
for value in rev:
    num,den = check_data(value)
    print (num,den)    

-49.96 m
-59.73 M
65.83 m
-73.59 M
NA NA
NA NA
-8.0 B
NA NA


Great! **No matter** what sequence of dashes and braces we see, we return a **negative number** when **appropriate** and NA if other things get tossed in there.

In the **next lesson** we'll look at diplaying the data, calculating growth rates and finally plotting the data we receive.