> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

In [4]:
import scraping_class
logfile = 'log_exercise_7.txt' ## name your log file.
connector = scraping_class.Connector(logfile)

# Exercise Set 7: Parsing and Information Extraction

In this Exercise Set we shall develop our webscraping skills even further by practicing **parsing** and navigating html trees using BeautifoulSoup and furthermore train extracting information from raw text with no html tags to help, using regular expressions. 

But just as importantly you will get a chance to think about **data quality issues** and how to ensure reliability when curating your own webdata. 

## Exercise Section 7.1: Logging and data quality

> **Ex. 7.1.1:** *Why is is it important to log processes in your data collection?*



In [2]:
# [Answer to Ex. 7.1.1 here]

> **Ex. 7.1.2:**
*How does logging help with both ensuring and documenting the quality of your data?*


In [2]:
# [Answer to Ex. 7.1.2 here]

## Exercise Section 7.2: Parsing a Table from HTML using BeautifulSoup.

Yesterday I showed you a neat little prepackaged function in pandas that did all the work. However today we should learn the mechanics of it. *(It is not just for educational purposes, sometimes the package will not do exactly as you want.)*

We hit the Basketball stats page from yesterday again: https://www.basketball-reference.com/leagues/NBA_2018.html.


> **Ex. 7.2.1:** Here we practice simply locating the table node of interest using the `find` method build into BeautifoulSoup. But first we have to fetch the HTML using the `requests` module. Parse the tree using `BeautifulSoup`. And then use the **>Inspector<** tool (* right click on the table < press inspect element *) in your browser to see how to locate the Eastern Conference table node - i.e. the *tag* name of the node, and maybe some defining *attributes*.

In [56]:
import requests
from bs4 import BeautifulSoup

#define url and #fetch the HTML using the requests module
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html'
response = requests.get(url)  
html = response.text   

#Parse the tree using BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml') # parse the raw html using BeautifoulSoup
#soup = BeautifulSoup(response[0].text,'lxml') # parse the HTML

#Eastern Conference starts with: table class="suppress_all sortable stats_table" id="confs_standings_E"
eastern_conf = soup.find('table',{'id':'confs_standings_E'}) 
print(eastern_conf)

<table class="suppress_all sortable stats_table" data-cols-to-freeze="1" id="confs_standings_E"><caption>Conference Standings Table</caption>
<colgroup><col/><col/><col/><col/><col/><col/><col/><col/></colgroup>
<thead>
<tr>
<th aria-label="Eastern Conference" class="poptip sort_default_asc left" data-stat="team_name" scope="col">Eastern Conference</th>
<th aria-label="Wins" class="poptip right" data-stat="wins" data-tip="Wins" scope="col">W</th>
<th aria-label="Losses" class="poptip right" data-stat="losses" data-tip="Losses" scope="col">L</th>
<th aria-label="Win-Loss Percentage" class="poptip right" data-stat="win_loss_pct" data-tip="Win-Loss Percentage" scope="col">W/L%</th>
<th aria-label="Games Behind" class="poptip sort_default_asc right" data-stat="gb" data-tip="Games Behind" scope="col">GB</th>
<th aria-label="Points Per Game" class="poptip right" data-stat="pts_per_g" data-tip="Points Per Game" scope="col">PS/G</th>
<th aria-label="Opponent Points Per Game" class="poptip righ

You have located the table should now build a function that starts at a "table node" and parses the information, and outputs a pandas DataFrame. 

Inspect the element either within the notebook or through the **>Inspector<** tool and start to see how a table is written in html. Which tag names can be used to locate rows? How will you iterate through columns. Were is the header located?

> **Ex. 7.1.2:** First you parse the header which can be found in the canonical tag name: thead. 
Next you use the `find_all` method to search for the tag, and iterate through each of the elements extracting the text, using the `.text` method builtin to the the node object. Store the header values in a list container. 

> **Ex. 7.1.3:** Next you locate the rows, using the canonical tag name: tbody. And from here you search for all rows tags. Fiugre out the tag name yourself, inspecting the tbody node in python or using the **Inspector**. 

> **Ex. 7.1.4:** Next run through all the rows and extract each value, similar to how you extracted the header. However here is a slight variation: Since each value node can have a different tag depending on whether it is a digit or a string, you should use the `.children` method instead of the `.find_all` - (or write compile a regex that matches both the td tag and the th tag.) 
>Once the value nodes of each row has been located using the `.children` method you should extract the value. Store the extracted rows as a list of lists: ```[[val1,val2,...valk],...]```

[<a href="/teams/TOR/2018.html">Toronto Raptors</a>, <a href="/teams/BOS/2018.html">Boston Celtics</a>, <a href="/teams/PHI/2018.html">Philadelphia 76ers</a>, <a href="/teams/CLE/2018.html">Cleveland Cavaliers</a>, <a href="/teams/IND/2018.html">Indiana Pacers</a>, <a href="/teams/MIA/2018.html">Miami Heat</a>, <a href="/teams/MIL/2018.html">Milwaukee Bucks</a>, <a href="/teams/WAS/2018.html">Washington Wizards</a>, <a href="/teams/DET/2018.html">Detroit Pistons</a>, <a href="/teams/CHO/2018.html">Charlotte Hornets</a>, <a href="/teams/NYK/2018.html">New York Knicks</a>, <a href="/teams/BRK/2018.html">Brooklyn Nets</a>, <a href="/teams/CHI/2018.html">Chicago Bulls</a>, <a href="/teams/ORL/2018.html">Orlando Magic</a>, <a href="/teams/ATL/2018.html">Atlanta Hawks</a>]


In [None]:
 # Ex. 7.1.2: Store the header values in a list container.
# I use the eastern.conf html and not the whole soup html
header_find = eastern_conf.find('thead') #find all the headers 
headers = header_find.find_all('th')
#print(headers)

# create empty list for header values
header_list = [] 

#iterate through and append to list
for header in headers: 
    if header.has_attr('aria-label'):
        header_list.append(header['aria-label'])
#print(header_list)
# how to do it with text??'''

In [58]:
#Ex. 7.1.3: And from here you search for all rows tags.
row_find = eastern_conf.find('tbody') 
#all rows start with: <tr class="full_table">
rows = row_find.find_all('tr',{'class':'full_table'})
#print(rows)


In [59]:
#Ex 7.1.4: run through all the rows and extract each value, 

# each row value has different tags depending on their type (digit or string)
# function to check what tag it is and either convert to float or not. 
import numpy as np
def convert_value_type(value_node):
    if value_node.name == 'th':
        return value_node.text
    else: # assume node is td:
        try: 
            return float(value_node.text)
        except:
            return np.nan # assume there is no value if it fails. 
data = []
for row_node in rows:
    row = []
    for child in row_node.children:
        row.append(convert_value_type(child))
    data.append(row)  

#print(data)

> **Ex. 7.2.6:** Now locate all tables from the page, using the `.find_all` method searching for the table tag name. Iterate through the table nodes and apply the function created for parsing html tables. Store each table in a dictionary using the table name as key. The name is found by accessing the id attribute of each table node, using dictionary-style syntax - i.e. `table_node['id']`.

In [67]:
import pandas as pd

#from solution:
#df = pd.DataFrame(data,columns=header)
def parse_html_table(table_node):
    # parse header
    header = table_node.thead.find_all('th') # locate each column name using the `th` tag, which entail a string .
    # extract the label you want. brevity use .text, for a more informative get teh aria-label attribute
    header = [col['aria-label'] for col in header]
    # parse rows: the canonical tbody locates the data body.
    body = table_node.tbody
    # rows are found using the "tr" tag
    rows = body.find_all('tr')
    # each row value has different tags depending on their type (digit or string)
    # function to check what tag it is and either convert to float or not. 
    '''import numpy as np
    def convert_value_type(value_node):
        if value_node.name == 'th':
            return value_node.text
        else: # assume node is td:
            try: 
                return float(value_node.text)
            except:
                return np.nan # assume there is no value if it fails. 
    data = []
    for row_node in rows:
        row = []
        for child in row_node.children:
            row.append(convert_value_type(child))
        data.append(row)
    df = pd.DataFrame(data,columns=header)
    return df

df = parse_html_table(table_node)
df.head()'''

In [69]:
all_tables = soup.find_all('table') 

dicts = {}

for table_node in all_tables:
    name = table_node['id'] # access the id attribute. 
    table = parse_html_table(table_node) # apply parse_table function
    dicts[name] = table # store table in the dictionary


dicts


{'confs_standings_E': None,
 'confs_standings_W': None,
 'divs_standings_E': None,
 'divs_standings_W': None}

> **Ex. 7.2.7. (extra) :** Compare your results to the pandas implementation. pd.read_html