# Python HTML web Scraping Examples

## First Example
This first example simply finds all the table cell tags that relate to Session Numbers.  The printout is clearly out of context having lost the association of each session number to its host course.

When we find tags that contain the data of interest, note below that we can use the .text command to extract that text.

Attribute values can be extracted by using square brackets with the attribute name (in string format) inside, just was we index dictionaries.

In [None]:
from bs4 import BeautifulSoup

htmlPath = './CourseSchedule.htm'
f = open(htmlPath)
htmlDoc = f.read()
f.close()

# parse html 
htmlParsed = BeautifulSoup(htmlDoc,"lxml")

results = htmlParsed.find_all('td', attrs={'class' : 'Num'})
print('Results: ',results)
print()
print('Number of tags found: ', len(results))
print()
print('Results Data Type: ', type(results))
print()
print('      Full Tag                      Tag Width   Class')
for result in results:
    print(result, ',', result.text, ',', result['width'], ',', result['class'])

## Keeping the Context: Hierarchical Search
In this example we first search for each course, and then find each of its sessions through a secondary search.

Note, first , as show above that the data type returned by a Beautifulsoup .find_all() command is a ResultSet.  This is a data type defined in the Beautifulsoup package and it is, essentially, a list of HTML tags.  Each of those tags can be re-searched with Beautifulsoup.

In [5]:
from bs4 import BeautifulSoup

htmlPath = './CourseSchedule.htm'
f = open(htmlPath)
htmlDoc = f.read()
f.close()

# parse html 
htmlParsed = BeautifulSoup(htmlDoc,"lxml")

results1 = htmlParsed.find_all('table', attrs={'class' : 'Course'})
#print('len(results1):',len(results1))
#print('results1:',results1)

#"""
for result1 in results1:
    results2a = result1.find('tr', attrs = {'class' : 'Name'})
    courseName = results2a.text
    results2b = result1.find_all('tr', attrs = {'class' : 'Session'})
    #print('Course Name: ',courseName,'\n results2b:',results2b)

    for result2 in results2b:
            results3a = result2.find_all('td', attrs = {'class' : 'Num'})
            results3b = result2.find_all('td', attrs = {'class' : 'Time'})
            results3c = result2.find_all('td', attrs = {'class' : 'Days'})

            for i in range(len(results3a)):
                print(courseName+': ', results3a[i].text+',', results3b[i].text+',', results3c[i].text)


BUAD 5012 Competing Through Business Analytics:  1, 8:00 - 9:20 a.m., M W
BUAD 5012 Competing Through Business Analytics:  2, 9:30 - 10:50 a.m., M W
BUAD 5012 Competing Through Business Analytics:  3, 12:30 - 1:50 p.m., M W
BUAD 5042 Heuristic Algorithms:  1, 8:00 - 9:20 a.m., T Th
BUAD 5042 Heuristic Algorithms:  2, 9:30 - 10:50 a.m., T Th
BUAD 5042 Heuristic Algorithms:  3, 12:30 - 1:50 p.m., T Th


## Parsing Data from the Internet
This code requests HTML data from a [site](https://www.basketball-reference.com/players/w/walljo01/gamelog-advanced/2017/) containing basketball data and, subsequently, retrieves data from it.

In [None]:
from bs4 import BeautifulSoup  # Parsing HTML
import requests  # Internet information requests
import re        # Regular Expressions (Regex) package

htmlPath = 'https://www.basketball-reference.com/players/w/walljo01/gamelog-advanced/2017/'
htmlDoc = requests.get(htmlPath).content

""" parse html """
htmlParsed = BeautifulSoup(htmlDoc, 'lxml')
""" get data from one game: object games will be of the ResultType data type """
games = htmlParsed.find_all('tr', attrs={'id' : 'pgl_advanced.424'})  # Finds one row of data
""" The next statement uses Regex to find all rows of data """
#games = htmlParsed.find_all('tr', attrs={'id' : re.compile('^pgl_advanced.')})  

print('Number of games found:',len(games))
print('Game stats are in data type:',type(games),'\n')
all_games = []  # set up an empty list in which to store the data 
for game in games:
    new_game = []   #create empty list to store the data from the next game
    for field in game.find_all('td'):
        new_game.append(field.text)   # append next data point to the current game list
        
    all_games.append(new_game)        # append current game data list to overall list
    
    
print('\nHere\'s the Data')
print(all_games)