#### Overview


If the IR cycle is something like this:
1. Collect documents (i.e. web crawling or retrieving specific pages)
1. Extract *structured* information from documents (i.e. convert the document external format to match your schema)
1. Index the documents
1. Query the index

Most of the documents of interest have some structured elements and some unstructured elements. 
We will look at Seattle U computer science faculty home pages.  For example, Prof. Dingle's page:
https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/dingle-adair.html

Things like the name, email, office number, and phone number are structured, and the list of research interests is (probably) unstructured narrative. 

Here we are going to concentrate on the first two.  We will work on
1. Making a service call to get an HTML document
1. Parsing the document to pull out certain fields 
1. Packaging and storing document data so it's ready for indexing

The exercise:  for any/all faculty pages, extract this information

1. Name
1. Phone number
1. Email address
1. Research interests

Many sites have a "structured API"  -- for example https://docs.microsoft.com/en-us/linkedin/ -- which takes a request (e.g. for a person or handle) and returns a data structure (e.g. containing the person's name, contacts, employment history).  
But sometimes we have to extract structured information directly from a web page -- that is tricky and dangerous, because the HTML is structured for display purposes and not semantically -- the HTML can change abruptly and break all your extraction code, and there is no guarantee that the structure of every page of interest is the same.


#### Service Calls

Getting the HTML source for a page.

We will be making calls to an HTTP server, so we need to talk about requests and responses.  This will be useful to you both in the retrieval context, but also because you will be making requests to SOLR, which is itself a service.

We will use Python requests library http://docs.python-requests.org/en/master/


In [1]:
import requests

In [None]:
url = "https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/dingle-adair.html"
response = requests.get(url)

In [None]:
type(response)

In [None]:
response.status_code

In [None]:
response.headers

In [None]:
response.text

#### HTML String to Parsed HTML

Beautiful Soup package
* Documentation:  https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Installation: pip install beautifulsoup4

In [2]:
from bs4 import BeautifulSoup as soup

In [None]:
page = soup(response.text, "html.parser")

In [None]:
type(page)

In [None]:
print(page.title)

In [None]:
print(page.head)

In [None]:
print(page.body)

In [None]:
print(page.text)

In [None]:
print(page.prettify())

In [None]:
## This extracts the name!
page.find('div', {'id': 'zoneA'}).find('h1', {'id': 'pageTitle'}).text

In [None]:
## This extracts the email!!
page.find('div', {'id': 'zoneA'}).\
find('div', {'class': "staffBioPageInfo"}).\
find('p', {'class': 'Email'}).\
find('a').\
text

In [None]:
## This extracts the phone!!
page.find('div', {'id': 'zoneA'}).\
find('div', {'class': "staffBioPageInfo"}).\
find('p', {'class': 'Phone'}).\
text.replace('Phone: ', '')

In [None]:
## This extracts the bio information!!!
page.find('div', {'class': "ExtendedBiography"}).text

### Packaging it Up

In [3]:
handles = ['dingle-adair', 
           'mckee-michael', 
           'hanks-steven', 
           'khadivi-pejman', 
           'leblanc-richard', 
           'koenig-michael',
          'kong-hidy',
          'larson-eric',
          'li-lin',
          'lundeen-kevin',
          'mishra-aditya',
          'obare-james',
          'oh-sheila',
          'reeder-susan',
          'wong-jason',
          'zhu-yingwu-']

In [11]:
from random import choice
r = list(range(1990, 2020))
choice(r)

2006

In [15]:
from random import choice
def choose_joined_year():
    return choice(list(range(1995, 2015))) 

def extract_faculty_info(handle):
    url = f"https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/{handle}.html"
    response = requests.get(url)
    page = soup(response.text, "html.parser")
    name = page.find('div', {'id': 'zoneA'}).find('h1', {'id': 'pageTitle'}).text
    email = page.find('div', {'id': 'zoneA'}).find('div', {'class': "staffBioPageInfo"}).find('p', {'class': 'Email'})
    if email == None:
        email = None
    else:
        email = email.find('a').text
    phone = page.find('div', {'id': 'zoneA'}).find('div', {'class': "staffBioPageInfo"}).find('p', {'class': 'Phone'})
    if phone == None:
        phone = None
    else:
        phone = phone.text.replace('Phone: ', '')
    bio = page.find('div', {'class': "ExtendedBiography"})
    if bio == None:
        bio = None
    else:
        bio = bio.text
    return {'name': name, 'email': email, 'phone': phone, 'bio': bio, 'joined': choose_joined_year(), "handle": handle}


In [16]:
for handle in handles:
    print(extract_faculty_info(handle))

{'name': 'Adair Dingle, Ph.D.', 'email': 'dingle@seattleu.edu', 'phone': '(206) 296-5516', 'bio': "Dr. Dingle's Personal Webpage\n\xa0\nTeaching Interests:\n\nData Structures\nFoundations of Computer Science\nObject-Oriented Software Development\nLanguages and Computation\nDesign Patterns and Refactoring\n\n\xa0\nResearch Interests:\nReclaiming Garbage and Education: Java Memory Leaks, Tracking the Design of Objects: Encapsulation Through Polymorphism, The Maintainability Gap, Assessing the Ripple Effect of Language Choice in CS1, Improving C++ Performance Using Temporaries, The Object-Ownership Model: A Case Study for Inheritance and Operator Overloading", 'joined': 2011, 'handle': 'dingle-adair'}
{'name': 'Michael McKee', 'email': 'mckeem@seattleu.edu', 'phone': None, 'bio': '\xa0\nTeaching Interests:\n\nProgramming & Problem Solving\nData Structures And Algorithms\nDatabases\nSoftware Economics\nSoftware Testing\nData Analytics\n\nResearch Interests:\n\nCS Education, Databases, Data

{'name': 'James Obare', 'email': 'obarej@seattleu.edu', 'phone': '(206) 296-2837', 'bio': '\xa0\nTeaching Interests:\n\nIntro to Computer Science\nIntro Computers & Applications\nComp Systems Principles\nCyber Security\nComputer Organization and Architecture\n', 'joined': 1998, 'handle': 'obare-james'}
{'name': 'Sheila Oh', 'email': 'ohsh@seattleu.edu', 'phone': '(206) 296-2164', 'bio': "Professor Oh's Personal Webpage\xa0\n\xa0\nTeaching Interests:\n\nProgramming & Prob Solving\nFoundations of Computer Sci\nFundamentals of Databases\nObject-Oriented Concepts\nDatabase Systems\nData Structures And Algorithms\n", 'joined': 2012, 'handle': 'oh-sheila'}
{'name': 'Susan Reeder', 'email': 'sreeder@seattleu.edu', 'phone': '(206) 296-5508', 'bio': "Professor Reeder's Personal Webpage\n\xa0\nTeaching Interests:\n\nProgramming and Data Types\nProgramming & Prob Solving\nData Structures, Object-Oriented Development\nThe Art of Web Design\n", 'joined': 2005, 'handle': 'reeder-susan'}
{'name': 'Ja

#### Serializing / Storing

It is often useful/necessary to store these "documents" prior to indexing.  Usually this consists of storing the URL or handle, and have it point to the parsed document.   That way a crawler can skip the page if it wants

Two implementations
* Quick and easy and efficient:  python "pickle" serializer.
* Stil quick and easy to use but leaves us readable text for indexing:  write JSON string

In [None]:
import pickle
dingle = extract_faculty_info('dingle-adair')
pickle.dump(dingle, open("stored/dingle-adair.p", "wb"))
recovered = pickle.load( open( "stored/dingle-adair.p", "rb" ))

In [None]:
type(recovered)

#### Put some aside for next lecture

In [None]:
for handle in handles:
    data = extract_faculty_info(handle)
    print(str(data) + "\n")
    pickle.dump(data, open( f"stored/{handle}.p", "wb" ))

#### Also put out a plain text version so we can use non-python tools 

In [None]:
data = extract_faculty_info('dingle-adair')
str(data)
f = open("json\dingle-adair.json", "w")
f.write(str(data))
f.close()

In [17]:
import json

for handle in handles:
    data = extract_faculty_info(handle)
    with open(f"json/{handle}.json", "w") as f:
        json.dump(data, f)