# [CDJ] Web Scraping Workbook

## Packages
The [`requests`](https://requests.readthedocs.io/en/latest/) library allows you to download files from the web. You can use the `requests` library to get information from web pages so that you can save them to files or analyze their data in python.

After using `requests` to access web data, you'll use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse that information, organized in HTML. `BeautifulSoup` makes querying a tree of tags and their attributes much easier than trying to parse HTML from scratch. You'll need to spend some time looking at the target web page and finding the combination of tag names and classes you're interested in, but `BeautifulSoup` can help access that information once you know what you need.

As usual, we will still need NumPy and Pandas for storing the data we got from the web.

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np

# Example: College of Engineering Faculty

Today, our goal is to collect information about College of Engineering Faculty: their name, position, office and headshot.

We will use this page: [https://www.engineering.cornell.edu/faculty-directory](https://www.engineering.cornell.edu/faculty-directory)

In [2]:
url = "https://www.engineering.cornell.edu/faculty-directory"

We need to put together clues from the structure of the page and the appearance of elements on the page. Appearance is usually determined by the CSS `style` property. Figuring out how to automatically find values from HTML will involve looking at HTML source, Cmd/Ctrl-F searching for values that you want, and figuring out how to identify styles or containing elements. Modern web pages are long and have lots of complicated elements, many of which do not appear as visible content. Starting from the top of the document and reading through is not recommended.

## Inspect the web page

Take a look at the HTML of the page. You can view this in a couple of different ways:

*   Right click > "View Page Source" - this will give you the entire source code
*   Right click > "Inspect" - this will start a interactive session

When you are looking, keep a note:

* What tag is enclosing information about each of person on the webpage - is there a unique attribute associated with it?
* What tag is enclosing each of the individual information fields (e.g., name, email) for a single person?

## Making the request

The following code does a `GET` request to the web host for the specified filename. HTTP is the protocol used to make web requests.

In [3]:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

It is good pratice to do a confidence check of this status code. 200 is success. Others you have probably seen: 404 is "page not found", 403 is "you do not have access". Codes starting with 3-- are often redirects. 500 means there is a bug in the server-side code. We included a header of user-agent to make us seem less like a bot.

In [4]:
print(response.status_code, response.reason)

200 OK


You're never going to get the analysis of a web page right the first time, so it's good to save a local copy of the HTML source so we don't need to hit the server again.

In [5]:
with open("faculty.html", "w") as writer:
  writer.write(response.text)

Here we're immediately reading the file again, but you could split these into two separate notebooks, one for downloading, one for analysis.

In [6]:
with open("faculty.html", "r") as reader:
  html_source = reader.read()

There are a lot of things that can go wrong when you are accessing web documents. Get in the habit of constantly adding confidence checks to make sure the state of variables is what you expect it to be.

In [7]:
html_source[:50]

'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="'

## Parsing the HTML

Here's where we turn the HTML text from a single long string into a searchable tree of tags. `BeautifulSoup` can support different ways of parsing (including XML). Here we'll use an HTML parser.

In [8]:
page = BeautifulSoup(html_source, "html.parser")

Now that we have a structured document we can ask for specific tags. See the Beautiful Soup documentation linked at the top for more details.

In [9]:
page.title

<title>Faculty Directory | Cornell Engineering</title>

We can also find all of the instances of a given tag. Here we find all links and display the first 10.

In [10]:
links = page.find_all("a")
print("there are", len(links), "links on the page")
links[:10]

there are 207 links on the page


[<a class="element-invisible element-focusable" href="#main-content">Skip to main content</a>,
 <a class="fa fa-id-badge is-active" data-drupal-link-system-path="node/32" href="/faculty-directory">Faculty Directory</a>,
 <a class="fa fa-heart-o" data-drupal-link-system-path="node/314" href="/giving">Giving</a>,
 <a data-drupal-link-system-path="node/20998" href="/industry-partners">Industry Partners</a>,
 <a class="fa fa-list-ul" data-drupal-link-system-path="node/216" href="/programs-departments">Programs &amp; Departments</a>,
 <a class="fa fa-newspaper-o" data-drupal-link-system-path="node/20714" href="/cornell-engineering-news">News</a>,
 <a class="fa fa-check-circle" data-drupal-link-system-path="node/20" href="/about">About Us</a>,
 <a href="https://www.cornell.edu">
 <img alt="Cornell University" src="/themes/custom/cornell/assets/img/cornell_seal_simple_web_b31b1b.svg"/>
 </a>,
 <a href="/">
 <img alt="Cornell Engineering" src="/themes/custom/cornell/assets/img/CornellEngineeri

But we can be more specific that just finding all instances of a type of tag. 

Inspecting the HTML we find that all the persons are wrapped around ``<article>`` tags with a class called ``person--listing``. Thus, we can obtain all instances of article with this specific class:

In [11]:
articles = page.find_all("article", {"class":"person--listing"})
print("there are", len(articles), "articles on the page")

there are 24 articles on the page


Observe the first person:

In [12]:
articles[0]

<article aria-label="About Nicholas L. Abbott" class="person--listing faculty--listing"><div class="row"><div class="faculty-pic columns small-12 medium-12 large-12"><a class="person__portrait" href="https://www.cheme.cornell.edu/faculty-directory/nicholas-l-abbott"><img alt="Nick Abbott" height="560" loading="lazy" src="/sites/default/files/styles/directory_square/public/content/faculty/image/Abbott_REIS_D_560.jpg?itok=IqeF8iI9" width="560"/></a></div><div class="faculty-info columns small-12 medium-12 large-12"><h2 class="h3 person__name"><a href="https://www.cheme.cornell.edu/faculty-directory/nicholas-l-abbott"><span>Nicholas L. Abbott</span></a></h2><div class="person__positions"><div class="person__position">
            Tisch University Professor
          </div></div><div class="person__office"><div class="person__department">
                                          Smith School of Chemical and Biomolecular Engineering
      
                          </div><div class="person

Now we can inspect this piece of HTML and try to figure out how to get to the information we want.

For example, if we want to get the name of the person, we can see that it is wrapped around a `<h2>` tag with a class `person__name`. We can thus access it:

In [13]:
articles[0].h2

<h2 class="h3 person__name"><a href="https://www.cheme.cornell.edu/faculty-directory/nicholas-l-abbott"><span>Nicholas L. Abbott</span></a></h2>

This gives us the entire tag. But we want the text:

In [14]:
articles[0].h2.text

'Nicholas L. Abbott'

At the same time, we get a link to the page of the person in an ``<a>`` tag within this ``<h2>``. This could also be useful. Notice that the link is an attribute value for the attribute `'href'`. We can access attribute values with square brackets ``tag['attribute name']``

In [15]:
articles[0].h2.a['href']

'https://www.cheme.cornell.edu/faculty-directory/nicholas-l-abbott'

We also want the position of the person. Again, by inspecting the HTML, we could see that this information is contained within a ``<div>`` tag with class `person__position`.

In [16]:
articles[0].find("div", {"class":"person__position"}).text

'\n            Tisch University Professor\n          '

Due to the nature of web pages, we ofteen get theese strange white spaces. No worries - as we could just use normal python string methods like [strip()](https://www.w3schools.com/python/ref_string_strip.asp) to get rid of these white spaces.

In [17]:
articles[0].find("div", {"class":"person__position"}).text.strip()

'Tisch University Professor'

We could follow the same process to get the department, office location, phone and email:

In [18]:
print("Department:", articles[0].find("div", {"class":"person__department"}).text.strip())
print("Office Location:", articles[0].find("div", {"class":"person__location"}).text.strip())
print("Phone:", articles[0].find("div", {"class":"person__phone"}).text.strip())
print("Email:", articles[0].find("div", {"class":"person__email"}).text.strip())

Department: Smith School of Chemical and Biomolecular Engineering
Office Location: 360 Olin Hall
Phone: 607/255-3601
Email: nla34@cornell.edu


## Automate the Process

Now that we know how to access the name, link, office and other information about one person, we can automate this process to get the same data for every person listed on the faculty directory.

This can be done by first creating empty lists for each of the data field we want to gather, and then populate theese lists by appending in a for loop over each person.

However, if you run this, you will get a ``AttributeError: 'NoneType' object has no attribute 'text'``.

Why? because some of the person do not have a phone number listed, so Python is unhappy about that. The solution is just to append NaN values when this error occurs using [try-except](https://www.w3schools.com/python/python_try_except.asp):

In [19]:
name_l = []
link_l = []
position_l = []
department_l = []
office_l = []
phone_l = []
email_l = []

for person in articles:
    name_l.append(person.h2.text)
    link_l.append(person.h2.a['href'])
    position_l.append(person.find("div", {"class":"person__position"}).text.strip())
    department_l.append(person.find("div", {"class":"person__department"}).text.strip())
    office_l.append(person.find("div", {"class":"person__location"}).text.strip())
    email_l.append(person.find("div", {"class":"person__email"}).text.strip())

    try:
        phone_l.append(person.find("div", {"class":"person__phone"}).text.strip())
    except:
        phone_l.append(np.nan)
        
name_l

['Nicholas L. Abbott',
 'Mohamed Abdelfattah',
 'Geoffrey Abers',
 'Jayadev Acharya',
 'Hunter Adams',
 'Steven Graham Adie',
 'Khurram Khan Afridi',
 'Rachit  Agarwal',
 'Christopher A. Alabi',
 'John D. Albertson',
 'David H. Albonesi',
 'Warren Douglas Allmon',
 'Lorenzo  Alvisi',
 'Nelly Andarawis-Puri',
 'C. Lindsay Anderson',
 'James Francis Antaki',
 'Alyssa B. Apsel',
 'Lynden A. Archer',
 'Shivaun D. Archer',
 'Chloé Arson',
 'Yoav Artzi',
 'Toby R. Ault',
 'C Thomas Avedisian',
 'Victoria Averbukh']

Now that we got all of our data in individual lists, we can put them into a dataframe.

In [21]:
pd.DataFrame({"Name":name_l, "Link":link_l, "Position":position_l, 
              "Department":department_l, "Office":office_l, 
              "Email":email_l, "Phone":phone_l})

Unnamed: 0,Name,Link,Position,Department,Office,Email,Phone
0,Nicholas L. Abbott,https://www.cheme.cornell.edu/faculty-director...,Tisch University Professor,Smith School of Chemical and Biomolecular Engi...,360 Olin Hall,nla34@cornell.edu,607/255-3601
1,Mohamed Abdelfattah,https://www.engineering.cornell.edu/faculty-di...,Assistant Professor,Electrical and Computer Engineering,Cornell Tech,mohamed@cornell.edu,
2,Geoffrey Abers,https://www.engineering.cornell.edu/faculty-di...,Chair of Earth and Atmospheric Sciences,Earth and Atmospheric Sciences,2160B Snee Hall/4126 Snee Hall,abers@cornell.edu,607/255-3879
3,Jayadev Acharya,https://www.engineering.cornell.edu/faculty-di...,Associate Professor,Electrical and Computer Engineering,"Frank H T Rhodes Hall, Room 382",acharya@cornell.edu,
4,Hunter Adams,https://www.engineering.cornell.edu/faculty-di...,Lecturer,Electrical and Computer Engineering,"Phillips Hall, Room 208",vha3@cornell.edu,717/304-0047
5,Steven Graham Adie,https://www.engineering.cornell.edu/faculty-di...,"Associate Professor, Associate Director and Di...",Meinig School of Biomedical Engineering,"Weill Hall, Room 113",sga42@cornell.edu,607/2552656
6,Khurram Khan Afridi,https://www.engineering.cornell.edu/faculty-di...,Associate Professor,Electrical and Computer Engineering,"Phillips Hall, Room 420",kka34@cornell.edu,
7,Rachit Agarwal,https://www.engineering.cornell.edu/faculty-di...,Assistant Professor,Computer Science,Gates Hall 411C,ra625@cornell.edu,607-255-4280
8,Christopher A. Alabi,https://www.cheme.cornell.edu/faculty-director...,Associate Professor,Smith School of Chemical and Biomolecular Engi...,356A Olin Hall,caa238@cornell.edu,607/255-7889
9,John D. Albertson,https://www.engineering.cornell.edu/faculty-di...,Professor,Civil and Environmental Engineering,"Hollister Hall, Room 113",albertson@cornell.edu,607/255-9671


Notice that we only got 23 records - we did not get all the Engineering professors from here. Why?

Look at the website again - we will notice that it is paginated - we select the initial letter. However, if we look at the URL once we select letters, we can see that there is a pattern in the URL. Specifically, there is a suffix to the URL of "?letter=X". For example, all faculty with the initial of "A" are in https://www.engineering.cornell.edu/faculty-directory?letter=A.

We can thus leverage this to obtain all professors by alternating this suffix.

### Pagination

In [22]:
name_l = []
link_l = []
position_l = []
department_l = []
office_l = []
phone_l = []
email_l = []

for letter in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']:
    url="https://www.engineering.cornell.edu/faculty-directory?letter="+letter
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    page = BeautifulSoup(response.text, "html.parser")
    articles = page.find_all("article", {"class":"person--listing"})
    
    for person in articles:
        name_l.append(person.h2.text)
        link_l.append(person.h2.a['href'])
        position_l.append(person.find("div", {"class":"person__position"}).text.strip())
        department_l.append(person.find("div", {"class":"person__department"}).text.strip())

        try:
            office_l.append(person.find("div", {"class":"person__location"}).text.strip())
        except:
            office_l.append(np.nan)

        try:
            email_l += [person.find("div", {"class":"person__email"}).text.strip()]
        except:
            email_l.append(np.nan)
        
        try:
            phone_l.append(person.find("div", {"class":"person__phone"}).text.strip())
        except:
            phone_l.append(np.nan)
        
pd.DataFrame({"Name":name_l, "Link":link_l, "Position":position_l, 
              "Department":department_l, "Office":office_l, 
              "Email":email_l, "Phone":phone_l})

Unnamed: 0,Name,Link,Position,Department,Office,Email,Phone
0,Nicholas L. Abbott,https://www.cheme.cornell.edu/faculty-director...,Tisch University Professor,Smith School of Chemical and Biomolecular Engi...,360 Olin Hall,nla34@cornell.edu,607/255-3601
1,Mohamed Abdelfattah,https://www.engineering.cornell.edu/faculty-di...,Assistant Professor,Electrical and Computer Engineering,Cornell Tech,mohamed@cornell.edu,
2,Geoffrey Abers,https://www.engineering.cornell.edu/faculty-di...,Chair of Earth and Atmospheric Sciences,Earth and Atmospheric Sciences,2160B Snee Hall/4126 Snee Hall,abers@cornell.edu,607/255-3879
3,Jayadev Acharya,https://www.engineering.cornell.edu/faculty-di...,Associate Professor,Electrical and Computer Engineering,"Frank H T Rhodes Hall, Room 382",acharya@cornell.edu,
4,Hunter Adams,https://www.engineering.cornell.edu/faculty-di...,Lecturer,Electrical and Computer Engineering,"Phillips Hall, Room 208",vha3@cornell.edu,717/304-0047
...,...,...,...,...,...,...,...
351,Lenan Zhang,https://www.engineering.cornell.edu/faculty-di...,Assistant Professor (Starting in July 2024),Sibley School of Mechanical and Aerospace Engi...,,lzhang@cornell.edu,
352,Zhiru Zhang,https://www.engineering.cornell.edu/faculty-di...,Associate Professor,Electrical and Computer Engineering,"Frank H T Rhodes Hall, Room 320",zhiruz@cornell.edu,607/255-5954
353,Qing Zhao,https://www.engineering.cornell.edu/faculty-di...,Joseph C. Ford Professor of Engineering,Electrical and Computer Engineering,"Frank H T Rhodes Hall, Room 325",qz16@cornell.edu,
354,Yu Zhong,https://www.mse.cornell.edu/faculty-directory/...,Assistant Professor,Materials Science and Engineering,229 Bard Hall,yz2833@cornell.edu,


# Exercise: Gimme! Coffee

Time for some excercise! 

Please scrape all the coffee that is on sale on Gimme! Coffee's website: [https://gimmecoffee.com/coffee/](https://gimmecoffee.com/coffee/)

First, make an request and check if the response code is expected.

In [None]:
# Your Code: Feel free to add new cells

Save the html locally as "gimme.html" and read it in again.

In [None]:
# Your Code

Construct an BeautifulSoup Object from the html you just got

In [None]:
# Your Code

Try obtain all the elements that represent a coffee item using .find_all().

In [None]:
# Your Code

Obtain the price of the first coffee item

In [None]:
# Your Code

Obtain the link of the first coffee item. E.g., https://gimmecoffee.com/mt-pleasant-coffee-pods-fairtrade-organic/

In [None]:
# Your Code

Now, automate the process over all the coffee items and put the name, description, link, price in separate lists.

In [None]:
# Your Code

Construct a dataframe to store the data and show the dataframe.

In [None]:
# Your Code