# Scraping Intro Homework: Columbia J-School Data Faculty

## Stephanie Andrews

In this assignment, we'll practicing our scraping skills by examining the Columbia Journalism School's listing of data faculty: https://journalism.columbia.edu/faculty?expertise=116

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that even though we installed the library as `pip install beautifulsoup4`, the import statement we practiced is slightly different.

In [1]:
import requests
import pandas as pd

In [2]:
from bs4 import BeautifulSoup

### 1) Grab the HTML for the webpage linked above

Use `requests` to get the HTML, assigning it to a variable

In [3]:
html_doc = requests.get("https://journalism.columbia.edu/faculty?expertise=116")
soup = BeautifulSoup(html_doc.content)

### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <!-- Google tag (gtag.js) -->
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=G-KQW8XM5VEJ">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-KQW8XM5VEJ');
  </script>
  <!-- Anti-flicker snippet (recommended) INC1425337 -->
  <style>
   .async-hide { opacity: 0 !important}
  </style>
  <script>
   (function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
(a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
})(window,document.documentElement,'async-hide','dataLayer',4000,
{'GTM-MV2DS4J':true});
  </script>
  <!-- Modified Analytics tracking code with Optimize plugin INC1425337 -->
 </head>
 <body>
  <p>
   <script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    

### 3) Use `.select(...)` to select all elements representing a faculty member

Assign the resulting elements to a variable named `faculty_els`.

You'll want "View Source" or pop open the Element Inspector to figure out which elements to target.

Note: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. 

A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

In [5]:
faculty_els = soup.select(".faculty-bio")
faculty_els

[<article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1596">
 <div class="faculty-photo">
 <a href="/faculty/denise-ajiri"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2022/31/f-88-6-13176708_01wgebpz_denise.jpg?itok=k-ND0dPI" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/denise-ajiri">Denise Ajiri</a></h2>
 <div class="sub-title"><p>Adjunct Assistant Professor</p>
 </div>
 </article>,
 <article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1156">
 <div class="faculty-photo">
 <a href="/faculty/andrea-fuller"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2019/04/andrea-fuller.jpg?itok=o-b7JFxn" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/andrea-fuller">Andrea Fuller</a><

### 4) Count the number of matching elements, using `len`

Does it match the number of faculty you see on the page? (It should.)

In [6]:
len(faculty_els)

8

### 5) For each faculty member, print their name, title, and faculty page URL

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Denise Ajiri's title is 'Adjunct Assistant Professor'. You can find more information about them @ /faculty/denise-ajiri
---
```

You'll note that the "href" is not a complete URL, but rather a "[relative path](https://www.w3schools.com/html/html_filepaths.asp)". Don't worry too much about that for now, although you're welcome to try "solving" that part.

In [7]:
for f in faculty_els:
    title = f.select_one(".sub-title p").text
    name = f.select_one(".title .about-link").text
    slug = f.select_one(".title .about-link")["href"]

    print(f"{name}'s title is '{title}'.",
          f"You can find more information about them @ {slug}")
    print("---")


Denise Ajiri's title is 'Adjunct Assistant Professor'. You can find more information about them @ /faculty/denise-ajiri
---
Andrea Fuller's title is 'Adjunct Faculty'. You can find more information about them @ /faculty/andrea-fuller
---
Robert Gebeloff's title is 'Adjunct Faculty'. You can find more information about them @ /faculty/robert-gebeloff
---
Mark Hansen's title is 'David and Helen Gurley Brown Professor of Journalism and Innovation; Director, David and Helen Gurley Brown Institute of Media Innovation'. You can find more information about them @ /faculty/mark-hansen
---
Tom  Meagher's title is 'Adjunct Faculty'. You can find more information about them @ /faculty/tom-meagher
---
Dhrumil Mehta's title is 'Associate Professor in Data Journalism; Deputy Director of the Tow Center for Digital Journalism'. You can find more information about them @ /faculty/dhrumil-mehta
---
Matt Rocheleau's title is 'Adjunct Faculty'. You can find more information about them @ /faculty/matt-roch

### 6) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `name`, `title`, `href`.

In [8]:
all_faculty_objs = []
for f in faculty_els:
    faculty_obj = [f.select_one(".sub-title p").text,
                   f.select_one(".title .about-link").text,
                   f.select_one(".title .about-link")["href"]]
    all_faculty_objs.append(faculty_obj)

df = pd.DataFrame(all_faculty_objs, columns=["title", "name", "href"])
df

Unnamed: 0,title,name,href
0,Adjunct Assistant Professor,Denise Ajiri,/faculty/denise-ajiri
1,Adjunct Faculty,Andrea Fuller,/faculty/andrea-fuller
2,Adjunct Faculty,Robert Gebeloff,/faculty/robert-gebeloff
3,David and Helen Gurley Brown Professor of Jour...,Mark Hansen,/faculty/mark-hansen
4,Adjunct Faculty,Tom Meagher,/faculty/tom-meagher
5,Associate Professor in Data Journalism; Deputy...,Dhrumil Mehta,/faculty/dhrumil-mehta
6,Adjunct Faculty,Matt Rocheleau,/faculty/matt-rocheleau
7,John S. and James L. Knight Professor of Profe...,Giannina Segnini,/faculty/giannina-segnini


### 7) Using that `DataFrame`, calculate how many are "Adjunct Faculty"

In [9]:
adjunct_faculty = df[df["title"] == "Adjunct Faculty"]
adjunct_faculty_str = ", ".join(adjunct_faculty["name"])

In [10]:
print(f"There are {len(adjunct_faculty)} Adjunct Faculty members:")
print(f"{adjunct_faculty_str}.")

There are 4 Adjunct Faculty members:
Andrea Fuller, Robert Gebeloff, Tom  Meagher, Matt Rocheleau.
