# Scraping Intro Homework: Columbia J-School Data Faculty

In this assignment, we'll practicing our scraping skills by examining the Columbia Journalism School's listing of data faculty: https://journalism.columbia.edu/faculty?expertise=116

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that even though we installed the library as `pip install beautifulsoup4`, the import statement we practiced is slightly different.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### 1) Grab the HTML for the webpage linked above

Use `requests` to get the HTML, assigning it to a variable

In [2]:
columbia_html = requests.get("https://journalism.columbia.edu/faculty?expertise=116").text
columbia_html

'<!DOCTYPE html>\n<html lang="en" dir="ltr">\n<head>\n\t\n<!-- Google tag (gtag.js) -->\n<script async src="https://www.googletagmanager.com/gtag/js?id=G-KQW8XM5VEJ"></script>\n<script>\n  window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag(\'js\', new Date());\n\n  gtag(\'config\', \'G-KQW8XM5VEJ\');\n</script>\n\n<!-- Anti-flicker snippet (recommended) INC1425337\xa0-->\n<style>.async-hide { opacity: 0 !important} </style>\n<script>(function(a,s,y,n,c,h,i,d,e){s.className+=\' \'+y;h.start=1*new Date;\nh.end=i=function(){s.className=s.className.replace(RegExp(\' ?\'+y),\'\')};\n(a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;\n})(window,document.documentElement,\'async-hide\',\'dataLayer\',4000,\n{\'GTM-MV2DS4J\':true});</script>\n<!-- Modified Analytics tracking code with Optimize plugin INC1425337 -->\n\xa0 \xa0 <script>\n\xa0 \xa0 (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\

### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [5]:
columbia_soup = BeautifulSoup(columbia_html)
columbia_soup

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-KQW8XM5VEJ"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-KQW8XM5VEJ');
</script>
<!-- Anti-flicker snippet (recommended) INC1425337 -->
<style>.async-hide { opacity: 0 !important} </style>
<script>(function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
(a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
})(window,document.documentElement,'async-hide','dataLayer',4000,
{'GTM-MV2DS4J':true});</script>
<!-- Modified Analytics tracking code with Optimize plugin INC1425337 --></head><body><p>
    <script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r

### 3) Use `.select(...)` to select all elements representing a faculty member

Assign the resulting elements to a variable named `faculty_els`.

You'll want "View Source" or pop open the Element Inspector to figure out which elements to target.

Note: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. 

A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

In [14]:
faculty_els = columbia_soup.select(".views-row")
faculty_els

[<div class="views-row views-row-1 views-row-odd views-row-first col-sm-3">
 <div class="views-field views-field-edit-node"> <span class="field-content"></span> </div>
 <div class="views-field views-field-rendered-entity"> <span class="field-content"> <article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1596">
 <div class="faculty-photo">
 <a href="/faculty/denise-ajiri"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2022/31/f-88-6-13176708_01wgebpz_denise.jpg?itok=k-ND0dPI" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/denise-ajiri">Denise Ajiri</a></h2>
 <div class="sub-title"><p>Adjunct Assistant Professor</p>
 </div>
 </article>
 </span> </div> </div>,
 <div class="views-row views-row-2 views-row-even col-sm-3">
 <div class="views-field views-field-edit-node"> <span class="field-content"></span> </div>
 <div class="views-fiel

### 4) Count the number of matching elements, using `len`

Does it match the number of faculty you see on the page? (It should.)

In [7]:
len(faculty_els)

8

### 5) For each faculty member, print their name, title, and faculty page URL

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Denise Ajiri's title is 'Adjunct Assistant Professor'. You can find more information about them @ /faculty/denise-ajiri
---
```

You'll note that the "href" is not a complete URL, but rather a "[relative path](https://www.w3schools.com/html/html_filepaths.asp)". Don't worry too much about that for now, although you're welcome to try "solving" that part.

In [11]:
for faculty_el in faculty_els:
    name = faculty_el.find('h2', class_='title regular')
    title = faculty_el.find('div', class_='sub-title')
    url = faculty_el.find('a')['href']
    print(f"{name.text}'s title is '{title.text}'. You can find more information about them @ https://journalism.columbia.edu{url}")
    print("---------")

Denise Ajiri's title is 'Adjunct Assistant Professor
'. You can find more information about them @ https://journalism.columbia.edu/faculty/denise-ajiri
---------
Andrea Fuller's title is 'Adjunct Faculty
'. You can find more information about them @ https://journalism.columbia.edu/faculty/andrea-fuller
---------
Robert Gebeloff's title is 'Adjunct Faculty
'. You can find more information about them @ https://journalism.columbia.edu/faculty/robert-gebeloff
---------
Mark Hansen's title is 'David and Helen Gurley Brown Professor of Journalism and Innovation; Director, David and Helen Gurley Brown Institute of Media Innovation
'. You can find more information about them @ https://journalism.columbia.edu/faculty/mark-hansen
---------
Tom  Meagher's title is 'Adjunct Faculty
'. You can find more information about them @ https://journalism.columbia.edu/faculty/tom-meagher
---------
Dhrumil Mehta's title is 'Associate Professor in Data Journalism; Deputy Director of the Tow Center for Digital

In [23]:
# a different way, using select (recommended)

for faculty_el in faculty_els:
    name = faculty_el.select('h2 a')[0].text
    title = faculty_el.select('.sub-title')[0].text.strip()
    url = faculty_el.select('h2 a')[0]['href']
    print(f"{name}'s title is '{title}'. You can find more information about them @ https://journalism.columbia.edu{url}")
    print("---------")


Denise Ajiri's title is 'Adjunct Assistant Professor'. You can find more information about them @ https://journalism.columbia.edu/faculty/denise-ajiri
---------
Andrea Fuller's title is 'Adjunct Faculty'. You can find more information about them @ https://journalism.columbia.edu/faculty/andrea-fuller
---------
Robert Gebeloff's title is 'Adjunct Faculty'. You can find more information about them @ https://journalism.columbia.edu/faculty/robert-gebeloff
---------
Mark Hansen's title is 'David and Helen Gurley Brown Professor of Journalism and Innovation; Director, David and Helen Gurley Brown Institute of Media Innovation'. You can find more information about them @ https://journalism.columbia.edu/faculty/mark-hansen
---------
Tom  Meagher's title is 'Adjunct Faculty'. You can find more information about them @ https://journalism.columbia.edu/faculty/tom-meagher
---------
Dhrumil Mehta's title is 'Associate Professor in Data Journalism; Deputy Director of the Tow Center for Digital Jour

### 6) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `name`, `title`, `href`.

In [28]:
elements = []

for faculty_el in faculty_els:
    name = faculty_el.find('h2', class_='title regular').text
    title = faculty_el.find('div', class_='sub-title').text.strip
    url = faculty_el.find('a')['href']
    url = f"https://journalism.columbia.edu{url}"
    
    teacher = {'name' : name,
            'title' : title,
            'url' : url}
    elements.append(teacher)

df = pd.DataFrame(elements)
df

Unnamed: 0,name,title,url
0,Denise Ajiri,<built-in method strip of str object at 0x1307...,https://journalism.columbia.edu/faculty/denise...
1,Andrea Fuller,<built-in method strip of str object at 0x1307...,https://journalism.columbia.edu/faculty/andrea...
2,Robert Gebeloff,<built-in method strip of str object at 0x1307...,https://journalism.columbia.edu/faculty/robert...
3,Mark Hansen,<built-in method strip of str object at 0x1306...,https://journalism.columbia.edu/faculty/mark-h...
4,Tom Meagher,<built-in method strip of str object at 0x1307...,https://journalism.columbia.edu/faculty/tom-me...
5,Dhrumil Mehta,<built-in method strip of str object at 0x1307...,https://journalism.columbia.edu/faculty/dhrumi...
6,Matt Rocheleau,<built-in method strip of str object at 0x1307...,https://journalism.columbia.edu/faculty/matt-r...
7,Giannina Segnini,<built-in method strip of str object at 0x1307...,https://journalism.columbia.edu/faculty/gianni...


In [35]:
# The .find selector produces a weird \n in the title. The .select selector, doesn't.

elements = []

for faculty_el in faculty_els:
    name = faculty_el.select('h2 a')[0].text
    title = faculty_el.select('.sub-title')[0].text.strip()
    url = faculty_el.select('h2 a')[0]['href']

    teacher = {'name' : name,
               'title' : title,
               'url' : 'https://journalism.columbia.edu'+url}
    elements.append(teacher)

df = pd.DataFrame(elements)
df

Unnamed: 0,name,title,url
0,Denise Ajiri,Adjunct Assistant Professor,https://journalism.columbia.edu/faculty/denise...
1,Andrea Fuller,Adjunct Faculty,https://journalism.columbia.edu/faculty/andrea...
2,Robert Gebeloff,Adjunct Faculty,https://journalism.columbia.edu/faculty/robert...
3,Mark Hansen,David and Helen Gurley Brown Professor of Jour...,https://journalism.columbia.edu/faculty/mark-h...
4,Tom Meagher,Adjunct Faculty,https://journalism.columbia.edu/faculty/tom-me...
5,Dhrumil Mehta,Associate Professor in Data Journalism; Deputy...,https://journalism.columbia.edu/faculty/dhrumi...
6,Matt Rocheleau,Adjunct Faculty,https://journalism.columbia.edu/faculty/matt-r...
7,Giannina Segnini,John S. and James L. Knight Professor of Profe...,https://journalism.columbia.edu/faculty/gianni...


In [39]:
df = pd.DataFrame([{
    'name' : faculty_el.select('h2 a')[0].text,
    'title' : faculty_el.select('.sub-title')[0].text.strip(),
    'url' : faculty_el.select('h2 a')[0]['href'],
} for faculty_el in faculty_els ])

df

Unnamed: 0,name,title,url
0,Denise Ajiri,Adjunct Assistant Professor,/faculty/denise-ajiri
1,Andrea Fuller,Adjunct Faculty,/faculty/andrea-fuller
2,Robert Gebeloff,Adjunct Faculty,/faculty/robert-gebeloff
3,Mark Hansen,David and Helen Gurley Brown Professor of Jour...,/faculty/mark-hansen
4,Tom Meagher,Adjunct Faculty,/faculty/tom-meagher
5,Dhrumil Mehta,Associate Professor in Data Journalism; Deputy...,/faculty/dhrumil-mehta
6,Matt Rocheleau,Adjunct Faculty,/faculty/matt-rocheleau
7,Giannina Segnini,John S. and James L. Knight Professor of Profe...,/faculty/giannina-segnini


### 7) Using that `DataFrame`, calculate how many are "Adjunct Faculty"

In [31]:
df['title'].value_counts()['Adjunct Faculty']

4

In [32]:
df = pd.DataFrame(elements)
df

Unnamed: 0,name,title,url
0,Denise Ajiri,Adjunct Assistant Professor,https://journalism.columbia.edu/faculty/denise...
1,Andrea Fuller,Adjunct Faculty,https://journalism.columbia.edu/faculty/andrea...
2,Robert Gebeloff,Adjunct Faculty,https://journalism.columbia.edu/faculty/robert...
3,Mark Hansen,David and Helen Gurley Brown Professor of Jour...,https://journalism.columbia.edu/faculty/mark-h...
4,Tom Meagher,Adjunct Faculty,https://journalism.columbia.edu/faculty/tom-me...
5,Dhrumil Mehta,Associate Professor in Data Journalism; Deputy...,https://journalism.columbia.edu/faculty/dhrumi...
6,Matt Rocheleau,Adjunct Faculty,https://journalism.columbia.edu/faculty/matt-r...
7,Giannina Segnini,John S. and James L. Knight Professor of Profe...,https://journalism.columbia.edu/faculty/gianni...


df = pd.DataFrame()

---

---

---