# Scraping Intro Homework: Columbia J-School Data Faculty

In this assignment, we'll practicing our scraping skills by examining the Columbia Journalism School's listing of data faculty: https://journalism.columbia.edu/faculty?expertise=116

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that even though we installed the library as `pip install beautifulsoup4`, the import statement we practiced is slightly different.

In [39]:
import requests
sample_http = requests.get("https://example.com")
sample_http

<Response [200]>

### 1) Grab the HTML for the webpage linked above

Use `requests` to get the HTML, assigning it to a variable

In [40]:
journalism_http = requests.get("https://journalism.columbia.edu/faculty?expertise=116").text
print(journalism_http)

<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
	
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-KQW8XM5VEJ"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-KQW8XM5VEJ');
</script>

<!-- Anti-flicker snippet (recommended) INC1425337 -->
<style>.async-hide { opacity: 0 !important} </style>
<script>(function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
(a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
})(window,document.documentElement,'async-hide','dataLayer',4000,
{'GTM-MV2DS4J':true});</script>
<!-- Modified Analytics tracking code with Optimize plugin INC1425337 -->
    <script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date()

### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [41]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(journalism_http)
type(soup)


bs4.BeautifulSoup

### 3) Use `.select(...)` to select all elements representing a faculty member

Assign the resulting elements to a variable named `faculty_els`.

You'll want "View Source" or pop open the Element Inspector to figure out which elements to target.

Note: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. 

A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

In [42]:
faculty_els = soup.select('.faculty-bio')
faculty_els

[<article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1596">
 <div class="faculty-photo">
 <a href="/faculty/denise-ajiri"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2022/31/f-88-6-13176708_01wgebpz_denise.jpg?itok=k-ND0dPI" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/denise-ajiri">Denise Ajiri</a></h2>
 <div class="sub-title"><p>Adjunct Assistant Professor</p>
 </div>
 </article>,
 <article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1156">
 <div class="faculty-photo">
 <a href="/faculty/andrea-fuller"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2019/04/andrea-fuller.jpg?itok=o-b7JFxn" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/andrea-fuller">Andrea Fuller</a><

### 4) Count the number of matching elements, using `len`

Does it match the number of faculty you see on the page? (It should.)

In [43]:
len(faculty_els)

8

### 5) For each faculty member, print their name, title, and faculty page URL

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Denise Ajiri's title is 'Adjunct Assistant Professor'. You can find more information about them @ /faculty/denise-ajiri
---
```

You'll note that the "href" is not a complete URL, but rather a "[relative path](https://www.w3schools.com/html/html_filepaths.asp)". Don't worry too much about that for now, although you're welcome to try "solving" that part.

In [None]:
for faculty in faculty_els:
    #Setting variable to select faculty name
    f_name = faculty.select('.title')
    for name in f_name:
        name = name.text
    
    #Setting variable to title
    f_title = faculty.select('.sub-title')
    for title in f_title:
        title = title.text

    #Setting variable to faculty page URL
    f_url = faculty.select('.about-link')
    for link in f_url:
        link = link['href']

    print(f"{name}'s title is {title}. You can find more information on this faculty at {link}")
  
    



Denise Ajiri's title is Adjunct Assistant Professor
. You can find more information on this faculty at /faculty/denise-ajiri
Andrea Fuller's title is Adjunct Faculty
. You can find more information on this faculty at /faculty/andrea-fuller
Robert Gebeloff's title is Adjunct Faculty
. You can find more information on this faculty at /faculty/robert-gebeloff
Mark Hansen's title is David and Helen Gurley Brown Professor of Journalism and Innovation; Director, David and Helen Gurley Brown Institute of Media Innovation
. You can find more information on this faculty at /faculty/mark-hansen
Tom  Meagher's title is Adjunct Faculty
. You can find more information on this faculty at /faculty/tom-meagher
Dhrumil Mehta's title is Associate Professor in Data Journalism; Deputy Director of the Tow Center for Digital Journalism
. You can find more information on this faculty at /faculty/dhrumil-mehta
Matt Rocheleau's title is Adjunct Faculty
. You can find more information on this faculty at /facult

In [None]:
#Simpler version of the code above
for el in faculty_els:
    name_1 = el.select("h2 a")[0].text
    title_1 = el.select(".sub-title")[0].text.strip()
    link_1 = el.select("h2 a")[0]["href"]
    print(f"Hello, {name_1}'s title is '{title_1}'. You can find more information about them @ {link_1}")
    print("---")

: 

### 6) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `name`, `title`, `href`.

In [117]:
faculty_list = []

for faculty in faculty_els:
    #create dictionary using for loops below
    dict = {}
    #Setting variable to select faculty name
    f_name = faculty.select('.title')
    for name in f_name:
        #make action pull value into dictionary with a name- in this case, called name
        dict["name"] = name.text
    
    #Setting variable to title
    f_title = faculty.select('.sub-title')
    for title in f_title:
        dict["title"] = title.text.strip()

    #Setting variable to faculty page URL
    f_url = faculty.select('.about-link')
    for link in f_url:
        dict["link"] = link['href']
   
    #append dictionary that was created into faculty_list in order to convert into dataframe
    faculty_list.append(dict)

print(faculty_list)




[{'name': 'Denise Ajiri', 'title': 'Adjunct Assistant Professor', 'link': '/faculty/denise-ajiri'}, {'name': 'Andrea Fuller', 'title': 'Adjunct Faculty', 'link': '/faculty/andrea-fuller'}, {'name': 'Robert Gebeloff', 'title': 'Adjunct Faculty', 'link': '/faculty/robert-gebeloff'}, {'name': 'Mark Hansen', 'title': 'David and Helen Gurley Brown Professor of Journalism and Innovation; Director, David and Helen Gurley Brown Institute of Media Innovation', 'link': '/faculty/mark-hansen'}, {'name': 'Tom  Meagher', 'title': 'Adjunct Faculty', 'link': '/faculty/tom-meagher'}, {'name': 'Dhrumil Mehta', 'title': 'Associate Professor in Data Journalism; Deputy Director of the Tow Center for Digital Journalism', 'link': '/faculty/dhrumil-mehta'}, {'name': 'Matt Rocheleau', 'title': 'Adjunct Faculty', 'link': '/faculty/matt-rocheleau'}, {'name': 'Giannina Segnini', 'title': 'John S. and James L. Knight Professor of Professional Practice in Data Journalism', 'link': '/faculty/giannina-segnini'}]


In [118]:
for faculty in faculty_els:
    dict = {}
    f_title = faculty.select('.subtitle')
    for title in f_title:
        dict["title"] = title.text
        
    print(dict)


{}
{}
{}
{}
{}
{}
{}
{}


In [119]:
faculty_els

[<article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1596">
 <div class="faculty-photo">
 <a href="/faculty/denise-ajiri"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2022/31/f-88-6-13176708_01wgebpz_denise.jpg?itok=k-ND0dPI" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/denise-ajiri">Denise Ajiri</a></h2>
 <div class="sub-title"><p>Adjunct Assistant Professor</p>
 </div>
 </article>,
 <article class="node node-faculty-bio node-teaser teaser faculty-bio" id="node-1156">
 <div class="faculty-photo">
 <a href="/faculty/andrea-fuller"><img alt="" class="img-responsive" height="269" src="https://journalism.columbia.edu/files/soj/styles/faculty_photo/public/content/image/2019/04/andrea-fuller.jpg?itok=o-b7JFxn" width="269"/></a> </div>
 <h2 class="title regular"><a class="about-link" href="/faculty/andrea-fuller">Andrea Fuller</a><

In [120]:
#Make dataframe and by using following function
df = pd.DataFrame.from_dict(faculty_list)

In [121]:
df

Unnamed: 0,name,title,link
0,Denise Ajiri,Adjunct Assistant Professor,/faculty/denise-ajiri
1,Andrea Fuller,Adjunct Faculty,/faculty/andrea-fuller
2,Robert Gebeloff,Adjunct Faculty,/faculty/robert-gebeloff
3,Mark Hansen,David and Helen Gurley Brown Professor of Jour...,/faculty/mark-hansen
4,Tom Meagher,Adjunct Faculty,/faculty/tom-meagher
5,Dhrumil Mehta,Associate Professor in Data Journalism; Deputy...,/faculty/dhrumil-mehta
6,Matt Rocheleau,Adjunct Faculty,/faculty/matt-rocheleau
7,Giannina Segnini,John S. and James L. Knight Professor of Profe...,/faculty/giannina-segnini


### 7) Using that `DataFrame`, calculate how many are "Adjunct Faculty"

In [129]:
count = df['title'].value_counts()['Adjunct Faculty']
print(count)

4


---

---

---