# Real World Example:  Scraping Columbia Faculty Page
Our target: [Columbia Faculty](http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37)

How many adjunct professors are on faculty?
Is it possible to get access to the database underlying the website? If not, we turn to **scraping**!

+ Take a look at the page you're scraping using your browser's developer tools.
+ What are the tags associated with the item(s) you want? 

**In the case of the Columbia Faculty page, our goals are:**
+ Find all the `<li>` tags 
+ Inside all of the `<li>` tags, find the `<h4>` tag
    + Inside the `<h4>`, the name the content of an `<a>` tag
+ Inside the `<li>` tag, find the `<p>` with class description. 
    + The title of the professor is the content of that tag

In [6]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [7]:
# this can also be done with requests
url = 'http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37'
faculty_html = urlopen(url).read()

In [8]:
print(faculty_html)

b'<!doctype html>\n<!--[if IE 7]>\n<html class="ie ie7" lang="en-US">\n<![endif]-->\n<!--[if IE 8]>\n<html class="ie ie8" lang="en-US">\n<![endif]-->\n<!--[if IE 9]>\n<html class="ie ie9" lang="en-US">\n<![endif]-->\n<!--[if !(IE 7) | !(IE 8) | !(IE 9)  ]><!-->\n<html lang="en-US">\n<!--<![endif]-->\n<head>\n\t<meta charset="utf-8" />\n\t<title> Full-Time, Adjunct & Visiting Faculty - Columbia University Graduate School of Journalism</title>\n\t<meta name="description" content=\'\'>\n\t<meta name="keywords" content=\'\'>\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0">\n\t<link rel="shortcut icon" href="/images/theme_images/mike1/favicon.ico"/>\n\t\n\t<link type="text/css" rel="stylesheet" href="/stylesheets/theme_stylesheets/mike1/style.css" media="all">\n\t<link type="text/css" rel="stylesheet" href="/stylesheets/theme_stylesheets/mike1/flexslider.css" media="all">\n\t<link type="text/css" rel="stylesheet" href="/stylesheets/theme_stylesheets/mike1/colorbox.c

In [9]:
# HTML Parser is a library that comes with Python 
document = BeautifulSoup(faculty_html, 'html.parser')

In [10]:
document

<!DOCTYPE doctype html>

<!--[if IE 7]>
<html class="ie ie7" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US">
<![endif]-->
<!--[if IE 9]>
<html class="ie ie9" lang="en-US">
<![endif]-->
<!--[if !(IE 7) | !(IE 8) | !(IE 9)  ]><!-->
<html lang="en-US">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title> Full-Time, Adjunct &amp; Visiting Faculty - Columbia University Graduate School of Journalism</title>
<meta content="" name="description">
<meta content="" name="keywords">
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<link href="/images/theme_images/mike1/favicon.ico" rel="shortcut icon"/>
<link href="/stylesheets/theme_stylesheets/mike1/style.css" media="all" rel="stylesheet" type="text/css">
<link href="/stylesheets/theme_stylesheets/mike1/flexslider.css" media="all" rel="stylesheet" type="text/css">
<link href="/stylesheets/theme_stylesheets/mike1/colorbox.css" media="all" rel="stylesheet" type="text/css">
<!--[if lte IE 8]>
		

In [11]:
# Double-check that we're good to go:
document.find('h2').string

' Full-Time, Adjunct & Visiting Faculty'

**Our task:** Just print out the names of all the faculty members.

In [12]:
h2_tag = document.find('h2')
h2_tag.string

' Full-Time, Adjunct & Visiting Faculty'

In [13]:
li_tags = document.find('li')

for item in li_tags:
    h4_tag = item.find('h4')
    a_tag = h4_tag.find('a')
    print(a_tag.string)

AttributeError: 'NoneType' object has no attribute 'find'

** AN ERROR!!!**

Our error message basically says that our `h4_tag` has the value `None`. This means that our `<li>` tag did not contain an `<h4>` tag as we had suspected. 

Since the `<li>` tag doesn't have the structure we expected, we probably need to inspect it further. We could go back to the webpage and inspect the source code using our developer tools. We could also print within our for loop to see what we're actually working with. 

> **None**, or NoneType, is Python value that stands for nothing, as one might suspect. It evaluates to true. It's null in some other languages (Javascript).

In [14]:
li_tags = document.find('li')

for item in li_tags:
    print(item)

<a href="/page/1-about-the-school/1">About the School</a>
 


Turns out, we didn't get the `<li>` tags we wanted. When you want to be more specific about which tags you want to get out, you want to look "up the tree" to find the parent of the tags you want. All of the `<li>` tags we're looking for are the children of this `<ul class="experts-list">`. Let's go back and look at our document! 

In [16]:
document

<!DOCTYPE doctype html>

<!--[if IE 7]>
<html class="ie ie7" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US">
<![endif]-->
<!--[if IE 9]>
<html class="ie ie9" lang="en-US">
<![endif]-->
<!--[if !(IE 7) | !(IE 8) | !(IE 9)  ]><!-->
<html lang="en-US">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title> Full-Time, Adjunct &amp; Visiting Faculty - Columbia University Graduate School of Journalism</title>
<meta content="" name="description">
<meta content="" name="keywords">
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<link href="/images/theme_images/mike1/favicon.ico" rel="shortcut icon"/>
<link href="/stylesheets/theme_stylesheets/mike1/style.css" media="all" rel="stylesheet" type="text/css">
<link href="/stylesheets/theme_stylesheets/mike1/flexslider.css" media="all" rel="stylesheet" type="text/css">
<link href="/stylesheets/theme_stylesheets/mike1/colorbox.css" media="all" rel="stylesheet" type="text/css">
<!--[if lte IE 8]>
		

Unfortunately, the code below will not work because the first `<li>` tag it encounters does not have an `<h4>` tag in them. We need to add a **conditional** to our for loop! 

In [15]:
ul_tag = document.find('ul', {'class': 'experts-list'})
li_tags = document.find('li')

for item in li_tags:
    h4_tag = item.find('h4')
    a_tag = h4_tag.find('a')
    print(a_tag.string)

AttributeError: 'NoneType' object has no attribute 'find'

In [17]:
ul_tag = document.find('ul', {'class': 'experts-list'})
li_tags = ul_tag.find_all('li')

for item in li_tags:
    h4_tag = item.find('h4')

    if h4_tag:
        a_tag = h4_tag.find('a')
        p_tag = item.find('p', {'class': 'description'})
        print(a_tag.string, "/", p_tag.string)

Adkison, Abbey  / Assistant Director, Multi-Media Journalism
Alarcón, Daniel / Assistant Professor of Broadcast Journalism
Barclay, Dolores  / Adjunct Faculty
Baum, Geraldine / Adjunct Faculty
Bell, Emily / Professor of Professional Practice & Director, Tow Center for Digital Journalism
Benedict, Helen  / Professor
Bennet, John  / Adjunct Faculty
Bennett, Rob / Adjunct Faculty
Berman, Nina / Associate Professor
Blair, Gwenda  / Adjunct Faculty
Blum, David  / Adjunct Faculty
Bockelman, Matt / None
Bodarky, George / Adjunct Assistant Professor 
Bogdanich, Walt  / Adjunct Faculty
Bourin, Lennart / Adjunct Faculty
Bradley, Theresa / Adjunct Faculty
Brainard, Curtis  / Staff Writer
Bruder, Jessica / Adjunct Faculty
Burford, Melanie  / Adjunct Faculty
Burleigh, Nina  / Adjunct Faculty
Cabot, Heather / Adjunct Professor
Cabral, Elena  / Adjunct Faculty & Assistant Director, Student Services
Canipe, Chris / None
Casciato, Tom / Adjunct Faculty
Cohen, Julie / Adjunct Faculty
Cohen, Lisa R. / Di

Yay, better! Now, instead of just printing our data, let's **SAVE** our data so that we can continue to play with it. our FAVORITE data structure is a **list of dictionaries**. Let's add to our for code from above so that we are creating a dictionary for each professor, and then adding that dictionary to a list. 

In [18]:
# Initializing an empty list
profs = []
ul_tag = document.find('ul', {'class': 'experts-list'})
li_tags = ul_tag.find_all('li')

for item in li_tags:
    h4_tag = item.find('h4')

    if h4_tag:
        a_tag = h4_tag.find('a')
        p_tag = item.find('p', {'class': 'description'})
        
        # Creating a dictionary called "prof_map"
        prof_map = {'name': a_tag.string, 'title': p_tag.string}
        
        # Adding our dictionary to our list
        profs.append(prof_map)
        
        # Commenting out our print statement from earlier
        # print(a_tag.string, "/", p_tag.string)
        
profs

[{'name': 'Adkison, Abbey ',
  'title': 'Assistant Director, Multi-Media Journalism'},
 {'name': 'Alarcón, Daniel',
  'title': 'Assistant Professor of Broadcast Journalism'},
 {'name': 'Barclay, Dolores ', 'title': 'Adjunct Faculty'},
 {'name': 'Baum, Geraldine', 'title': 'Adjunct Faculty'},
 {'name': 'Bell, Emily',
  'title': 'Professor of Professional Practice & Director, Tow Center for Digital Journalism'},
 {'name': 'Benedict, Helen ', 'title': 'Professor'},
 {'name': 'Bennet, John ', 'title': 'Adjunct Faculty'},
 {'name': 'Bennett, Rob', 'title': 'Adjunct Faculty'},
 {'name': 'Berman, Nina', 'title': 'Associate Professor'},
 {'name': 'Blair, Gwenda ', 'title': 'Adjunct Faculty'},
 {'name': 'Blum, David ', 'title': 'Adjunct Faculty'},
 {'name': 'Bockelman, Matt', 'title': None},
 {'name': 'Bodarky, George', 'title': 'Adjunct Assistant Professor '},
 {'name': 'Bogdanich, Walt ', 'title': 'Adjunct Faculty'},
 {'name': 'Bourin, Lennart', 'title': 'Adjunct Faculty'},
 {'name': 'Bradley

**Aside:** Dictionaries are considered to be identical if they have the exact same keys and values. 

In [19]:
x = {'a': 1, 'b': 2}
y = {'a': 1, 'b': 2}
x == y

True

In [20]:
for item in profs:
    print(item['name'])

Adkison, Abbey 
Alarcón, Daniel
Barclay, Dolores 
Baum, Geraldine
Bell, Emily
Benedict, Helen 
Bennet, John 
Bennett, Rob
Berman, Nina
Blair, Gwenda 
Blum, David 
Bockelman, Matt
Bodarky, George
Bogdanich, Walt 
Bourin, Lennart
Bradley, Theresa
Brainard, Curtis 
Bruder, Jessica
Burford, Melanie 
Burleigh, Nina 
Cabot, Heather
Cabral, Elena 
Canipe, Chris
Casciato, Tom
Cohen, Julie
Cohen, Lisa R.
Cohen, Sarah
Coll, Steve
Cooper, Ann
Coronel, Sheila 
Coyne , Kevin 
Cross, June 
Cunningham, Brent 
DePalma, Anthony
Deitsch, Richard
Diamond, Becky
Dinges, John
Donahue, Kerry 
Drew, Christopher 
Edsall, Thomas B. 
Einhorn, Cheryl
Elliott, Justin 
Epstein, Randi Hutter 
Evans, Farrell 
Ford, Constance Mitchell 
Freedman, Samuel 
Freeman, George 
French, Howard 
Fried, Stephen 
Garcia, Mario
Gezari, Vanessa
Gilderman, Greg
Gitlin, Todd
Giudice, Barbara 
Goldensohn, Marty
Goldman, Ari 
Goldstein, Jacob
Grueskin, Bill
Haburchak, Alan
Hajdu, David 
Hancock, LynNell
Hansen, Mark
Harris, Mark
Harte

## Aside: string indexing

**New task:** print all professor's whose last names start with M. 

In [21]:
# Not quite there yet...  
for item in profs:
    
    print(item['name'])

Adkison, Abbey 
Alarcón, Daniel
Barclay, Dolores 
Baum, Geraldine
Bell, Emily
Benedict, Helen 
Bennet, John 
Bennett, Rob
Berman, Nina
Blair, Gwenda 
Blum, David 
Bockelman, Matt
Bodarky, George
Bogdanich, Walt 
Bourin, Lennart
Bradley, Theresa
Brainard, Curtis 
Bruder, Jessica
Burford, Melanie 
Burleigh, Nina 
Cabot, Heather
Cabral, Elena 
Canipe, Chris
Casciato, Tom
Cohen, Julie
Cohen, Lisa R.
Cohen, Sarah
Coll, Steve
Cooper, Ann
Coronel, Sheila 
Coyne , Kevin 
Cross, June 
Cunningham, Brent 
DePalma, Anthony
Deitsch, Richard
Diamond, Becky
Dinges, John
Donahue, Kerry 
Drew, Christopher 
Edsall, Thomas B. 
Einhorn, Cheryl
Elliott, Justin 
Epstein, Randi Hutter 
Evans, Farrell 
Ford, Constance Mitchell 
Freedman, Samuel 
Freeman, George 
French, Howard 
Fried, Stephen 
Garcia, Mario
Gezari, Vanessa
Gilderman, Greg
Gitlin, Todd
Giudice, Barbara 
Goldensohn, Marty
Goldman, Ari 
Goldstein, Jacob
Grueskin, Bill
Haburchak, Alan
Hajdu, David 
Hancock, LynNell
Hansen, Mark
Harris, Mark
Harte

In [22]:
# indexing strings
message = "bungalow"
message[2:6]

'ngal'

In [23]:
# first character
message[0]

'b'

In [24]:
# last character
message[-1]

'w'

In [25]:
# get first 3 characters:
message[0:3]

'bun'

In [26]:
# if the first index is 0 (as in you're getting the first 3 characters)
message[:3]

'bun'

In [27]:
# to everything after the 4th character
message[4:]

'alow'

In [28]:
message[-5:-2]

'gal'

In [29]:
# print all professor's whose last names start with M. 
for item in profs:
    prof_name = item['name']
    if prof_name[0] == "M":
        print(item['name'])

Maciulis, Tony
Maharidge, Dale 
Mason, Tom
Matloff, Judith 
Maytal, Itai
McCormick, David 
McCray, Melvin
McDonald, Erica
McGregor, Susan E.
Mencher, Melvin
Merchant, Preston
Mintz, James
Morais, Betsy


## Aside: more on counting 

In [30]:
mcount = 0
# print all professor's whose last names start with M. 
for item in profs:
    prof_name = item['name']
    if prof_name[0] == "M":
        print(item['name'])
        mcount += 1
print(mcount)

Maciulis, Tony
Maharidge, Dale 
Mason, Tom
Matloff, Judith 
Maytal, Itai
McCormick, David 
McCray, Melvin
McDonald, Erica
McGregor, Susan E.
Mencher, Melvin
Merchant, Preston
Mintz, James
Morais, Betsy
13


In [31]:
m_profs = []
# print all professor's whose last names start with M. 
for item in profs:
    prof_name = item['name']
    if prof_name[0] == "M":
        m_profs.append(item)
print(len(m_profs))

13


In [32]:
m_profs

[{'name': 'Maciulis, Tony', 'title': 'Adjunct Faculty'},
 {'name': 'Maharidge, Dale ', 'title': 'Professor '},
 {'name': 'Mason, Tom', 'title': None},
 {'name': 'Matloff, Judith ', 'title': 'Adjunct faculty'},
 {'name': 'Maytal, Itai', 'title': 'Adjunct Faculty'},
 {'name': 'McCormick, David ', 'title': 'Adjunct Faculty'},
 {'name': 'McCray, Melvin', 'title': 'Adjunct Faculty'},
 {'name': 'McDonald, Erica', 'title': None},
 {'name': 'McGregor, Susan E.',
  'title': 'Assistant Professor & Assistant Director, Tow Center for Digital Journalism'},
 {'name': 'Mencher, Melvin', 'title': 'Professor Emeritus'},
 {'name': 'Merchant, Preston', 'title': None},
 {'name': 'Mintz, James', 'title': 'Adjunct Faculty'},
 {'name': 'Morais, Betsy', 'title': 'Adjunct Faculty'}]

## Aside: += 1 structure

We used `mcount += 1` in the code above.  

That's the same as saying: `mcount = mcount + 1`. This also works: `x *= 2` is the same as `x = x * 2`. 

In [33]:
# Find all the professors who are listed as "Adjunct Faculty" 

adjunct_profs = []

for item in profs:
    if item['title'] == "Adjunct Faculty":
        adjunct_profs.append(item)
print(len(adjunct_profs))

88


In [34]:
# Percentage of professors listed as adjunct faculty
print(len(adjunct_profs) / len(profs))

0.49162011173184356


In [35]:
adjunct_profs = []

for item in profs:
    
    # if item['title'] exists and if the string 'Adjunct' appears somewhere in the string contained in item['title']
    if (item['title']) and ("Adjunct" in item['title']):
        adjunct_profs.append(item)

        print(len(adjunct_profs))
print(len(adjunct_profs) / len(profs))

110
0.6145251396648045
