## Scraping websites

<div class="alert alert-info"> 
<h1> Your turn</h1>
<p>Find the email addresses for the [NYU Sociology faculty](http://as.nyu.edu/sociology/people/faculty.html). 
<p> <em> Remember to put things in functions as soon as possible. </em>
<p>Then try your script on the politics faculty page. Did it work? If not, fix it.
</div>



In [99]:
import requests
import re
import pandas as pd

In [100]:
text = 'My email address is mailto:neal.caren@gmail.com so be sure to send me notes.'

re.findall('mailto:neal.caren@gmail.com so', text)

['mailto:neal.caren@gmail.com so']

In [101]:
re.findall('mailto:(.*?) so', text)

['neal.caren@gmail.com']

In [102]:
emails = re.findall('mailto:(.*?) so', text)
df = pd.DataFrame(emails, columns = ['email_address'])
df['department'] = 'Sociology'
df

Unnamed: 0,email_address,department
0,neal.caren@gmail.com,Sociology


In [103]:
df.to_csv('emails.csv')

In [113]:
url = 'http://as.nyu.edu/sociology/people/faculty.html'
html = requests.get(url).text
emails = re.findall('mailto:(.*?)"', html)

In [114]:
emails

[u'delia.b@nyu.edu',
 u'siwei.cheng@nyu.edu',
 u'vivek.chibber@nyu.edu',
 u'sarahkcowan@nyu.edu',
 u'pd1092@nyu.edu',
 u'jo.dixon@nyu.edu',
 u'pengland@nyu.edu',
 u'thomas.ertman@nyu.edu',
 u'David.Garland@nyu.edu',
 u'amanda.geller@nyu.edu',
 u'kathleen.gerson@nyu.edu',
 u'jgoodwin.nyu@gmail.com',
 u'david.greenberg@nyu.edu',
 u'lynne.haney@nyu.edu',
 u'ruth.horowitz@nyu.edu',
 u'mikehout@nyu.edu',
 u'mary.beth.hunzaker@nyu.edu',
 u'robert.max.jackson@nyu.edu',
 u'gj1@nyu.edu',
 u'jerolmack@nyu.edu',
 u'nahoko.kameo@nyu.edu',
 u'eric.klinenberg@nyu.edu',
 u'sl53@nyu.edu',
 u'manza@nyu.edu',
 u'harvey.molotch@nyu.edu',
 u'ann.morning@nyu.edu',
 u'dr101@nyu.edu',
 u'pts1@nyu.edu',
 u'iddo.tavory@nyu.edu',
 u'lawrence.wu@nyu.edu']

In [115]:
url = 'http://as.nyu.edu/politics/people.html'
html = requests.get(url).text


In [123]:
emails = re.findall('theme__text--light">(.*?)<', html)

In [126]:
re.findall('theme__text--light">(.*?)<', html.replace(' [at] ','@') )

[u'bernd.beber@nyu.edu',
 u'nathaniel.beck@nyu.edu',
 u'steven.brams@nyu.edu',
 u'bruce.buenodemesquita@nyu.edu',
 u'amy.catalinac@nyu.edu',
 u'kanchan.chandra [@] gmail.com',
 u'cdawes@nyu.edu',
 u'David.Denoon@nyu.edu',
 u'eric.dickson@nyu.edu',
 u'tiberiu.dragu@nyu.edu',
 u'patrick.egan@nyu.edu',
 u'john.ferejohn [@] nyu.edu',
 u'michael.gilligan@nyu.edu',
 u'sanford.gordon@nyu.edu',
 u'catherine.hafer@nyu.edu',
 u'christine.harrington@nyu.edu',
 u'anna.harvey@nyu.edu',
 u'stephen.holmes@nyu.edu',
 u'jch2@nyu.edu',
 u'dimitri.landa@nyu.edu',
 u'jenn.larson@nyu.edu',
 u'michael.laver@nyu.edu',
 u'bernard.manin@nyu.edu',
 u'lmm1@nyu.edu',
 u'rebecca.morton@nyu.edu',
 u'jonathan.nagler@nyu.edu',
 u'obertell@netscape.net',
 u'pp14@nyu.edu',
 u'ryan.pevnick@nyu.edu',
 u'adam.przeworski@nyu.edu',
 u'pablo.querubin@gmail.com',
 u'peter.rosendorff@nyu.edu',
 u'hr31@nyu.edu',
 u'arturas.rozenas@nyu.edu',
 u'cds2083@nyu.edu',
 u'shanker.satyanath@nyu.edu',
 u'ms5@nyu.edu',
 u'ms268@nyu.edu',


Tip 1: You can split the HTML and process each section

In [229]:
'Remember when we learned about split?'.split()

['Remember', 'when', 'we', 'learned', 'about', 'split?']

In [230]:
'Remember when we learned about split?'.split('we')

['Remember when ', ' learned about split?']

In [139]:
html = requests.get('http://as.nyu.edu/sociology/people/faculty.html').text

slices = html.split('filtered-items-item js-filter-item')

In [129]:
len(slices)

33

In [130]:
slices[4]

u'" data-filter-keys="letter,department" data-filter-letter="Chibber" data-filter-department="Department of Sociology">\n                <a class="facultydirectorybio-person__name theme__text--dark" href="/sociology/people/faculty.vivek-a-chibber.html" title="Vivek Chibber">Vivek Chibber</a>\n                <span class="facultydirectorybio-person__position">Professor Of Sociology</span>\n                <span class="facultydirectorybio-person__department">Department of Sociology</span>\n                <br/>\n                <span class="facultydirectorybio-person__email">\n                    <a href="mailto:vivek.chibber@nyu.edu" title="Email Vivek A Chibber" class="theme__text--light">vivek.chibber@nyu.edu</a>\n                </span>\n                \n                <span class="facultydirectorybio-person__address">Puck Building, Room 4120, 295 Lafayette Street New York, NY 10012</span>\n                \n                \n                <span class="facultydirectorybio-person_

In [134]:
re.findall('title="(.*?)"', slices[4])

[u'Vivek Chibber', u'Email Vivek A Chibber']

In [135]:
re.findall('title="(.*?)"', slices[2])

[u'Delia Baldassarri', u'Email Delia Baldassarri']

In [136]:
re.findall('title="(.*?)"', slices[7])[0]

u'Jo Dixon'

In [146]:
def find_name(slice):
    return re.findall('title="(.*?)"', slice)[0]

In [147]:
find_name(slices[15])

u'Lynne Haney'

In [143]:
def find_email(slice):
    return re.findall('mailto:(.*?)"', slice)[0]

In [144]:
find_email(slices[15])

u'lynne.haney@nyu.edu'

In [153]:
def scrape_person(slice):
    name = find_name(slice)
    email = find_email(slice)
    entry = {'name' : name,
             'email': email }
    return entry

In [154]:
scrape_person(slices[17])

{'email': u'mikehout@nyu.edu', 'name': u'Michael Hout'}

In [156]:
for slice in slices:
    print(scrape_person(slice))

IndexError: list index out of range

In [158]:
print slices[0]

<!DOCTYPE HTML>
<html>
    <head><meta name="viewport" content="width=device-width, initial-scale=1">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="keywords">
<meta name="description">

    
        <meta name="dateCreated" content="2017-06-27">
    


    
    
<link rel="stylesheet" href="/etc/clientlibs/foundation/main.css" type="text/css">
<script type="text/javascript" src="/etc/clientlibs/granite/jquery.js"></script>
<script type="text/javascript" src="/etc/clientlibs/granite/utils.js"></script>
<script type="text/javascript" src="/etc/clientlibs/granite/jquery/granite.js"></script>
<script type="text/javascript" src="/etc/clientlibs/foundation/jquery.js"></script>
<script type="text/javascript" src="/etc/clientlibs/foundation/main.js"></script>



    
    <link href="/etc/designs/nyu-as.css" rel="stylesheet" type="text/css"/>









<!-- Gotham/Mercury Font Key -->
<link rel="sty

In [159]:
for slice in slices:
    try:
        print(scrape_person(slice))
    except:
        print 'Empty'
              

Empty
Empty
{'name': u'Delia Baldassarri', 'email': u'delia.b@nyu.edu'}
{'name': u'Siwei Cheng', 'email': u'siwei.cheng@nyu.edu'}
{'name': u'Vivek Chibber', 'email': u'vivek.chibber@nyu.edu'}
{'name': u'Sarah Cowan', 'email': u'sarahkcowan@nyu.edu'}
{'name': u'Paul DiMaggio', 'email': u'pd1092@nyu.edu'}
{'name': u'Jo Dixon', 'email': u'jo.dixon@nyu.edu'}
{'name': u'Paula England', 'email': u'pengland@nyu.edu'}
{'name': u'Thomas Ertman', 'email': u'thomas.ertman@nyu.edu'}
{'name': u'David Garland', 'email': u'David.Garland@nyu.edu'}
{'name': u'Amanda Geller', 'email': u'amanda.geller@nyu.edu'}
{'name': u'Kathleen Gerson', 'email': u'kathleen.gerson@nyu.edu'}
{'name': u'Jeff Goodwin', 'email': u'jgoodwin.nyu@gmail.com'}
{'name': u'David Greenberg', 'email': u'david.greenberg@nyu.edu'}
{'name': u'Lynne Haney', 'email': u'lynne.haney@nyu.edu'}
{'name': u'Ruth Horowitz', 'email': u'ruth.horowitz@nyu.edu'}
{'name': u'Michael Hout', 'email': u'mikehout@nyu.edu'}
{'name': u'Mary Beth Hunzaker'

In [163]:
directory = []

for slice in slices:
    try:
        entry = scrape_person(slice)
        directory.append(entry)
    except:
        print 'Empty'
              

Empty
Empty
Empty


In [166]:
df = pd.DataFrame(directory)
df

Unnamed: 0,email,name
0,delia.b@nyu.edu,Delia Baldassarri
1,siwei.cheng@nyu.edu,Siwei Cheng
2,vivek.chibber@nyu.edu,Vivek Chibber
3,sarahkcowan@nyu.edu,Sarah Cowan
4,pd1092@nyu.edu,Paul DiMaggio
5,jo.dixon@nyu.edu,Jo Dixon
6,pengland@nyu.edu,Paula England
7,thomas.ertman@nyu.edu,Thomas Ertman
8,David.Garland@nyu.edu,David Garland
9,amanda.geller@nyu.edu,Amanda Geller


<div class="alert alert-info"> 
<h1> Your turn</h1>
<p>Add a field for faculty rank (such as Associate or Assistant Professor) to scrape_person.
<p> Bonus: Add some try/except to scrape person so it returns results even when it is missing a field (like Abend).
</div>



Tip 2: Download Once, Load Many Times

In [2]:
import codecs

def save_file(text, file_name):
    with codecs.open(file_name, 'wb', 'utf8') as outfile:
        outfile.write(text)


In [6]:
url = 'http://mobilizationjournal.org/toc/maiq/22/2'
html= requests.get(url).text

save_file(html, 'moby_22_2.html')


In [10]:
def load_file(file_name):
    with codecs.open(file_name, 'rb', 'utf8') as infile:
        text = infile.read()
    return text

In [17]:
def download_file(volume, issue):
    url = 'http://mobilizationjournal.org/toc/maiq/' + str(issue) + '/' + str(issue)
    html= requests.get(url).text
    return html

### The logic:

*If we have a file, load it. If we don't, get it and save it.*

### In Python

1. `Try` to load the file.
2. If that fails, get it from the internet and save a copy.

In [39]:
try:
    load_file('moby_15_2.html')
except:
    html = get_file(15,2)
    save_file(html, 'moby_15_2.html')

In [22]:
!ls moby_*

moby_15_2.html moby_22_2.html


In [43]:
volume = 10
issue = 1

file_name = 'moby_' + str(volume) + '_' + str(issue) + '.html'

try:
    html = load_file(file_name)
except:
    print('Could not find it. Going to the internet')
    html = download_file(volume, issue)
    save_file(html, file_name)

In [44]:
!ls moby_*

moby_10_1.html moby_15_2.html


In [45]:
def load_or_download(volume, issue):
    '''Loads a local HTML. If not found, gets the file from the internet.'''
    
    file_name = 'moby_' + str(volume) + '_' + str(issue) + '.html'
    try:
        html = load_file(file_name)
    except:
        print('Could not find it. Going to the internet')
        html = download_file(volume, issue)
        save_file(html, file_name)

In [46]:
load_or_download(5, 3)

Could not find it. Going to the internet


In [47]:
def load_or_download(volume, issue):
    '''Loads a local HTML. If not found, gets the file from the internet.'''
    
    file_name = 'moby_' + str(volume) + '_' + str(issue) + '.html'
    
    try:
        html = load_file(file_name)
    except:
        print('Could not find it. Going to the internet')
        html = download_file(volume, issue)
        save_file(html, file_name)
        
    return html

In [48]:
load_or_download(7, 3)

Could not find it. Going to the internet


u'\n\n\n\n\n        \n        \n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">\n<head>\n    \n\n\n\n\n\n\n<title>\n    \n            Mobilization: An International Quarterly\n            -\n            \n                    Error\n                \n        \n</title>\n\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n    \n<meta name="robots" content="noarchive,noindex,nofollow" />\n\n\n\n\n\n\n\n\n\n\n<meta name="MSSmartTagsPreventParsing" content="true"/>\n\n\n\n\n    \n    \n    <script>\n        (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n            (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n                m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n        })(window,document,\'script\',\'/

In [49]:
load_or_download(7, 3)

u'\n\n\n\n\n        \n        \n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">\n<head>\n    \n\n\n\n\n\n\n<title>\n    \n            Mobilization: An International Quarterly\n            -\n            \n                    Error\n                \n        \n</title>\n\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n    \n<meta name="robots" content="noarchive,noindex,nofollow" />\n\n\n\n\n\n\n\n\n\n\n<meta name="MSSmartTagsPreventParsing" content="true"/>\n\n\n\n\n    \n    \n    <script>\n        (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n            (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n                m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n        })(window,document,\'script\',\'/

<div class="alert alert-info"> 
<h1> Your turn. </h1>
<p>
Modify the functions above to work to download UNC faculty web pages.
<p> Hint 1: load_file doesn't need to be changed.
<p> Hint 2: Instead of volume and issue, we only have one thing: department, such as sociology or politicalscience (keep it 1 word). 
<p> Hint 3: download_file needs a new url and to modified based on Hint 2
</div>

In [67]:
def load_file(file_name):
    with codecs.open(file_name, 'rb', 'utf8') as infile:
        text = infile.read()
    return text

def download_file(department):
    url = 'http://' + department + '.unc.edu/people/faculty/' 
    html= requests.get(url).text
    return html

In [68]:
def load_or_download(department):
    '''Loads a local HTML. If not found, gets the file from the internet.'''
    
    file_name = 'email_' + department + '.html'
    try:
        html = load_file(file_name)
    except:
        print('Could not find it. Going to the internet')
        html = download_file(department)
        save_file(html, file_name)
        



In [70]:
load_or_download('sociology')

In [73]:
!ls ema*

email_sociology.html emails.csv


### Tip 3: Pandas lets you cheat

[You had choices.](http://www.sv.uio.no/english/research/phd/summer-school/courses-2017/)

In [214]:
import pandas as pd

url = 'http://www.sv.uio.no/english/research/phd/summer-school/courses-2017/'

courses = pd.read_html(url)

This returns a list of HTML tables on the page as pandas dataframes. 

In [224]:
courses

[                                               title  \
 0  Mixed Methods: Towards a Methodological Pluralism   
 1          Political Violence: A Relational Approach   
 2                        Case Study Research Methods   
 3                      Democracy and Democratization   
 4                              Democracy and Justice   
 
                                           instructor  \
 0  Professor Giampietro Gobo, University of Milan...   
 1  Professor Donatella della Porta and Assistant ...   
 2  Professor Andrew Bennett, Georgetown Universit...   
 3  Professor David J. Samuels, University of Minn...   
 4        Professor Ian Shapiro, Yale University, USA   
 
                                          disciplines               date  
 0  Research Methodology, Political Science, Socio...  24 - 28 July 2017  
 1                       Sociology, Political Science  24 - 28 July 2017  
 2  Research Methodology, Political Science, Socio...  24 - 28 July 2017  
 3          

In [215]:
courses[0]

Unnamed: 0,24 - 28 July 2017,Lecturers,Main disciplines
0,Mixed Methods: Towards a Methodological Pluralism,"Professor Giampietro Gobo, University of Milan...","Research Methodology, Political Science, Socio..."
1,Political Violence: A Relational Approach,Professor Donatella della Porta and Assistant ...,"Sociology, Political Science"
2,Case Study Research Methods,"Professor Andrew Bennett, Georgetown Universit...","Research Methodology, Political Science, Socio..."
3,Democracy and Democratization,"Professor David J. Samuels, University of Minn...",Political Science
4,Democracy and Justice,"Professor Ian Shapiro, Yale University, USA",Political Science


In [216]:
courses[1]

Unnamed: 0,31 July - 4 August 2017,Lecturers,Main disciplines
0,Climate Change Adaptation and Transformations ...,"Professor Karen O'Brien, University of Oslo, N...","Human Geography, Environment and Climate"
1,The Politics of Nature in the Anthropocene: An...,"Associate Professor Andrew S. Mathews, Univers...","Anthropology, Environment and Climate"
2,Responsible Research and Innovation,Professor Richard Owen and Senior Lecturer Sar...,"Innovation Studies, STS"
3,Collecting and Analyzing Big Data,"Associate Professor Neal Caren, University of ...","Economics, Sociology, Big Data"
4,Exploring Educational Transfer,"Professor Dr. Florian Waldow, Humboldt-Univers...","Comparative Education, Sociology"
5,Political Psychology,"Professor Fathali M. Moghaddam, Georgetown Uni...","Psychology, Political Science"


In [217]:
week1_df = courses[0]
week2_df = courses[1]

week1_df.append(week2_df)

Unnamed: 0,24 - 28 July 2017,Lecturers,Main disciplines,31 July - 4 August 2017
0,Mixed Methods: Towards a Methodological Pluralism,"Professor Giampietro Gobo, University of Milan...","Research Methodology, Political Science, Socio...",
1,Political Violence: A Relational Approach,Professor Donatella della Porta and Assistant ...,"Sociology, Political Science",
2,Case Study Research Methods,"Professor Andrew Bennett, Georgetown Universit...","Research Methodology, Political Science, Socio...",
3,Democracy and Democratization,"Professor David J. Samuels, University of Minn...",Political Science,
4,Democracy and Justice,"Professor Ian Shapiro, Yale University, USA",Political Science,
0,,"Professor Karen O'Brien, University of Oslo, N...","Human Geography, Environment and Climate",Climate Change Adaptation and Transformations ...
1,,"Associate Professor Andrew S. Mathews, Univers...","Anthropology, Environment and Climate",The Politics of Nature in the Anthropocene: An...
2,,Professor Richard Owen and Senior Lecturer Sar...,"Innovation Studies, STS",Responsible Research and Innovation
3,,"Associate Professor Neal Caren, University of ...","Economics, Sociology, Big Data",Collecting and Analyzing Big Data
4,,"Professor Dr. Florian Waldow, Humboldt-Univers...","Comparative Education, Sociology",Exploring Educational Transfer


In [218]:

week1_df.columns = ['title', 'instructor', 'disciplines']
week1_df['date'] = '24 - 28 July 2017'
week1_df

Unnamed: 0,title,instructor,disciplines,date
0,Mixed Methods: Towards a Methodological Pluralism,"Professor Giampietro Gobo, University of Milan...","Research Methodology, Political Science, Socio...",24 - 28 July 2017
1,Political Violence: A Relational Approach,Professor Donatella della Porta and Assistant ...,"Sociology, Political Science",24 - 28 July 2017
2,Case Study Research Methods,"Professor Andrew Bennett, Georgetown Universit...","Research Methodology, Political Science, Socio...",24 - 28 July 2017
3,Democracy and Democratization,"Professor David J. Samuels, University of Minn...",Political Science,24 - 28 July 2017
4,Democracy and Justice,"Professor Ian Shapiro, Yale University, USA",Political Science,24 - 28 July 2017


In [219]:
week2_df.columns = ['title', 'instructor', 'disciplines']
week2_df['date'] = '31 July - 4 August 2017'
week2_df

Unnamed: 0,title,instructor,disciplines,date
0,Climate Change Adaptation and Transformations ...,"Professor Karen O'Brien, University of Oslo, N...","Human Geography, Environment and Climate",31 July - 4 August 2017
1,The Politics of Nature in the Anthropocene: An...,"Associate Professor Andrew S. Mathews, Univers...","Anthropology, Environment and Climate",31 July - 4 August 2017
2,Responsible Research and Innovation,Professor Richard Owen and Senior Lecturer Sar...,"Innovation Studies, STS",31 July - 4 August 2017
3,Collecting and Analyzing Big Data,"Associate Professor Neal Caren, University of ...","Economics, Sociology, Big Data",31 July - 4 August 2017
4,Exploring Educational Transfer,"Professor Dr. Florian Waldow, Humboldt-Univers...","Comparative Education, Sociology",31 July - 4 August 2017
5,Political Psychology,"Professor Fathali M. Moghaddam, Georgetown Uni...","Psychology, Political Science",31 July - 4 August 2017


In [220]:
catalog_df = week1_df.append(week2_df)

catalog_df

Unnamed: 0,title,instructor,disciplines,date
0,Mixed Methods: Towards a Methodological Pluralism,"Professor Giampietro Gobo, University of Milan...","Research Methodology, Political Science, Socio...",24 - 28 July 2017
1,Political Violence: A Relational Approach,Professor Donatella della Porta and Assistant ...,"Sociology, Political Science",24 - 28 July 2017
2,Case Study Research Methods,"Professor Andrew Bennett, Georgetown Universit...","Research Methodology, Political Science, Socio...",24 - 28 July 2017
3,Democracy and Democratization,"Professor David J. Samuels, University of Minn...",Political Science,24 - 28 July 2017
4,Democracy and Justice,"Professor Ian Shapiro, Yale University, USA",Political Science,24 - 28 July 2017
0,Climate Change Adaptation and Transformations ...,"Professor Karen O'Brien, University of Oslo, N...","Human Geography, Environment and Climate",31 July - 4 August 2017
1,The Politics of Nature in the Anthropocene: An...,"Associate Professor Andrew S. Mathews, Univers...","Anthropology, Environment and Climate",31 July - 4 August 2017
2,Responsible Research and Innovation,Professor Richard Owen and Senior Lecturer Sar...,"Innovation Studies, STS",31 July - 4 August 2017
3,Collecting and Analyzing Big Data,"Associate Professor Neal Caren, University of ...","Economics, Sociology, Big Data",31 July - 4 August 2017
4,Exploring Educational Transfer,"Professor Dr. Florian Waldow, Humboldt-Univers...","Comparative Education, Sociology",31 July - 4 August 2017


In [222]:
catalog_df.to_csv('catalog.csv', encoding='utf8')

<div class="alert alert-info"> 
<h1> Your turn</h1>
<p>Scrape the 2016 course listings. Add it to our catalog.


### Tip 3: Sometimes you can hack their API.

![](https://raw.githubusercontent.com/nealcaren/CSSS-CABD/master/images/HttpFox.png)

### Tip 4: If you know HTML, you can also parse the page.

In [35]:
from bs4 import BeautifulSoup

In [36]:
soup = BeautifulSoup(page_html, "lxml")

In [37]:
soup.find_all('div', attrs={'class':'art_title'})

[<div class="art_title">Nonviolent Resistance Research</div>,
 <div class="art_title">Do Contemporaneous Armed Challenges Affect the Outcomes of Mass Nonviolent Campaigns?</div>,
 <div class="art_title">Revolution, Nonviolence, and the Arab Uprisings</div>,
 <div class="art_title">Nonviolence as a Weapon of the Resourceful: From Claims to Tactics in Mobilization</div>,
 <div class="art_title">Rightful Radical Resistance: Mass Mobilization and Land Struggles in India and Brazil</div>,
 <div class="art_title">Decolonizing Civil Resistance</div>,
 <div class="art_title">The Dynamics of Nonviolence Knowledge</div>,
 <div class="art_title">Book Reviews</div>]

In [38]:
ts = soup.find_all('div', attrs={'class':'art_title'})

for i in ts:
    print i.contents[0]


Nonviolent Resistance Research
Do Contemporaneous Armed Challenges Affect the Outcomes of Mass Nonviolent Campaigns?
Revolution, Nonviolence, and the Arab Uprisings
Nonviolence as a Weapon of the Resourceful: From Claims to Tactics in Mobilization
Rightful Radical Resistance: Mass Mobilization and Land Struggles in India and Brazil
Decolonizing Civil Resistance
The Dynamics of Nonviolence Knowledge
Book Reviews


In [39]:
[i.contents[0] for i in ts]

[u'Nonviolent Resistance Research',
 u'Do Contemporaneous Armed Challenges Affect the Outcomes of Mass Nonviolent Campaigns?',
 u'Revolution, Nonviolence, and the Arab Uprisings',
 u'Nonviolence as a Weapon of the Resourceful: From Claims to Tactics in Mobilization',
 u'Rightful Radical Resistance: Mass Mobilization and Land Struggles in India and Brazil',
 u'Decolonizing Civil Resistance',
 u'The Dynamics of Nonviolence Knowledge',
 u'Book Reviews']

In [40]:
def load_or_get(volume, issue):
    '''
    Tries to open a Moby issue. If not found, gets it from the internet.
    '''
    
    file_name = 'moby_%s_%s.html' % (volume, issue)
    url       = 'http://mobilizationjournal.org/toc/maiq/%s/%s' % (volume, issue)
    
    # First, try to find the file stored locally.
    try:
        with codecs.open(file_name, 'r') as infile:
            page_html = infile.read()
    # If that didn't work, try getting it from the interent      
    except Exception, e:
        print 'Going to the internet to get %s-%s' % (issue, volume)
        page = requests.get(url)
        page_html = page.text
            
        # Save the file so you only go to the page once. It is polite. 
        with codecs.open(file_name, 'wb') as outfile:
            outfile.write(page_html)
    
    #don't forget to send the stuff back
    return page_html

In [41]:
page_html = load_or_get(8,1)

Going to the internet to get 1-8


In [42]:
page_html = load_or_get(10,1)

Going to the internet to get 1-10


In [43]:
def scrape_headlines(page_html):
    titles =  re.findall('div class="art_title">(.*?)<\/div', page_html)
    return titles

In [44]:
volume = 13

for issue in [1,2]:
    page_html = load_or_get(volume, issue)
    print  scrape_headlines(page_html)

Going to the internet to get 1-13
[u'Repression and Crime Control: Why Social Movement Scholars Should Pay Attention to Mass Incarceration as a Form of Repression', u'Borrowing from the Women\'s Movement "for Reasons of Public Security": A Study of Social Movement Outcomes and Judicial Activism in the European Union', u'Ideology, Strategy and Conflict in a Social Movement Organization: The Sierra Club Immigration Wars', u'Situating Movements Historically: May 1968, Alain Touraine, and New Social Movement Theory', u'The Spatial Dynamics of the May 1968 French Demonstrations', u'Forming Coalitions: A Network-Theoretic Approach to the Contemporary South Korean Environmental Movement']
Going to the internet to get 2-13
[u'Assessing Stability in the Patterns of Selection Bias in Newspaper Coverage of Protest During the Transition from Communism in Belarus<sup>*</sup>', u'Validity and Media-Derived Protest Event Data: Examining Relative Coverage Tendencies in Mexican News Media <sup>*</sup>'

### Still stuck?


Tip 5: Spoofing a browser

`headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}`

`r = requests.get(url, headers=headers)`

Tip 6: Keep cookies

`s = requests.Session()`

`s.get('http://httpbin.org/get')`


Tip 7: Authentication in requests

`requests.get('https://api.github.com/user', auth=('user', 'pass'))`

### Tip 8: Still still stuck


3. Selenium - control the browser. 