## Scraping websites

<div class="alert alert-info"> 
<h1> Your turn</h1>
<p>Find the email addresses for the [NYU Sociology faculty](http://as.nyu.edu/sociology/people/faculty.html). 
<p> <em> Remember to put things in functions as soon as possible. </em>
<p>Then try your script on the politics faculty page. Did it work? If not, fix it.
</div>



In [None]:
import requests
import re
import pandas as pd

In [None]:
text = 'My email address is mailto:neal.caren@gmail.com so be sure to send me notes.'

re.findall('mailto:neal.caren@gmail.com so', text)

In [None]:
re.findall('mailto:(.*?) so', text)

In [None]:
emails = re.findall('mailto:(.*?) so', text)
df = pd.DataFrame(emails, columns = ['email_address'])
df['department'] = 'Sociology'
df

In [None]:
df.to_csv('emails.csv')

Tip 1: You can split the HTML and process each section

In [None]:
def find_name(slice):
    return re.findall('title="(.*?)"', slice)[0]

In [None]:
def find_email(slice):
    return re.findall('mailto:(.*?)"', slice)[0]

In [None]:
def scrape_person(slice):
    name = find_name(slice)
    email = find_email(slice)
    entry = {'name' : name,
             'email': email }
    return entry

In [None]:
directory = []

for slice in slices:
    try:
        entry = scrape_person(slice)
        directory.append(entry)
    except:
        print 'Empty'
              

In [None]:
df = pd.DataFrame(directory)
df

<div class="alert alert-info"> 
<h1> Your turn</h1>
<p>Add a field for faculty rank (such as Associate or Assistant Professor) to scrape_person.
<p> Bonus: Add some try/except to scrape person so it returns results even when it is missing a field (like Abend).
</div>



Tip 2: Download Once, Load Many Times

In [None]:
import codecs

def save_file(text, file_name):
    with codecs.open(file_name, 'wb', 'utf8') as outfile:
        outfile.write(text)


In [None]:
url = 'http://mobilizationjournal.org/toc/maiq/22/2'
html= requests.get(url).text

save_file(html, 'moby_22_2.html')


In [None]:
def load_file(file_name):
    with codecs.open(file_name, 'rb', 'utf8') as infile:
        text = infile.read()
    return text

In [None]:
def download_file(volume, issue):
    url = 'http://mobilizationjournal.org/toc/maiq/' + str(issue) + '/' + str(issue)
    html= requests.get(url).text
    return html

In [None]:
def load_or_download(volume, issue):
    '''Loads a local HTML. If not found, gets the file from the internet.'''
    
    file_name = 'moby_' + str(volume) + '_' + str(issue) + '.html'
    
    try:
        html = load_file(file_name)
    except:
        print('Could not find it. Going to the internet')
        html = download_file(volume, issue)
        save_file(html, file_name)
        
    return html

<div class="alert alert-info"> 
<h1> Your turn. </h1>
<p>
Modify the functions above to work to download UNC faculty web pages.
<p> Hint 1: load_file doesn't need to be changed.
<p> Hint 2: Instead of volume and issue, we only have one thing: department, such as sociology or politicalscience (keep it 1 word). 
<p> Hint 3: download_file needs a new url and to modified based on Hint 2
</div>

In [None]:
import pandas as pd

url = 'http://www.sv.uio.no/english/research/phd/summer-school/courses-2017/'

courses = pd.read_html(url)

In [None]:
week1_df = courses[0]
week2_df = courses[1]

In [None]:
week1_df.columns = ['title', 'instructor', 'disciplines']
week1_df['date'] = '24 - 28 July 2017'


In [None]:
week2_df.columns = ['title', 'instructor', 'disciplines']
week2_df['date'] = '31 July - 4 August 2017'


In [None]:
catalog_df = week1_df.append(week2_df)

catalog_df

In [None]:
catalog_df.to_csv('catalog.csv', encoding='utf8')

<div class="alert alert-info"> 
<h1> Your turn</h1>
<p>Scrape the 2016 course listings. Add it to our catalog.


### Still stuck?


Tip 5: Spoofing a browser

`headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}`

`r = requests.get(url, headers=headers)`

Tip 6: Keep cookies

`s = requests.Session()`

`s.get('http://httpbin.org/get')`


Tip 7: Authentication in requests

`requests.get('https://api.github.com/user', auth=('user', 'pass'))`

### Tip 8: Still still stuck


3. Selenium - control the browser. 