# Web Scraping in Python

In this appendix lecture we'll go over how to scrape information from the web using Python. 

##### We'll go to a website, decide what information we want, see where and how it is stored, then scrape it and set it as a pandas DataFrame!

#### Some things you should consider before web scraping a website:

1.) You should check a site's terms and conditions before you scrape them. 

2.) Space out your requests so you don't overload the site's server, doing this could get you blocked.

3.) Scrapers break after time - web pages change their layout all the time, you'll more than likely have to rewrite your code. 

4.) Web pages are usually inconsistent, more than likely you'll have to clean up the data after scraping it.

5.) Every web page and situation is different, you'll have to spend time configuring your scraper.

#### To learn more about HTML I suggest theses two resources:

[W3School](http://www.w3schools.com/html/)

[Codecademy](http://www.codecademy.com/tracks/web)


#### There are three modules we'll need in addition to python are:

1.) BeautifulSoup, which you can download by typing: *pip install beautifulsoup4* or *conda install beautifulsoup4* (for the Anaconda distrbution of Python) in your command prompt.

2.) lxml , which you can download by typing: *pip install lxml* or *conda install lxml* (for the Anaconda distrbution of Python) in your command prompt.

3.) requests,  which you can download by typing: *pip install requests* or *conda install requests* (for the Anaconda distrbution of Python) in your command prompt.

We'll start with our imports:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from pandas import Series,DataFrame

If you get an import error, you'll need to check which python version (anaconda or not) you're using. If you downloaded jupyter with anaconda, it will likely default to an anaconda environment. you can check by running the command below

In [2]:
import sys
sys.version

'3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12) \n[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]'

In [3]:
sys.executable

'/Users/mac28/anaconda/envs/py35/bin/python'

You may need to activate that specific environement, and then install the packages to be able to use them in your notebook

## Find the webpage that you want to scrape

My scraping script is specific to multnomah county apartment listings on craigslist, so you'll want to:
* navigate to that page
* look at the source code.

In [4]:
id_number = '5715121196'

In [5]:
url = 'https://portland.craigslist.org/mlt/apa'+'/'+id_number+'.html'
url

'https://portland.craigslist.org/mlt/apa/5715121196.html'

In [6]:
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)
print(type(soup))
print(soup)

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>

<html class="no-js">
<head>
<title>OHSU - One Bedroom, One Bath Condo - Charming Historical Building</title>
<link href="http://portland.craigslist.org/mlt/apa/5715121196.html" rel="canonical">
<meta content="Showing Instructions: To schedule a showing in this competitive rental market, just click the green Schedule Showing button on our website to quickly request a showing. It's as easy as 1,2,3. Get the " name="description">
<meta content="noarchive,nofollow,unavailable_after:Wednesday, 07-Sep-16 14:40:13 PDT" name="robots">
<meta content="preview" name="twitter:card">
<meta content="Showing Instructions: To schedule a showing in this competitive rental market, just click the green Schedule Showing button on our website to quickly request a showing. It's as easy as 1,2,3. Get the " property="og:description">
<meta content="https://images.craigslist.org/00n0n_iDdLMUCqtJB_600x450.jpg" property="og:image">
<meta content="craigslist" property=



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))


### So now how do we find the information in this mess?

* Use HTML inspection tools. I've been using firebug, but there are many others out there.


In [7]:
summary = soup.find("span",{'class':'price'})

print(summary)


<span class="price">$1595</span>


In [8]:
if summary == None:
    print(np.nan)
else:
    text = summary.find(text=True)
    print(text)

$1595


So eventually you can create a function that looks something like this.

In [9]:
def get_price(id_number):
    url = 'https://portland.craigslist.org/mlt/apa'+'/'+id_number+'.html'

    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c,'html.parser')

    summary = soup.find("span",{'class':'price'})
    if summary == None:
        return(np.nan)
    else:
        text = summary.find(text=True)
        return(text)

In [10]:
get_price('5715121196')

'$1595'

Now that we know how to get ONE piece of info, we can repeat the process on many of them.

Riley - 
* Show how many functions I worked on
* Show final Script
