#LocaleBnB - AirBnB contextual recommender

###G Scott Stukey, Zipfian Academy, April 2015 - July 2015

Welcome to my capstone project, LocaleBnB - AirBnB Contextual Recommender. 

###Background:
This project was born out of my frustrations with AirBnB's search functionality. While my fiancee & I were booking in Montreal, we knew that we wanted to stay in a fun, young neighborhood & experience "local Montreal" and stay away from touristy areas. We wanted to find our 'Mission, SF' or 'Park Slope, Brooklyn' of Montreal. We found that while we could filter by neighborhood, we didn't know what the neighborhoods were like. We had no way of mapping the neighborhood to our criteria!

Interestingly, the AirBnB content team has this amazing [neighborhood guide](https://www.airbnb.com/locations) section of their site, that gives great information about various cities (ex: [San Francisco](https://www.airbnb.com/locations/san-francisco)) & [traits of neighborhoods](https://www.airbnb.com/locations/san-francisco/neighborhoods) found in these cities (ex: [Alamo Square](https://www.airbnb.com/locations/san-francisco/alamo-square) is: Loved by San Franciscans, Stunning Views, & Touristy).

Using the neighborhood guides put together for NYC & SF neighborhoods & scraping 4,000+ listings on AirBnB, I'm building an app that uses the listing description to predict whether the listing's neighborhood has a specified trait. I'm then applying that model to score the existing search results & resort their order based on a user's like of each.

###Project Details:
I'm going to do the project in 4 phases: 
* Data Aquisition & ETL (scraping, storing, feature engineering)
* Exploratory Data Analysis
* Predictive Modeling
* App Creation

###Notebook Details:
This notebook was created retroactively after much of the project, for the purposes of explaining my methodology for the project. Prior to this notebook's creation, there were several notebooks used to test & iterate into good solutions.  For the actual project, I created classes for Neighborhoods, Search Results, & Room Listings which helped me with my app & code.

Also note, I will be creating import cells for each section, so that if you wanted to recreate a particular section you can do it without worrying about the above code.

#Data Aquisition & ETL (scraping, storing, feature engineering)

###Extracting neighborhoods from AirBnB's city guide

In [17]:
# import some things for this section
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [7]:
# I've already made a file with the 2 cities I want to scrape:
# NOTE: city_id is unique to this project; it is not based on AirBnB's city ids 
df = pd.read_csv('../data/city_list.csv')
print df.head()

   city_id           city state        country
0        1  San-Francisco    CA  United-States
1        2       New-York    NY  United-States


In [18]:
# lets scrape an example city ('https://www.airbnb.com/locations/new-york') to see what we're working with
r = requests.get('https://www.airbnb.com/locations/new-york')

In [31]:
soup = BeautifulSoup(r.content)
soup.prettify

<bound method BeautifulSoup.prettify of <!DOCTYPE html>
<html lang="en" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta charset="utf-8"/>
<script>
  sherlock_firstbyte = Number(new Date());
</script>
<title>New York Travel Guide - Airbnb Neighborhoods</title>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"fc09a36731","applicationID":"1024027","transactionName":"dlwMQktaWAgBEB1RWkFaB0UWRlwLEw==","queueTime":10,"applicationTime":203,"ttGuid":"","agentToken":null,"agent":"js-agent.newrelic.com/nr-632.min.js"}</script>
<script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o?o:n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]

In [33]:
#Lets find a nice list of neighborhoods to work with
neighborhood_list_raw = soup.find('div', {'class':'neighborhood-list'}).find_all('a')[1:]    # the 0 index is a link to 'all-neighborhoods'
neighborhood_list_raw[:5]

[<a href="/locations/new-york/alphabet-city">Alphabet City</a>,
 <a href="/locations/new-york/astoria">Astoria</a>,
 <a href="/locations/new-york/battery-park-city">Battery Park City</a>,
 <a href="/locations/new-york/bedford-stuyvesant">Bedford-Stuyvesant</a>,
 <a href="/locations/new-york/boerum-hill">Boerum Hill</a>]

In [34]:
#Lets extract the data we need from them
neighborhoods = []

for i, hood in enumerate(neighborhood_list_raw):
    hood_name = hood.get_text()
    hood_url = hood['href']
    neighborhoods.append((i, hood_name, hood_url))

hood_df = pd.DataFrame(neighborhoods, columns=["neighborhood_id", "neighborhood", "neighborhood_url"])
hood_df.head()

Unnamed: 0,neighborhood_id,neighborhood,neighborhood_url
0,0,Alphabet City,/locations/new-york/alphabet-city
1,1,Astoria,/locations/new-york/astoria
2,2,Battery Park City,/locations/new-york/battery-park-city
3,3,Bedford-Stuyvesant,/locations/new-york/bedford-stuyvesant
4,4,Boerum Hill,/locations/new-york/boerum-hill


Using the above code as a template, I was able to create **'scrape_neighborhood_list.py'**

This script is used to grab neighborhoods from cities
* takes in a file ('../data/city_list.csv')
* grabs the neighborhoods we wish to scrape from the file
* scrapes AirBnB's city guide for each of the cities to grab the hoods

Note: In the script, I exported the neighborhodo list to a csv ('../data/neighborhood_list.csv'), as it wasn't large enough to warrent a full database. If I plan to scale this project, a PostgreSQL database would do the trick just fine.

###Extracting neighborhoods data from AirBnB's neighborhood guides

In [42]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [43]:
city_df = pd.read_csv('../data/city_list.csv')
hood_df = pd.read_csv('../data/neighborhood_list.csv')
hood_df.head()

Unnamed: 0,neighborhood_id,neighborhood,neighborhood_url,city_id,city
0,0,Alamo Square,/locations/san-francisco/alamo-square,1,san-francisco
1,1,Bayview,/locations/san-francisco/bayview,1,san-francisco
2,2,Bernal Heights,/locations/san-francisco/bernal-heights,1,san-francisco
3,3,Chinatown,/locations/san-francisco/chinatown,1,san-francisco
4,4,Civic Center,/locations/san-francisco/civic-center,1,san-francisco


In [47]:
#Lets find my old stomping grounds, Carroll Gardens, Brooklyn!
hood_df[hood_df['neighborhood']=="Carroll Gardens"]

Unnamed: 0,neighborhood_id,neighborhood,neighborhood_url,city_id,city
49,49,Carroll Gardens,/locations/new-york/carroll-gardens,2,new-york


In [86]:
# lets scrape an example neighborhood ('https://www.airbnb.com/locations/new-york/carroll-gardens') to see what we're working with
r = requests.get('https://www.airbnb.com/locations/new-york/carroll-gardens')

In [87]:
soup = BeautifulSoup(r.content)
soup.prettify

<bound method BeautifulSoup.prettify of <!DOCTYPE html>
<html lang="en" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta charset="utf-8"/>
<script>
  sherlock_firstbyte = Number(new Date());
</script>
<title>Carroll Gardens, New York Guide - Airbnb Neighborhoods</title>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"fc09a36731","applicationID":"1024027","transactionName":"dlwMQktaWAgBEB1cVlxUClRWR1wLCwZBHUBdXBU=","queueTime":17,"applicationTime":318,"ttGuid":"","agentToken":null,"agent":"js-agent.newrelic.com/nr-632.min.js"}</script>
<script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o?o:n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t

In [70]:
# Lets grab a bunch of information that we might need:
def extract_features(soup):

    headline = soup.find('div', {'class':'center description'}).get_text().strip()

    description = soup.find('meta', {'name':'description'})['content']

    # Note: Not all neighborhoods in this guide have traits, we we can't just chain the .find_all('span')
    traits = []    
    traits_html = soup.find('ul', {'class':'traits'}).find_all('span')
    for trait in traits_html:
        traits.append(trait.get_text())

    tags = []    
    for tag in soup.find_all('div', {'class':'neighborhood-tag'}):
        tags.append(tag.get_text().strip())

    similar_hoods = []
    similar_hood_html = soup.find('ul', {'class':'trait-neighborhoods neighborhoods'})
    for similar_hood in similar_hood_html.find_all('li'):
        similar_hoods.append(similar_hood['data-neighborhood-permalink'])

    neighboring_hoods = []
    for neighboring_hood in soup.find('p', {'class':'lede center'}).find_all('a'):
        neighboring_hoods.append(neighboring_hood.get_text())

    caption_bar = soup.find('div', {'class':'caption bar'}).find_all('li')
    public_trans = caption_bar[0].strong.get_text()
    having_a_car = caption_bar[1].strong.get_text()

    data_bbox = soup.find('div', {'id':'inner-map'})['data-bbox']
    data_x = float(soup.find('div', {'id':'inner-map'})['data-x'])
    data_y = float(soup.find('div', {'id':'inner-map'})['data-y'])
    
    print "HEADLINE: \n%s\n" % headline
    print "DESCRIPTION: \n%s\n" % description
    print "TRAITS: \n%s\n" % traits
    print "TAGS: \n%s\n" % tags
    print "SIMILAR HOODS: \n%s\n" % similar_hoods
    print "NEIGHBORING HOODS: \n%s\n" % neighboring_hoods
    print "PUBLIC TRANS: %s" % public_trans
    print "HAVING A CAR: %s" % having_a_car
    print 
    print "GEO INFORMATION: \n%s, \n%s, %s" % (data_bbox, data_x, data_y)

In [71]:
extract_features(soup) 

HEADLINE: 
Carroll Gardens' brownstones, boutiques, and brunch spots are equal parts cool, calm, and collected.

DESCRIPTION: 
Carroll Gardens has established itself as a Brooklyn favorite. Although flush with hip bars, boutiques, and restaurants, this neighborhood has never lost its old-NYC mystique. Quintessential brownstones line tree-trimmed sidewalks and local retailers and Italian eateries populate its cheerful main street. For a stroll or a stay, Carroll Gardens promises a healthy dose of Brooklyn's cool candor.

TRAITS: 
[u'Peace & Quiet', u'Nightlife', u'Dining', u'Loved by New Yorkers', u'Great Transit']

TAGS: 
[u'sought-after', u'beautiful', u'quiet', u'trendy', u'family friendly', u'foodie', u'front gardens', u'hipsters', u'historic', u'professional', u'italian-american', u'calm', u'strollers', u'brownstone', u'family-friendly', u'lesbian friendly', u'families', u'restaurants/bars', u'italian', u'court st.', u'family friendly', u'quaint', u'strategically located', u'vibran

Scanning the above, I completely agree with nearly all of the above!  

Callouts:
- Interesting to see that a neighborhood can have both "Peace & Quiet" as well as "Nightlife" as traights, 
but CG has it all!
- I wouldn't call Greenwood Heights 'neighboring' to Caroll Gardens.
- I had a car and felt it easy, though in a recent trip back my secret spots down 1st & Gowanus are now condos. 
- The odd thing is that "Brooklyn" is *neighboring* Carroll Gardens.  Lets look at that a little more:

In [96]:
# Why is Brooklyn a 'neighbor' of Carroll Gardens?:
soup.find('p', {'class':'lede center'})

<p class="lede center">
          Carroll Gardens is within <a href="/locations/neighborhoods/3795">Brooklyn</a> and bordered by <a href="/locations/neighborhoods/1260">Cobble Hill</a>, <a href="/locations/neighborhoods/1304">Columbia Street Waterfront</a>, <a href="/locations/neighborhoods/1213">Boerum Hill</a>, <a href="/locations/neighborhoods/1298">Gowanus</a>, <a href="/locations/neighborhoods/1347">Greenwood Heights</a>, and <a href="/locations/neighborhoods/1301">Red Hook</a>
</p>

So the extract_features doesn't pull this information out. It's something to keep in mind, however considering a listing for place in Williamsburg will note that it's in Williamsburg and NOT Brooklyn, this isn't too much of a concern for our purposes.

In [74]:
# lets scrape an example neighborhood SF hood! Again, using my own hood as an example
# 'https://www.airbnb.com/locations/san-francisco/duboce-triangle')
r = requests.get('https://www.airbnb.com/locations/san-francisco/duboce-triangle')

In [75]:
soup = BeautifulSoup(r.content)
extract_features(soup)

HEADLINE: 
Be in the center of everything without having to brave the crowds or the noise.

DESCRIPTION: 
Although Duboce Triangle’s main attraction is its humble namesake park, this quaint residential neighborhood’s centrality contributes to its sought-after status. While this put-together neighborhood is distinctly tranquil, the bustling nightlife and shopping options of The Castro and the Mission are mere steps away. When you want to trek to more distant San Francisco destinations, Duboce Triangle’s multiple MUNI lines make it easy.

TRAITS: 
[u'Great Transit']

TAGS: 
[u'attractive', u'central', u'public transport', u'victorians', u'park', u'well kempt', u'sunny', u'greenery', u'dogs', u'residential', u'desirable', u'location', u'charming', u'community', u'quaint', u'trails', u'family', u'strollers', u'desirable', u'hip', u'waterfront', u'hip', u'trendy', u'chic']

SIMILAR HOODS: 
['glen-park', 'noe-valley', 'south-beach']

NEIGHBORING HOODS: 
[u'Lower Haight', u'Hayes Valley', u'H

Whoa! AirBnB's Content team is good!  

Again, the piece of information I'm happy, but not in love with are Neighboring Hoods (SoMa & Duboce?!).

I still have my car, so I'll pretend that having a car is difficult... (you better not take my spot!)

Using the above code as a template, I was able to create **'scrape_neighborhoods.py'**

This script is used to scrape the neighborhood pages for its content
* takes in a file ('../data/neighborhood_list.csv')
* scrapes AirBnB's neighborhood guide for neighborhoods

I also run the **'extract_features_from_neighborhoods.py'** file, which extracts features from the Neighborhoods & extends the MongoDB document for each neighborhood with this information.

The reason for the above split was to scrape the data from AirBnB (/slash/ not get banned) and to seperate the scraping from the feature extraction

The majority of the functions are based on the AirBnBNeighborhood class in **'airbnb/airbnbneighborhood.py'**, whose methods help.

Notes: 
* The above extract_features() function in this notebook does NOT handle when a neighborhood didn't have a piece of information. Some neighborhoods don't have traits, similar hoods, or neighboring hoods. I take care of that in production code.
* As mentioned above, So the extract_features doesn't discern when a hood is within another hood (i.e. Carroll Gardens'is within' Brooklyn). It's something to keep in mind, however considering a listing for place in Williamsburg will note that it's in Williamsburg and NOT Brooklyn, this isn't too much of a concern for our purposes.
* When saving the r.content object to a MongoDB database, then reimporting that string and BeautifulSoup-ing it, I ran into some issues. As a work around, I pickle the requests object to save to the collection, then unpickle it to parse through it with BeautifulSoup. 

###Extracting data from AirBnB's Listings

In [382]:
import requests
from bs4 import BeautifulSoup
import html5lib

In [319]:
# lets scrape an example search result to see what we're working with
# https://www.airbnb.com/s/Portland--OR--United-States?checkin=09%2F16%2F2015&checkout=09%2F22%2F2015&guests=4&price_max=300
# we see we have the city, state & country in the search URL, and all of the other information are passed via parameters
r = requests.get('https://www.airbnb.com/s/Portland--OR--United-States?checkin=09%2F16%2F2015&checkout=09%2F22%2F2015&guests=4&price_max=300')

In [384]:
r.content

'<!DOCTYPE html>\n\n<!--[if lt IE 8]>\n\n<html lang="en"\n      \n      xmlns:fb="http://ogp.me/ns/fb#"\n      class="ie">\n\n<![endif]-->\n\n<!--[if IE 8]>\n\n<html lang="en"\n      \n      xmlns:fb="http://ogp.me/ns/fb#"\n      class="ie ie8">\n\n<![endif]-->\n\n<!--[if IE 9]>\n\n  <html lang="en"\n      \n      xmlns:fb="http://ogp.me/ns/fb#"\n      class="ie ie9">\n\n<![endif]-->\n\n<!--[if (gt IE 9)|!(IE)]><!-->\n<html lang="en"\n      \n      xmlns:fb="http://ogp.me/ns/fb#">\n\n<!--<![endif]-->\n\n  <head>\n      <link rel="dns-prefetch" href="//maps.googleapis.com">\n      <link rel="dns-prefetch" href="//maps.gstatic.com">\n      <link rel="dns-prefetch" href="//mts0.googleapis.com">\n      <link rel="dns-prefetch" href="//mts1.googleapis.com">\n\n    <!--[if IE]><![endif]-->\n    <meta charset="utf-8">\n\n    <!--[if IE 8]>\n      <link href="https://a1.muscache.com/airbnb/static/packages/common_o2.1_ie8-8341a24cc71ce465b36585d097252e3d.css" media="all" rel="stylesheet" type="

In [418]:
soup = BeautifulSoup(r.content)
soup

<!DOCTYPE html>
<!--[if lt IE 8]>

<html lang="en"
      
      xmlns:fb="http://ogp.me/ns/fb#"
      class="ie">

<![endif]--><!--[if IE 8]>

<html lang="en"
      
      xmlns:fb="http://ogp.me/ns/fb#"
      class="ie ie8">

<![endif]--><!--[if IE 9]>

  <html lang="en"
      
      xmlns:fb="http://ogp.me/ns/fb#"
      class="ie ie9">

<![endif]--><!--[if (gt IE 9)|!(IE)]><!--><html lang="en" xmlns:fb="http://ogp.me/ns/fb#">
<!--<![endif]-->
<head>
<link href="//maps.googleapis.com" rel="dns-prefetch"/>
<link href="//maps.gstatic.com" rel="dns-prefetch"/>
<link href="//mts0.googleapis.com" rel="dns-prefetch"/>
<link href="//mts1.googleapis.com" rel="dns-prefetch"/>
<!--[if IE]><![endif]-->
<meta charset="utf-8"/>
<!--[if IE 8]>
      <link href="https://a1.muscache.com/airbnb/static/packages/common_o2.1_ie8-8341a24cc71ce465b36585d097252e3d.css" media="all" rel="stylesheet" type="text/css" />
    <![endif]-->
<!--[if !(IE 8)]><!-->
<link href="https://a2.muscache.com/airbnb/static/pa

In [420]:
listings_raw = soup.find_all("div", {"class": "listing"})

len(listings_raw)

8

It's here I must note, that there should be 18 listings on each page (or thereabouts, specifically if its on the first page). This is confirmed by doing ctrl-f on the page and searching for *class="listing"*. This gives you 18 results. However, in our parsed soup object, we only get 8 back.   

Based on the [BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser), this might be due to old parsers.  I tried both html.parser & lxml. I couldn't get html5lib working on my machine. My guess is that there is something in the code, a broken tag or JS or divception with a broken tag somewhere, that is causing this. 

Again, for the purposes of our project, we just need a representative sample, and being able to get the top 8+ is useful enough.  In our EDA, we'll definately explore how many listings we find on each page.

In [424]:
# Lets grab a bunch of information that we'll need:
def extract_thumbnail_data(listing):
    cur_data = {}
    listing_id = listing['data-id']
    cur_data['listing_id'] = listing_id

    cur_data['lat'] = listing['data-lat']
    cur_data['lng'] = listing['data-lng']

    # cur_data['thumbnail_img'] = listing.find("img")['src']    # Old Image
    if listing.find('img') != None:
        tmp_img = listing.find('img')['data-urls'][2:]
        tmp_img = tmp_img[:tmp_img.find("\"")]
        cur_data['thumbnail_img'] = tmp_img
    else:
        cur_data['thumbnail_img'] = "/no_thumbnail.jpg"

    cur_data['blurb'] = listing['data-name']

    if listing.find("span"):
        cur_data['thumbnail_price'] = listing.find("span").get_text()
    else:
        cur_data['thumbnail_price'] = "n/a"

    if listing.find("div", {'itemprop':"description"}):
        if listing.find("div", {'itemprop':"description"}).find('a'):
            tmp = listing.find("div", {'itemprop':"description"}).find('a').get_text()
            if tmp.find(u'\xb7') != -1:
                tmp=tmp[:tmp.find(u'\xb7')]
            cur_data['listing_type'] = tmp
        else:
            cur_data['listing_type'] = "not available"
    else:
        cur_data['listing_type'] = "not available"


    return cur_data

In [430]:
for listing in listings_raw:
    t = extract_thumbnail_data(listing)
    for x in t:
        print "%s: %s" % (x, t[x])

listing_id: 4963976
thumbnail_price: 167
thumbnail_img: https://a2.muscache.com/ac/pictures/62186231/8ee4277e_original.jpg?interpolation=lanczos-none&size=x_medium&output-format=jpg&output-quality=70
lat: 45.55706779081872
lng: -122.63691374623141
listing_type: 
  Entire home/apt
     
blurb: Handbuilt Alberta Arts Cottage
listing_id: 3167041
thumbnail_price: 122
thumbnail_img: https://a1.muscache.com/ac/pictures/41482889/c7c1259e_original.jpg?interpolation=lanczos-none&size=x_medium&output-format=jpg&output-quality=70
lat: 45.557603765748595
lng: -122.63400815862627
listing_type: 
  Entire home/apt
     
blurb: Awesome Alberta Arts loft apartment
listing_id: 6560720
thumbnail_price: 122
thumbnail_img: https://a1.muscache.com/ac/pictures/83530370/1039e84a_original.jpg?interpolation=lanczos-none&size=x_medium&output-format=jpg&output-quality=70
lat: 45.53139419479622
lng: -122.69989668431994
listing_type: 
  Entire home/apt
     
blurb: Apartment in Portland's Nob Hill
listing_id: 63029

Great! So for every listing in the search result, we now have great information about it, including:
- The listing name & id
- a blurb, or mini-description (seperate from the full description we'll scrape later)
- an image to work with
- the lat & lng of the listing (YASS!!!!!!)
- listing type for most (though, not all)
- the price for most (though, not all)

Where this information will come in handy is when our app does live scrapes.  We'll want to grab a lot of the information to use before we ever scrape the listing.

Using the above code as a template, I was able to create **'scrape_search_results.py'**

This script is used to scrape AirBnB's search result pages for its content
* takes in a file ('../data/city_list.csv')
* generates a sampling of dates (1)
* scrapes AirBnB's searchlings

(1) I keyed in on 2 day availabilities from Tues-Thurs & Fri-Sun from June-Dec), to get a good representative sample of listings

I also run the **'extract_listings_from_search_results.py'** file, which inserts all of the listings found in search results into a listings collection. This essentially "initializes" the listings collection

Again, the reason for the above split was to scrape the data from AirBnB (/slash/ not get banned) and to seperate the scraping from the feature extraction.

The majority of the functions are based on the AirBnBNeighborhood class in **'airbnb/airbnbsearchresult.py'**, whose methods help.

Notes: 
* I ran the **'scrape_search_results.py'** file multiple times.  There was one time where I closed my computer mid-run. Not cool. It corrupted a file in my MongoDB colleciton and I had to scrape all of the information again. Lesson Learned.
* There is likely duplicate runs in the production database. For the most part, I don't care what search results are returned, so much as I have a bevy of listings from which to pull (or try to pull) data. It does not impact my project

###Extracting data from AirBnB's Search Results

In [432]:
import requests
from bs4 import BeautifulSoup

In [433]:
# lets scrape an example search result to see what we're working with
# https://www.airbnb.com/rooms/1943748 (from above)
# Note, when clicking through from a search, you get the parameters passed in the URL
# https://www.airbnb.com/rooms/19463748?checkin=09%2F16%2F2015&checkout=09%2F22%2F2015&guests=4&s=XKml
r = requests.get('https://www.airbnb.com/rooms/1943748')

In [434]:
soup = BeautifulSoup(r.content)
soup.prettify

<bound method BeautifulSoup.prettify of <!DOCTYPE html>
<!--[if lt IE 8]>

<html lang="en"
      
      xmlns:fb="http://ogp.me/ns/fb#"
      class="ie">

<![endif]--><!--[if IE 8]>

<html lang="en"
      
      xmlns:fb="http://ogp.me/ns/fb#"
      class="ie ie8">

<![endif]--><!--[if IE 9]>

  <html lang="en"
      
      xmlns:fb="http://ogp.me/ns/fb#"
      class="ie ie9">

<![endif]--><!--[if (gt IE 9)|!(IE)]><!--><html lang="en" xmlns:fb="http://ogp.me/ns/fb#">
<!--<![endif]-->
<head>
<link href="//maps.googleapis.com" rel="dns-prefetch"/>
<link href="//maps.gstatic.com" rel="dns-prefetch"/>
<link href="//mts0.googleapis.com" rel="dns-prefetch"/>
<link href="//mts1.googleapis.com" rel="dns-prefetch"/>
<!--[if IE]><![endif]-->
<meta charset="utf-8"/>
<!--[if IE 8]>
      <link href="https://a1.muscache.com/airbnb/static/packages/common_o2.1_ie8-8341a24cc71ce465b36585d097252e3d.css" media="all" rel="stylesheet" type="text/css" />
    <![endif]-->
<!--[if !(IE 8)]><!-->
<link href="

In [445]:
def extract_listing_features(soup):

    listing_name = soup.find('div', {'class':"rich-toggle wish_list_button"})['data-name']
    address = soup.find('div', {'class':"rich-toggle wish_list_button"})['data-address']
    hood = soup.find('div',{'id':'neighborhood-seo-link'}).h3.a.get_text().strip()
    num_saved = soup.find('div', {'class':"rich-toggle wish_list_button"})['title']
    price_currency = soup.find('meta', {'itemprop': 'priceCurrency'})['content']
    price = soup.find('meta', {'itemprop': 'price'})['content']

    headline = soup.find('meta', {'property':"og:description"})['content']

    description_raw = soup.find('div', {'class':'row description'}).find('div', {'class':'expandable-content expandable-content-long'}).get_text()

    print listing_name
    print address
    print hood
    print num_saved
    print price_currency
    print price
    print
    print "HEADLINE:"
    print headline
    print
    print "DESCRIPTION"
    print description_raw

In [446]:
extract_listing_features(soup)

Classic Close-In Portland
Southeast Ankeny Street, Portland (Buckman)
Buckman
Saved 223 times
USD
138

HEADLINE:
House in Portland, United States. My spacious 3 bedroom house is a classic &quot;Portland style&quot; home with hardwood floors and art by local artists throughout. Located about a mile from downtown it&#x27;s centrally located and a great base for exploring the city on foot, by bike, public tra...

DESCRIPTION

The Space
The first floor has an open layout that includes a living room with a gas fireplace for chilly days. South-facing French doors open onto an elevated front deck that is surrounded by foliage, so you can watch the world go by without being on stage. The kitchen has a retro chrome cafe table for quick meals, and there's a formal dining room in the living room. There is a half bath off the kitchen and a stereo and turntable with lots of records and CDs.
One bedroom, a media room, and the bathroom are located on the second floor. The bedroom is a simple guest ro

NICE! We got the information we needed!

We'll definately have to do some feature engineering on the description (hence the name, description raw, but for the most part we're in a good place.

In [447]:
#Lets try one more:
r = requests.get('https://www.airbnb.com/rooms/44323')
soup = BeautifulSoup(r.content)
extract_listing_features(soup)

Hip Inner Mission Garden Room
Lexington St, San Francisco (Mission District)
Mission District
Saved 2433 times
USD
69

HEADLINE:
House in San Francisco, United States. We are currently not accepting ANY short or long term reservations for the Garden Room. We will soon start a major remodel of our entire home and we are unsure as to when we will reopen the new &amp; improved space.  Why rent a room for $60-$90 per ni...

DESCRIPTION

The Space
Why rent a room for $60-$90 per night and share your bathroom with the host? 
Why be a $20 cab ride away from Valencia Street's bars, night clubs and restaurants when you can stay directly in the middle of our vibrant neighborhood?
Our 1876 Victorian home is located in the sunny Liberty Hill Historic District. Features include:
   * Full size bed and linens for your comfort.
   * Private garden path entrance with your own key.
   * Sunny patio with gurgling fountain.
   * Private bathroom with tile-lined shower adjacent to room.
   * Plush bathroo

Interesting, that this doesnt have the same pieces of information as the previous one did. It only has the Space section.

In [449]:
#Lets try at least one more:
r = requests.get('https://www.airbnb.com/rooms/736080')
soup = BeautifulSoup(r.content)
extract_listing_features(soup)

HUGE Master Suite with Private Bath
17th Ave, San Francisco (Inner Sunset)
Inner Sunset
Saved 61 times
USD
119

HEADLINE:
Apartment in San Francisco, United States. Beautiful, Modern MASTER Bedroom with Private BATH &amp; Private PARKING GARAGE SPOT &amp; WASHER/DRYER!  The Master Bedroom: 420 sq ft!!! Room is Larger than most studios in SF!   •	Private Bathroom •	Separate Large Living Room/Office with couch, cof...

DESCRIPTION

The Space
Beautiful, Modern MASTER Bedroom with Private BATH & Private PARKING GARAGE SPOT & WASHER/DRYER!
The Master Bedroom: 420 sq ft!!! Room is Larger than most studios in SF!  
•	Private Bathroom
•	Separate Large Living Room/Office with couch, coffee table, 30 in TV, DVD player
•	Queen Size Bed with Super Comfortable Pillow Top Mattress
•	2 Closets with Full Length Mirrors
•	Full Size Dresser with Vanity Mirror
•	Backyard view with Natural Light
•	High Speed Wireless Internet, Comcast Cable with DVR
***Free Garage Parking Spot Available!!!!!!***
The C

Using the above code as a template, I was able to create **'scrape_listings.py'**

This script is used to scrape AirBnB's listings pages for its content
* scrapes every listing found in the listing collection

NOTE: 
* While I did this file, I got stopped several times. 
* At a certain point, I used a VPN from Canada to scrape, so some my findings are from CA
* Thankfully, I had a 3500+ chunk that I was able to do overnight the one night. 

I also run the **'extract_features_from_listings.py'** file, which extracts features from the listings & extends the MongoDB document for each neighborhood with this information.

The reason for the above split was to scrape the data from AirBnB (/slash/ not get banned) and to seperate the scraping from the feature extraction

The majority of the functions are based on the AirBnBListing class in **'airbnb/airbnblisting.py'**, whose methods help.

Notes: 
* I didn't do it here, but I will do some feature engineering during the modeling phase to the description to help is put it in the model.

#Continue onto the Analysis:
###See /notebook_2_eda-analysis.ipynb