# Project 3:  Web Scraping
### Finding Underpriced RVs on Craigslist

![](https://snag.gy/WrdUMx.jpg)

In this project we will be practicing our web scraping skills.  You can use Scrapy or Python requests in order to complete this project.  It may be helpful to write some prototype code in this notebook to test your assumptions, then move it into a Python file that can be run from the command line.

> In order to run code from the command line, instead of the notebook, you just need to save your code to a file (with a .py extension), and run it using the Python interpreter:<br><br>
> `python my_file.py`

You will be building a process to scrape a single category of search results on Craigslist, that can easily be applied to other categories by changing the search terms.  The main goal is to be able to target and scrape a single page given a set of parameters.

**If you use Scrapy, provide your code in a folder.**

## Import your libraries for scrapy / requests / pandas / numpy / etc
Setup whichever libraries you need. Review past material for reference.

In [1]:
# PREPARE REQUIRED LIBRARIES
import numpy as np
import requests
import pandas as pd
import scrapy




## 1.  Scrape for the largest US cities (non-exhaustive list)
Search, research, and scrape Wikipedia for a list of the largest US cities.  There are a few sources but find one that is in a nice table.  We don't want all cities, just signifficant cities.  Examine your source.  Look for what can be differentiable.

- Use requests
- Build XPath query(ies)
- Extract to a list
- Clean your list

In [2]:
# SCRAPE WIKIPEDI<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: A FOR LARGEST US CITIES (NON-EXHAUSTIVE LIST)

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

import requests

response = requests.get('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population')
HTML = response.text  
HTML[0:150]  


u'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of United States cities by population - Wiki'

## 1.2 Only retain cities with properly formed ASCII

Optionally, filter out any cities with impropper ASCII characters.  A smaller list will be easier to look at.  However you may not need to filter these if you spend more time scraping a more concise city list.  This list should help you narrow down the list of regional Craigslist sites.

In [3]:
# ONLY RETAIN PROPERLY FORMED CITIES WITH FILTERING FUNCTION

city = Selector(text=HTML).xpath('//td/i/a/text()|//td[2]/a/text()').extract()


## 2.  Write a function to capture current pricing information via Craigslist in one city.
Choose a city from your scraped data, then go to the cooresponding city section on Craigslist, searching for "rv" in the auto section.  Write a method that pulls out the prices.

In [4]:
response = requests.get('https://sfbay.craigslist.org/search/rva?min_price=100')
HTML = response.text  
HTML[0:150]


u'\ufeff<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>SF bay area recreational vehicles  - craigslist</title>\n\n    <meta name="description" content='


## 2.1 Create a mapping of cities to cooresponding regional Craigslist URLs

Major US cities on Craigslist typically have their own cooresponding section (ie: SFBay Area, NYC, Boston, Miami, Seattle, etc).  Later, you will use these to query search results for various metropolitian regions listed on Craigslist.  Between the major metropolitan Craigslist sites, the only thing that will differ is the URL's that correspond to them.

The point of the "mapping":  Create a data structure that allows you to iterate with both the name of the city from Wikipedia, with the cooresponding variable that that will allow you to construct each craigslist URL for each region.

> For San Francsico (the Bay Area metropolitan area), the url for the RV search result is:
> http://sfbay.craigslist.org/search/sss?query=rv
>
> The convention is http://[region].craigslist.org/search/sss?query=rf
> Replacing [region] with the cooresponding city name will allow you to quickly iterate through each regional Craigslist site, and scrape the prices from the search results.  Keep this in mind while you build this "mapping".


In [5]:
Selector(text=HTML).xpath('//link[@rel="next"]/@href').extract()


[u'https://sfbay.craigslist.org/search/rva?s=120&min_price=100']

In [6]:
Selector(text=HTML).xpath('//p/a[@class="result-title hdrlnk"]/text()').extract()

[u'Daily rental Mercedes based',
 u'RV for sale',
 u"30' 1998 Mallard. Bunkhouse Model",
 u'1987 "Sunrader" Toyota MH',
 u'camping trailer for sale',
 u'2004 Lance 820 Cab over Camper',
 u'RV Covers, Trailer Covers,',
 u'35 ft cyclone',
 u'1978 Chevy RV',
 u"1979 26'Winnebago",
 u'Gorgeous Monaco Diplomat 2006',
 u'Onan 4.0 Generator',
 u'L+I+FE STYLE-\U0001f4a5\U0001f4a5\U0001f4a5 19..90 Winnebago Minnie',
 u"2004- 27' Minnie Winnebago WF427P",
 u'\u273fBeautiful Like-New RV for Rent - with Slide Out - Sleeps 6!',
 u'RV for Rent - Beautiful RVs at Affordable Prices!!!',
 u"\u27b830' RV For Rent - Like New! with 4 Bunk beds - RARE! Sleeps up to 10!",
 u"2001 34' Monaco with slide",
 u'2002 Ultra Gulf Stream, Class A- Low miles',
 u'2007 Forest River Georgetown SE 350DS Bunkhouse',
 u'Airstream Bathtubs',
 u"Tear Drop Trailer 14'",
 u"1997 Holiday Rambler Vacationer 38' with slide out",
 u'1987 Toyota Sunrader RV',
 u'pop up tent trailer',
 u"2011 SABRE 31' Travel Trailer LIKE NEW!!!!$$

In [7]:
Selector(text=HTML).xpath('//span[@class="result-meta"]/span[@class="result-price"]').extract()

[u'<span class="result-price">$180</span>',
 u'<span class="result-price">$6300</span>',
 u'<span class="result-price">$6450</span>',
 u'<span class="result-price">$10900</span>',
 u'<span class="result-price">$8900</span>',
 u'<span class="result-price">$12500</span>',
 u'<span class="result-price">$795</span>',
 u'<span class="result-price">$26900</span>',
 u'<span class="result-price">$400</span>',
 u'<span class="result-price">$3200</span>',
 u'<span class="result-price">$106000</span>',
 u'<span class="result-price">$150</span>',
 u'<span class="result-price">$2016</span>',
 u'<span class="result-price">$40000</span>',
 u'<span class="result-price">$165</span>',
 u'<span class="result-price">$165</span>',
 u'<span class="result-price">$175</span>',
 u'<span class="result-price">$25000</span>',
 u'<span class="result-price">$6900</span>',
 u'<span class="result-price">$40900</span>',
 u'<span class="result-price">$150</span>',
 u'<span class="result-price">$17500</span>',
 u'<span 

In [8]:
Selector(text=HTML).xpath('//p/a/@href').extract()

[u'/eby/rvs/6178099805.html',
 u'/nby/rvs/6188098678.html',
 u'/nby/rvs/6188098381.html',
 u'/sfc/rvs/6184813462.html',
 u'/pen/rvs/6188087232.html',
 u'/sby/rvs/6188072055.html',
 u'/nby/rvd/6186581416.html',
 u'/nby/rvs/6180906524.html',
 u'/nby/rvs/6171517396.html',
 u'/nby/rvs/6176267890.html',
 u'/eby/rvs/6188012028.html',
 u'/nby/rvs/6178005392.html',
 u'/sfc/rvs/6187948022.html',
 u'/nby/rvs/6188004876.html',
 u'/sby/rvs/6184580435.html',
 u'/sby/rvs/6146934904.html',
 u'/sby/rvs/6157370653.html',
 u'/nby/rvs/6181276736.html',
 u'/sfc/rvs/6187966006.html',
 u'/nby/rvs/6174207787.html',
 u'/eby/rvs/6182476403.html',
 u'/eby/rvs/6181149130.html',
 u'/sby/rvs/6163638518.html',
 u'/scz/rvs/6187955349.html',
 u'/sby/rvs/6187934673.html',
 u'/sby/rvs/6174860941.html',
 u'/eby/rvs/6183262137.html',
 u'/nby/rvs/6183649607.html',
 u'/nby/rvs/6187909393.html',
 u'/sfc/rvs/6183077985.html',
 u'/sby/rvs/6180759925.html',
 u'/nby/rvs/6183174592.html',
 u'/nby/rvs/6187714129.html',
 u'/sfc/rv

In [9]:
city = [r.encode('utf-8') for r in city]

In [10]:
city = [i.replace(" ", "") for i in city]

In [11]:
city


['NewYork',
 'LosAngeles',
 'Chicago',
 'Houston',
 'Philadelphia',
 'SanAntonio',
 'SanDiego',
 'Dallas',
 'SanJose',
 'Jacksonville',
 'SanFrancisco',
 'FortWorth',
 'Charlotte',
 'Seattle',
 'ElPaso',
 'Detroit',
 'Memphis',
 'Portland',
 'LasVegas',
 'Louisville',
 'Baltimore',
 'Milwaukee',
 'Albuquerque',
 'Tucson',
 'Fresno',
 'Mesa',
 'KansasCity',
 'LongBeach',
 'ColoradoSprings',
 'Miami',
 'VirginiaBeach',
 'Omaha',
 'Oakland',
 'Minneapolis',
 'Tulsa',
 'Arlington',
 'NewOrleans',
 'Wichita',
 'Cleveland',
 'Tampa',
 'Bakersfield',
 'Aurora',
 'Anaheim',
 'SantaAna',
 'CorpusChristi',
 'Riverside',
 'Lexington',
 'St.Louis',
 'Stockton',
 'Pittsburgh',
 'Cincinnati',
 'Anchorage',
 'Henderson',
 'Greensboro',
 'Plano',
 'Newark',
 'Toledo',
 'Orlando',
 'ChulaVista',
 'Irvine',
 'FortWayne',
 'JerseyCity',
 'Durham',
 'St.Petersburg',
 'Laredo',
 'Buffalo',
 'Lubbock',
 'Chandler',
 'Scottsdale',
 'Glendale',
 'Reno',
 'Norfolk',
 'Winston\xe2\x80\x93Salem',
 'NorthLasVegas

In [12]:
def wrapper(f):
    def fun(l):
        # complete the function
        for item in l:
            f('http://{}.craigslist.org/search/rva?min_price=100'.format(item))
    return fun

@wrapper
def sort_list(l):
    print l

sort_list(city)

http://NewYork.craigslist.org/search/rva?min_price=100
http://LosAngeles.craigslist.org/search/rva?min_price=100
http://Chicago.craigslist.org/search/rva?min_price=100
http://Houston.craigslist.org/search/rva?min_price=100
http://Philadelphia.craigslist.org/search/rva?min_price=100
http://SanAntonio.craigslist.org/search/rva?min_price=100
http://SanDiego.craigslist.org/search/rva?min_price=100
http://Dallas.craigslist.org/search/rva?min_price=100
http://SanJose.craigslist.org/search/rva?min_price=100
http://Jacksonville.craigslist.org/search/rva?min_price=100
http://SanFrancisco.craigslist.org/search/rva?min_price=100
http://FortWorth.craigslist.org/search/rva?min_price=100
http://Charlotte.craigslist.org/search/rva?min_price=100
http://Seattle.craigslist.org/search/rva?min_price=100
http://ElPaso.craigslist.org/search/rva?min_price=100
http://Detroit.craigslist.org/search/rva?min_price=100
http://Memphis.craigslist.org/search/rva?min_price=100
http://Portland.craigslist.org/search/rva


## 3. Define a function to caculate mean and median price per city.

Now that you've created a list of cities you want to scrape, adapt your solution for grabbing data in one region site, to grab data for all regional sites that you collected, then calculate the mean and median price of RV results from each city.

> Look at the URLs from a few different regions (ie: portland, phoenix, sfbay), and find what they have in common.  Determine the area in the URL string that needs to change the least, and figure out how to replace only that portion of the URL in order to iterate through each city.


## 4. Run your scraping process, and save your results to a CSV file.


## 5. Do an analysis of the RV market.

Go head we'll wait.  Anything notable about the data?

In [13]:
rvsf_df = pd.read_csv('/Users/keatoncarano/Desktop/dsi-sf-7-materials-Keaton/craigslist/craigslist/craigslist/spiders/sfrv.csv')
rvsf_df[['price']] = rvsf_df[['price']].replace('[\$,]','',regex=True).astype(float)


In [14]:
rvpr_df = pd.read_csv('/Users/keatoncarano/Desktop/dsi-sf-7-materials-Keaton/craigslist/craigslist/craigslist/spiders/cl_crawl_port2.csv')
rvpr_df[['price']] = rvpr_df[['price']].replace('[\$,]','',regex=True).astype(float)


In [15]:
rvsf_df.head()

Unnamed: 0,price,link,listing
0,11999.0,https://sfbay.craigslist.org/nby/rvd/618710067...,SPECIAL - 2003 - 29 foot Arctic Fox Travel Tr...
1,25000.0,https://sfbay.craigslist.org/sby/rvs/617674163...,2000 Winnebago 35ft Itasca 18k miles
2,147900.0,https://sfbay.craigslist.org/sby/rvd/618708817...,2007 Monaco Camelot 42PDQ Quad Slide-Out Class...
3,5800.0,https://sfbay.craigslist.org/sfc/rvs/618704183...,Hybrid Expandable Travel Trailer
4,9300.0,https://sfbay.craigslist.org/eby/rvs/618707407...,2007 KEYSTONE SPRINGDALE USED 3 TIMES


In [16]:
#SF mean and Median Prices
print np.mean(rvsf_df.price), np.median(rvsf_df.price)

31609.8108553 19900.0


In [17]:
sflisting = rvsf_df.groupby(['listing'])['price'].mean()

In [18]:
rvpr_df.head(3)

Unnamed: 0,price,link,listing
0,89995.0,https://portland.craigslist.org/mlt/rvs/617235...,2007 Winnebago 36G Motorhome
1,200.0,https://portland.craigslist.org/clc/rvs/615451...,Tent Trailer
2,89995.0,https://portland.craigslist.org/mlt/rvs/617235...,2007 Winnebago 36G Motorhome


In [19]:
sflisting.shape

(1072,)

In [20]:
rvpr_df.dropna()


Unnamed: 0,price,link,listing
0,89995.0,https://portland.craigslist.org/mlt/rvs/617235...,2007 Winnebago 36G Motorhome
1,200.0,https://portland.craigslist.org/clc/rvs/615451...,Tent Trailer
2,89995.0,https://portland.craigslist.org/mlt/rvs/617235...,2007 Winnebago 36G Motorhome
3,89995.0,https://portland.craigslist.org/mlt/rvs/617221...,2007 Winnebago 36G Motorhome
4,89995.0,https://portland.craigslist.org/mlt/rvs/617221...,2007 Winnebago 36G Motorhome
5,89995.0,https://portland.craigslist.org/mlt/rvs/618033...,2007 Winnebago 36G Motorhome
6,6500.0,https://portland.craigslist.org/wsc/rvs/618669...,Clean class C!!!
7,65000.0,https://portland.craigslist.org/clc/rvs/615417...,Camper/Truck F450 2008 Lariat diesel and 2012 ...
8,34900.0,https://portland.craigslist.org/mlt/rvd/618667...,2005 Coachmen Freedom Ramp & Camp 30' Class C ...
9,44900.0,https://portland.craigslist.org/mlt/rvd/618667...,2008 Dorado B+ 26' Double Slide


In [21]:
np.mean(rvpr_df['price'])

31847.466606498194

In [22]:
#portland mean and median prices

print np.mean(rvpr_df.price), np.median(rvpr_df.price)

31847.4666065 19995.0


In [23]:
rvsf_df['city'] = 'sfbayarea'


In [24]:
rvpr_df['city'] = 'portland'

In [25]:
rvpr_df.head()

Unnamed: 0,price,link,listing,city
0,89995.0,https://portland.craigslist.org/mlt/rvs/617235...,2007 Winnebago 36G Motorhome,portland
1,200.0,https://portland.craigslist.org/clc/rvs/615451...,Tent Trailer,portland
2,89995.0,https://portland.craigslist.org/mlt/rvs/617235...,2007 Winnebago 36G Motorhome,portland
3,89995.0,https://portland.craigslist.org/mlt/rvs/617221...,2007 Winnebago 36G Motorhome,portland
4,89995.0,https://portland.craigslist.org/mlt/rvs/617221...,2007 Winnebago 36G Motorhome,portland


In [26]:
rv_all = pd.concat([rvsf_df, rvpr_df], ignore_index=True)

In [27]:
rv_all.head()

Unnamed: 0,price,link,listing,city
0,11999.0,https://sfbay.craigslist.org/nby/rvd/618710067...,SPECIAL - 2003 - 29 foot Arctic Fox Travel Tr...,sfbayarea
1,25000.0,https://sfbay.craigslist.org/sby/rvs/617674163...,2000 Winnebago 35ft Itasca 18k miles,sfbayarea
2,147900.0,https://sfbay.craigslist.org/sby/rvd/618708817...,2007 Monaco Camelot 42PDQ Quad Slide-Out Class...,sfbayarea
3,5800.0,https://sfbay.craigslist.org/sfc/rvs/618704183...,Hybrid Expandable Travel Trailer,sfbayarea
4,9300.0,https://sfbay.craigslist.org/eby/rvs/618707407...,2007 KEYSTONE SPRINGDALE USED 3 TIMES,sfbayarea



### 5.1 Does it makes sense to buy RVs in one region and sell them in another?

Assuming the cost of shipping or driving from one regional market to another.

In [28]:
region = rv_all.groupby(['city'])['price'].mean()

In [29]:
region


city
portland     31847.466606
sfbayarea    31609.810855
Name: price, dtype: float64

In [30]:
#it would appear as though you could buy a RV in the SF bay area and sell it for a profit in protland. However, this may just be 
# due to the RV's in portland being more expensive

### 5.2 Can you pull out the "make" from the markup and include that in your analyis?
How reliable is this data and does it make sense?

In [32]:
sflisting = rvsf_df.groupby(['listing'])['price'].mean()
#it is possible to pull this data from the description but it is essentially unstructured data and is 
#nearly impossible to sort. Because each craigslist poster is able to input the string however they want.
# when you look at the listing above, it is evident that almost all descriptions are unique as the total as the
#groupby only reduce the listing size by 100

#HOWEVER, if i was looking for a specific make or model, I would filter the pandas dataframe to look for that
#make / model in the description. Could also do multiple makes and models ect. However I do not know specifics
#about RVs


print sflisting.shape, rvsf_df['listing'].shape


(1072,) (1216,)


### 5.3 Are there any other variables you could pull out of the markup to help describe your dataset?

In [33]:
#we could get more granular and attempt to look at surrounding cities to a major city to see if we could find the same 
#make and model except for a cheaper price. However, we would run into the issue of difficulty extracting the make
#and model info from the description


## 6. Move your project into scrapy (if you haven't used Scrapy yet)

>Start a project by using the command `scrapy startproject [projectname]`
> - Update your settings.py (review our past example)
> - Update your items.py
> - Create a spiders file in your `[project_name]/[project_name]/spiders` directory

You can update your spider class with the complete list of craigslist "start urls" to effectively scrape all of the regions.  Start with one to test.

Updating your parse method with the method you chose should require minimal changes.  It will require you to update your parse method to use the response parameter, and an item model (defined in items.py).

In [None]:
#done


## 7.  Chose another area of Craigslist to scrape.

**Choose an area having more than a single page of results, then scrape multiple regions, multiple pages of search results and or details pages.**

This is the true exercise of being able to understand how to succesffuly plan, develop, and employ a broader scraping strategy.  Even though this seems like a challenging task, a few tweeks of your current code can make this very managable if you've pieced together all the touch points.  If you are still confused as to some of the milestones within this process, this is an excellent opportunity to round out your understanding, or help you build a list of questions to fill in your gaps.

_Use Scrapy!  Provide your code in this project directory when you submit this project._

In [35]:
#look at sf apartments between 2000 and 4000 in rent
sf_apt_r = requests.get('https://sfbay.craigslist.org/search/sfc/apa?min_price=2000&max_price=4000&availabilityMode=0')
HTML = sf_apt_r.text  
HTML[0:150]

u'\ufeff<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>SF bay area apts/housing for rent  - craigslist</title>\n\n    <meta name="description" content='

In [41]:
#description
description = Selector(text=HTML).xpath('//p/a[@class="result-title hdrlnk"]/text()').extract()

In [53]:
#Pull rent per month
rent = Selector(text=HTML).xpath('//span[@class="result-meta"]/span[@class="result-price"]/text()').extract()

In [43]:
#pull neighborhood from each listing
hood = Selector(text=HTML).xpath('//span[@class="result-meta"]/span[@class="result-hood"]/text()').extract()

In [52]:
rooms = hood = Selector(text=HTML).xpath('//span[@class="result-meta"]/span[@class="housing"]/text()').extract()

I could then take all of these and put them into a pandas dataframe and then perform some analysis as well as cleaning of the data