# Project 3:  Web Scraping
### Finding Underpriced RVs on Craigslist

![](https://snag.gy/WrdUMx.jpg)

In this project we will be practicing our web scraping skills.  You can use Scrapy or Python requests in order to complete this project.  It may be helpful to write some prototype code in this notebook to test your assumptions, then move it into a Python file that can be run from the command line.

> In order to run code from the command line, instead of the notebook, you just need to save your code to a file (with a .py extension), and run it using the Python interpreter:<br><br>
> `python my_file.py`

You will be building a process to scrape a single category of search results on Craigslist, that can easily be applied to other categories by changing the search terms.  The main goal is to be able to target and scrape a single page given a set of parameters.

**If you use Scrapy, provide your code in a folder.**

## Import your libraries for scrapy / requests / pandas / numpy / etc
Setup whichever libraries you need. Review past material for reference.

In [9]:
# PREPARE REQUIRED LIBRARIES

from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns

# make sure charts appear in the notebook:
%matplotlib inline


import requests

response = requests.get("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population")
HTML = response.text  
HTML[0:150]    

u'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of United States cities by population - Wiki'



## 1.  Scrape for the largest US cities (non-exhaustive list)
Search, research, and scrape Wikipedia for a list of the largest US cities.  There are a few sources but find one that is in a nice table.  We don't want all cities, just signifficant cities.  Examine your source.  Look for what can be differentiable.

- Use requests
- Build XPath query(ies)
- Extract to a list
- Clean your list

In [10]:
cities = Selector(text = HTML).xpath("//td[2]/text()|  //td[2]/a/text()").extract()

In [11]:
sorted(cities)

[u'0',
 u'1',
 u'10',
 u'10',
 u'12',
 u'192',
 u'2',
 u'22',
 u'3',
 u'38',
 u'4',
 u'5',
 u'51',
 u'54',
 u'6',
 u'7',
 u'73',
 u'8',
 u'9',
 u'Abilene',
 u'Akron',
 u'Alexandria',
 u'Allentown',
 u'Amarillo',
 u'Anaheim',
 u'Ann Arbor',
 u'Antioch',
 u'Arlington',
 u'Arvada',
 u'Athens',
 u'Augusta',
 u'Aurora',
 u'Aurora',
 u'Bakersfield',
 u'Bayam\xf3n',
 u'Beaumont',
 u'Bellevue',
 u'Berkeley',
 u'Boulder',
 u'Broken Arrow',
 u'Brownsville',
 u'Buffalo',
 u'Burbank',
 u'Caguas',
 u'California',
 u'Cambridge',
 u'Cape Coral',
 u'Carlsbad',
 u'Carolina',
 u'Carrollton',
 u'Cary',
 u'Cedar Rapids',
 u'Centennial',
 u'Chandler',
 u'Chattanooga',
 u'Chesapeake',
 u'Chula Vista',
 u'Cincinnati',
 u'Clarksville',
 u'Clearwater',
 u'Cleveland',
 u'Clinton',
 u'Clovis',
 u'College Station',
 u'Colorado Springs',
 u'Columbia',
 u'Columbus',
 u'Concord',
 u'Coral Springs',
 u'Corona',
 u'Corpus Christi',
 u'Costa Mesa',
 u'Dallas',
 u'Daly City',
 u'Davenport',
 u'Davie',
 u'Dayton',
 u'Del

In [12]:
# ONLY RETAIN PROPERLY FORMED CITIES WITH FILTERING FUNCTION
cities =[item.replace(u'’',"") for item in cities]

In [13]:
cities = pd.DataFrame(cities)

In [14]:
pd.DataFrame(cities).head()

Unnamed: 0,0
0,San Antonio
1,San Diego
2,Dallas
3,San Jose
4,San Francisco


## 1.2 Only retain cities with properly formed ASCII

Optionally, filter out any cities with impropper ASCII characters.  A smaller list will be easier to look at.  However you may not need to filter these if you spend more time scraping a more concise city list.  This list should help you narrow down the list of regional Craigslist sites.

In [15]:

# Above 

## 2.  Write a function to capture current pricing information via Craigslist in one city.
Choose a city from your scraped data, then go to the cooresponding city section on Craigslist, searching for "rv" in the auto section.  Write a method that pulls out the prices.

In [26]:
import scrapy

from bs4 import BeautifulSoup as bs
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

In [27]:
response2 = requests.get("https://lasvegas.craigslist.org/search/rva")
CL = response2.text  
CL[0:150]    

u'\ufeff<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>las vegas recreational vehicles  - craigslist</title>\n\n    <meta name="description" content="l'

In [28]:
prices = Selector(text = CL).xpath("//span[@class='result-meta']/span[@class='result-price']/text() ").extract()

In [29]:
details = Selector(text = CL).xpath("//a[@class='result-title hdrlnk']/text()").extract()

In [30]:
pd.DataFrame(prices).shape

(106, 1)

In [31]:
pd.DataFrame(details).head()


Unnamed: 0,0
0,2008 Coachmen
1,1998 Monaco Windsor 38' diesel Pusher trade fo...
2,Extra clean travel trailer
3,2001 FourWinds Hurricane
4,Dometic model 320 with spray


In [32]:
df = pd.DataFrame(prices)

In [33]:
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106 entries, 0 to 105
Data columns (total 1 columns):
0    106 non-null object
dtypes: object(1)
memory usage: 920.0+ bytes


In [34]:
#convert from unicode
def clean_text(row):
    
    return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]

df[0] = df.apply(clean_text)

type(df[0])

pandas.core.series.Series

In [35]:
df.head()

Unnamed: 0,0
0,$7500
1,$28500
2,$6900
3,$18900
4,$210


In [36]:
df.shape

(106, 1)

In [37]:
# Remove Dollar sign, convert to int 
df[0] = df[0].str.replace(',', '')
df[0] = df[0].str.replace('$', '')
df[0] = df[0].astype(int)

In [38]:
df[0].head()

0     7500
1    28500
2     6900
3    18900
4      210
Name: 0, dtype: int64

In [41]:
df[0].sum()

2011999

In [40]:
df[0].mean()


18981.122641509435


## 2.1 Create a mapping of cities to cooresponding regional Craigslist URLs

Major US cities on Craigslist typically have their own cooresponding section (ie: SFBay Area, NYC, Boston, Miami, Seattle, etc).  Later, you will use these to query search results for various metropolitian regions listed on Craigslist.  Between the major metropolitan Craigslist sites, the only thing that will differ is the URL's that correspond to them.

The point of the "mapping":  Create a data structure that allows you to iterate with both the name of the city from Wikipedia, with the cooresponding variable that that will allow you to construct each craigslist URL for each region.

> For San Francsico (the Bay Area metropolitan area), the url for the RV search result is:
> http://sfbay.craigslist.org/search/sss?query=rv
>
> The convention is http://[region].craigslist.org/search/sss?query=rf
> Replacing [region] with the cooresponding city name will allow you to quickly iterate through each regional Craigslist site, and scrape the prices from the search results.  Keep this in mind while you build this "mapping".


In [52]:
major_cities= {"New_York": "newyork", "Alaska": "anchorage", "Arizona": "phoenix",
               "Californa": "sfbay", "Texas": "houston", "Colorado": "denver" , 
               "Utah": "saltlakecity", "Maryland": "baltimore", "Massachusetts": "boston",
               "Washington": "seattle","Ohio": "cleveland", "Flordia": "miami",
               "Michigan": "detroit", "Oregon":"portland" }

large_cities = major_cities.values()
large_cities

['miami',
 'saltlakecity',
 'anchorage',
 'cleveland',
 'baltimore',
 'houston',
 'newyork',
 'phoenix',
 'denver',
 'detroit',
 'seattle',
 'portland',
 'sfbay',
 'boston']

In [66]:
url_dict = {reigon: "https://"+reigon+".craigslist.org/search/cta?query=rv" 
            for reigon in large_cities }
url_dict

{'anchorage': 'https://anchorage.craigslist.org/search/cta?query=rv',
 'baltimore': 'https://baltimore.craigslist.org/search/cta?query=rv',
 'boston': 'https://boston.craigslist.org/search/cta?query=rv',
 'cleveland': 'https://cleveland.craigslist.org/search/cta?query=rv',
 'denver': 'https://denver.craigslist.org/search/cta?query=rv',
 'detroit': 'https://detroit.craigslist.org/search/cta?query=rv',
 'houston': 'https://houston.craigslist.org/search/cta?query=rv',
 'miami': 'https://miami.craigslist.org/search/cta?query=rv',
 'newyork': 'https://newyork.craigslist.org/search/cta?query=rv',
 'phoenix': 'https://phoenix.craigslist.org/search/cta?query=rv',
 'portland': 'https://portland.craigslist.org/search/cta?query=rv',
 'saltlakecity': 'https://saltlakecity.craigslist.org/search/cta?query=rv',
 'seattle': 'https://seattle.craigslist.org/search/cta?query=rv',
 'sfbay': 'https://sfbay.craigslist.org/search/cta?query=rv'}


## 3. Define a function to caculate mean and median price per city.

Now that you've created a list of cities you want to scrape, adapt your solution for grabbing data in one region site, to grab data for all regional sites that you collected, then calculate the mean and median price of RV results from each city.

> Look at the URLs from a few different regions (ie: portland, phoenix, sfbay), and find what they have in common.  Determine the area in the URL string that needs to change the least, and figure out how to replace only that portion of the URL in order to iterate through each city.

In [67]:
def city_rv_stats(city, url_dict): 
    url = url_dict[city]
    response = requests.get(url)
    HTML = response.text
    
    # gen list of prices 
    prices = Selector(text=HTML).xpath("//span/span[@class='result-price']/text()").extract()
    for i, amt in enumerate(prices):
        if "$" in amt:
            amt = amt.replace("$","")
        prices[i] = int(amt)
        
    # remove outliers
    for i, amt in enumerate(prices):
        if (amt < 250) or (amt > 300000):
            del prices[i]
    
    mean = np.mean(prices)
    median = int(np.median(prices)) 
    std = np.std(prices)
    
    # Control outliers by STd.
    for i, amt in enumerate(prices):
        if np.abs(amt - mean) > 3*std:
            del prices[i]
            # update these metrics
            std = np.std(prices)
            mean = np.mean(prices)
            
    # return the price dict
    return {"city":city, "mean":int(mean), "std": int(std),
            "max": np.max(prices), "median":median, "min": np.min(prices)}

In [77]:
#Test
ny = city_rv_stats("sfbay", url_dict)
ny

{'city': 'sfbay',
 'max': 59763,
 'mean': 14700,
 'median': 10000,
 'min': 500,
 'std': 13294}


## 4. Run your scraping process, and save your results to a CSV file.

In [167]:
results = pd.read_csv("/Users/NVR/Desktop/dsi-sf-7-materials-nvr/projects/project-3/craigslist/crawlcl.csv")

In [168]:
rv_market = pd.DataFrame(results)

In [169]:
rv_market.head()

Unnamed: 0,price,detail,area
0,$24500,Remodeled 25' Mallard Sprinter RV,las vegas
1,$4000,Pop up camper for sale,bakersfield
2,$26500,Trade for Class A Motor-home With Slides,flagstaff
3,$15000,2004 FLEETWOOD PIONEER & 2006 CHEVY SILVERADO,imperial co
4,$300000,2008 Country Coach Affinity 45',inland empire


In [170]:
rv_market['price'] = rv_market['price'].str.replace(',', '')
rv_market['price'] = rv_market['price'].str.replace('$', '')
rv_market['price'] = rv_market['price'].astype(int)

In [172]:
rv_market = rv_market[rv_market['price'] > 1000]  #remove low values 

In [175]:
rv_market = rv_market[rv_market['price'] < 300000] #remove high values


## 5. Do an analysis of the RV market.

Go head we'll wait.  Anything notable about the data?

In [176]:
most_county_rv_sales = pd.DataFrame(rv_market.groupby(['area'])['price'].sum())

In [177]:
most_county_rv_sales.sort(["price"], ascending=False).head()

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,price
area,Unnamed: 1_level_1
skagit,1453535
"victoria, BC",1394773
moses lake,1297787
"tri-cities, WA",1287679
salem,1197890


In [178]:
least_county_rv_sales=most_county_rv_sales.sort(["price"], ascending=True).head()
least_county_rv_sales

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,price
area,Unnamed: 1_level_1
fredericksburg,3000
norfolk,4500
syracuse,5000
"washington, DC",6400
"richmond, VA",6500


In [179]:
high_median_price_rv_sales = pd.DataFrame(rv_market.groupby(['area'])['price'].median())

In [180]:
high_median_price_rv_sales.sort(["price"], ascending=False).head()

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,price
area,Unnamed: 1_level_1
provo,69497.0
twin falls,58495.0
ogden,57997.0
elko,41997.5
harrisburg,41875.0


In [181]:
low_median_price_rv_sales =high_median_price_rv_sales.sort(["price"], ascending=True).head()
low_median_price_rv_sales

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,price
area,Unnamed: 1_level_1
salt lake,2650.0
fredericksburg,3000.0
"washington, DC",3200.0
plattsburgh,3253.0
baltimore,4000.0


In [182]:
high_avg_rv_price=pd.DataFrame(rv_market.groupby(['area'])['price'].mean()) 
high_avg_rv_price.sort(["price"], ascending=False).head()

  


Unnamed: 0_level_0,price
area,Unnamed: 1_level_1
twin falls,93622.5
provo,91211.0
elko,76973.75
frederick,65833.333333
klamath falls,63248.666667


In [183]:
low_avg_rv_price = high_avg_rv_price=pd.DataFrame(rv_market.groupby(['area'])['price'].mean()) 
low_avg_rv_price.sort(["price"], ascending=True).head()

  


Unnamed: 0_level_0,price
area,Unnamed: 1_level_1
fredericksburg,3000.0
"washington, DC",3200.0
baltimore,3600.0
outer banks,4325.0
norfolk,4500.0


In [186]:
#According to the data, it seems that Kileen-Temple, in Texas is the healthiest used RV market in the United States.
#The RV sales in Kileen-Temple are almost double that of the following market, Skagit Washington. 
#The poorest RV markets in the US include Fredricksburg, Miss, and Syracuse NY. 



### 5.1 Does it makes sense to buy RVs in one region and sell them in another?

Assuming the cost of shipping or driving from one regional market to another.

In [187]:
#After observing the median prices of all of the highest and lowest performing markets, it is clear 
#that arbitrage can be effectivley executed by transporting RVS. However, we do not have information 
#about the overall quality of the RV, which could be a factor to why some markets have lower priced RV. I.e.,
#the RVs for sale are either very old are very crappy.
#
#One particularly interesting case of arbitrage to be executed is buying in Salt Lake City, Utah, and Driving to
#Ogden, Utah. These two cities are located only about 45 minutes apart, but the spread in the prices of RVs is Drastic.
#In Ogden, the median price of RVs are 57997.0, whereas in Salt Lake City the price just caps 3000. 
#Unless Salt lake is not Salt lake city
#
#It is also interesting to Note that the further west you go, the more prices rise of RVs. This may have to do with the 
# fact that much of the RV tourinsm occurs on the West Coast, and particularly in Texas up through the great plains.
#The east coast however is in many ways undesirable for most RV travelers. 

### 5.2 Can you pull out the "make" from the markup and include that in your analyis?
How reliable is this data and does it make sense?

In [206]:
make_sales = pd.DataFrame(rv_market.groupby(['detail'])['price'].count())

In [207]:
make_sales.head()

Unnamed: 0_level_0,price
detail,Unnamed: 1_level_1
! gorgeous arctic fox fifth wheel!!,1
!! BEST DEAL ON 2016 AIRSTREAM FLYING CLOUD 23D!!,1
!! RARE 2005 CHINOOK GLACIER CLASS C GORGEOUS!!,1
!!!!!!!!!!!!!!! ALUMINUM TRAILER POLISHING !!!!!!!!!!!!!!!!!!!!,1
!!BEST DEAL ON NEW CLASS C DIESEL MERCEDES,1


In [None]:
#This data is unreliable becuase the variation in naming styles is to vast to capture the commonly occuring 
# Types. A solution could be to implement a "Model Dictionary" and then to sort by the revelent makes. 

### 5.3 Are there any other variables you could pull out of the markup to help describe your dataset?

In [None]:
# Similarly to the issue noted above, other useful stats like "Feet", "Year" and "Miles" all are unstandardized in the 
# descriptions. To extract these we would want to make a dictionary or list including all of hte possible combinations of 
# the variables listed above, and then search each listings details to see if it includes our keywords. 


## 6. Move your project into scrapy (if you haven't used Scrapy yet)

>Start a project by using the command `scrapy startproject [projectname]`
> - Update your settings.py (review our past example)
> - Update your items.py
> - Create a spiders file in your `[project_name]/[project_name]/spiders` directory

You can update your spider class with the complete list of craigslist "start urls" to effectively scrape all of the regions.  Start with one to test.

Updating your parse method with the method you chose should require minimal changes.  It will require you to update your parse method to use the response parameter, and an item model (defined in items.py).


## 7.  Chose another area of Craigslist to scrape.

**Choose an area having more than a single page of results, then scrape multiple regions, multiple pages of search results and or details pages.**

This is the true exercise of being able to understand how to succesffuly plan, develop, and employ a broader scraping strategy.  Even though this seems like a challenging task, a few tweeks of your current code can make this very managable if you've pieced together all the touch points.  If you are still confused as to some of the milestones within this process, this is an excellent opportunity to round out your understanding, or help you build a list of questions to fill in your gaps.

_Use Scrapy!  Provide your code in this project directory when you submit this project._