
# Introduction to Web Scraping and Spiders with `scrapy`

_Authors: Dave Yerrington (SF), Sam Stack(DC)_

---


### Learning Objectives
- Understand the structure and content of HTML
- Learn about elements, attributes, and element hierarchy in HTML
- Learn about XPath and using multiple and singular selections
- Practice using Scrapy to get data from craigslist
- Practice using Beautiful Soup to parse data from craigslist
- Walkthrough the construction of a spider built using scrapy

### Hypertext Markup Language

Let's break that down further:


> **Hypertext** - at its most basic level, is text which contains links to other texts and other forms of media such as image or video

> **Markup Language** - a system for defining the presentation of text. HTML is a markup language for the web.

> Kim Goulbourne, General Assembly

HTML defines the **structure** of a webpage. This is important because as we'll see shortly webscraping is about identifying the structure of a webpage in a way that allows you to algorithmically target the portions of the page that hold the data you're after. 

One of the largest sources of data in the world is all around us.  We consume the web in some form every day.  One of the most powerful python toolsets we will learn allows us to extract and normalize data from unstructured sources like webpages.  

**If you can see it, it can be scraped, mined, and put into a dataframe.**

Before we begin the actual process of webscraping with python, it is important to cover the basic constructs that describe HTML as unstructured data. 

Then we will cover a powerful selection technique called XPath, and look at a basic workflow using a framework called [Scrapy](http://www.scrapy.org).

## The Document Object Model (The DOM)

---

"The Document Object Model (DOM) provides a representation of the document as a structured group of nodes and objects that have properties and methods. With the DOM, programmers can build documents, navigate their structure, and add, modify, or delete elements and content. Anything found in an HTML document can be accessed, changed, deleted, or added using the DOM."

- Kim Goulbourne, General Assembly


## The DOM (continued)
---
In the HTML DOM (Document Object Model), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

### Elements
Elements begin and end with open and close "tags", which are defined by namespaced, encapsulated strings. These namespaces that begin and end the elements must be the same.

```
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

As you may have several different titles or paragraphs on a single page, you can assign ID values to namespace to make more unique reference points.  IDs are also very useful for labelling nested elements.
```
<title id ='title_1'>I am a the first title.</title>
<p id ='para_1'>I am the first paragraph.</p>
<title id ='title_2'>I am a the second title.</title>
<p id ='para_2'>I am the second paragraph.</p>
```

**Elements can have parents and children:**
It is important to remember that an element can be both a parent and a child and whether to refer to the element as a parent or a child depends on the specific element you are referencing.

```
<body id = 'parent'>
    <div id = 'child_1'>I am the child of 'parent'
        <div id = 'child_2'>I am the child of 'child_1'
            <div id = 'child_3'>I am the child of 'child_2'
                <div id = 'child_4'>I am the child of 'child_3'</div>
            </div>
        </div>
    </div>
</body>
```
**or**
```
<body id = 'parent'>
    <div id = 'child_1'>I am the parent of 'child_2'
        <div id = 'child_2'>I am the parent of 'child_3'
            <div id = 'child_3'> I am the parent of 'child_4'
                <div id = 'child_4'>I am not a parent </div>
            </div>
        </div>
    </div>
</body>
```





### Attributes

HTML elements can have attributes.  They describe properties, and characteristics of elements.  Some affect how the element behaves or looks in terms of the rendered output by the browser.

Anchor elements have href attributes that tell the browser where to go after it is clicked.  Anchor elements are typically formatted in bold and underlined as a visual cue to differentiate itself.

**Markup that describes an element with attributes, litterally looks like this:**

```
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)





### Element hierarchy Visually Represented

![Nodes](http://www.computerhope.com/jargon/d/dom1.jpg)

### Element hierarchy in code:

```
<html>
    <head>
        <title>Example</title>
    </head>
    
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body> 
</html>
```

### You are now qualified HTML experts

Your HTML learning can continue...

Read all about the different elements supported amongst modern browsers:
 * [HTML5 Cheatsheet](http://websitesetup.org/html5-cheat-sheet/)
 * [Mozilla HTML Element Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
 * [HTML5 Visual Cheatsheet](http://www.unitedleather.biz/PDF/HTML5-Visual-Cheat-Sheet1.pdf)
 

## What is XPath?

---
<img src="assets/obama_wiki.png" width="550", height="300">
Xpath is a syntax that we'll use to target sections of an HTML/XML page. You can think of Xpath as a query language for HTML/XML.

Understanding how to identify elements and attributes within HTML documents gives us the capability to write simple expressions that create structured data.

## XPath Helper?

---

To make this process easier to deal with, we will be using XPath helper, which is a Chrome add on.  It's not necessary, but highly recommended to help build XPath expressions.

[XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en)

XPath expressions can select elements, element attributes, and element text.  These selections can be either to a single item, or multiple items.  Generally, if you're not specific enough, you will end up selecting multiple elements.


<a id='multiple-selections'></a>

### Multiple selections

***Multiple selections*** are useful for capturing search results, or any repeating element.  For instance, the _titles_ of an apartment listing search results from Craigslist:


**URL**

[http://sfbay.craigslist.org/search/sfc/apa](http://sfbay.craigslist.org/search/sfc/apa)


**Example HTML Markup**
```
...
<span class="pl"> 
    <time datetime="2016-01-12 23:27" title="Tue 12 Jan 11:27:35 PM">Jan 12</time> 
    <a href="/sfc/apa/5400584579.html" data-id="5400584579" class="hdrlnk">Welcome home to a sweetly renovated four bedroom one and a half bath</a> 
</span>
...
```

**XPath - Multiple Titles** _copy this into the XPath Helper Query box_
```
//a[@class='result-title hdrlnk']
```

**Returns (Ad Titles)**
```
***New Remodeled two bedroom Apartment***
WONDERFUL ONE BR APARTMENT HOME
Beautiful 1bed/1bath Apartment in Russian Hill NO SECURITY DEPOSIT
Knockout SF View|Green Oasis|Private Driveway|Furnished
3BR/3BA Spacious, Beautiful SOMA Loft: 5 month lease
Nob Hill Large Studio - Light, Quiet, Lovely Building
etc...
```

### Singular selections

***Singular selections*** are necessary when you want to grab specific, unique text within elements.  Here's an example of a details page on Craigslist:

> *Note: this example may be expired if you view it sometime after July 31st, 2017. Please replace this with a current craigslist listing!

**URL**

[https://sfbay.craigslist.org/sfc/apa/d/exquisite-level-contemporary/6244612708.html)

**HTML Markup**

```
<div class="postinginfos">
    <p class="postinginfo">post id: 6244612708</p>
    <p class="postinginfo" style="opacity: 1;">posted: <time class="timeago" datetime="2017-07-31T18:51:09-0700" title="2017-07-31  6:51pm">38 minutes ago</time></p>
```

**XPath - Single Item**

```
//p[@class='postinginfo'][2]/time
```
**Returns (Time of posting or age of Post)**
```
39 minutes ago
```

## XPATH HELPER SHORTCUTS!

- Hit Ctrl-Shift-X (or Command-Shift-X on OS X), or click the XPath Helper button in the toolbar, to open the XPath Helper console.
- Hold down Shift as you mouse over elements on the page. The query box will continuously update to show the XPath query for the element below the mouse pointer, and the results box will show the results for the current query.

# A Note On Specificity

The concept of specificity is recurrent in front end web development and other applications where one must traverse HTML. When you write your Xpath queries, you should write them as specifically as possible to ensure you target the element you intend while also not writing them so specifically that you sacrifice generalizability or readability. 

---

As an example, let's return to our multiple selection Xpath query to be used on http://sfbay.craigslist.org/search/sfc/apa



``
//a[@class='result-title hdrlnk'] 
``


If our intention was to grab **only the first** instance of the anchor tag with the class 'result-title hdrlnk', we were not being specific enough to target this. Instead, we've grabbed every anchor element of that class.

---
#### Two alternatives that achieve our goal of grabbing the first anchor element might be:

1) Specify the first anchor element that satisfies our query:

``
(//a[@class='result-title hdrlnk'])[1]
``

2) Specify the id, a unique identifier for the anchor element on this page:

``
//a[@data-id='6247436293']
``

#### The following query, however, would be overly specific:

``
/html[@class='js canvas draggable fileAPI geolocation hashChange matchMedia picture pushState placeholder no-touchCapable transitions localStorage']/body[@class='search has-map en desktop grid has-map-view-button']/section[@id='page-top']/form[@id='searchform']/div[@id='sortable-results']/ul[@class='rows']/li[@class='result-row'][1]/p[@class='result-info']/a[@class='result-title hdrlnk']
``

## A simple  example using `scrapy` and `XPath`.

---

Below is an example of how to get information out of some fake HTML using the XPath capabilities of the `scrapy` package. You will likely need to install the scrapy package using `conda install scrapy`.   
**Note:** `Conda install` will install the necessary dependent packages needed for Scrapy, `pip install` will **not**.

We will use the `Selector` class from the `Scrapy` library to help us construct our query.

`Selector` classes take the HTML target as an argument and can then utilize several flavors of query type to extract information.  In our situation we will specify `XPath` as our query flavoured language (though we could also use CSS). 

Just like with writing python scripts, there are several ways you can access the exact same information in HTML.  Lets try a few out.

In [1]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

# HTML structure string
HTML = """
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span>\
    <span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
"""

# Option 1: use the exact class name to get its associated text
best = Selector(text=HTML).xpath("//span[@class='bestof-text']/text()").extract()
best

[u'best of']

In [2]:
# Option 2: use the 'contains()' function extract any text that includes the text 'best of'
best = Selector(text=HTML).xpath("//span[contains(text(), 'best of')]/text()").extract()
best

[u'best of']

In [3]:
# Option 3: First grabs the entire html post where 'class='bestof-link'
best =  Selector(text=HTML).xpath("/html/body/div/p/a[@class='bestof-link']")
# parse the first grabbed chunk for the the text of the specific element with class='bestof-text'
nested_best =  best.xpath("./span[@class='bestof-text']/text()").extract()
nested_best

[u'best of']

_Option 3 will probably be the most common for you because there is a good chance that you will want to grab information from several children elements that exist within one parent element._

## Where's Waldo - "XPath Edition"

In this example, we will find Waldo together.  Find Waldo as:

- Element
- Attribute
- Text element

In [4]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo Im not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

**Tip:** We can use the asterisk special character '*' as an place holder for 'all possible'.

```python
# all elements where class='alpha'
Selector(text=HTML).xpath('//*[@class="alpha"]').extract()



#returns

[u'<div class="alpha">Bill gates</div>',
 u'<div class="alpha">Zuckerberg</div>']
```


In [5]:
Selector(text=HTML).xpath('//*[@class="alpha"]').extract()

[u'<div class="alpha">Bill gates</div>',
 u'<div class="alpha">Zuckerberg</div>']

#### Find element 'waldo'

In [6]:
# text contents of the element waldo
Selector(text=HTML).xpath('/html/body/waldo/text()').extract()

[u'Waldo']

**Find attribute(s) 'waldo'**

In [60]:
# Contents of all attributes named waldo
Selector(text=HTML).xpath('//*[@*="waldo"]').extract()

[u'<ul id="waldo">\n            <li class="waldo">\n                <span> yo Im not here</span>\n            </li>\n            <li class="waldo">Height:  ???</li>\n            <li class="waldo">Weight:  ???</li>\n            <li class="waldo">Last Location:  ???</li>\n            <li class="nerds">\n                <div class="alpha">Bill gates</div>\n                <div class="alpha">Zuckerberg</div>\n                <div class="beta">Theil</div>\n                <div class="animal">parker</div>\n            </li>\n        </ul>',
 u'<li class="waldo">\n                <span> yo Im not here</span>\n            </li>',
 u'<li class="waldo">Height:  ???</li>',
 u'<li class="waldo">Weight:  ???</li>',
 u'<li class="waldo">Last Location:  ???</li>']

In [11]:
# Contents of all class attributes named waldo
Selector(text=HTML).xpath('//*[@class="waldo"]').extract()

[u'<li class="waldo">\n                <span> yo Im not here</span>\n            </li>',
 u'<li class="waldo">Height:  ???</li>',
 u'<li class="waldo">Weight:  ???</li>',
 u'<li class="waldo">Last Location:  ???</li>']

**Find text element Waldo**

In [61]:
# gets everything around the text element waldo
Selector(text=HTML).xpath("//*[text()='Waldo']").extract()

[u'<waldo>Waldo</waldo>']

## Using Requests + Beautiful Soup to extract information from a webpage.

---

Beautiful Soup is a python library useful for pulling data out of HTML and XML files.  It works with many parsers, such as XPath and can be executed in an IDE, so it can be much easier to work with when first extracting information from html.

Please make sure that the required packages are installed: 

```bash
# beautiful soup:
> conda install bs4 
> conda install lxml

# or if conda doesn't work
> pip install bs4
> pip install lxml
```

Lets find another posting for a sweet set of wheels on Craigslist (You will probably have to update the URL to one that hasn't expired.):

![](assets/craigslist.jpg)

> *Note: you will need to update this to a current/working craigslist post.*

https://washingtondc.craigslist.org/doc/cto/d/want-reliable-vehicle-for/6244528863.html

### Step 1: fetch the content by URL

In [7]:
# you will need the requests library in order to fully utilize bs4
import requests
from bs4 import BeautifulSoup

# target web page
url = "https://washingtondc.craigslist.org/doc/cto/d/want-reliable-vehicle-for/6244528863.html"
# establishing the connection to the webpage
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
#Ex. 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found
print 'Status Code: ',response.status_code

# Pull HTML string out of requests and convert to a python string
html = response.text

# The first 500 characters of the content
print "\nFirst part of HTML document fetched as string:\n"
print html[:500]

Status Code:  200

First part of HTML document fetched as string:

<!DOCTYPE html>
<html class="no-js">
<head>
<title>Want a Reliable Vehicle for Work? The 2010 Toyota Camry is Just the Ca - cars &amp; trucks - by owner - vehicle automotive sale</title>
    	<link rel="canonical" href="http://washingtondc.craigslist.org/doc/cto/d/want-reliable-vehicle-for/6244528863.html">
	<meta name="description" content="Available to buy is our 10 Toyota Camry. AT. 4-door sedan has gotten just 97X00 ofmileage. Vehicle engine and wiring just maintained. Installed new brakes, 


[More information on request status codes](http://www.restapitutorial.com/httpstatuscodes.html)

### Step 2: Parse HTML document with Beautiful Soup

This step allows us to access the elements of the document by XPATH expressions. Soup queries are more like accessing information within a python object.  

**Note:** There are many ways to get the elements in a "soup" object

Here are a few ways to select HMTL elements as "objects" within "soup" as a document.

In [8]:
soup = BeautifulSoup(html, 'lxml')
# Grabbing the element
soup.html.title

<title>Want a Reliable Vehicle for Work? The 2010 Toyota Camry is Just the Ca - cars &amp; trucks - by owner - vehicle automotive sale</title>

In [9]:
# Just the text between elements
print soup.html.title.text

Want a Reliable Vehicle for Work? The 2010 Toyota Camry is Just the Ca - cars & trucks - by owner - vehicle automotive sale


In [10]:
# Find all anchor items in the soup
soup.findAll("a")

[<a class="header-logo" href="/" name="logoLink">CL</a>,
 <a href="/">washington, DC</a>,
 <a href="/doc/">district of columbia</a>,
 <a href="/search/doc/sss">for sale</a>,
 <a href="/search/doc/cto">cars &amp; trucks - by owner</a>,
 <a href="https://post.craigslist.org/c/wdc">post</a>,
 <a href="https://accounts.craigslist.org/login/home">account</a>,
 <a class="favlink" href="#"><span aria-hidden="true" class="icon icon-star fav"></span><span class="fav-number"></span><span class="fav-label"> favorites</span></a>,
 <a class="to-banish-page-link" href="#">\n<span aria-hidden="true" class="icon icon-trash red"></span>\n<span class="banished_count"></span>\n<span class="discards-label"> hidden</span>\n</a>,
 <a class="header-logo" href="/">CL</a>,
 <a href="/reply/wdc/cto/6244528863" id="replylink">reply</a>,
 <a class="flaglink" data-flag="28" href="https://post.craigslist.org/flag?flagCode=28&amp;postingID=6244528863&amp;cat=cto&amp;area=wdc" title="flag as prohibited / spam / misca

In [11]:
# Find just the first anchor element in the soup
soup.find('a')

<a class="header-logo" href="/" name="logoLink">CL</a>

In [12]:
# find single or multiple elements
# First parameter
element = soup.findAll("a", {"class": "header-logo"})
element[0].text

u'CL'

In [13]:
price_search = soup.findAll('span', {"class": "price"})
price_search[0].text

u'$2050'

In [15]:
# switching back to SF apartment listings
response = requests.get("http://sfbay.craigslist.org/search/sfc/apa")

In [16]:
soup = BeautifulSoup(response.text, "lxml")
search_titles = soup.findAll("a", {"class": "hdrlnk"})

In [17]:
for link in search_titles[0:5]:
    print link.attrs

{'data-id': '6253544409', 'href': '/sfc/apa/d/second-chance-3-br-gorgeous/6253544409.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6253669482', 'href': '/sfc/apa/d/luxury-spacious-studio/6253669482.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6213832532', 'href': '/sfc/apa/d/beautiful-3-bed-2-ba-gar-yard/6213832532.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6253615357', 'href': '/sfc/apa/d/stunning-views/6253615357.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6253657873', 'href': '/sfc/apa/d/love-purple-rain-the-beach/6253657873.html', 'class': ['result-title', 'hdrlnk']}


##### > **Check:** How do we know which parameters `findAll()` takes?

### Practice: can you select the price of our junker?  

 - Use XPath Helper to get an idea of where the element is within the HTML document.
 - Try to select using the soup.html.body.something.something method.
 - Try using findAll() to find a concise element.

## Building a Spider with [Scrapy](http://scrapy.org/)

---

> *"Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them."*

Below we will walkthrough the creation of a **spider** using scrapy. Spiders are automated processes that will crawl through a webpage or webpages and collect information.

> **Note:** This code should be written in a script outside of jupyter notebook.

<a id='scrapy-project'></a>
### 1. Create a new Scrapy project

In your terminal. `cd` into a directory you want to create your Crawler's folder.  I recommend the desktop for ease of access to the files inside we will need to edit.
> `scrapy startproject craigslist`

**Should create output that looks like this:**
<blockquote>
2016-01-13 00:12:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2016-01-13 00:12:45 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-13 00:12:45 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'craigslist' created in:
    /Users/davidyerrington/virtualenvs/data/scraping/craigslist

You can start your first spider with:
    cd craigslist
    scrapy genspider example example.com
</blockquote>

**That command generates a set of project files:**
<blockquote>
craigslist/
    scrapy.cfg
    craigslist/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...
</blockquote>

Generally, these are our files.  We will go into more detail on these soon.

 * **`scrapy.cfg`:** the project configuration file
 * **`craigslist/`:** the project’s python module, you’ll later import your code from here.
 * **`craigslist/items.py`:** the project’s items file.
 * **`craigslist/pipelines.py`:** the project’s pipelines file.
 * **`craigslist/settings.py`:** the project’s settings file.
 * **`craigslist/spiders/`:** a directory where you’ll later put your spiders.
 
Please add this line to your craigslist/settings.py file before continuing (under the 'NEWSPIDER_MODULE' is fine):
 
 <blockquote>
 DOWNLOAD_HANDLERS = {'s3': None,}
 </blockquote>



### 2. Define an "item"

Next, open the items.py file in your IDE so that we can define an item. 

When we define an item, it's telling our new application what it will be collecting.  In essence, an "item", is an entity that has attributes (ie: "title", "description", "price", etc) that are descriptive and relate to elements on pages that we will be scraping.  

In more precise terms, this is a model (for those who are familliar with ORM or relational database terms).  Don't worry if this is a foreign concept.  The main idea to understand is that a model has attributes that closely resemble / relate to elements on our target web page(s).

```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CraigslistItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    price = scrapy.Field()
```


### 3. A spider that crawls

An item is a model that resembles data on a webpage.  A spider is something that crawls pages and uses our item model to to get and hold items for us.

**Scrapy spiders are python classes.  Let's write our first file, called `craigslist_spider.py` and put it in our `/spiders` directory:**

```python
import scrapy

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["craigslist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sfc/apa"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)
```

**Next, let's dive in and crawl from our `/craigslist/craigslist` directory:**

```
> scrapy crawl craigslist
```

**What just happened?**
 * Our application requested the URLs from the `start_urls` class attribute.
 * Ran parse over the content containing the HTML markup, of each request URL.
 * What else?
 
```python
    with open(filename, 'wb') as f:
        f.write(response.body)
```

It saved a file in our base project directory.  It should be named based on the end of the URL.  In our case, it should create a file called "sfc".  This is taken directly from the Scrapy docs and it's only point is to illustrate the workflow so far.  It is kind of nice to have a reference to our HTML file though.  

There might be some errors listed when we crawl, but they are fine for now.

### 4. XPath + parsing with our spider

So far, we've defined what fields we'll get, some urls to fetch, and saved some content to a file.  Let's actually do something interesting.

**We should let our spider know about the item model we made earlier.  In the head of the `craigslist/craigslist/spiders/craigslist_spider.py`, lets add a new import:**

```python
from craigslist.items import CraigslistItem
```

<br><br><br>
**Let's replace our parse method, to find some data from our Craigslist spider response, and map it to our item model, CraigslistItem:**


```python
def parse(self, response): # define parse function 
    items = [] # element for storing scraped info
	hxs = scrapy.Selector(response) # selector is a function that allows us to grab html from the response(target website)
	for sel in hxs.xpath("//li[@class='result-row']/p"): # as we're using xpath languange we need to specify
                        # that the paragraphs we are trying to isolate are expressed via xpath
		item = CraigslistItem()
        item['title'] =  sel.xpath("a/text()").extract() #title text from the 'a' element 
		item['link']  =  sel.xpath("a/@href").extract() # href/url from the 'a' element 
		item['price'] =  sel.xpath('span/span[@class="result-price"]/text()').extract()[0]
                # price from the result price class nested in a few span elements.
        items.append(item)
	return items # shows scraped information as terminal output

```



### Save and examine our scraped data

By default, we can save our crawled data as csv.  To save our data, we just need to pass a few optional parameters to our crawl call:

<blockquote>
> scrapy crawl craigslist -o items.csv -t csv
</blockquote>

It's always good to iteratively check our data when developing a spider to make sure it's close to what we want. 

> *Pro tip:  The longer your iterations are between checks, the harder it's going to be to understand what's not working and fix bugs.*

You should now have a file called '`items.csv`' in the directory you ran the `scrapy crawl` command from.

## Addendum: leveraging XPath to get more results

---

Generally, a workflow that is useful in this context is to load the page in your Chrome browser, check out the page using the XPath Helper plugin, and from that derive your own XPath expressions based on the output.

`text()` selects only the text of a given element (between the tags), and `@attribute_name` is used to select attributes.

**Here are a few examples of `text()`**:
<blockquote>
<h1>Darwin - The Evolution Of An Exhibition</h1>
</blockquote>

The XPath selector for this:

<blockquote>
//h1/text()
</blockquote>

**Here are a few examples of attributes**:

And the description is contained inside a `<div>` tag with `id="description"`:
<blockquote>
<h2>Description:</h2>

<div id="description">
Short documentary made for Plymouth City Museum and Art Gallery regarding the setup of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
</div>
...
</blockquote>

XPath
<blockquote>
//div[@id='description']
</blockquote>

### Following links for more results

100 results is pretty cool but what if we want more?  We need to follow the "next" links, and find new pages to grab.  We're going to update the **`parse()`** method of our spider class, and use pandas to write the data to csv.

```python
import scrapy
from craigslist.items import CraigslistItem
import pandas as pd


class CraigslistSpider(scrapy.Spider):
    name = 'craigslist'
    allowed_domains = ['craigslist.org']
    start_urls = ['http://sfbay.craigslist.org/search/sfc/apa']

    def parse(self, response):
        items = []
        hxs = scrapy.Selector(response)
        titles = hxs.xpath("//li[@class='result-row']/p")

        for sel in titles:
            item = CraigslistItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['price'] = \
                sel.xpath('span/span[@class="result-price"]/text()'
                          ).extract()[0]
            items.append(item)  # return items
        df = pd.DataFrame(items)

        with open('Mike.csv', 'a') as f:
        	df.to_csv(f, header= False, index=False)

        # Does the next page exist ? Let 's get it!

        next_page = \
            response.xpath("(//a[@class='button next']/@href)[1]")

        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)


```

To call this command, enter: 

<blockquote>
> scrapy crawl craigslist
</blockquote>
