![What is Html](http://designshack.designshack.netdna-cdn.com/wp-content/uploads/htmlbasics-0.jpg)

One of the largest sources of data in the world, is all around us.  We consume the web in some form, every day.  One of the most powerful tools you can learn, allows to to extract and normalize data from undstructured sources.  If you can see it, it can be scraped, mined, and put into a dataframe.

Today, we will walk through some basic constructs that describe HTML as unstructured data, a power selection technique called XPath, and a basic workflow from a framework called [Scrapy](http://www.scrapy.org).

#HTML

In the HTML DOM (Document Object Model), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

## Elements
Elements begin and end with open and close "tags", which are defined by namespaced, encapsulated strings. 

```
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

**Elements begin and end in the same namespace like so:**  `<p></p>`

**Elements can have parents and children:**

```
<body>
    <div>I am inside the parent element
        <div>I am inside a child element</div>
        <div>I am inside another child element</div>
        <div>I am inside yet another child element</div>
    </div>
</body>
```

## Attributes
HTML elements can have attributes.  They describe properties, and characteristics of elements.  Some affect how the element behaves or looks in terms of the rendered output by the browser.

The most common element is an "anchor" element.  Anchor elements typically have an "href" element, which tells the browser where to go after it is clicked.  Anchor elements typically are formatted in bold, and sometimes are underlined as a visual cue to differentiate itself.

**Markup that describes nn element with attributes, litterally looks like this**

```
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

## Element Heirarchy

![Nodes](http://www.computerhope.com/jargon/d/dom1.jpg)

**Literally Represented:**

```
<html>
    
    <head>
        <title>Example</title>
    </head>
    
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body>
    
</html>
```
## You are now qualified HTML experts

![](http://hpcc.advancingexpertcare.org/wp-content/uploads/2014/10/certified.jpg)

Your HTML learning can continue...

Read all about the different elements supported amongst modern browsers:
 * [HTML5 Cheatsheet](http://websitesetup.org/html5-cheat-sheet/)
 * [Mozilla HTML Element Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
 * [HTML5 Visual Cheatsheet](http://www.unitedleather.biz/PDF/HTML5-Visual-Cheat-Sheet1.pdf)
 



# What is XPath?

<img src="http://img.crx4chrome.com/63/4c/b1/hgimnogjllphhhkhlmebbmlgjoejdpjl-screenshot.jpg">

Now that we're all familiar with hypertext markup language (HTML), understanding how to identify elements and attributes within HTML documents, gives us the capability to write simple expressions that create structured data.

To make this process easier to deal with, we will be using XPath helper, which is a Chrome addon.  It's not necessary, but highly recommended, to help build XPath expressions.

[XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en)



## Multiple vs Singlular Selections
XPath expressions can select elements, element attributes, and element text.  These selections can be either to a single item, or multiple items.  Generally, if you're not specific enough, you will end up selecting multiple elements.

***Multiple selections*** are useful for capturing search results, or any repeating element.  For instance, the _titles_ of an apartment listing search results from Craigslist:

**URL**

[http://sfbay.craigslist.org/search/sfc/apa](http://sfbay.craigslist.org/search/sfc/apa)


**HTML Markup**
```
...
<span class="pl"> 
    <time datetime="2016-01-12 23:27" title="Tue 12 Jan 11:27:35 PM">Jan 12</time> 
    <a href="/sfc/apa/5400584579.html" data-id="5400584579" class="hdrlnk">Welcome home to a sweetly renovated four bedroom one and a half bath</a> 
</span>
...
```

**XPath - Multiple Titles**
```
//a[@class='hdrlnk']
```

**Returns (Ad Titles)**
```
***New Remodeled two bedroom Apartment***
WONDERFUL ONE BR APARTMENT HOME
Beautiful 1bed/1bath Apartment in Russian Hill NO SECURITY DEPOSIT
Knockout SF View|Green Oasis|Private Driveway|Furnished
3BR/3BA Spacious, Beautiful SOMA Loft: 5 month lease
Nob Hill Large Studio - Light, Quiet, Lovely Building
etc...
```

<br><br><br>
***Singular selections*** are necessary when you want to grab specific, unique text within elements.  Here's an example (which is probably going to be expired if you view it sometime after Jan 12th, 2016) of a details page on Craigslist:

**URL**
(Only $8000!)
[http://sfbay.craigslist.org/sfc/apa/5400585892.html](http://sfbay.craigslist.org/sfc/apa/5400585892.html)

**HTML Markup**
```
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
```

**XPath - Single Item**
```
//p[@class='postinginfo'][2]/time
```
**Returns (Time of posting)**
```
2016-01-12 11:23pm
```

# What is [Scrapy](http://scrapy.org/)?

_"Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them."_

## Installation

```
pip install scrapy
```

_*note: Scrapy does not work with Python 3!!!_

### Windows Users

If Scrapy crashes with:

>`ImportError: No module named win32api`

You need to install [pywin32](http://sourceforge.net/projects/pywin32/) because of [this Twisted bug](http://twistedmatrix.com/trac/ticket/3707).


## 1. Create a new Scrapy project


> `scrapy startproject craigslist`

Should create output that looks like this:
<blockquote>
```
2016-01-13 00:12:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2016-01-13 00:12:45 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-13 00:12:45 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'craigslist' created in:
    /Users/davidyerrington/virtualenvs/data/scraping/craigslist

You can start your first spider with:
    cd craigslist
    scrapy genspider example example.com
```
</blockquote>

This command generates a set of project files:
<blockquote>
```
craigslist/
    scrapy.cfg
    craigslist/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...
```
</blockquote>

Generally, these are our files.  We will go into more depth about these soon.

 * **scrapy.cfg:** the project configuration file
 * **craigslist/:** the project’s python module, you’ll later import your code from here.
 * **craigslist/items.py:** the project’s items file.
 * **craigslist/pipelines.py:** the project’s pipelines file.
 * **craigslist/settings.py:** the project’s settings file.
 * **craigslist/spiders/:** a directory where you’ll later put your spiders.
 
Long story, but please add this line to your craigslist/settings.py file before continuing:
 
 <blockquote>
 ```
 DOWNLOAD_HANDLERS = {'s3': None,}
 ```
 </blockquote>

## 2. Define an "Item"

Basically, when we define an item, it's telling our new application what it will be collecting.  In essence, an "item", is an entity that has attributes (ie: "title", "description", "price", etc) that are descriptive and relate to elements on pages that we will be scraping.  

In more precise terms, this is a model, for those who are familliar with ORM, or relational database terms.  Don't worry if this is a foreign concept.  The main idea to undersatnd is that a model has attributes that closely resemble / relate to elements on our target web page(s).

<blockquote>
```
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class CraigslistItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
```
</blockquote>

## 3. A Spider That Crawls

An item is a model that resembles data on a webpage.  A spider is something that crawls, and uses our item model, to old our items for us.

Scrapy spiders are classes.  Let's write our first file, called `craigslist_spider.py` and put it in our `/spiders` directory:

<blockquote>
```
import scrapy

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["craigslist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sfc/apa"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)
```
</blockquote>

Let's dive in and crawl from our `/craigslist/craigslist` directory:

<blockquote>
```
scrapy crawl craigslist
```
</blockquote>

What just happened?

 * Our application requested the URLs from the `start_urls` class attribute.
 * Ran parse over the content containing the HTML markup, of each request URL.
 * What else?
 
<blockquote>
```
        with open(filename, 'wb') as f:
            f.write(response.body)
```
</blockquote>

It saved a file in our base project directory.  It shoudl be named based on the end of the URL.  In our case, it should create a file called "sfc".  This is taken directly from the Scrapy docs and it's only point is to illustrate teh workflow so far.  It is kind of nice to have a reference to our HTML file though.  

There might be some errors listed when we crawl, but they are fine for now.

## 4. XPath + Parsing /w Our Spider

So far, we've defined what fields we'll get, some urls to fetch, and saved some content to a file.  Let's actually do something interesting.

We should let our spider know about our item model we made earlier.  In the head of the `craigslist/craigslist/spiders/craigslist_spider.py`, lets add a new import:

<blockquote>
```
from craigslist.items import CraigslistItem
```
</blockquote>

Why won't it work otherwise?

<br><br><br>
Let's replace our parse method, to find some data from our Craigslist spider response, and map it to our item model, CraigslistItem.

<blockquote>
```
    def parse(self, response):

        for sel in response.xpath("//div[@class='content']/span[@class='rows']/p"):

            item = CraigslistItem()
            item['title'] =  sel.xpath("span/span/a[@class='hdrlnk']").extract()[0]
            item['link']  =  sel.xpath("span/span/a[@class='hdrlnk']/@href").extract()[0]
            yield item
```
</blockquote>





### Save our data
By default, we can save our crawled data as json.  To save our data, we just need to pass an optional parameter to our crawl call:

<blockquote>
```
scrapy crawl craigslist -o items.json
```
</blockquote>

### Let's Look at Our Data
It's always good to iteratively check our data when developing a spider, to make sure it's close to what we want. 

_Pro tip:  The longer your iterations are between checks, the harder it's going to be to understand what's no working and fix bugs_

In [16]:
import pandas as pd

# update this path to your own
# hint: from terminal, use the pwd command in the same directory as items.json to find
# your scraping directory with your json file
pd.read_json("/Users/davidyerrington/virtualenvs/data/scraping/craigslist/craigslist/items.json").head()

Unnamed: 0,link,price,title
0,/sfc/apa/5401845169.html,"<span class=""price"">$5250</span>","<a href=""/sfc/apa/5401845169.html"" data-id=""54..."
1,/sfc/apa/5401844876.html,"<span class=""price"">$3995</span>","<a href=""/sfc/apa/5401844876.html"" data-id=""54..."
2,/sfc/apa/5401832595.html,"<span class=""price"">$3800</span>","<a href=""/sfc/apa/5401832595.html"" data-id=""54..."
3,/sfc/apa/5401842161.html,"<span class=""price"">$3650</span>","<a href=""/sfc/apa/5401842161.html"" data-id=""54..."
4,/sfc/apa/5401841489.html,"<span class=""price"">$3750</span>","<a href=""/sfc/apa/5401841489.html"" data-id=""54..."


So what's wrong with this data so far?

### 1)  We need to update the way we select our "title" attribute.  Google this and try to fix the problem.  This is what we do all day as developers.

**Discus and fix**...


In [9]:
# Solution:
    
    

### 2) Now, we would like to add the price attribute into our items.  What files need to be updated?  What needs to be added to our spider?  

**Update our code!**

In [10]:
# Solution:

## More on XPath

Generally, the workflow that is useful in this context, is to load the page in your Chrome browser, check out the page using the XPath Helper plugin, then deriving your own XPath expressions from the output.

Generally, `text()` selects only the text of a given element (between the tags), and `@attribute_name` is used to select attributes.

**Here are a few examples**

### Text()

<blockquote>
```
<h1>Darwin - The Evolution Of An Exhibition</h1>
```
</blockquote>

The XPath selector for this:

<blockquote>
```
//h1/text()
```
</blockquote>

### @Attributes

And the description is contained inside a `<div>` tag with `id="description"`:
<blockquote>
```
<h2>Description:</h2>

<div id="description">
Short documentary made for Plymouth City Museum and Art Gallery regarding the setup of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
</div>
...
```
</blockquote>

XPath
<blockquote>
```
//div[@id='description']
```
</blockquote>

## Getting More Results

100 results is pretty cool but what if we want more?  We need to follow the "next" links, and find new pages to grab.  Using the **parse()** method of our spider class, we only need to return another type of object.

<blockquote>
```
    def parse(self, response):

        for sel in response.xpath("//div[@class='content']/span[@class='rows']/p"):

            item = CraigslistItem()
            item['title'] =  sel.xpath("span/span/a[@class='hdrlnk']").extract()[0]
            item['link']  =  sel.xpath("span/span/a[@class='hdrlnk']/@href").extract()[0]
            item['price'] =  sel.xpath("span/span/span[@class='price']").extract()[0]
            yield item

        # Does the next page exist?  Let's get it!
        next_page   = response.xpath("(//a[@class='button next']/@href)[1]")

        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)

```
</blockquote>

## Tips

Loading up an interactive "scrapy shell" will help make testing your XPath expressions even easier.

To start the shell, you just need to run this command:

<blockquote>
```
scrapy shell http://sfbay.craigslist.org/search/apa
```
</blockquote>

Learn more about it:  [Scrapy Shell Documentation](http://doc.scrapy.org/en/0.24/topics/shell.html)