### Pipe Dream

#### Setup:

- Undesrtand what we went to do.
- Find sources to help us do it.

#### Acquisition:

- Read in the raw data from online.
- Format these data to be usable.

#### Processing:

- Many options!

### Our Focus

- Acquisition!
- Using scrapy via python.

### HyperText Markup Language

- HTML tags: <...> </...>
- The HTML tree:
~~~
	<html>
		<body>
			<div>
				<p>Hello World!</p>
				<p>Enjoy DataCamp!</p>
			</div>
		</body>
	</html>
~~~

### HTML Tags

~~~
<tag-name attrib-name="attrib info">
	..element contents..
</tag-name>
~~~

- We've seen tag names such as *html*, *div*, and *p*.

**Examples**:

~~~
<div id="unique-id" class="some class">
	..div element contents..
</div>
~~~

- *id* attribute should be unique.
- *class* attribute does not need to be unique.

~~~
<a href="https://www.datacamp.com">
	This texts links to DataCamp!
</a>
~~~

- *a* tags are for hyperlinks.
- *href* attribute tells what link to go to.

### Crash Course XPath

- Example:
~~~
xpath = '/html/body/div[2]'
~~~

- Single forward-slash '/' used to move forward on generation.
- tag-names between slashes give direction to which element(s).
- Brackets '[]' after a tag name tell us which of the selected siblings to choose.
	- One-indexed: starts from 1

- Double forward-slash '//' direct to all *table* elements within the entire HTML code:

~~~
xpath = '//table'
~~~

- Or direct to all *table* elements which are descendants of the 2nd *div* child of the *body* element:

~~~
xpath = '/html/body/div[2]//table'
~~~

**The wildcard**

- the asterisk '*' is the wildcard:

~~~
xpath = '/html/body/*'
~~~

**Attribute**

- '@' represents attribute

~~~
xpath = '//p[@class="class-1"]'

xpath2 = '//*[@id="uid"]'

xpath3 = '//div[@id="uid"]/p[2]'
~~~

**Content with contains**

~~~
contains(@attr-name,"string-expr")
~~~

**Example**:

~~~
xpath = '//*[contains(@class,"class-1")]'
~~~

returns:

~~~
<p class="class-1"> ... </p>

<div class="class-1 class-2"> ... </div>

<p class="class-12"> ... </p>
~~~

while:

~~~
xpath = '//*[@class="class-1"]'
~~~

returns:

~~~
<p class="class-1"> ... </p>
~~~

### scrapy Selector

**Setting up a Selector**

~~~
from scrapy import Selector

html = ...

sel = Selector(text=html)
~~~

**Selecting Selectors**

- We can use *xpath* call within a *Selector* to create new *Selector* 's of specific pieces of the html code.
- The return is a *SelectorList* of *Selector* objects.

~~~
sel.xpath('//p')
~~~

**Extracting data from a SelectorList**

- Use the *extract()* method:

~~~
sel.xpath('//p').extract()

sel.xpath('//p').extract_first() # grab first element of SelectorList
~~~


#### XPath Chaining

*Selector* and *SelectorList* objects allow for chaining when using the *xpath* method. What this means is that you can apply the *xpath* method over once you've already applied it. For example, if *sel* is the name of our *Selector*, then

~~~
sel.xpath('/html/body/div[2]')
~~~

is the same as

~~~
sel.xpath('/html').xpath('./body/div[2]')
~~~

or is the same as

~~~
sel.xpath('/html').xpath('./body').xpath('./div[2]')
~~~

The only catch is that you need to "glue together" the XPath pieces by using a period at the start of each subsequent XPath string (notice the periods we added to the XPath strings in our examples).

### HTML text to Selector

~~~
from scrapy import Selector

import requests

url = ...

html = requests.get(url).text

sel = Selector(text=html)
~~~

### CSS Locators

- '/' replaced by '>' (except first character)
	- XPath:
    ~~~
    /html/body/div
    ~~~
	- CSS Locator:
    ~~~
    html > body > div
    ~~~

- '//' replaced by a blank space (except first character)
	- XPath: 
    ~~~
    //div/span//p
    ~~~
	- CSS Locator:
    ~~~
    div > span p
    ~~~

- '[N]' replaced by ':nth-of-type(N)'
	- XPath:
    ~~~
    //div/p[2]
    ~~~
	- CSS Locator:
    ~~~
    div > p:nth-of-type(2)
    ~~~

**Attributes in CSS**

- To find an element by class, use a period '.'
	- Example:
	~~~
	p.class-1
	~~~
	selects all paragraph elements belonging to *class-1*
- To find an element by id, use a pound sign '#'
	- Example:
	~~~
	div#uid
	~~~
	selects the *div* element with *id* equal to *uid*


#### Select paragraph elements within *class1*

~~~
css_locator = 'div#uid > p.class1'
~~~

**Selectors with CSS**

~~~
from scrapy import Selector

html = ...

sel = Selector(text=html)

sel.css('div > p')

sel.css('div > p').extract()
~~~

#### CSS Attributes and Text Selection

- Using XPath: <xpath-to-element>/@attr-name

~~~
xpath = '//div[@id="uid"]/a/@href'
~~~

- Using CSS Locator: <css-to-element>::attr(attr-name)

~~~
css_locator = 'div#uid > a::attr(href)'
~~~

#### Text Extraction

~~~
<p id="p-example">
	Hello World!
	Try <a href="https://datacamp.com">DataCamp</a> today!
</p>
~~~

- In XPath use *text()*

~~~
sel.xpath('//p[@id="p-example"]/text()').extract() # only text in element
~~~

~~~
sel.xpath('//p[@id="p-example"]//text()').extract() # all text under
~~~

- For CSS Locator, use '::text'

~~~
sel.css('p#p-example::text').extract() # only text in element
~~~

~~~
sel.css('p#p-example ::text').extract() # all text under
~~~


### Selector vs Response

- The Response has all the tools we learned with Selectors:
	- *xpath* and *css* method followed by *extract* and *extract_first* methods.
- The Response also keeps track of the url from which the HTML code was loaded.
- The Response helps us move from one site to another, so that we can "crawl" the web while scraping.

~~~
response.xpath('//div/span[@class="bio"]')

response.css('div > span.bio')

#Chaining
response.xpath('//div').css('span.bio')

# Data extraction
response.xpath('//div').css('span.bio').extract()
response.xpath('//div').css('span.bio').extract_first()
~~~

- The response keeps track of the URL within the response *url* variable:

~~~
print(response.url)
~~~

- The response lets us "follow" a new link with the *follow()* method:

~~~
response.follow(next_url)
~~~


In [None]:
import requests
from scrapy.http import Response, TextResponse

url = 'https://www.datacamp.com/courses/all'
response = TextResponse(body=requests.get(url).content,url=url)

course_divs = response.css('div.course-block')

#print(course_divs)
first_div = course_divs[0]
children = first_div.xpath('./*')
print(len(children))

In [None]:
first_child = children[0]
print(first_child.extract())

In [None]:
print(children[1].extract())

In [None]:
print(children[2].extract())

### List of courses:

- In one CSS Locator:

In [None]:
links = response.css('div.course-block > a::attr(href)').extract()

- Using XPath:

In [None]:
hrefs = course_divs.xpath('./a/@href').extract()

In [None]:
len(links) == len(hrefs)

### A Spider

~~~
import scrapy
from scrapy.crawler import CrawlerProcess

class SpiderClassName(scrapy.Spider):
	name = 'spider_name'
	# the code for the spider

process = CrawlerProcess()
process.crawl(SpiderClassName)
process.start()
~~~

In [None]:
import scrapy

class DCSpider(scrapy.Spider):

    name = 'dc_spider'

    def start_requests(self):
        urls = ['https://www.datacamp.com/courses/all']
        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self,response):
        links = response.css('div.course-block > a::attr(href)').extract()
        filepath = 'DC_links.csv'
        with open(filepath,'w') as f:
            f.writelines([link+'\n' for link in links])

- Need to have a function called *start_requests*
- Need to have at least one parser function to handle the HTML code

#### Start requests

~~~
def star_requests(self):
	urls = [...]
	for url in urls:
		yield scrapy.Request(url=url, callback=self.parse)
~~~

- *scrapy.Request* here will fill in a response variable for us.
- The *url* argument tells us which site to scrape.


#### Parse and Crawl

~~~
def parse(self,response):
	links = response.css('div.course-block > a::attr(href)').extract()
	for link in links:
		yield response.follow(url=link,callback=self.parse2)

def parse2(self,response):
	# parse the course sites here!
~~~

## Capstone:

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

class DC_Chapter_Spider(scrapy.Spider):

    name = 'dc_chapter_spider'

    def start_requests(self):
        url = 'https://www.datacamp.com/courses/all'
        yield scrapy.Request(url=url,callback=self.parse_front)

    def parse_front(self,response):
        course_blocks = response.css('div.course-block')
        course_links = course_blocks.xpath('./a/@href')

        links_to_follow = course_links.extract()

        for url in links_to_follow:
            yield response.follow(url=url,callback=self.parse_pages)

    def parse_pages(self,response):
        crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
        crs_title_ext = crs_title.extract_first().strip()

        ch_titles = response.css('h4.chapter__title::text')
        ch_titles_ext = [t.strip() for t in ch_titles.extract()]

        dc_dict[crs_title_ext] = ch_titles_ext

dc_dict = {}

process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

2019-05-18 21:54:45 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: scrapybot)
2019-05-18 21:54:45 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17134-SP0
2019-05-18 21:54:45 [scrapy.crawler] INFO: Overridden settings: {}
2019-05-18 21:54:45 [scrapy.extensions.telnet] INFO: Telnet Password: 273e4d060b99a8a1
2019-05-18 21:54:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-05-18 21:54:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defa

2019-05-18 21:55:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/probability-puzzles-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/writing-functions-and-stored-procedures-in-sql-server> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/time-series-with-datatable-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/intermediate-interactive-data-visualization-with-plotly-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/feature-engineering-for-machine-learning-in-python> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:14 [scrapy.core

2019-05-18 21:55:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/garch-models-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/analyzing-social-media-data-in-python> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/spreadsheet-basics> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/linear-algebra-for-data-science-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/fraud-detection-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/supply-chain-analytics

2019-05-18 21:55:46 [scrapy.extensions.logstats] INFO: Crawled 125 pages (at 125 pages/min), scraped 0 items (at 0 items/min)
2019-05-18 21:55:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/foundations-of-predictive-analytics-in-python-part-1> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/marketing-analytics-in-r-choice-modeling> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/advanced-deep-learning-with-keras-in-python> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/categorical-data-in-the-tidyverse> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:55:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/multiv

2019-05-18 21:56:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/developing-r-packages> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/fundamentals-of-bayesian-data-analysis-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/data-manipulation-in-r-with-datatable> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/network-analysis-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/building-web-applications-in-r-with-shiny> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp

2019-05-18 21:56:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/sentiment-analysis-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/visualizing-time-series-data-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/intro-to-sql-for-data-science> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/object-oriented-programming-in-r-s3-and-r6> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/supervised-learning-with-scikit-learn> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacam

2019-05-18 21:56:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/cleaning-data-in-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/introduction-to-machine-learning-with-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.datacamp.com/courses/intermediate-r> (referer: https://www.datacamp.com/courses/all)
2019-05-18 21:56:58 [scrapy.core.engine] INFO: Closing spider (finished)
2019-05-18 21:56:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 239532,
 'downloader/request_count': 260,
 'downloader/request_method_count/GET': 260,
 'downloader/response_bytes': 66034877,
 'downloader/response_count': 260,
 'downloader/response_status_count/200': 260,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 5, 19, 0, 56, 58, 996185),

In [2]:
dc_dict

{'Introduction to R': ['Intro to basics',
  'Matrices',
  'Data frames',
  'Vectors',
  'Factors',
  'Lists',
  'Intro to basics',
  'Vectors',
  'Matrices',
  'Factors',
  'Data frames',
  'Lists'],
 'Foundations of Inference': ['Introduction to ideas of inference',
  'Hypothesis testing errors: opportunity cost',
  'Completing a randomization test: gender discrimination',
  'Confidence intervals',
  'Introduction to ideas of inference',
  'Completing a randomization test: gender discrimination',
  'Hypothesis testing errors: opportunity cost',
  'Confidence intervals'],
 'Bond Valuation and Analysis in R': ['Introduction and Plain Vanilla Bond Valuation',
  'Duration and Convexity',
  'Yield to Maturity',
  'Comprehensive Example',
  'Introduction and Plain Vanilla Bond Valuation',
  'Yield to Maturity',
  'Duration and Convexity',
  'Comprehensive Example'],
 'Correlation and Regression': ['Visualizing two variables',
  'Simple linear regression',
  'Model Fit',
  'Correlation',
  '