# Web Scraping with Python

## Pipe Line
##### Setup
* Define Objective
* Identify Sources
##### Acquisition
* Access Raw Data
* Parse & Extract
##### Processing
* Analyze --> Learn --> -- Wrangle --> Explore --> Analyze

## Hyper Text Markup Language (HTML)
### HTML Tags and Attributes
```
<tag-name attrib-name='attrib info'>
    ... element contents ...
</tag-name>
```

#### "div" tag
```
<div id="unique-id" class="some class">
    ... div element contents ...
</div>
```
* id should be unique
* class attribute doesn't need to be unique

#### "a" tag - for linking
```
<a href="https://www.datacamp.com">
    This text links to DataCamp!
</a>
```
* a tags are for hyperlinks
* href attribute tells what link to go to

## XPath
`xpath = '/html/body/div[2]'`

##### Simple XPath
* Single forward-slash `/` used to move forward one generation
* tag-name between slashes give direction to which element(s)
* Brackets `[]` after a tag name tell us which of the selected siblings to choose

##### Double Feature
* Direct to all `table` elements within the entire HTML code:
`xpath = '//table'`
* Direct to all table elements which are descendants of
2nd div child of the body element: `/html/body/div[2]//table`

### XPaths and Selectors

#### XPath Navigation
`xpath = '/html/body'`

`xpath = '/html/[1]/body[1]`

#### The wild card
* The asterisk `*` is the "wildcard"
* It ignores the tag type

`xpath = '/html/body/*'`

#### Off the Beaten XPath

##### (At)tribute
* `@` represents "attribute"
* `@class`, `@id`, `@href`

`xpath = '//p[@class="class-1"]'`

`xpath = '//*[@id="uid"]'`

`xpath = '//div[@id="uid"]/p[2]'`

##### Content with Contains
* `xpath` `contains` notation:

`cotains(@attri-name, "string-expr")`

`xpath = '//*[contains(@class, "class-1")]'`

`xpath = '//*[@class="class-1"]'`

`xpath = '/html/body/div/p[2]/@class'`

`xpath = '//a[contains(@class, "package-snippet")]/@href`

## Selector Objects

##### Setting up a `Selector`
```
from scrapy import Selector

html = '''<html> .... </html>'''
sel = Selector(text=html)
```
* Created a `scrapy` `Selector` object using a string the html code
* The `selector` `sel` has selected the entire html document
* Use `xpath` call within a `Selector` to create a `new Selectors` of specific pieces of html code
* The return is a `SelectorList` of `Selector` object

`sel.xpath("//p")`

##### Extracting data from a `SelectorList`
* Use the `extract()` method: `sel.xpath("//p").extract()`
* Use `extract_first()` method to get the first element from the list
`sel.xpath("//p").extract_first()`

##### Extracting data from a `Selector`
```
ps = sel.xpath('//p')
second_p = ps[1]
second_p.extract
```

##### XPath chaining
```
sel.xpath('/html/body/div[2]')
#OR
sel.xpath('/html').xpath('./body/div[2]')
#OR
sel.xpath('/html').xpath('./body').xpath('./div[2]')
```

### HTML text to Selector
```
from scrapy import Selector
import requests

url = 'https://en.wikipedia.org/wiki/Web_scraping'
html = reqeusts.get(url).content
sel = Selector(text=html)
```

## CSS Locators, Chaining and Responses

##### CSS Locators
* `/` replace by `>` (except first character)
    * XPath: `/html/body/div`
    * CSS Locator: `html > body > div`
* `//` replaced by a `blank space` (except first character)
    * XPath: `//div/p[2]`
    * CSS Locator: `div > span p`
* `[N]` replaced by `:nth-of-type(N)`
    * XPath: `//div/p[2]`
    * CSS Locator: `div > p:nth-of-type(2)`
* Example:
```
xpath = '/html/body//div/p[2]'
css = 'html > body div > p:nth-of-type(2)'
```

##### Attribute in CSS
* To find an element by class, use a period `.`
    * ex: `p.class-1` selects all paragraph elements belonging to clas-1
* To find an element by id, use a pound sign `#`
    * ex: `div#uid` selects the div element with id equal to uid
* `div#uid > p.class1` - select paragraph elements within class1
* `.class1` - select all elements whose class attribute belongs to class1

##### Selectors with CSS
```
from scrapy import Selector

html = '''<html> ...... </html>'''
sel = Selector(text = html)

sel.css("div > p")

sel.css("div > p").extract()
```

##### CSS Wildcard
* `*` asterics - ignore tag type
* `*` selects all elements in HTML document
* `*.class-1` or `.class-1` selects all elements belong to class-1
* `*#uid` or `#uid` selects the element with id attribute equal to id

#### Attribute and Text Selection
* Using XPath: `<xpath-to-element>/@attr-name`
    * `xpath = '//div[@id="uid"]/a/@href'`
* Using CSS Locator: `<css-to-element> :: attr(attr-name)`
    * `css_locator = 'div#uid > a :: attr(href)'`

##### Text Extraction - text() method
* extracts all texts within the element not within future generation
    * xpath: `sel.xpath('//p[@id="p-example"]/text()').extract()`
    * css: `sel.css('p#p-example:: text').extract()`
* extracts all texts within element and in future generation
    * xpath: `sel.xpath('//p[@id="p-example"]//text()').extract()`
    * css: `sel.css('p#p-example :: text').extract()`
    
#### Response
* Response has all tools with Selectors
    * xpath and css methods
    * extract and extract_first methods
* Response also keeps track of the URL where HTML code was loaded from
* Response helps us move from one site to another, so that we can "crawl" the web while srapping
* Example:
    * xpath: `response.xpath('//div/span[@class="bio"]')`
    * css: `response.css('div > span.bio')`
    * chaining: `response.xpath('//div').css('span.bio')`
    * data extraction:
        * `response.xpath('//div').css('span.bio').extract()`
        * `response.xpath('//div').css('span.bio').extract_first()`
* `response` keeps track of the URL within the response `url` variable `response.url`
* `response` lets us "follow" a new link with the `follow()` method `response.follow(next-url)`

## Scrapping
```
https://www.datacamp.com/courses/all
course_divs = response.css('div.course-block')
print(len(course-divs))
```

##### Inspecting course-block
```
first_div = course_divs[0]
children = first_div.xpath('./*')
print(len(children))
```

```
# first child
first_child = children[0]
print(first_child.extract())

# second child
second_child = children[1]
print(second_child.extract())

# third child
third_child = children[2]
print(third_child.extract())
```

##### Listful
* `css: links = response.css('div.course-block > a::attr(href)').extract()`

```
# step1: course blocks
course_divs = response.css('div.course-block')

# step2: hyperlink elements
hrefs = course_divs.xpath('./a/@href')

# step3: extract the links
links = hrefs.extract()
for l in links:
    print(l)
```

## Spider

#### A Classy Spider
```
import scrapy
from scrapy.crawler import CrawlerProcess

class SpiderClassName(scrapy.Spider):
    name = "spider_name"
    # the code for your spider
    ...
# Running the spider
process = CrawlerProcess()  # initiate a Crawler Process
process.crawl(SpiderClassName)  # tell teh process which spider to use
process.start()  # start the crawling process
```

#### Weaving the Web
```
class DCspider(scrapy.Spider)
    name = 'dc_spider'
    def start_requests(self):
        urls = ['https://www.datacamp.com/courses/all']
        for url in urls:
            yield scrapy.Request(url = url, callback = self.parse)
    def parse(self, response):
        # simple example: write out the html
        html_file = 'DC_courses.html'
        with open(html_file, 'wb') as fout:
            fout.write(response.body)
```
* Need to have a function called start-requests
* Need to have at least one parser function to handle the HTML code

#### Start Requests - Request for Service
```
def start_request(self):
    urls = ['https://www.datacamp.com/courses/all']
    for url in urls:
        yield scrapy.Request(url = url, callback = self.parse)
```
```
def start_requests(self):
    url = 'htpps://www.datacamp.com/courses/all'
    yield scrapy.Request(url = url, callback = self.parse)
```
* scrapy.Request will fill in a response variable
* url argument tells which site to scrape
* callback argument tells where to send the response variable for processing

#### Parse and Crawl
```
def parse(self, response):
    # input passing code with response that you already know!
    # output to a file
    # crawl the web

def parse(self.response):
    links = response.css('div.course-block > a::attr(href)').extract()
    filepath = 'DC_links.csv'
    with open (filepath, 'w') as f:
        f.writelines([link + '/n' for link in links])

def parse(self, response):
    links = response.css('div.course-block > a::attr(href)').extract()
    for link in links:
        yield response.follow(url = link, callback = self.parse2)

def parse2(self, response):
# parse the course sites here
```

#### Inspecting Elements
```
import scrapy
from scrapy.crawler import CrawlerProcess

class DC_Chapter_Spider(scrapy.Spider):
    name = "dc_chapter_spider"
    
    def start_requests(self):
        url = 'https://www.datacamp.com/courses/all'
        yield scrapy.Request(url = url, callback = self.parse_front)
        
    def parse_front(self, response):
        # code to parse the front courses page
    
    def parse_pages(self, response):
        # code to parse course pages
        # fill in dc_dict here

dc_dict = dict()

process = CrawlerProcess()
process.crawler(DC_Chapter_Spider)
process.start()
```

#### Parsing Front Page
```
def parse_front(self, response):
    # Narrow in on the course blocks
    course_blocks = response.css('div.course-block')
    
    # Direct to the course links
    course_links = course_blocks.xpath('./a/@href')
    
    # Extract the links (as a list of strings)
    links_to_follow = course_links.extract()
    
    # Follow the links the next parser
    for url in links_to_follow:
        yield response.follow(url = url, callback = self.parse_pages)
```

#### Parsing the Course Pages
```
def pages_pages(self, response):
    # Direct to the course title next
    crs_title = response.xpath('//h1[contains(@class, "title")]/text()')
    
    # Extract and clean the course title text
    crs_title_ext = crs_title.extract_first().strip()
    
    # Direct to the chapter titles text
    ch_titles = response.css('h4.chapter--title::text')
    
    # Extract and clean the chapter titles text
    cht_titles_ext = [t.strip() for t in ch_titles.extract()]
    
    # Store this in a dictionary
    dc_dict[crs_title_ext] = ch_titles_ext
```