# Web- Scraping
- Markov chain generator : https://pypi.python.org/pypi/markovify/0.4.3
- scrapy : https://scrapy.org/
- bs4 : https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- request : https://pypi.python.org/pypi/requests/2.11.1

## What is scraper?
- Request library : usually just go to any endpoint and grab it, and parse it with like bs4
- bs4 : neat way to parse web pages, but it doesn't scale super well
- scrapy : make spriders and they go out and get whatever I want

- ** pip install scrapy markovify**
- scrapy needs twisted, LXML, Six, CSS Select, Zoap, 

In [36]:
from bs4 import BeautifulSoup
import requests
import os
req = requests.get("https://seoul.craigslist.co.kr/pet/d/rabbit-to-good-home/6373149291.html")
print(req.status_code)

200


In [2]:
req.text

'<!DOCTYPE html>\n<html class="no-js">\n<head>\n<title>Rabbit to a good home - pets</title>\n    \t<link rel="canonical" href="https://seoul.craigslist.co.kr/pet/d/rabbit-to-good-home/6373149291.html">\n\t<meta name="description" content="I\'m looking for a good home for my rabbit. His name is Basil. He would be perfect for anyone with a child or anyone who just wants an easy pet to have around. He\'s potty trained and will come with...">\n\t<meta name="robots" content="noarchive,nofollow,unavailable_after: 19-Dec-17 18:35:25 KST">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:description" content="I\'m looking for a good home for my rabbit. His name is Basil. He would be perfect for anyone with a child or anyone who just wants an easy pet to have around. He\'s potty trained and will come with...">\n\t<meta property="og:image" content="https://images.craigslist.org/00L0L_fg0unpXytzp_600x450.jpg">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta pr

In [51]:
from bs4 import BeautifulSoup
import markovify
import requests
import os

urls = [
    'https://newyork.craigslist.org/brk/pet/d/female-roller-pigeon-6/6411589358.html',
    'https://newyork.craigslist.org/wch/pet/d/kitten-made-of-love/6411471250.html',
    'https://newyork.craigslist.org/wch/pet/d/posey-delightful-bundle/6411507144.html',
    'https://newyork.craigslist.org/que/pet/d/looking-to-adopt-loving-adult/6411496297.html',
    'https://newyork.craigslist.org/brx/pet/d/adorable-yorkie-puppies/6411459357.html',
    'https://newyork.craigslist.org/brk/pet/d/dog-walker-dog-sitter/6385375915.html'
    
]


def process_post(url):
    req = requests.get(url)
    if req.status_code == 200:
        soup = BeautifulSoup(req.text, 'html.parser')
        post_title = soup.find(id='titletextonly').text
        post_body = soup.find(id='postingbody').text
        if any([
            post_title is None,
            post_body is None
        ]):
            raise ValueError()
        return (post_title, post_body)
    else: 
        raise ValueError()


def save(post_title, post_body):
    with open('titles.txt', 'a') as title_file:
        title_file.write(post_title + '\n')
    with open('bodies.txt', 'a') as body_file:
        body_file.write(post_body + '\n')    
        

def generate(which_file):
    files = {
        'body': 'bodies.txt',
        'title': 'titles.txt'
    }
    with open(files[which_file]) as f:
        text = f.read()
    text_model = markovify.Text(text, state_size=3)
    return text_model.make_sentence(tries=50)
    

def main():
    try:
        os.unlink('titles.txt')
        os.unlink('bodies.txt') 
    except FileNotFoundError:
        pass
   
    first_page = requests.get('https://newyork.craigslist.org/d/pets/search/pet')
    soup = BeautifulSoup(first_page.text, 'html.parser')
    links = soup.select('#sortable-results > ul > li > p > a')
    urls = [link.get('href') for link in links]
       
    for url in urls:
        try:
            title, body = process_post(url)
        except ValueError:
            pass
        else:
            save(title, body)
    #         print('title : {}\n\n body : {}\n\n'.format(title, body))
        
        
if __name__ == '__main__':
    main()
    print(generate('title'))
    print(generate('body'))

None
Please be the lucky person who adopts him.


In [None]:
#sortable-results > ul > li> p > a

In [34]:
!cat titles.txt

Female Roller Pigeon - $6
A Kitten Made Of Love
Posey - A Delightful Bundle Wrapped Up In A Cat
Looking to Adopt Loving Adult Cat
Adorable Yorkie Puppies!
Dog walker / dog sitter


In [35]:
!cat bodies.txt



QR Code Link to This Post


Female. Cage not included email with any questions.    


QR Code Link to This Post


If there was a recipe for the perfect kitten, it would be called Pepper.  He is everyone's idea of the most playful, the most affectionate and the most adorable kitten that could every be.  If you adopt this bundle of joy, he will be a wonderful companion as well as a friend to any other family cat.  He is FLV/FIV negative and will recently neutered.  He is all set to walk into your loving home.  If interest, please call  show contact info
 (landline), cell #  show contact info
.



QR Code Link to This Post


How would you describe the perfect cat? She would have to be really sweet and have a great disposition.  She would have to be playful and loving and really affectionate.  She would have to sit on your lap and purr with joy at just being with you.  And maybe if your are lucky, she would also be beautiful.  Well, you have found her.  Would you real

In [38]:
!ls

assets					 python_for_file_system.ipynb
live_coding_web_scraping_20171204.ipynb  scraper.py
project1				 tempdir1
python_for_file_system_20171204.ipynb


In [61]:
import scrapy 
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
# from scrapy.linkextractors import LinkExtractor

from pet_item import Pet

class PetSpider(scrapy.Spider):
    name = 'petspider'
    start_url = ['https://newyork.craigslist.org/d/pets/search/pet']
    
    def parse(self, response):
#         links = LinkExtactor(
#             restrict_css=('#sortable-results > ul > li > p')
#         ).extract_links(response)
#         print(links)
        next_page = response.css('.rows .row a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
#         for link in response.css('#sortable-results > ul > li > p'):
#             yield {'href': link.css('a ::href').extra_first()}
        

In [None]:
import scrapy.exceptions import DropItem

class PetPipeline(object):
    def process_item(self, item, spider):
        if item['body']

In [62]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)