### Playnomore 사이트 데이터 수집
- `http://playnomore.co.kr/`
- scrapy에서 fake_useragent 사용
- scrapy를 실행할때 아규먼트를 설정해서 실행
- pipelines에서 데이터 베이스로 데이터를 저장

In [1]:
import scrapy
import requests
from scrapy.http import TextResponse
import pandas as pd

### 1. 프로젝트 생성

In [2]:
!rm -rf playnomore # 있으면 제거
!scrapy startproject playnomore

New Scrapy project 'playnomore', using template directory '/home/ubuntu/.pyenv/versions/3.6.9/envs/python3/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /home/ubuntu/python3/notebook/fastcampus_1학기_수업내용/06_scrapy/playnomore

You can start your first spider with:
    cd playnomore
    scrapy genspider example example.com


### 2. items.py
- title, price, img, link

In [None]:
# items.py를 읽어온다.
# %load playnomore/playnomore/items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class PlaynomoreItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


- `%` : 한줄을 읽어서 실행
- `%%` : 셀 단위를 읽어서 실행

In [3]:
%%writefile playnomore/playnomore/items.py
import scrapy


class PlaynomoreItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    img = scrapy.Field()
    link = scrapy.Field()

Overwriting playnomore/playnomore/items.py


### 3. xpath 확인
- 링크
- 링크 -> 상세페이지(제목, 이미지URL, 가격)
- fake_useragent install
    - `pip install fake_useragent`

In [4]:
url = "http://playnomore.co.kr/category/bag/24/"
req = requests.get(url)
req
# <Response [403]> : useagent를 체크해서 막고있기 때문이다.

<Response [403]>

In [None]:
!pip install fake_useragent

In [6]:
from fake_useragent import UserAgent
url = "http://playnomore.co.kr/category/bag/24/"
headers = {"User-Agent" : UserAgent().chrome}
req = requests.get(url, headers=headers)
req # <Response [200]> : 잘 된것을 확인
response = TextResponse(req.url, body=req.text, encoding="utf-8")
response

<200 http://playnomore.co.kr/category/bag/24/>

#### 상세페이지 링크

In [13]:
links = response.xpath('//*[@id="contents"]/div[2]/div/ul/li/div/a/@href').extract()

# response.urljoin(links[0]) # 도메인이 없어서 도메인을 추가 : urljoin(links[0])

links = list(map(response.urljoin, links)) # 한페이지의 데이터

#### 상세페이지 : 제목, 가격, 이미지URL 가져오기

In [15]:
url = links[0]
headers = {"User-Agent" : UserAgent().chrome}
req = requests.get(url, headers=headers)
response = TextResponse(req.url, body=req.text, encoding="utf-8")
response

<200 http://playnomore.co.kr/product/black-play-day-10-micro-baguette-grey-python-180/547/?cate_no=24&display_group=1>

#### 타이들을 두개로 나누어서 가져와야된다.
- 하위 엘리먼트의 텍스트는 가져오지 않기 때문이다.

In [18]:
title1 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/font/text()').extract()
title2 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/text()').extract()

title = "".join(title1) + "".join(title2)
title

'[Black Play-Day 10%] MICRO BAGUETTE grey python '

#### 가격 가져오기

In [19]:
price = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[2]/text()').extract()[0]
price

'$ 162'

#### 이미지 URL 가져오기

In [21]:
img = "http:" + response.xpath('//*[@id="contents"]/div[1]/div[1]/div[1]/div[1]/img/@src').extract()[0]
img

'http://playnomore.co.kr/web/product/big/201910/596078374708bbb81ffd629a8cf88950.jpg'

### 4. spider.py
- scrapy-fake-useragent install
    - `pip install scrapy-fake-useragent`

In [22]:
!pip install scrapy-fake-useragent

Collecting scrapy-fake-useragent
  Downloading https://files.pythonhosted.org/packages/0d/1b/c9e5f1917c87a09787271933dcd4bface93ee60248a6a85cc98f2fd58e2c/scrapy_fake_useragent-1.1.0-py2.py3-none-any.whl
Installing collected packages: scrapy-fake-useragent
Successfully installed scrapy-fake-useragent-1.1.0
[33mYou are using pip version 18.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [23]:
!pip list | grep fake

fake-useragent        0.1.11   
scrapy-fake-useragent 1.1.0    
[33mYou are using pip version 18.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [40]:
%%writefile playnomore/playnomore/spiders/spider.py
import scrapy
from playnomore.items import PlaynomoreItem


class PlaynomoreSpider(scrapy.Spider):
    # 이 이름을 사용해야 한다 규칙이다.
    name = "Playnomore"
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
        }
    }
    
    def start_requests(self):
        url = "http://playnomore.co.kr/category/bag/24/"
        yield scrapy.Request(url, callback=self.parse)
        
    def parse(self, response):
        links = response.xpath('//*[@id="contents"]/div[2]/div/ul/li/div/a/@href').extract()
        links = list(map(response.urljoin, links))
        for link in links:
            yield scrapy.Request(link, callback=self.page_parse)
                         
    # 상세페이지 데이터를 객체로 만든다.
    def page_parse(self, response):
        item = PlaynomoreItem()
        title1 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/font/text()').extract()
        title2 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/text()').extract()
        item["title"] = "".join(title1) + "".join(title2)
        item["price"] = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[2]/text()').extract()[0]
        item["img"] = "http:" + response.xpath('//*[@id="contents"]/div[1]/div[1]/div[1]/div[1]/img/@src').extract()[0]
        item["link"] = response.url
        yield item

Overwriting playnomore/playnomore/spiders/spider.py


In [41]:
%%writefile run.sh
cd playnomore
scrapy crawl Playnomore -o playnomore.csv

Overwriting run.sh


In [43]:
!chmod +x run.sh

In [44]:
!./run.sh

2019-11-28 05:24:27 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: playnomore)
2019-11-28 05:24:27 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Oct 24 2019, 05:23:48) - [GCC 7.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-1052-aws-x86_64-with-debian-buster-sid
2019-11-28 05:24:27 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'playnomore', 'FEED_FORMAT': 'csv', 'FEED_URI': 'playnomore.csv', 'NEWSPIDER_MODULE': 'playnomore.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['playnomore.spiders']}
2019-11-28 05:24:27 [scrapy.extensions.telnet] INFO: Telnet Password: b0c82527e89468e1
2019-11-28 05:24:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'sc

2019-11-28 05:24:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://playnomore.co.kr/product/black-play-day-10-micro-moon-olive-180/542/?cate_no=24&display_group=1>
{'img': 'http://playnomore.co.kr/web/product/big/201909/009a912866af7785c74acb8fd657fa5d.jpg',
 'link': 'http://playnomore.co.kr/product/black-play-day-10-micro-moon-olive-180/542/?cate_no=24&display_group=1',
 'price': '$ 162',
 'title': '[Black Play-Day 10%] MICRO MOON olive '}
2019-11-28 05:24:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://playnomore.co.kr/product/black-play-day-20-pre-order-micro-candy-white-170/507/?cate_no=24&display_group=1>
{'img': 'http://playnomore.co.kr/web/product/big/201905/f3b7377fccdb62d9657441b1eda6d19d.jpg',
 'link': 'http://playnomore.co.kr/product/black-play-day-20-pre-order-micro-candy-white-170/507/?cate_no=24&display_group=1',
 'price': '$ 136',
 'title': '[Black Play-Day 20%] [PRE-ORDER] MICRO CANDY white '}
2019-11-28 05:24:28 [scrapy.core.scraper] DEBUG: Scraped from

### pandas로 확인

In [45]:
df = pd.read_csv("playnomore/playnomore.csv")
df.tail(2)

Unnamed: 0,img,link,price,title
29,http://playnomore.co.kr/web/product/big/201910...,http://playnomore.co.kr/product/black-play-day...,$ 162,[Black Play-Day 10%] MICRO BAGUETTE cocoa box
30,http://playnomore.co.kr/web/product/big/201910...,http://playnomore.co.kr/product/black-play-day...,$ 162,[Black Play-Day 10%] MICRO BAGUETTE grey python


### 5. argument 설정

In [47]:
%%writefile playnomore/playnomore/spiders/spider.py
import scrapy
from playnomore.items import PlaynomoreItem


class PlaynomoreSpider(scrapy.Spider):
    # 이 이름을 사용해야 한다 규칙이다.
    name = "Playnomore"
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
        }
    }
    
    def __init__(self, category1="bag", category2=24, **kwargs):
        self.start_url = "http://playnomore.co.kr/category/{}/{}/".format(category1, category2)
        super().__init__(**kwargs) # 기존에 있던 생성자 함수를 상속, 위코드만 추가
    
    def start_requests(self):
        url = self.start_url
        yield scrapy.Request(url, callback=self.parse)
        
    def parse(self, response):
        links = response.xpath('//*[@id="contents"]/div[2]/div/ul/li/div/a/@href').extract()
        links = list(map(response.urljoin, links))
        for link in links:
            yield scrapy.Request(link, callback=self.page_parse)
                         
    # 상세페이지 데이터를 객체로 만든다.
    def page_parse(self, response):
        item = PlaynomoreItem()
        title1 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/font/text()').extract()
        title2 = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[1]/text()').extract()
        item["title"] = "".join(title1) + "".join(title2)
        item["price"] = response.xpath('//*[@id="contents"]/div[1]/div[1]/div[2]/div[2]/text()').extract()[0]
        item["img"] = "http:" + response.xpath('//*[@id="contents"]/div[1]/div[1]/div[1]/div[1]/img/@src').extract()[0]
        item["link"] = response.url
        yield item

Overwriting playnomore/playnomore/spiders/spider.py


#### 신발정보 수집
- shoes, 25

In [50]:
%%writefile run.sh
cd playnomore
scrapy crawl Playnomore -o playnomore2.csv -a category1=shoes -a category2=25

Overwriting run.sh


In [51]:
!./run.sh

2019-11-28 05:32:30 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: playnomore)
2019-11-28 05:32:30 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Oct 24 2019, 05:23:48) - [GCC 7.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-1052-aws-x86_64-with-debian-buster-sid
2019-11-28 05:32:30 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'playnomore', 'FEED_FORMAT': 'csv', 'FEED_URI': 'playnomore2.csv', 'NEWSPIDER_MODULE': 'playnomore.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['playnomore.spiders']}
2019-11-28 05:32:30 [scrapy.extensions.telnet] INFO: Telnet Password: ac3d39ac7c5dfd5c
2019-11-28 05:32:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 's

2019-11-28 05:32:32 [scrapy.core.scraper] DEBUG: Scraped from <200 http://playnomore.co.kr/product/sold-out-winkygirl-color-blocks-5cm-rose-gold/18/?cate_no=25&display_group=1>
{'img': 'http://playnomore.co.kr/web/product/big/201702/18_shop7_183931.jpg',
 'link': 'http://playnomore.co.kr/product/sold-out-winkygirl-color-blocks-5cm-rose-gold/18/?cate_no=25&display_group=1',
 'price': '$ 414',
 'title': '[SOLD OUT] WINKYGIRL COLOR BLOCKS (5cm) rose gold'}
2019-11-28 05:32:32 [scrapy.core.scraper] DEBUG: Scraped from <200 http://playnomore.co.kr/product/sold-out-shygirl-oxford-chrome/207/?cate_no=25&display_group=1>
{'img': 'http://playnomore.co.kr/web/product/big/201702/207_shop7_312513.jpg',
 'link': 'http://playnomore.co.kr/product/sold-out-shygirl-oxford-chrome/207/?cate_no=25&display_group=1',
 'price': '$ 450',
 'title': '[SOLD OUT] SHYGIRL oxford chrome'}
2019-11-28 05:32:32 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-28 05:32:32 [scrapy.extensions.feedexport] INFO

In [52]:
df = pd.read_csv("playnomore/playnomore2.csv")
df.tail(2)

Unnamed: 0,img,link,price,title
13,http://playnomore.co.kr/web/product/big/201702...,http://playnomore.co.kr/product/sold-out-winky...,$ 414,[SOLD OUT] WINKYGIRL COLOR BLOCKS (5cm) rose gold
14,http://playnomore.co.kr/web/product/big/201702...,http://playnomore.co.kr/product/sold-out-shygi...,$ 450,[SOLD OUT] SHYGIRL oxford chrome


### 6. Mongodb에 저장
- pymongo 사용
- pymongo를 pipelines.py에 적용
- pymongo install
    - `pip install pymongo==2.8.1`

In [54]:
import pymongo

In [55]:
client = pymongo.MongoClient('mongodb://13.124.41.91:27017/')
client

MongoClient('13.124.41.91', 27017)

In [56]:
db = client.playnomore # 데이터베이스
collection = db.shoes # 테이블
collection

Collection(Database(MongoClient('13.124.41.91', 27017), 'playnomore'), 'shoes')

In [57]:
# 한글이 잘 들어가는지 확인
data = {"title":"신발"}
collection.insert(data)

ObjectId('5ddf5e67045a62483536dea4')

#### Mongodb 모듈 파일 생성
- import 해서 생성을 하기 위해서

In [58]:
%%writefile playnomore/playnomore/mongodb.py
import pymongo

client = pymongo.MongoClient('mongodb://13.124.41.91:27017/')
db = client.playnomore 
collection = db.shoes 

Writing playnomore/playnomore/mongodb.py


In [59]:
!cat playnomore/playnomore/pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class PlaynomorePipeline(object):
    def process_item(self, item, spider):
        return item


In [60]:
%%writefile playnomore/playnomore/pipelines.py
from .mongodb import collection

class PlaynomorePipeline(object):
    
    def process_item(self, item, spider):
        
        data = { "title": item["title"], 
                 "price": item["price"], 
                 "img": item["img"],
                 "link": item["link"],
                }
        collection.insert(data)
        return item

Overwriting playnomore/playnomore/pipelines.py


In [61]:
!echo "ITEM_PIPELINES = {" >> playnomore/playnomore/settings.py

In [62]:
!echo "    'playnomore.pipelines.PlaynomorePipeline': 300," >> playnomore/playnomore/settings.py

In [63]:
!echo "}" >> playnomore/playnomore/settings.py

In [68]:
!tail -n 5 playnomore/playnomore/settings.py

#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
ITEM_PIPELINES = {
    'playnomore.pipelines.PlaynomorePipeline': 300,
}


In [69]:
!cat run.sh

cd playnomore
scrapy crawl Playnomore -o playnomore2.csv -a category1=shoes -a category2=25


In [70]:
!./run.sh

2019-11-28 05:55:28 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: playnomore)
2019-11-28 05:55:28 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Oct 24 2019, 05:23:48) - [GCC 7.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-1052-aws-x86_64-with-debian-buster-sid
2019-11-28 05:55:28 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'playnomore', 'FEED_FORMAT': 'csv', 'FEED_URI': 'playnomore2.csv', 'NEWSPIDER_MODULE': 'playnomore.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['playnomore.spiders']}
2019-11-28 05:55:28 [scrapy.extensions.telnet] INFO: Telnet Password: ccbc6eaf249acb77
2019-11-28 05:55:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 's

2019-11-28 05:55:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://playnomore.co.kr/product/sold-out-shy-friends-oxford-black/181/?cate_no=25&display_group=1> (referer: http://playnomore.co.kr/category/shoes/25/)
2019-11-28 05:55:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://playnomore.co.kr/product/sold-out-winkygirl-color-blocks-metallic-navy-5cm/20/?cate_no=25&display_group=1> (referer: http://playnomore.co.kr/category/shoes/25/)
2019-11-28 05:55:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://playnomore.co.kr/product/sold-out-winkygirl-color-blocks-5cm-metallic-pink/137/?cate_no=25&display_group=1> (referer: http://playnomore.co.kr/category/shoes/25/)
2019-11-28 05:55:29 [scrapy.core.scraper] DEBUG: Scraped from <200 http://playnomore.co.kr/product/sold-out-kiss-me-pink-beige/22/?cate_no=25&display_group=1>
{'img': 'http://playnomore.co.kr/web/product/big/201702/22_shop7_242463.jpg',
 'link': 'http://playnomore.co.kr/product/sold-out-kiss-me-pink-beige/22

2019-11-28 05:55:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://playnomore.co.kr/product/sold-out-winkygirl-color-blocks-5cm-multi/19/?cate_no=25&display_group=1>
{'img': 'http://playnomore.co.kr/web/product/big/201702/19_shop7_561442.jpg',
 'link': 'http://playnomore.co.kr/product/sold-out-winkygirl-color-blocks-5cm-multi/19/?cate_no=25&display_group=1',
 'price': '$ 414',
 'title': '[SOLD OUT] WINKYGIRL COLOR BLOCKS (5cm) multi'}
2019-11-28 05:55:30 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-28 05:55:30 [scrapy.extensions.feedexport] INFO: Stored csv feed (15 items) in: playnomore2.csv
2019-11-28 05:55:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15998,
 'downloader/request_count': 32,
 'downloader/request_method_count/GET': 32,
 'downloader/response_bytes': 303173,
 'downloader/response_count': 32,
 'downloader/response_status_count/200': 17,
 'downloader/response_status_count/301': 15,
 'elapsed_time