Skip to content

lin-zone/scrapyu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapyu

Build Status codecov PyPI - Python Version GitHub GitHub stars GitHub forks

UserAgentMiddleware

# settings.py
USERAGENT_TYPE = 'firefox'
DOWNLOADER_MIDDLEWARES = {
   'scrapyu.UserAgentMiddleware': 543,
}

MarkdownPipeline

# settings.py
MARKDOWNS_STORE = 'news'
ITEM_PIPELINES = {
    'scrapyu.MarkdownPipeline': 300,
}
# items.py
import scrapy

class MarkdownItem(scrapy.Item):
    html = scrapy.Field()
    filename = scrapy.Field()

FirefoxCookiesMiddleware

# settings.py
GECKODRIVER_PATH = 'geckodriver'
DOWNLOADER_MIDDLEWARES = {
   'scrapyu.FirefoxCookiesMiddleware': 543,
}

MongoDBPipeline

# settings.py
MONGODB_URI = 'mongodb://localhost:27017'
# or
# MONGODB_HOST = 'localhost'
# MONGODB_PORT = 27017
MONGODB_DATABASE = 'scrapyu'
MONGODB_COLLECTION = 'items'
MONGODB_BUFFER_LENGTH = 100
MONGODB_UNIQUE_KEY = 'title name'       # use only if no buffer
# or
# MONGODB_UNIQUE_KEY = ['title', 'name']
# MONGODB_UNIQUE_KEY = ('title', 'name')
ITEM_PIPELINES = {
    'scrapyu.MongoDBPipeline': 300,
}

RedisDupeFilter

# settings.py
DUPEFILTER_CLASS = 'scrapyu.RedisDupeFilter'
REDIS_DUPE_HOST = 'localhost'
REDIS_DUPE_PORT = 6379
REDIS_DUPE_DATABASE = 0
REDIS_DUPE_PASSWORD = 'password'
REDIS_DUPE_KEY = 'requests'
REDIS_DUPE_IGNORE_URL = r'http://scrapytest.org/\d+'

genspider

scrapyu genspider -l

results in :

Available templates:
  single
  single_splash

generate a single file spider

scrapyu genspider python www.python.org -t single

generate a single file spider, integration splash

scrapyu genspider python www.python.org -t single_splash

About

Scrapy utils

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published