Welcome to Scrapy-Arweave

Scrapy is a popular open-source and collaborative python framework for extracting the data you need from websites. scrapy-arweave provides scrapy pipelines and feed exports to store items into Arweave.

🏠 Homepage

Install

pip install scrapy-arweave

Examples

Usage

Install scrapy-arweave and some additional requirements.

pip install scrapy-arweave

It has some requirements that must be installed as well:

Debian/Ubuntu

sudo apt-get install libmagic1

Windows

pip install python-magic-bin

OSX

When using Homebrew: brew install libmagic
When using macports: port install file

Add 'scrapy-arweave.pipelines.ImagesPipeline' and/or 'scrapy-arweave.pipelines.FilesPipeline' to ITEM_PIPELINES setting in your Scrapy project if you need to store images or other files to Arweave. For Images Pipeline, use:

ITEM_PIPELINES = {'scrapy_arweave.pipelines.ImagesPipeline': 1}

For Files Pipeline, use:

ITEM_PIPELINES = {'scrapy_arweave.pipelines.FilesPipeline': 1}

The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.

Or You can also use both the Files and Images Pipeline at the same time.

ITEM_PIPELINES = {
 'scrapy_arweave.pipelines.ImagesPipeline': 0,
 'scrapy_arweave.pipelines.FilesPipeline': 1
}

If you are using the ImagesPipeline make sure to install the pillow package. The Images Pipeline requires Pillow 7.1.0 or greater. It is used for thumbnailing and normalizing images to JPEG/RGB format.

pip install pillow

Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.

Add store path of files or images for Web3Storage, LightHouse, Moralis, Pinata or Estuary as required.

# For ImagesPipeline
IMAGES_STORE = 'ar://images'

# For FilesPipeline
FILES_STORE = 'ar://files'

For more info regarding ImagesPipeline and FilesPipline. See here

For Feed storage to store the output of scraping as json, csv, json, jsonlines, jsonl, jl, csv, xml, marshal, pickle etc set FEED_STORAGES as following for the desired output format:

from scrapy_arweave.feedexport import get_feed_storages
FEED_STORAGES = get_feed_storages()

Then set WALLET_JWK and GATEWAY_URL. And, set FEEDS as following to finally store the scraped data.

WALLET_JWK = "<WALLET_JWK>" # It can be wallet jwk file path or jwk data itself
GATEWAY_URL = "https://arweave.net"

FEEDS = {
   'ar://house.json': {
    "format": "json"
  },
}

See more on FEEDS here

Now perform the scrapping as you would normally.

Author

👤 Pawan Paudel

Github: @pawanpaudel93

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
scrapy_arweave		scrapy_arweave
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
logo.png		logo.png
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapy_arweave

scrapy_arweave

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

LICENSE.txt

LICENSE.txt

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

README.md

README.md

logo.png

logo.png

pyproject.toml

pyproject.toml

setup.py

setup.py

Repository files navigation

Welcome to Scrapy-Arweave

🏠 Homepage

Install

Examples

Usage

Debian/Ubuntu

Windows

OSX

Author

🤝 Contributing

Show your support

About

Releases

Packages

Languages

License

pawanpaudel93/scrapy-arweave

Folders and files

Latest commit

History

Repository files navigation

Welcome to Scrapy-Arweave

🏠 Homepage

Install

Examples

Usage

Debian/Ubuntu

Windows

OSX

Author

🤝 Contributing

Show your support

About

Resources

License

Stars

Watchers

Forks

Languages