Distributed Scraping Architecture

Welcome to the Distributed Scraping Architecture project! This project leverages Scrapy, Celery, Redis, and scrapy-redis to create a scalable and robust web scraping framework.

Introduction

In today's data-driven world, efficiently gathering and processing large datasets is crucial. This project aims to provide a distributed web scraping architecture that can handle large-scale data extraction tasks reliably.

Features

Scrapy: Powerful web crawling and scraping framework.
Celery: Asynchronous task queue/job queue for distributing scraping tasks.
Redis: In-memory data structure store used as a message broker.
scrapy-redis: Integration to distribute Scrapy tasks across multiple nodes.

Upcoming

Adding new way of executing scrapers using subprocess
Structured way to start distributed scraping for dummys

Installation

To get started, clone the repository and install the necessary dependencies:

git clone https://github.com/milan1310/distributed-scrapy-scraping.git
cd distributed-scrapy-scraping
pip install -r requirements.txt

Usage

Start Redis: Make sure you have Redis installed and running.
```
redis-server
```
Start Celery: Run Celery worker to process tasks.
```
celery -A tasks worker --loglevel=info
```
Add URLs to Queue: Use the add_urls.py script to add URLs to the Redis queue.
```
python add_urls.py
```
Run Spider: Execute the spider to start scraping.
```
python run_spider.py
```

Project Structure

distributed-scrapy-scraping/
├── amazon_distribution/
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   ├── spiders/
│   │   ├── __init__.py
│   │   └── amazon_spider.py
├── scrapy-redis/
│   ├── __init__.py
│   ├── connection.py
│   ├── defaults.py
│   ├── dupefilter.py
│   ├── picklecompat.py
│   ├── pipeline.py
│   ├── queue.py
│   ├── scheduler.py
├── .gitignore
├── add_urls.py
├── celery_app.py
├── db.py
├── models.py
├── requirements.txt
├── run_spider.py
├── scrapy.cfg
└── tasks.py

Contributing

Contributions are welcome! Please fork the repository and submit pull requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Scraping Architecture

Table of Contents

Introduction

Features

Upcoming

Installation

Usage

Project Structure

Contributing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
amazon_distribution		amazon_distribution
scrapy-redis		scrapy-redis
.gitignore		.gitignore
README.md		README.md
add_urls.py		add_urls.py
celery_app.py		celery_app.py
db.py		db.py
models.py		models.py
requirements.txt		requirements.txt
run_spider.py		run_spider.py
scrapy.cfg		scrapy.cfg
tasks.py		tasks.py

milan1310/distributed-scrapy-scraping

Folders and files

Latest commit

History

Repository files navigation

Distributed Scraping Architecture

Table of Contents

Introduction

Features

Upcoming

Installation

Usage

Project Structure

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages