Pastecrawler - Efficient Paste Crawler and Storage with Serverless Architecture and IaC

Description:

Pastecrawler is an innovative solution designed to efficiently crawl and capture pastes from various sources, storing paste content in Amazon S3 and associated metadata in DynamoDB. This project is built on an event-driven architecture, employing a serverless approach for scalability, and it utilizes Infrastructure as Code (IaC) principles to manage its infrastructure deployment.

Key Features:

Paste Crawling: Pastecrawler adeptly scrapes pastes from diverse platforms and sources, ensuring comprehensive coverage.
Serverless Architecture: By leveraging serverless computing, Pastecrawler guarantees automatic scalability, minimizing operational complexities and reducing costs.
Infrastructure as Code (IaC): The entire infrastructure setup, including AWS resources, event triggers, S3 buckets, and DynamoDB tables, is defined and provisioned using IaC tools such as AWS CloudFormation or Terraform.
Amazon S3 Integration: Captured paste content is securely stored in Amazon S3, allowing easy access, retrieval, and analysis.
DynamoDB Metadata: Metadata linked to each paste, including source information, timestamps, and other relevant details, is structured and stored in DynamoDB for efficient retrieval.
Event-Driven System: The project embraces an event-driven approach, processing paste captures and storage through event triggers, ensuring a highly responsive system.
Caching Mechanism: An intelligent caching mechanism efficiently manages duplicate pastes, optimizing storage and retrieval operations.
Scalability and Performance: Pastecrawler is designed to manage substantial paste volumes while maintaining high performance and responsiveness.
Configuration and Customization: The system offers easy configuration of sources to crawl, storage settings, and event triggers to align with specific requirements.

Current Flow

Get Started

Start

Install Docker & docker-compose
Make sure you have there enough space for extra images & containers
Clone the project
Open command line(Terminal)

cd <root project folder>
chmod +x start.local.sh
./start.local.sh

Redis UI (can be skipped)

redis://:eYVX7EwVmmxKPCDmwMtyKVge8oLd2t81@redis:6379

LocalStack Health (can be skipped)
Take a 🍵 break for 7 minutes, to see the first pastes in Dynamodb UI
Enjoy 😜

Stop

cd <root project folder>
chmod +x stop.local.sh
./stop.local.sh

Unit Test

Install yarn

cd <project root folder>
yarn
yarn test

Add more crawlers

Steps

See the project structure & serverless.yml file
Add your service (lambda, ec2, ecs, batch)
Add additional services according to your requirements (S3, Redis..)
Add to env.<env> file the required params
Add a new dynamodb table with serverless framework
Make sure that your service has the right role & policy permissions in the serverless.yml
Enjoy ❤️

TODO

Add unit tests, coverage > 95%
Fix security vulnerability of the serverless framework OR use Terraform
Apply least-privilege permissions in serverless.yml
Add E2E tests gauge
Fix TODOs in the code
Add type declaration files to support TS
Store passwords/secrets in secrets-manager

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
diagrams		diagrams
src		src
.dockerignore		.dockerignore
.env.local		.env.local
.eslintrc		.eslintrc
.gitignore		.gitignore
.prettierrc		.prettierrc
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
serverless.yml		serverless.yml
start.local.sh		start.local.sh
stop.local.sh		stop.local.sh
webpack.config.js		webpack.config.js
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pastecrawler - Efficient Paste Crawler and Storage with Serverless Architecture and IaC

Description:

Key Features:

Current Flow

Get Started

Start

Stop

Unit Test

Add more crawlers

Steps

TODO

Tools

Nice to know

Contacts

About

Releases

Packages

Languages

mic3ael/pastes-crawler

Folders and files

Latest commit

History

Repository files navigation

Pastecrawler - Efficient Paste Crawler and Storage with Serverless Architecture and IaC

Description:

Key Features:

Current Flow

Get Started

Start

Stop

Unit Test

Add more crawlers

Steps

TODO

Tools

Nice to know

Contacts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages