Pastecrawler is an innovative solution designed to efficiently crawl and capture pastes from various sources, storing paste content in Amazon S3 and associated metadata in DynamoDB. This project is built on an event-driven architecture, employing a serverless approach for scalability, and it utilizes Infrastructure as Code (IaC) principles to manage its infrastructure deployment.
- Paste Crawling: Pastecrawler adeptly scrapes pastes from diverse platforms and sources, ensuring comprehensive coverage.
- Serverless Architecture: By leveraging serverless computing, Pastecrawler guarantees automatic scalability, minimizing operational complexities and reducing costs.
- Infrastructure as Code (IaC): The entire infrastructure setup, including AWS resources, event triggers, S3 buckets, and DynamoDB tables, is defined and provisioned using IaC tools such as AWS CloudFormation or Terraform.
- Amazon S3 Integration: Captured paste content is securely stored in Amazon S3, allowing easy access, retrieval, and analysis.
- DynamoDB Metadata: Metadata linked to each paste, including source information, timestamps, and other relevant details, is structured and stored in DynamoDB for efficient retrieval.
- Event-Driven System: The project embraces an event-driven approach, processing paste captures and storage through event triggers, ensuring a highly responsive system.
- Caching Mechanism: An intelligent caching mechanism efficiently manages duplicate pastes, optimizing storage and retrieval operations.
- Scalability and Performance: Pastecrawler is designed to manage substantial paste volumes while maintaining high performance and responsiveness.
- Configuration and Customization: The system offers easy configuration of sources to crawl, storage settings, and event triggers to align with specific requirements.
- Install Docker &
docker-compose
- Make sure you have there enough space for extra images & containers
- Clone the project
- Open command line(Terminal)
cd <root project folder>
chmod +x start.local.sh
./start.local.sh
- Redis UI (can be skipped)
redis://:eYVX7EwVmmxKPCDmwMtyKVge8oLd2t81@redis:6379
- LocalStack Health (can be skipped)
- Take a 🍵 break for 7 minutes, to see the first pastes in Dynamodb UI
- Enjoy 😜
cd <root project folder>
chmod +x stop.local.sh
./stop.local.sh
- Install yarn
cd <project root folder>
yarn
yarn test
- See the project structure &
serverless.yml
file - Add your service (lambda, ec2, ecs, batch)
- Add additional services according to your requirements (S3, Redis..)
- Add to
env.<env>
file the required params - Add a new dynamodb table with serverless framework
- Make sure that your service has the right role & policy permissions in the
serverless.yml
- Enjoy ❤️
- Add unit tests, coverage > 95%
- Fix security vulnerability of the
serverless framework
OR use Terraform - Apply least-privilege permissions in
serverless.yml
- Add E2E tests gauge
- Fix TODOs in the code
- Add type declaration files to support TS
- Store passwords/secrets in secrets-manager