Ingest real-time streaming text data with automatic appending of NLP metadata
This project represents a mostly serverless data engineering architecture for ingesting real-time streaming data and automatically appending NLP metadata via managed AWS services. The project may serve as a baseline for implementing complex ingestion pipelines powering NLP services.
The following AWS services are leveraged:
- CloudFormation -- infrastructure as code (IaC)
- Lambda - ingestion and transformation
- Firehose - stream buffering
- Comprehend - NLP metadata enrichment
- Elasticsearch Service - document storage and search
- S3 - persistent object storage
This project leverages GitHub Actions for its CI/CD pipeline. If forking, you can deploy via your own Actions by providing the following Secrets in your repository:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION_ID
- IP_ADDRESS
A dataset for demonstration purposes has been provided. Use the following script to send example data to the Ingest Lambda for processing.
python stream.py