AWS NLP Data Pipeline

Ingest real-time streaming text data with automatic appending of NLP metadata

Overview

This project represents a mostly serverless data engineering architecture for ingesting real-time streaming data and automatically appending NLP metadata via managed AWS services. The project may serve as a baseline for implementing complex ingestion pipelines powering NLP services.

The following AWS services are leveraged:

CloudFormation -- infrastructure as code (IaC)
Lambda - ingestion and transformation
Firehose - stream buffering
Comprehend - NLP metadata enrichment
Elasticsearch Service - document storage and search
S3 - persistent object storage

Deployment

This project leverages GitHub Actions for its CI/CD pipeline. If forking, you can deploy via your own Actions by providing the following Secrets in your repository:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION_ID
IP_ADDRESS

Example

A dataset for demonstration purposes has been provided. Use the following script to send example data to the Ingest Lambda for processing.

python stream.py

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.github/workflows		.github/workflows
cfn		cfn
example		example
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS NLP Data Pipeline

Overview

Deployment

Example

About

Releases

Packages

Languages

License

jacobeturpin/aws-nlp-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

AWS NLP Data Pipeline

Overview

Deployment

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages