Skip to content

jritsema/aws-pdf-video-extraction-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aws-pdf-video-extraction-pipeline

Extract text and transcriptions from PDFs and videos uploaded to an S3 bucket.

This project deploys an S3 Lambda trigger. It performs the following tasks:

  1. PDF Processing:

    • When a PDF file is uploaded, the function extracts images from the PDF.
    • It then processes the extracted images using Amazon Textract, an AWS service for optical character recognition (OCR).
    • The extracted text from the images is then uploaded to the same S3 bucket, with the file path modified to include the image name and "textract.txt".
  2. Video Processing:

    • When a video file (e.g., .mp4) is uploaded, the function submits a transcription job to Amazon Transcribe, an AWS service for speech-to-text conversion.
    • The transcription output is then uploaded to the S3 bucket, with the file path modified to include the original file name and "transcribe.out".
  3. Transcription Processing:

    • When the transcription output file (with the ".transcribe.out" extension) is uploaded, the function reads the JSON data, extracts the transcript, and uploads it to the S3 bucket with the file path modified to include the original file name and "transcribe.txt".
    • After the transcript is uploaded, the function deletes the intermediate artifacts (the ".transcribe.out" file and any associated objects) from the S3 bucket.

Usage

Optional - register pre-commit hooks and asdf for deps.

make init

Deploy the infra. S3 trigger to Lambda container.

terraform init && terraform apply

Deploy code changes (uses containers on Lambda).

Setup python environment for local dev.

cd lambda
make init
make deploy-container function=my-function

Terraform

Requirements

Name Version
terraform >= 1.0
aws ~> 5.0
docker >= 3.0

Providers

Name Version
aws ~> 5.0

Modules

Name Source Version
docker_image terraform-aws-modules/lambda/aws//modules/docker-build n/a
lambda terraform-aws-modules/lambda/aws n/a

Resources

Name Type
aws_iam_role.lambda resource
aws_iam_role_policy_attachment.test_attach resource
aws_s3_bucket_notification.main resource
aws_caller_identity.current data source
aws_ecr_authorization_token.token data source
aws_s3_bucket.main data source

Inputs

Name Description Type Default Required
region region string "us-east-1" no
s3_bucket s3 bucket to run image extraction against string n/a yes

Outputs

No outputs.

About

Extract text and transcriptions from PDFs and videos uploaded to an S3 bucket

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors